date:20120321

[PATCH] kvm-intel: Auto-load on CPUs with VMX.

2012-03-21 Thread Josh Triplett

Enable x86 feature-based autoloading for the kvm-intel module on CPUs
with X86_FEATURE_VMX.

Signed-off-by: Josh Triplett j...@joshtriplett.org
---

I've tested this and it works correctly.  I can also supply the obvious
equivalent patch for kvm-amd on AMD CPUs with SVM, but I don't have any
such CPUs to test that on.

 arch/x86/kvm/vmx.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3b4c8d8..bb3bcff 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -27,6 +27,7 @@
 #include linux/highmem.h
 #include linux/sched.h
 #include linux/moduleparam.h
+#include linux/mod_devicetable.h
 #include linux/ftrace_event.h
 #include linux/slab.h
 #include linux/tboot.h
@@ -51,6 +52,12 @@
 MODULE_AUTHOR(Qumranet);
 MODULE_LICENSE(GPL);
 
+static const struct x86_cpu_id vmx_cpu_id[] = {
+   X86_FEATURE_MATCH(X86_FEATURE_VMX),
+   {}
+};
+MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
+
 static bool __read_mostly enable_vpid = 1;
 module_param_named(vpid, enable_vpid, bool, 0444);
 
-- 
1.7.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Qemu-devel] [PATCH v4 4/7] RTC: Set internal millisecond register to 500ms when reset divider

2012-03-21 Thread Zhang, Yang Z

 -Original Message-
 From: Stefano Stabellini [mailto:stefano.stabell...@eu.citrix.com]
 Sent: Wednesday, March 21, 2012 1:39 AM
 To: Zhang, Yang Z
 Cc: qemu-de...@nongnu.org; Paolo Bonzini; aligu...@us.ibm.com;
 kvm@vger.kernel.org
 Subject: Re: [Qemu-devel] [PATCH v4 4/7] RTC: Set internal millisecond 
 register to
 500ms when reset divider

 On Mon, 19 Mar 2012, Zhang, Yang Z wrote:
  The first update cycle begins one - half seconds later when divider reset is
 removing.

  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
   hw/mc146818rtc.c |   38 +-
   1 files changed, 33 insertions(+), 5 deletions(-)

  diff --git a/hw/mc146818rtc.c b/hw/mc146818rtc.c
  index 6ebb8f6..5e7fbb5 100644
  --- a/hw/mc146818rtc.c
  +++ b/hw/mc146818rtc.c
  @@ -110,6 +110,8 @@ static void rtc_set_time(RTCState *s);
   static void rtc_calibrate_time(RTCState *s);
   static void rtc_set_cmos(RTCState *s);

  +static int32_t divider_reset;

 this should be part of RTCState

   static uint64_t get_guest_rtc_us(RTCState *s)
   {
   int64_t host_usec, offset_usec, guest_usec;
  @@ -220,16 +222,24 @@ static void rtc_periodic_timer(void *opaque)
   }
   }

  -static void rtc_set_offset(RTCState *s)
  +static void rtc_set_offset(RTCState *s, int32_t start_usec)

 I noticed that you are always passing a positive number or 0 as
 start_usec: it might be worth turning start_usec into a uint32_t or
 uint64_t to avoid integer signedness errors.
Agree.

 Also it is not clear to me what this start_usec is supposed to be: if it
 is a delay to be added to the guest time, then it is best to rename it
 to delay_usec and add a comment on top to explain what it is for.
Actually, the start_usec only used when divider is changed from reset to an 
valid time base. When the changing happened, the first update cycle is 500ms 
later, so the start_usec equals to 500ms. When pass 0 as start_usec, it means 
the rtc internal millisecond is not changed, it is synchronized with host's 
time.

   struct tm *tm = s-current_tm;
  -int64_t host_usec, guest_sec, guest_usec;
  +int64_t host_usec, guest_sec, guest_usec, offset_usec, old_guest_usec;

   host_usec = qemu_get_clock_ns(host_clock) / NS_PER_USEC;
  +offset_usec = s-offset_sec * USEC_PER_SEC + s-offset_usec;
  +old_guest_usec = (host_usec + offset_usec) % USEC_PER_SEC;

   guest_sec = mktimegm(tm);
  -guest_usec = guest_sec * USEC_PER_SEC;

  +/* start_usec equal 0 means rtc internal millisecond is
  + * same with before */
  +if (start_usec == 0) {
  +guest_usec = guest_sec * USEC_PER_SEC + old_guest_usec;
  +} else {
  +guest_usec = guest_sec * USEC_PER_SEC + start_usec;
  +}

 This looks like a mistake to me: before guest_usec was derived
 exclusively from mktimegm (take or leave USEC_PER_SEC), while now
 guest_usec is the sum of the value returned by mktimegm and
 old_guest_usec, even when start_usec is 0.
 Are you sure that this is correct?
The logic is right. Before this patch, we assume the rtc internal millisecond 
register is same with host time, so we don't need to consider it and using 
mktimeis enough. Now, the rtc internal millisecond can be changed, so I use the 
old_guest_usec to record the current internal millisecond. When start_usec is 
0, it means we don't need to change the internal millisecond register and the 
offset_sec is same as before.

best regards
yang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Qemu-devel] [PATCH v4 2/7] RTC: Update the RTC clock only when reading it

2012-03-21 Thread Zhang, Yang Z

 -Original Message-
 From: Stefano Stabellini [mailto:stefano.stabell...@eu.citrix.com]
 Sent: Tuesday, March 20, 2012 10:16 PM
 To: Zhang, Yang Z
 Cc: qemu-de...@nongnu.org; Paolo Bonzini; aligu...@us.ibm.com;
 kvm@vger.kernel.org
 Subject: Re: [Qemu-devel] [PATCH v4 2/7] RTC: Update the RTC clock only when
 reading it
 
 On Mon, 19 Mar 2012, Zhang, Yang Z wrote:
  There has no need to use two periodic timer to update RTC time. In this 
  patch,
 we only update it when guest reading it.
 
 So the basic idea here is that we don't need to two periodic timers
 because we are going to calculate the RTC guest time from QEMU's
 host_clock.
 
 I only have a couple of observations:
 
 - shouldn't we use QEMU rtc_clock, rather than host_clock?
Right. It should be rtc_clock not host_clock

 - it would be better to use shifts rather than divisions whenever
 possible, they are much cheaper;
Agree.
 
 - rtc_calibrate_time seems to be the new functions that updates the
 guest rtc time based on QEMU host_clock. Are you sure we are calling
 it all the times we need to call it? Could we just call it at the
 beginning of cmos_ioport_write and cmos_ioport_read?
No. If the RTC_B_SET is set or divider reset is held, we should not call the 
rtc_calibrate_time.

best regards
yang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v4 5/7] RTC:Add RTC update-ended interrupt support

2012-03-21 Thread Paolo Bonzini

Il 20/03/2012 19:35, Stefano Stabellini ha scritto:
 This is the function that is used to figure out whether we need the
 timers or not, the condition seems to be:
 
 (Not (REG_C_UF | REG_C_AF)) And (Not (REG_B_SET))
 
 Shouldn't actually check for UIE being enabled?

No, you need to set UF in case the code observes it without actually
enabling interrupt delivery on the ISA bus.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Gleb Natapov

On Wed, Mar 21, 2012 at 08:56:03AM +0800, Wen Congyang wrote:
 At 03/20/2012 11:45 PM, Gleb Natapov Wrote:
  On Tue, Mar 20, 2012 at 05:59:16PM +0800, Wen Congyang wrote:
  At 03/19/2012 03:33 PM, Wen Congyang Wrote:
  At 03/08/2012 03:57 PM, Wen Congyang Wrote:
  We can know the guest is paniced when the guest runs on xen.
  But we do not have such feature on kvm.
 
  Another purpose of this feature is: management app(for example:
  libvirt) can do auto dump when the guest is crashed. If management
  app does not do auto dump, the guest's user can do dump by hand if
  he sees the guest is paniced.
 
  I touch the hypervisor instead of using virtio-serial, because
  1. it is simple
  2. the virtio-serial is an optional device, and the guest may
 not have such device.
 
  Changes from v2 to v3:
  1. correct spelling
 
  Changes from v1 to v2:
  1. split up host and guest-side changes
  2. introduce new request flag to avoid changing return values.
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel 
  in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
 
 
 
  Hi all:
 
  we neet this feature, but we don't decide how to implement it.
  We have two solution:
  1. use vmcall
  2. use virtio-serial.
 
  I will not change this patch set before we decide how to do it.
  Can we make a decision recent days?
 
  Anybody can decide which solution to use?
 
  To make an informed decision we need to have at least raw idea how
  virtio-serial variant will look.
 
 Hmm, I think we can do this:
 1. reset the virtio-serial device or reset the port we use to notice
the host that guest is panicked.
 2. write some specific messages to the port
 
 So the port should have fixed name. If this port is opened by the userspace
 before the guest is paniced, I am not sure whether we can use it(because a
 port only can be opened once at the same time).
Yes, IMO we should dedicate one virtio-serial port for panic
notifications. Just like we dedicate one for a console.

 We cannot call any function in the module, so we may need to write a simple
 driver for virtio-serial(like diskdump's disk driver).
 
netconsole uses standard NIC drivers in polling mode to send OOPSes
over the network and it mostly works. So I think using virtio-serial
driver is not out of question, but with IRQ disabled of course.

 I donot know how to implement it now. But I guess that it may be complicated.
 
Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
event over IMPI. The code is pretty complex. Of course if we a going to
implement something more complex than simple hypercall for panic
notification we better do something more interesting with it than just
saying panic happened, like sending stack traces on all cpus for
instance.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Wen Congyang

At 03/21/2012 05:11 PM, Gleb Natapov Wrote:
 On Wed, Mar 21, 2012 at 08:56:03AM +0800, Wen Congyang wrote:
 At 03/20/2012 11:45 PM, Gleb Natapov Wrote:
 On Tue, Mar 20, 2012 at 05:59:16PM +0800, Wen Congyang wrote:
 At 03/19/2012 03:33 PM, Wen Congyang Wrote:
 At 03/08/2012 03:57 PM, Wen Congyang Wrote:
 We can know the guest is paniced when the guest runs on xen.
 But we do not have such feature on kvm.

 Another purpose of this feature is: management app(for example:
 libvirt) can do auto dump when the guest is crashed. If management
 app does not do auto dump, the guest's user can do dump by hand if
 he sees the guest is paniced.

 I touch the hypervisor instead of using virtio-serial, because
 1. it is simple
 2. the virtio-serial is an optional device, and the guest may
not have such device.

 Changes from v2 to v3:
 1. correct spelling

 Changes from v1 to v2:
 1. split up host and guest-side changes
 2. introduce new request flag to avoid changing return values.
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel 
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/



 Hi all:

 we neet this feature, but we don't decide how to implement it.
 We have two solution:
 1. use vmcall
 2. use virtio-serial.

 I will not change this patch set before we decide how to do it.
 Can we make a decision recent days?

 Anybody can decide which solution to use?

 To make an informed decision we need to have at least raw idea how
 virtio-serial variant will look.

 Hmm, I think we can do this:
 1. reset the virtio-serial device or reset the port we use to notice
the host that guest is panicked.
 2. write some specific messages to the port

 So the port should have fixed name. If this port is opened by the userspace
 before the guest is paniced, I am not sure whether we can use it(because a
 port only can be opened once at the same time).
 Yes, IMO we should dedicate one virtio-serial port for panic
 notifications. Just like we dedicate one for a console.
 
 We cannot call any function in the module, so we may need to write a simple
 driver for virtio-serial(like diskdump's disk driver).

 netconsole uses standard NIC drivers in polling mode to send OOPSes
 over the network and it mostly works. So I think using virtio-serial
 driver is not out of question, but with IRQ disabled of course.

The code for netconsole is in which file?
Another question: we cannot call the function in the module directly in the
kernel.

 
 I donot know how to implement it now. But I guess that it may be complicated.

 Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
 event over IMPI. The code is pretty complex. Of course if we a going to
 implement something more complex than simple hypercall for panic
 notification we better do something more interesting with it than just
 saying panic happened, like sending stack traces on all cpus for
 instance.

If we implement it by virtio-serial, I agree with passing more useful message
to host.

Thanks
Wen Congyang

 
 --
   Gleb.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Gleb Natapov

On Wed, Mar 21, 2012 at 05:35:49PM +0800, Wen Congyang wrote:
 At 03/21/2012 05:11 PM, Gleb Natapov Wrote:
  On Wed, Mar 21, 2012 at 08:56:03AM +0800, Wen Congyang wrote:
  At 03/20/2012 11:45 PM, Gleb Natapov Wrote:
  On Tue, Mar 20, 2012 at 05:59:16PM +0800, Wen Congyang wrote:
  At 03/19/2012 03:33 PM, Wen Congyang Wrote:
  At 03/08/2012 03:57 PM, Wen Congyang Wrote:
  We can know the guest is paniced when the guest runs on xen.
  But we do not have such feature on kvm.
 
  Another purpose of this feature is: management app(for example:
  libvirt) can do auto dump when the guest is crashed. If management
  app does not do auto dump, the guest's user can do dump by hand if
  he sees the guest is paniced.
 
  I touch the hypervisor instead of using virtio-serial, because
  1. it is simple
  2. the virtio-serial is an optional device, and the guest may
 not have such device.
 
  Changes from v2 to v3:
  1. correct spelling
 
  Changes from v1 to v2:
  1. split up host and guest-side changes
  2. introduce new request flag to avoid changing return values.
  --
  To unsubscribe from this list: send the line unsubscribe 
  linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
 
 
 
  Hi all:
 
  we neet this feature, but we don't decide how to implement it.
  We have two solution:
  1. use vmcall
  2. use virtio-serial.
 
  I will not change this patch set before we decide how to do it.
  Can we make a decision recent days?
 
  Anybody can decide which solution to use?
 
  To make an informed decision we need to have at least raw idea how
  virtio-serial variant will look.
 
  Hmm, I think we can do this:
  1. reset the virtio-serial device or reset the port we use to notice
 the host that guest is panicked.
  2. write some specific messages to the port
 
  So the port should have fixed name. If this port is opened by the userspace
  before the guest is paniced, I am not sure whether we can use it(because a
  port only can be opened once at the same time).
  Yes, IMO we should dedicate one virtio-serial port for panic
  notifications. Just like we dedicate one for a console.
  
  We cannot call any function in the module, so we may need to write a simple
  driver for virtio-serial(like diskdump's disk driver).
 
  netconsole uses standard NIC drivers in polling mode to send OOPSes
  over the network and it mostly works. So I think using virtio-serial
  driver is not out of question, but with IRQ disabled of course.
 
 The code for netconsole is in which file?
drivers/net/netconsole.c naturally.

 Another question: we cannot call the function in the module directly in the
 kernel.
True I think. netconsole and actual NIC driver have a layer of abstraction
between them. Modules can call functions from other modules. Your
module can register to panic_notifier_list (like IPMI does) and call
functions from virtio-serial. Or panic handling can be added directly
to virtio-serial since it will have to be modified anyway. Something
like netpoll API will have to be added to it.

  
  I donot know how to implement it now. But I guess that it may be 
  complicated.
 
  Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
  event over IMPI. The code is pretty complex. Of course if we a going to
  implement something more complex than simple hypercall for panic
  notification we better do something more interesting with it than just
  saying panic happened, like sending stack traces on all cpus for
  instance.
 
 If we implement it by virtio-serial, I agree with passing more useful message
 to host.
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

To address this, we add two hooks:
 - __ticket_spin_lock which is called after the cpu has been
   spinning on the lock for a significant number of iterations but has
   failed to take the lock (presumably because the cpu holding the lock
   has been descheduled).  The lock_spinning pvop is expected to block
   the cpu until it has been kicked by the current lock holder.
 - __ticket_spin_unlock, which on releasing a contended lock
   (there are more cpus with tail tickets), it looks to see if the next
   cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h   |   32 
 arch/x86/include/asm/paravirt_types.h |   10 ++
 arch/x86/include/asm/spinlock.h   |   53 ++--
 arch/x86/include/asm/spinlock_types.h |4 --
 arch/x86/kernel/paravirt-spinlocks.c  |   15 +
 arch/x86/xen/spinlock.c   |8 -
 6 files changed, 61 insertions(+), 61 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a7d2db9..0fa4553 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -750,36 +750,16 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
-static inline int arch_spin_is_locked(struct arch_spinlock *lock)
+static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
+   __ticket_t ticket)
 {
-   return PVOP_CALL1(int, pv_lock_ops.spin_is_locked, lock);
+   PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
-static inline int arch_spin_is_contended(struct arch_spinlock *lock)
+static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
+   __ticket_t ticket)
 {
-   return PVOP_CALL1(int, pv_lock_ops.spin_is_contended, lock);
-}
-#define arch_spin_is_contended arch_spin_is_contended
-
-static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
-{
-   PVOP_VCALL1(pv_lock_ops.spin_lock, lock);
-}
-
-static __always_inline void arch_spin_lock_flags(struct arch_spinlock *lock,
- unsigned long flags)
-{
-   PVOP_VCALL2(pv_lock_ops.spin_lock_flags, lock, flags);
-}
-
-static __always_inline int arch_spin_trylock(struct arch_spinlock *lock)
-{
-   return PVOP_CALL1(int, pv_lock_ops.spin_trylock, lock);
-}
-
-static __always_inline void arch_spin_unlock(struct arch_spinlock *lock)
-{
-   PVOP_VCALL1(pv_lock_ops.spin_unlock, lock);
+   PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
 #endif
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8e8b9a4..005e24d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -327,13 +327,11 @@ struct pv_mmu_ops {
 };
 
 struct arch_spinlock;
+#include asm/spinlock_types.h
+
 struct pv_lock_ops {
-   int (*spin_is_locked)(struct arch_spinlock *lock);
-   int (*spin_is_contended)(struct arch_spinlock *lock);
-   void (*spin_lock)(struct arch_spinlock *lock);
-   void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long 
flags);
-   int (*spin_trylock)(struct arch_spinlock *lock);
-   void (*spin_unlock)(struct arch_spinlock *lock);
+   void (*lock_spinning)(struct arch_spinlock *lock, __ticket_t ticket);
+   void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index a82c2bf..7e66b85 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -37,6 +37,35 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+/* How

[PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

The code size expands somewhat, and its probably better to just call
a function rather than inline it.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/Kconfig |3 +++
 kernel/Kconfig.locks |2 +-
 2 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..10c28ec 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -623,6 +623,9 @@ config PARAVIRT_SPINLOCKS
 
  If you are unsure how to answer this question, answer N.
 
+config ARCH_NOINLINE_SPIN_UNLOCK
+   def_bool PARAVIRT_SPINLOCKS
+
 config PARAVIRT_CLOCK
bool
 
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 5068e2a..584637b 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE
 ARCH_INLINE_SPIN_LOCK_IRQSAVE
 
 config INLINE_SPIN_UNLOCK
-   def_bool !DEBUG_SPINLOCK  (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)
+   def_bool !DEBUG_SPINLOCK  (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)  
!ARCH_NOINLINE_SPIN_UNLOCK
 
 config INLINE_SPIN_UNLOCK_BH
def_bool !DEBUG_SPINLOCK  ARCH_INLINE_SPIN_UNLOCK_BH

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 6/11] xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv ticketlocks

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/xen/spinlock.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 0e4d057..5dce49d 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -223,12 +223,26 @@ void xen_uninit_lock_cpu(int cpu)
unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL);
 }
 
+static bool xen_pvspin __initdata = true;
+
 void __init xen_init_spinlocks(void)
 {
+   if (!xen_pvspin) {
+   printk(KERN_DEBUG xen: PV spinlocks disabled\n);
+   return;
+   }
+
pv_lock_ops.lock_spinning = xen_lock_spinning;
pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 
+static __init int xen_parse_nopvspin(char *arg)
+{
+   xen_pvspin = false;
+   return 0;
+}
+early_param(xen_nopvspin, xen_parse_nopvspin);
+
 #ifdef CONFIG_XEN_DEBUG_FS
 
 static struct dentry *d_spin_debug;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 10/11] xen/pvticketlock: allow interrupts to be enabled while blocking

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

If interrupts were enabled when taking the spinlock, we can leave them
enabled while blocking to get the lock.

If we can enable interrupts while waiting for the lock to become
available, and we take an interrupt before entering the poll,
and the handler takes a spinlock which ends up going into
the slow state (invalidating the per-cpu lock and want values),
then when the interrupt handler returns the event channel will
remain pending so the poll will return immediately, causing it to
return out to the main spinlock loop.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/xen/spinlock.c |   46 --
 1 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d4a9ec2..4926974 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -142,7 +142,20 @@ static void xen_lock_spinning(struct arch_spinlock *lock, 
__ticket_t want)
 * partially setup state.
 */
local_irq_save(flags);
-
+   /*
+* We don't really care if we're overwriting some other
+* (lock,want) pair, as that would mean that we're currently
+* in an interrupt context, and the outer context had
+* interrupts enabled.  That has already kicked the VCPU out
+* of xen_poll_irq(), so it will just return spuriously and
+* retry with newly setup (lock,want).
+*
+* The ordering protocol on this is that the lock pointer
+* may only be set non-NULL if the want ticket is correct.
+* If we're updating want, we must first clear lock.
+*/
+   w-lock = NULL;
+   smp_wmb();
w-want = want;
smp_wmb();
w-lock = lock;
@@ -157,24 +170,43 @@ static void xen_lock_spinning(struct arch_spinlock *lock, 
__ticket_t want)
/* Only check lock once pending cleared */
barrier();
 
-   /* Mark entry to slowpath before doing the pickup test to make
-  sure we don't deadlock with an unlocker. */
+   /*
+* Mark entry to slowpath before doing the pickup test to make
+* sure we don't deadlock with an unlocker.
+*/
__ticket_enter_slowpath(lock);
 
-   /* check again make sure it didn't become free while
-  we weren't looking  */
+   /*
+* check again make sure it didn't become free while
+* we weren't looking
+*/
if (ACCESS_ONCE(lock-tickets.head) == want) {
add_stats(TAKEN_SLOW_PICKUP, 1);
goto out;
}
+
+   /* Allow interrupts while blocked */
+   local_irq_restore(flags);
+
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
/* Block until irq becomes pending (or perhaps a spurious wakeup) */
xen_poll_irq(irq);
add_stats(TAKEN_SLOW_SPURIOUS, !xen_test_irq_pending(irq));
+
+   local_irq_save(flags);
+
kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq));
 out:
cpumask_clear_cpu(cpu, waiting_cpus);
w-lock = NULL;
+
local_irq_restore(flags);
+
spin_time_accum_blocked(start);
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
@@ -188,7 +220,9 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
for_each_cpu(cpu, waiting_cpus) {
const struct xen_lock_waiting *w = per_cpu(lock_waiting, cpu);
 
-   if (w-lock == lock  w-want == next) {
+   /* Make sure we read lock before want */
+   if (ACCESS_ONCE(w-lock) == lock 
+   ACCESS_ONCE(w-want) == next) {
add_stats(RELEASED_SLOW_KICKED, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
break;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 11/11] xen: enable PV ticketlocks on HVM Xen

2012-03-21 Thread Raghavendra K T

From: Stefano Stabellini stefano.stabell...@eu.citrix.com

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/xen/smp.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index e85d3ee..8dc2574 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -568,4 +568,5 @@ void __init xen_hvm_smp_init(void)
smp_ops.cpu_die = xen_hvm_cpu_die;
smp_ops.send_call_func_ipi = xen_smp_send_call_function_ipi;
smp_ops.send_call_func_single_ipi = 
xen_smp_send_call_function_single_ipi;
+   xen_init_spinlocks();
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 3/11] x86/ticketlock: collapse a layer of functions

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Now that the paravirtualization layer doesn't exist at the spinlock
level any more, we can collapse the __ticket_ functions into the arch_
functions.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/include/asm/spinlock.h |   35 +--
 1 files changed, 5 insertions(+), 30 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 7e66b85..f6442f4 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -79,7 +79,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
  * in the high part, because a wide xadd increment of the low part would carry
  * up and contaminate the high part.
  */
-static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
+static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
 {
register struct __raw_tickets inc = { .tail = 1 };
 
@@ -99,7 +99,7 @@ static __always_inline void __ticket_spin_lock(struct 
arch_spinlock *lock)
 out:   barrier();  /* make sure nothing creeps before the lock is taken */
 }
 
-static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
+static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
arch_spinlock_t old, new;
 
@@ -113,7 +113,7 @@ static __always_inline int 
__ticket_spin_trylock(arch_spinlock_t *lock)
return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
 }
 
-static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
+static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
__ticket_t next = lock-tickets.head + 1;
 
@@ -121,46 +121,21 @@ static __always_inline void 
__ticket_spin_unlock(arch_spinlock_t *lock)
__ticket_unlock_kick(lock, next);
 }
 
-static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
 
return !!(tmp.tail ^ tmp.head);
 }
 
-static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
+static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
 
return ((tmp.tail - tmp.head)  TICKET_MASK)  1;
 }
-
-static inline int arch_spin_is_locked(arch_spinlock_t *lock)
-{
-   return __ticket_spin_is_locked(lock);
-}
-
-static inline int arch_spin_is_contended(arch_spinlock_t *lock)
-{
-   return __ticket_spin_is_contended(lock);
-}
 #define arch_spin_is_contended arch_spin_is_contended
 
-static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-   __ticket_spin_lock(lock);
-}
-
-static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-   return __ticket_spin_trylock(lock);
-}
-
-static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-   __ticket_spin_unlock(lock);
-}
-
 static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
  unsigned long flags)
 {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 9/11] x86/ticketlock: add slowpath logic

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Maintain a flag in the LSB of the ticket lock tail which indicates
whether anyone is in the lock slowpath and may need kicking when
the current holder unlocks.  The flags are set when the first locker
enters the slowpath, and cleared when unlocking to an empty queue (ie,
no contention).

In the specific implementation of lock_spinning(), make sure to set
the slowpath flags on the lock just before blocking.  We must do
this before the last-chance pickup test to prevent a deadlock
with the unlocker:

UnlockerLocker
test for lock pickup
- fail
unlock
test slowpath
- false
set slowpath flags
block

Whereas this works in any ordering:

UnlockerLocker
set slowpath flags
test for lock pickup
- fail
block
unlock
test slowpath
- true, kick

If the unlocker finds that the lock has the slowpath flag set but it is
actually uncontended (ie, head == tail, so nobody is waiting), then it
clears the slowpath flag.

The unlock code uses a locked add to update the head counter.  This also
acts as a full memory barrier so that its safe to subsequently
read back the slowflag state, knowing that the updated lock is visible
to the other CPUs.  If it were an unlocked add, then the flag read may
just be forwarded from the store buffer before it was visible to the other
CPUs, which could result in a deadlock.

Unfortunately this means we need to do a locked instruction when
unlocking with PV ticketlocks.  However, if PV ticketlocks are not
enabled, then the old non-locked add is the only unlocking code.

Note: this code relies on gcc making sure that unlikely() code is out of
line of the fastpath, which only happens when OPTIMIZE_SIZE=n.  If it
doesn't the generated code isn't too bad, but its definitely suboptimal.

Thanks to Srivatsa Vaddagiri for providing a bugfix to the original
version of this change, which has been folded in.
Thanks to Stephan Diestelhorst for commenting on some code which relied
on an inaccurate reading of the x86 memory ordering rules.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Stephan Diestelhorst stephan.diestelho...@amd.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h   |2 +-
 arch/x86/include/asm/spinlock.h   |   86 +++-
 arch/x86/include/asm/spinlock_types.h |2 +
 arch/x86/kernel/paravirt-spinlocks.c  |3 +
 arch/x86/xen/spinlock.c   |6 ++
 5 files changed, 74 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4343419..f2f1700 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -756,7 +756,7 @@ static __always_inline void __ticket_lock_spinning(struct 
arch_spinlock *lock,
PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
-static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
+static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
__ticket_t ticket)
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index c14263a..e98cfa8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -1,11 +1,14 @@
 #ifndef _ASM_X86_SPINLOCK_H
 #define _ASM_X86_SPINLOCK_H
 
+#include linux/jump_label.h
 #include linux/atomic.h
 #include asm/page.h
 #include asm/processor.h
 #include linux/compiler.h
 #include asm/paravirt.h
+#include asm/bitops.h
+
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -40,32 +43,28 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1  11)
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
+extern struct jump_label_key paravirt_ticketlocks_enabled;
+static __always_inline bool static_branch(struct jump_label_key *key);
 
-static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
-   __ticket_t ticket)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
 {
+   set_bit(0, (volatile unsigned long *)lock-tickets.tail);
 }
 
-static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
-__ticket_t ticket)
+#else  /* !CONFIG_PARAVIRT_SPINLOCKS */
+static __always_inline

[PATCH RFC V6 7/11] x86/pvticketlock: use callee-save for lock_spinning

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Although the lock_spinning calls in the spinlock code are on the
uncommon path, their presence can cause the compiler to generate many
more register save/restores in the function pre/postamble, which is in
the fast path.  To avoid this, convert it to using the pvops callee-save
calling convention, which defers all the save/restores until the actual
function is called, keeping the fastpath clean.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/include/asm/paravirt.h   |2 +-
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/paravirt-spinlocks.c  |2 +-
 arch/x86/xen/spinlock.c   |3 ++-
 4 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 0fa4553..4343419 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -753,7 +753,7 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
-   PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket);
+   PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
 static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 005e24d..5e0c138 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -330,7 +330,7 @@ struct arch_spinlock;
 #include asm/spinlock_types.h
 
 struct pv_lock_ops {
-   void (*lock_spinning)(struct arch_spinlock *lock, __ticket_t ticket);
+   struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 };
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index c2e010e..4251c1d 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -9,7 +9,7 @@
 
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
-   .lock_spinning = paravirt_nop,
+   .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
 #endif
 };
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 5dce49d..176a554 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -173,6 +173,7 @@ out:
local_irq_restore(flags);
spin_time_accum_blocked(start);
 }
+PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
 
 static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 {
@@ -232,7 +233,7 @@ void __init xen_init_spinlocks(void)
return;
}
 
-   pv_lock_ops.lock_spinning = xen_lock_spinning;
+   pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 8/11] x86/pvticketlock: when paravirtualizing ticket locks, increment by 2

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Increment ticket head/tails by 2 rather than 1 to leave the LSB free
to store a is in slowpath state bit.  This halves the number
of possible CPUs for a given ticket size, but this shouldn't matter
in practice - kernels built for 32k+ CPU systems are probably
specially built for the hardware rather than a generic distro
kernel.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/include/asm/spinlock.h   |   10 +-
 arch/x86/include/asm/spinlock_types.h |   10 +-
 2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index f6442f4..c14263a 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,7 +81,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
  */
 static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
 {
-   register struct __raw_tickets inc = { .tail = 1 };
+   register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 
inc = xadd(lock-tickets, inc);
 
@@ -107,7 +107,7 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
if (old.tickets.head != old.tickets.tail)
return 0;
 
-   new.head_tail = old.head_tail + (1  TICKET_SHIFT);
+   new.head_tail = old.head_tail + (TICKET_LOCK_INC  TICKET_SHIFT);
 
/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
@@ -115,9 +115,9 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-   __ticket_t next = lock-tickets.head + 1;
+   __ticket_t next = lock-tickets.head + TICKET_LOCK_INC;
 
-   __add(lock-tickets.head, 1, UNLOCK_LOCK_PREFIX);
+   __add(lock-tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
__ticket_unlock_kick(lock, next);
 }
 
@@ -132,7 +132,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t 
*lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
 
-   return ((tmp.tail - tmp.head)  TICKET_MASK)  1;
+   return ((tmp.tail - tmp.head)  TICKET_MASK)  TICKET_LOCK_INC;
 }
 #define arch_spin_is_contended arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index dbe223d..aa9a205 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,7 +3,13 @@
 
 #include linux/types.h
 
-#if (CONFIG_NR_CPUS  256)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define __TICKET_LOCK_INC  2
+#else
+#define __TICKET_LOCK_INC  1
+#endif
+
+#if (CONFIG_NR_CPUS  (256 / __TICKET_LOCK_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -11,6 +17,8 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
+#define TICKET_LOCK_INC((__ticket_t)__TICKET_LOCK_INC)
+
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 #define TICKET_MASK((__ticket_t)((1  TICKET_SHIFT) - 1))
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance trouble

2012-03-21 Thread David Cure

hello,

Le Tue, Mar 20, 2012 at 02:38:22PM +0200, Gleb Natapov ecrivait :
 
 Try to add feature policy='disable' name='hypervisor'/ to cpu
 definition in XML and check command line.

ok I try this but I can't use cpu model to map the host cpu
(my libvirt is 0.9.8) so I use :

  cpu match='exact'
modelOpteron_G3/model
feature policy='disable' name='hypervisor'/
  /cpu

(the physical server use Opteron CPU).

The log is here :
http://www.roullier.net/Report/report-3.2-vhost-net-1vcpu-cpu.txt.gz

And now with only 1 vcpu, the response time is 8.5s, great
improvment. We keep this configuration for production : we check the
response time when some other users are connected.

David.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V6 5/11] xen/pvticketlock: Xen implementation for PV ticket locks

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Replace the old Xen implementation of PV spinlocks with and implementation
of xen_lock_spinning and xen_unlock_kick.

xen_lock_spinning simply registers the cpu in its entry in lock_waiting,
adds itself to the waiting_cpus set, and blocks on an event channel
until the channel becomes pending.

xen_unlock_kick searches the cpus in waiting_cpus looking for the one
which next wants this lock with the next ticket, if any.  If found,
it kicks it by making its event channel pending, which wakes it up.

We need to make sure interrupts are disabled while we're relying on the
contents of the per-cpu lock_waiting values, otherwise an interrupt
handler could come in, try to take some other lock, block, and overwrite
our values.

Raghu: use function + enum instead of macro, cmpxchg for zero status reset

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
---
 arch/x86/xen/spinlock.c |  344 +++
 1 files changed, 77 insertions(+), 267 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 1258514..0e4d057 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -16,45 +16,46 @@
 #include xen-ops.h
 #include debugfs.h
 
+enum xen_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   TAKEN_SLOW_SPURIOUS,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+
 #ifdef CONFIG_XEN_DEBUG_FS
 static struct xen_spinlock_stats
 {
-   u64 taken;
-   u32 taken_slow;
-   u32 taken_slow_nested;
-   u32 taken_slow_pickup;
-   u32 taken_slow_spurious;
-   u32 taken_slow_irqenable;
-
-   u64 released;
-   u32 released_slow;
-   u32 released_slow_kicked;
+   u32 contention_stats[NR_CONTENTION_STATS];
 
 #define HISTO_BUCKETS  30
-   u32 histo_spin_total[HISTO_BUCKETS+1];
-   u32 histo_spin_spinning[HISTO_BUCKETS+1];
u32 histo_spin_blocked[HISTO_BUCKETS+1];
 
-   u64 time_total;
-   u64 time_spinning;
u64 time_blocked;
 } spinlock_stats;
 
 static u8 zero_stats;
 
-static unsigned lock_timeout = 1  10;
-#define TIMEOUT lock_timeout
-
 static inline void check_zero(void)
 {
-   if (unlikely(zero_stats)) {
-   memset(spinlock_stats, 0, sizeof(spinlock_stats));
-   zero_stats = 0;
+   u8 ret;
+   u8 old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(spinlock_stats, 0, sizeof(spinlock_stats));
}
 }
 
-#define ADD_STATS(elem, val)   \
-   do { check_zero(); spinlock_stats.elem += (val); } while(0)
+static inline void add_stats(enum xen_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
 
 static inline u64 spin_time_start(void)
 {
@@ -73,22 +74,6 @@ static void __spin_time_accum(u64 delta, u32 *array)
array[HISTO_BUCKETS]++;
 }
 
-static inline void spin_time_accum_spinning(u64 start)
-{
-   u32 delta = xen_clocksource_read() - start;
-
-   __spin_time_accum(delta, spinlock_stats.histo_spin_spinning);
-   spinlock_stats.time_spinning += delta;
-}
-
-static inline void spin_time_accum_total(u64 start)
-{
-   u32 delta = xen_clocksource_read() - start;
-
-   __spin_time_accum(delta, spinlock_stats.histo_spin_total);
-   spinlock_stats.time_total += delta;
-}
-
 static inline void spin_time_accum_blocked(u64 start)
 {
u32 delta = xen_clocksource_read() - start;
@@ -98,19 +83,15 @@ static inline void spin_time_accum_blocked(u64 start)
 }
 #else  /* !CONFIG_XEN_DEBUG_FS */
 #define TIMEOUT(1  10)
-#define ADD_STATS(elem, val)   do { (void)(val); } while(0)
+static inline void add_stats(enum xen_contention_stat var, u32 val)
+{
+}
 
 static inline u64 spin_time_start(void)
 {
return 0;
 }
 
-static inline void spin_time_accum_total(u64 start)
-{
-}
-static inline void spin_time_accum_spinning(u64 start)
-{
-}
 static inline void spin_time_accum_blocked(u64 start)
 {
 }
@@ -133,230 +114,83 @@ typedef u16 xen_spinners_t;
asm(LOCK_PREFIX  decw %0 : +m ((xl)-spinners) : : memory);
 #endif
 
-struct xen_spinlock {
-   unsigned char lock; /* 0 - free; 1 - locked */
-   xen_spinners_t spinners;/* count of waiting cpus */
+struct xen_lock_waiting {
+   struct arch_spinlock *lock;
+   __ticket_t want;
 };
 
 static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
+static cpumask_t waiting_cpus;
 
-#if 0
-static int xen_spin_is_locked(struct arch_spinlock *lock)
+static void

[PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

Changes since last posting: (Raghavendra K T)
[
 - Rebased to linux-3.3-rc6.
 - used function+enum in place of macro (better type checking) 
 - use cmpxchg while resetting zero status for possible race
[suggested by Dave Hansen for KVM patches ]
]

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism.

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct next
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into slowpath state.

- When releasing a lock, if it is in slowpath state, the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The slowpath state is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a small ticket can deal with 128 CPUs, and large ticket
32768).

This series provides a Xen implementation, but it should be
straightforward to add a KVM implementation as well.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in slowpath
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(lock-tickets, inc);
inc.tail = ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out:barrier();
which results in:
push   %rbp
mov%rsp,%rbp

mov$0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f   # Slowpath if lock in contention

pop%rbp
retq   

### SLOWPATH START
1:  and$-2,%edx
movzbl %dl,%esi

2:  mov$0x800,%eax
jmp4f

3:  pause  
sub$0x1,%eax
je 5f

4:  movzbl (%rdi),%ecx
cmp%cl,%dl
jne3b

pop%rbp
retq   

5:  callq  *__ticket_lock_spinning
jmp2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push   %rbp
mov%rsp,%rbp

mov$0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f

pop%rbp
retq   

### SLOWPATH START
1:  pause  
movzbl (%rdi),%eax
cmp%dl,%al
jne1b

pop%rbp
retq   
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
head and fetch the slowpath flag from tail.  This version of the
patch uses a locked add to do this, followed by a test to see if the
slowflag is set.  The lock prefix acts as a full memory barrier, so we
can be sure that other CPUs will have seen the unlock before we read
the flag (without the barrier the read could be fetched from the
store queue before it hits memory, which could result in a deadlock).

This is is all unnecessary complication if you're not using PV ticket
locks, it also uses the jump-label machinery to use the standard
add-based unlock in the non-PV case.

if (TICKET_SLOWPATH_FLAG

[PATCH RFC V6 4/11] xen: defer spinlock setup until boot CPU setup

2012-03-21 Thread Raghavendra K T

From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

There's no need to do it at very early init, and doing it there
makes it impossible to use the jump_label machinery.

Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/xen/smp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 501d4e0..e85d3ee 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -201,6 +201,7 @@ static void __init xen_smp_prepare_boot_cpu(void)
 
xen_filter_cpu_maps();
xen_setup_vcpu_info_placement();
+   xen_init_spinlocks();
 }
 
 static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
@@ -530,7 +531,6 @@ void __init xen_smp_init(void)
 {
smp_ops = xen_smp_ops;
xen_fill_possible_map();
-   xen_init_spinlocks();
 }
 
 static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-03-21 Thread Jan Kiszka

This is now implied by kvm_irqchip_in_kernel.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 kvm-all.c  |6 --
 kvm-stub.c |6 --
 kvm.h  |2 --
 3 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 21c7dd2..7f8c188 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -75,7 +75,6 @@ struct KVMState
 #ifdef KVM_CAP_SET_GUEST_DEBUG
 struct kvm_sw_breakpoint_head kvm_sw_breakpoints;
 #endif
-int pit_in_kernel;
 int pit_state2;
 int xsave, xcrs;
 int many_ioeventfds;
@@ -195,11 +194,6 @@ static void kvm_reset_vcpu(void *opaque)
 kvm_arch_reset_vcpu(env);
 }
 
-int kvm_pit_in_kernel(void)
-{
-return kvm_state-pit_in_kernel;
-}
-
 int kvm_init_vcpu(CPUState *env)
 {
 KVMState *s = kvm_state;
diff --git a/kvm-stub.c b/kvm-stub.c
index 1f1c686..fab3ab1 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -16,12 +16,6 @@
 #include gdbstub.h
 #include kvm.h
 
-int kvm_pit_in_kernel(void)
-{
-return 0;
-}
-
-
 int kvm_init_vcpu(CPUState *env)
 {
 return -ENOSYS;
diff --git a/kvm.h b/kvm.h
index 8ef4476..128cc8f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -83,8 +83,6 @@ int kvm_update_guest_debug(CPUState *env, unsigned long 
reinject_trap);
 int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset);
 #endif
 
-int kvm_pit_in_kernel(void);
-
 int kvm_on_sigbus_vcpu(CPUState *env, int code, void *addr);
 int kvm_on_sigbus(int code, void *addr);
 
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Attilio Rao


On 21/03/12 10:20, Raghavendra K T wrote:

From: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)

To address this, we add two hooks:
  - __ticket_spin_lock which is called after the cpu has been
spinning on the lock for a significant number of iterations but has
failed to take the lock (presumably because the cpu holding the lock
has been descheduled).  The lock_spinning pvop is expected to block
the cpu until it has been kicked by the current lock holder.
  - __ticket_spin_unlock, which on releasing a contended lock
(there are more cpus with tail tickets), it looks to see if the next
cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.
   


I've made some real world benchmarks based on this serie of patches 
applied on top of a vanilla Linux-3.3-rc6 (commit 
4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both 
CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions 
compared:

* vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK + patch

(you can check out the monolithic kernel configuration I used, and 
verify the sole difference, here):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernel-configs/

Tests, information and results are summarized below.

== System used information:
* Machine is a XEON x3450, 2.6GHz, 8-ways system:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/dmesg
* System version, a Debian Squeeze 6.0.4:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/debian-version
* gcc version, 4.4.5:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/gcc-version

== Tests performed
* pgbench based on PostgreSQL 9.2 (development version) as it has a lot 
of scalability improvements in it:

http://www.postgresql.org/docs/devel/static/install-getsource.html

I used a stock installation, with only this simple configuration change:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/postsgresql.conf.patch

For collecting data I used this simple scripts, which runs the test 10 
times every time with a different set of threads (from 1 to 64). Please 
note that the first 8 runs cache all the data in memory in order to 
avoid subsequent I/O, thus they are discarded in sampling and calculation:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench_script

Here is the crude data (please remind this is tps, thus the higher the 
better):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench-crude-datas/

And here are data chartered with ministat tool, comparing all the 4 
kernel configuration for every thread configuration:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench-9.2-total.bench

As you can see, the patch doesn't really show a statistically meaningful 
difference for this workload, excluding the single-thread run for the 
patched + CONFIG_PARAVIRT_SPINLOCK=y case, which seems 5% faster.



* pbzip2, which is a parallel version of bzip2, supposed to reproduce a 
CPU-intensive, multithreaded, application.
The file choosen for compression is 1GB sized, got from /dev/urandom 
(this is not published but I may have it, so if you need it for more 
tests please just ask), and all the I/O is done on a tmpfs volume in 
order to avoid I/O floaty effects.


For collecting data I used this simple scripts, which runs the test 10 
times every time with a different set of threads (from 1 to 64):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2bench_script

Here is the crude data (please remind this is time(1) output, thus the 
lower the better):

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2-crude-datas/

And here are data chartered with ministat tool, comparing all the 4 
kernel configuration for every thread configuration:

http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2-1.1.1-total.bench

As you can see, the patch doesn't really show a statistically meaningful 
difference for this workload.



* kernbench-0.50 run, doing I/O on a 10GB tmpfs volume (thus no actual  
I/O involved), with the following invokation:

./kernbench -n10 -s -c16 -M -f

Re: [PATCH 1/1] Fixed integer constant is too large for ‘long’ type warning

2012-03-21 Thread Avi Kivity

On 03/20/2012 01:49 PM, Sasikantha babu wrote:
 ---
  arch/x86/kvm/pmu.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)



Missing signoff.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling

2012-03-21 Thread Avi Kivity

On 03/19/2012 06:53 PM, Nadav Har'El wrote:
 Hi, in a minute I'll send a new version of the MSR_IA32_FEATURE_CONTROL
 patch for nested VMX; I just wanted to reply first to your comments so
 you'll know what to expect:

 On Wed, Mar 07, 2012, Avi Kivity wrote about Re: PATCH: nVMX: Better 
 MSR_IA32_FEATURE_CONTROL handling:
  On 03/07/2012 05:58 PM, Nadav Har'El wrote:
   + u64 msr_ia32_feature_control;
};
  
  Need to add to the list of exported MSRs so it can be live migrated
  (msrs_to_save).

 Did this.

  The variable itself should live in vcpu-arch, even if
  some bits are vendor specific.

 But not this. I understand what you explained about vmx.c being for
 Intel *hosts*, not about emulating Intel *guests*, but I do think that
 since none of the bits in this MSR are relevant on AMD hosts (which
 don't do nested VMX), it isn't useful to support this MSR outside vmx.c.

 So I left this variable it in vmx-nested. As I noted earlier, svm.c
 did exactly the same thing (nested.vm_cr_msr), so at least there's
 symmetry here.

Okay.


   @@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc

 switch (msr_index) {
 case MSR_IA32_FEATURE_CONTROL:
   - *pdata = 0;
   + *pdata = to_vmx(vcpu)-nested.msr_ia32_feature_control;
 break;
  
  In a separate patch, please move this outside vmx_get_vmx_msr().  It's
  not a vmx msr.

 Done, but not split into two patches: The patch removes the old case in
 vmx_get_vmx_msr() (and also removes vmx_set_vmx_msr() entirely) and
 instead adds the case in vmx_get_msr() and vmx_set_msr().

   +#define VMXON_NEEDED_FEATURES \
   +   (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX)
  
  Use const u64 instead of #define please, it jars my eyes.

 I would, if Linux coding style allowed to declare variables in the
 middle of blocks. Unfortunately it doesn't, so I left this #define.
 I don't think it's that bad.


Move it to the top of the file, or as a variable at the top of the
function please.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] qemu-kvm: Prepare kvm PIT for upstream merge

2012-03-21 Thread Avi Kivity

On 03/19/2012 05:29 PM, Jan Kiszka wrote:
 Some preparation patches to arrange qemu-kvm for merging in latest qemu
 with its own in-kernel PIT support. Later on, we can switch to that
 version without losing features on the way, even just temporarily.


Thanks, applied.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] pci-assign: Fall back to host-side MSI if INTx sharing fails

2012-03-21 Thread Avi Kivity

On 03/19/2012 11:56 AM, Jan Kiszka wrote:
 If the host or the device does not support INTx sharing, retry the IRQ
 assignment with host-side MSI support enabled but warn about potential
 consequences. This allows to preserve the previous behavior where we
 defaulted to MSI and did not support INTx sharing at all.



Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: Drop installation of self-built optionroms

2012-03-21 Thread Avi Kivity

On 03/19/2012 06:58 PM, Jan Kiszka wrote:
 All corresponding binaries are now in pc-bios, so we can remove this
 diff to upstream.


Thanks, applied.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Fixed integer constant is too large for 'long' type warning

2012-03-21 Thread Sasikantha babu


Signed-off-by: Sasikantha babu sasikanth@gmail.com
---
 arch/x86/kvm/pmu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7aad544..9395230 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -365,7 +365,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
data)
case MSR_CORE_PERF_FIXED_CTR_CTRL:
if (pmu-fixed_ctr_ctrl == data)
return 0;
-   if (!(data  0xf444)) {
+   if (!(data  0xf444ull)) {
reprogram_fixed_counters(pmu, data);
return 0;
}
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Stephan Diestelhorst

On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
 On 21/03/12 10:20, Raghavendra K T wrote:
  From: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
 
  Rather than outright replacing the entire spinlock implementation in
  order to paravirtualize it, keep the ticket lock implementation but add
  a couple of pvops hooks on the slow patch (long spin on lock, unlocking
  a contended lock).
 
  Ticket locks have a number of nice properties, but they also have some
  surprising behaviours in virtual environments.  They enforce a strict
  FIFO ordering on cpus trying to take a lock; however, if the hypervisor
  scheduler does not schedule the cpus in the correct order, the system can
  waste a huge amount of time spinning until the next cpu can take the lock.
 
  (See Thomas Friebel's talk Prevent Guests from Spinning Around
  http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
 
  To address this, we add two hooks:
- __ticket_spin_lock which is called after the cpu has been
  spinning on the lock for a significant number of iterations but has
  failed to take the lock (presumably because the cpu holding the lock
  has been descheduled).  The lock_spinning pvop is expected to block
  the cpu until it has been kicked by the current lock holder.
- __ticket_spin_unlock, which on releasing a contended lock
  (there are more cpus with tail tickets), it looks to see if the next
  cpu is blocked and wakes it if so.
 
  When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
  functions causes all the extra code to go away.
 
 
 I've made some real world benchmarks based on this serie of patches 
 applied on top of a vanilla Linux-3.3-rc6 (commit 
 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both 
 CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions 
 compared:
 * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
 * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
 * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
 * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
 
[...]
 == Results
 This test points in the direction that Jeremy's rebased patches don't 
 introduce a peformance penalty at all, but also that we could likely 
 consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by 
 default and suggest disabling just on very old CPUs (assuming a 
 performance regression can be proven there).

Very interesting results, in particular knowing that in the one guest
case things do not get (significantly) slower due to the added logic
and LOCKed RMW in the unlock path.

AFAICR, the problem really became apparent when running multiple guests
time sharing the physical CPUs, i.e., two guests with eight vCPUs each
on an eight core machine.  Did you look at this setup with your tests?

Thanks,
  Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelho...@amd.com, Tel. +49 (0)351 448 356 719  (AMD: 29-719)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: User question: balloon memory, currentMemory and Linux/FreeBSD guests

2012-03-21 Thread Avi Kivity

On 03/14/2012 01:51 PM, Andy Smith wrote:
 Hi,

   I'm a KVM user, using it for the first time and so far very happy
 with it.
 I wonder if someone can help me get my head round the issue of balloon
 memory and overallocation.

 Basically I have a Debian wheezy/sid KVM server, on which I have a
 couple of Linux guests and several FreeBSD guests and I see different
 behaviour between the two guest types. On FreeBSD I have installed the
 virtio and balloon drivers (disk and network working great!), there is
 definately a balloon process running.
 The behaviour I see is that both Linux and FreeBSD guests always show
 the currentMemory number when quiered with qemu-monitor-command
 --hmp ClientName --cmd 'info balloon'. But from a top in the debian
 KVM host the RES size of the Linux guest processes varies
 (presumably depending on the demand of the guest OS) but the FreeBSD
 kvm processes always sit at exactly the currentMemory size.

 Basically I'm interested to know what is going on, whether what I see
 is normal and whether balloon currently dynamically manages memory
 usage or if its still the case that it requires administrator
 intervention to re-release memory after the OS has previously been
 allocated it.

The RES column indicates how much memory guests are actively using. 
Linux touches memory when it first uses it, so if your guests are idle,
RES will remain low.  It seems FreeBSD touches memory immediately (or
perhaps those guests are not idle), so it's high.

RES (or rather, RSS - resident set size) can decrease when the host
swaps guest pages, or if the guest gives up memory via to ballooning.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-03-21 Thread Avi Kivity

On 03/21/2012 02:36 PM, Jan Kiszka wrote:
 This is now implied by kvm_irqchip_in_kernel.



So we can't have -no-kvm-pit?

No huge loss, but unexpected.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fixed integer constant is too large for 'long' type warning

2012-03-21 Thread Avi Kivity

On 03/21/2012 03:19 PM, Sasikantha babu wrote:
 Signed-off-by: Sasikantha babu sasikanth@gmail.com


Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-03-21 Thread Jan Kiszka

On 2012-03-21 14:36, Avi Kivity wrote:
 On 03/21/2012 02:36 PM, Jan Kiszka wrote:
 This is now implied by kvm_irqchip_in_kernel.


 
 So we can't have -no-kvm-pit?
 
 No huge loss, but unexpected.

See e81dda195556e72f8cd294998296c1051aab30a8.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Can VMX provide real mode support?

2012-03-21 Thread Jan Kiszka

On 2012-03-21 13:38, GaoYi wrote:
 Hi Jan,
 
 Since the newest Intel-VT supports the guest OS under the real mode, 
 which was already supported in AMD-V, can the VMX in the latest KVM support 
 that case?

Yes, both with our without that unrestricted guest support (as Intel
called it), real mode will generally work. Without that CPU feature, I
think to recall that there were some limitations for big real mode, not
sure.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-intel: Auto-load on CPUs with VMX.

2012-03-21 Thread Avi Kivity

On 03/21/2012 08:33 AM, Josh Triplett wrote:
 Enable x86 feature-based autoloading for the kvm-intel module on CPUs
 with X86_FEATURE_VMX.

 Signed-off-by: Josh Triplett j...@joshtriplett.org
 ---

 I've tested this and it works correctly.  I can also supply the obvious
 equivalent patch for kvm-amd on AMD CPUs with SVM, but I don't have any
 such CPUs to test that on.


On the one hand this makes sense and is consistent with how other
modules are loaded.  On the other hand this will waste memory for
non-virt users running on distro kernels (but distro kernels can
override this new behaviour anyway).

I'm inclined to apply, but will wait for more comments.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-03-21 Thread Gleb Natapov

On Wed, Mar 21, 2012 at 02:39:47PM +0100, Jan Kiszka wrote:
 On 2012-03-21 14:36, Avi Kivity wrote:
  On 03/21/2012 02:36 PM, Jan Kiszka wrote:
  This is now implied by kvm_irqchip_in_kernel.
 
 
  
  So we can't have -no-kvm-pit?
  
  No huge loss, but unexpected.
 
 See e81dda195556e72f8cd294998296c1051aab30a8.
 
I am curious what is the reason for upstream to not supporting disabling the
in-kernel PIT separately?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Can VMX provide real mode support?

2012-03-21 Thread Avi Kivity

On 03/21/2012 03:40 PM, Jan Kiszka wrote:
 On 2012-03-21 13:38, GaoYi wrote:
  Hi Jan,
  
  Since the newest Intel-VT supports the guest OS under the real mode, 
  which was already supported in AMD-V, can the VMX in the latest KVM support 
  that case?

 Yes, both with our without that unrestricted guest support (as Intel
 called it), real mode will generally work. Without that CPU feature, I
 think to recall that there were some limitations for big real mode, not
 sure.


Yes, big real mode will not work without unrestricted guest.  There
was some work to emulate it (module option emulate_invalid_guest_state),
but it is not complete.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-03-21 Thread Jan Kiszka

On 2012-03-21 14:41, Gleb Natapov wrote:
 On Wed, Mar 21, 2012 at 02:39:47PM +0100, Jan Kiszka wrote:
 On 2012-03-21 14:36, Avi Kivity wrote:
 On 03/21/2012 02:36 PM, Jan Kiszka wrote:
 This is now implied by kvm_irqchip_in_kernel.



 So we can't have -no-kvm-pit?

 No huge loss, but unexpected.

 See e81dda195556e72f8cd294998296c1051aab30a8.

 I am curious what is the reason for upstream to not supporting disabling the
 in-kernel PIT separately?

It was considered no longer relevant:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/85393

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Attilio Rao


On 21/03/12 13:22, Stephan Diestelhorst wrote:

On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
   

On 21/03/12 10:20, Raghavendra K T wrote:
 

From: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)

To address this, we add two hooks:
   - __ticket_spin_lock which is called after the cpu has been
 spinning on the lock for a significant number of iterations but has
 failed to take the lock (presumably because the cpu holding the lock
 has been descheduled).  The lock_spinning pvop is expected to block
 the cpu until it has been kicked by the current lock holder.
   - __ticket_spin_unlock, which on releasing a contended lock
 (there are more cpus with tail tickets), it looks to see if the next
 cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.

   

I've made some real world benchmarks based on this serie of patches
applied on top of a vanilla Linux-3.3-rc6 (commit
4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
compared:
* vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK + patch

 

[...]
   

== Results
This test points in the direction that Jeremy's rebased patches don't
introduce a peformance penalty at all, but also that we could likely
consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
default and suggest disabling just on very old CPUs (assuming a
performance regression can be proven there).
 

Very interesting results, in particular knowing that in the one guest
case things do not get (significantly) slower due to the added logic
and LOCKed RMW in the unlock path.

AFAICR, the problem really became apparent when running multiple guests
time sharing the physical CPUs, i.e., two guests with eight vCPUs each
on an eight core machine.  Did you look at this setup with your tests?

   


Please note that my tests are made on native Linux, without XEN involvement.

You maybe meant that the spinlock paravirtualization became generally 
useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?


Attilio
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Stephan Diestelhorst

On Wednesday 21 March 2012, 13:49:28 Attilio Rao wrote:
 On 21/03/12 13:22, Stephan Diestelhorst wrote:
  On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
 
  On 21/03/12 10:20, Raghavendra K T wrote:
   
  From: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
 
  Rather than outright replacing the entire spinlock implementation in
  order to paravirtualize it, keep the ticket lock implementation but add
  a couple of pvops hooks on the slow patch (long spin on lock, unlocking
  a contended lock).
 
  Ticket locks have a number of nice properties, but they also have some
  surprising behaviours in virtual environments.  They enforce a strict
  FIFO ordering on cpus trying to take a lock; however, if the hypervisor
  scheduler does not schedule the cpus in the correct order, the system can
  waste a huge amount of time spinning until the next cpu can take the lock.
 
  (See Thomas Friebel's talk Prevent Guests from Spinning Around
  http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
 
  To address this, we add two hooks:
 - __ticket_spin_lock which is called after the cpu has been
   spinning on the lock for a significant number of iterations but has
   failed to take the lock (presumably because the cpu holding the lock
   has been descheduled).  The lock_spinning pvop is expected to block
   the cpu until it has been kicked by the current lock holder.
 - __ticket_spin_unlock, which on releasing a contended lock
   (there are more cpus with tail tickets), it looks to see if the next
   cpu is blocked and wakes it if so.
 
  When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
  functions causes all the extra code to go away.
 
 
  I've made some real world benchmarks based on this serie of patches
  applied on top of a vanilla Linux-3.3-rc6 (commit
  4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
  CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
  compared:
  * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
  * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
  * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
  * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
 
   
  [...]
 
  == Results
  This test points in the direction that Jeremy's rebased patches don't
  introduce a peformance penalty at all, but also that we could likely
  consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
  default and suggest disabling just on very old CPUs (assuming a
  performance regression can be proven there).
   
  Very interesting results, in particular knowing that in the one guest
  case things do not get (significantly) slower due to the added logic
  and LOCKed RMW in the unlock path.
 
  AFAICR, the problem really became apparent when running multiple guests
  time sharing the physical CPUs, i.e., two guests with eight vCPUs each
  on an eight core machine.  Did you look at this setup with your tests?
 
 
 
 Please note that my tests are made on native Linux, without XEN involvement.
 
 You maybe meant that the spinlock paravirtualization became generally 
 useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?

Yes, that is what I meant.  Just to clarify why you do not see any
speed-ups, and were wondering why.  If the whole point of the exercise
was to see that there are no perforamnce regressions, fine.  In that
case I misunderstood.

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelho...@amd.com
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Attilio Rao


On 21/03/12 14:25, Stephan Diestelhorst wrote:

On Wednesday 21 March 2012, 13:49:28 Attilio Rao wrote:
   

On 21/03/12 13:22, Stephan Diestelhorst wrote:
 

On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:

   

On 21/03/12 10:20, Raghavendra K T wrote:

 

From: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)

To address this, we add two hooks:
- __ticket_spin_lock which is called after the cpu has been
  spinning on the lock for a significant number of iterations but has
  failed to take the lock (presumably because the cpu holding the lock
  has been descheduled).  The lock_spinning pvop is expected to block
  the cpu until it has been kicked by the current lock holder.
- __ticket_spin_unlock, which on releasing a contended lock
  (there are more cpus with tail tickets), it looks to see if the next
  cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.


   

I've made some real world benchmarks based on this serie of patches
applied on top of a vanilla Linux-3.3-rc6 (commit
4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
compared:
* vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK + patch


 

[...]

   

== Results
This test points in the direction that Jeremy's rebased patches don't
introduce a peformance penalty at all, but also that we could likely
consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
default and suggest disabling just on very old CPUs (assuming a
performance regression can be proven there).

 

Very interesting results, in particular knowing that in the one guest
case things do not get (significantly) slower due to the added logic
and LOCKed RMW in the unlock path.

AFAICR, the problem really became apparent when running multiple guests
time sharing the physical CPUs, i.e., two guests with eight vCPUs each
on an eight core machine.  Did you look at this setup with your tests?


   

Please note that my tests are made on native Linux, without XEN involvement.

You maybe meant that the spinlock paravirtualization became generally
useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?
 

Yes, that is what I meant.  Just to clarify why you do not see any
speed-ups, and were wondering why.  If the whole point of the exercise
was to see that there are no perforamnce regressions, fine.  In that
case I misunderstood.
   


Yes, that's right, I just wanted to measure (possible) overhead in 
native Linux and the cost of leaving CONFIG_PARAVIRT_SPINLOCK on.


Thanks,
Attilio
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] Improve iteration through sptes from rmap

2012-03-21 Thread Takuya Yoshikawa

By removing sptep from rmap_iterator, I could achieve 15% performance
improvement without inlining.

Takuya Yoshikawa (3):
  KVM: MMU: Make pte_list_desc fit cache lines well
  KVM: MMU: Improve iteration through sptes from rmap

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: MMU: Make pte_list_desc fit cache lines well

2012-03-21 Thread Takuya Yoshikawa

From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
bytes do not fit cache lines well.  Furthermore, some allocators may
use 64/32-byte objects for the pte_list_desc cache.

This patch solves this problem by changing PTE_LIST_EXT from 4 to 3.

For shadow paging, the new size is still large enough to hold both the
kernel and process mappings for usual anonymous pages.  For file
mappings, there may be a slight change in the cache usage.

Note: with EPT/NPT we almost always have a single spte in each reverse
mapping and we will not see any change by this.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/mmu.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dc5f245..3213348 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -135,8 +135,6 @@ module_param(dbg, bool, 0644);
 #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK \
| PT64_NX_MASK)
 
-#define PTE_LIST_EXT 4
-
 #define ACC_EXEC_MASK1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
 #define ACC_USER_MASKPT_USER_MASK
@@ -151,6 +149,9 @@ module_param(dbg, bool, 0644);
 
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
+/* make pte_list_desc fit well in cache line */
+#define PTE_LIST_EXT 3
+
 struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
struct pte_list_desc *more;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks

2012-03-21 Thread Raghavendra K T


On 03/21/2012 08:03 PM, Attilio Rao wrote:

On 21/03/12 14:25, Stephan Diestelhorst wrote:

On Wednesday 21 March 2012, 13:49:28 Attilio Rao wrote:

On 21/03/12 13:22, Stephan Diestelhorst wrote:

On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:


On 21/03/12 10:20, Raghavendra K T wrote:


From: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation
but add
a couple of pvops hooks on the slow patch (long spin on lock,
unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have
some
surprising behaviours in virtual environments. They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the
hypervisor
scheduler does not schedule the cpus in the correct order, the
system can
waste a huge amount of time spinning until the next cpu can take
the lock.

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

To address this, we add two hooks:
- __ticket_spin_lock which is called after the cpu has been
spinning on the lock for a significant number of iterations but has
failed to take the lock (presumably because the cpu holding the lock
has been descheduled). The lock_spinning pvop is expected to block
the cpu until it has been kicked by the current lock holder.
- __ticket_spin_unlock, which on releasing a contended lock
(there are more cpus with tail tickets), it looks to see if the next
cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.



I've made some real world benchmarks based on this serie of patches
applied on top of a vanilla Linux-3.3-rc6 (commit
4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
compared:
* vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK + patch



[...]


== Results
This test points in the direction that Jeremy's rebased patches don't
introduce a peformance penalty at all, but also that we could likely
consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
default and suggest disabling just on very old CPUs (assuming a
performance regression can be proven there).


Very interesting results, in particular knowing that in the one guest
case things do not get (significantly) slower due to the added logic
and LOCKed RMW in the unlock path.

AFAICR, the problem really became apparent when running multiple guests
time sharing the physical CPUs, i.e., two guests with eight vCPUs each
on an eight core machine. Did you look at this setup with your tests?



Please note that my tests are made on native Linux, without XEN
involvement.

You maybe meant that the spinlock paravirtualization became generally
useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?

Yes, that is what I meant. Just to clarify why you do not see any
speed-ups, and were wondering why. If the whole point of the exercise
was to see that there are no perforamnce regressions, fine. In that
case I misunderstood.


Yes, that's right, I just wanted to measure (possible) overhead in
native Linux and the cost of leaving CONFIG_PARAVIRT_SPINLOCK on.


True.  Even my result was only revolved around native overhead.

Till now main concern in the community was native overhead. So this time 
we have the results that proves CONFG_PARAVRT_SPINLOCK is now in par 
with corresponding vanilla because of ticketlock improvements.


Coming to Guest scenario, I intend to post KVM counterpart of the 
patches with results where we see huge improvement (around 90%) in 
contention scenario and almost zero overhead in normal case.








--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: MMU: Improve iteration through sptes from rmap

2012-03-21 Thread Takuya Yoshikawa

From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

Iteration using rmap_next(), the actual body is pte_list_next(), is
inefficient: every time we call it we start from checking whether rmap
holds a single spte or points to a descriptor which links more sptes.

In the case of shadow paging, this quadratic total iteration cost is a
problem.  Even for two dimensional paging, with EPT/NPT on, in which we
almost always have a single mapping, the extra checks at the end of the
iteration should be eliminated.

This patch fixes this by introducing rmap_iterator which keeps the
iteration context for the next search.  Furthermore the implementation
of rmap_next() is splitted into two functions, rmap_get_first() and
rmap_get_next(), to avoid repeatedly checking whether the rmap being
iterated on has only one spte.

Although there seemed to be only a slight change for EPT/NPT, the actual
improvement was significant: we observed that GET_DIRTY_LOG for 1GB
dirty memory became 15% faster than before.  This is probably because
the new code is easy to make branch predictions.

Note: we just remove pte_list_next() because we can think of parent_ptes
as a reverse mapping.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/mmu.c   |  196 --
 arch/x86/kvm/mmu_audit.c |   10 +-
 2 files changed, 124 insertions(+), 82 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3213348..29ad6f9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -842,32 +842,6 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
return count;
 }
 
-static u64 *pte_list_next(unsigned long *pte_list, u64 *spte)
-{
-   struct pte_list_desc *desc;
-   u64 *prev_spte;
-   int i;
-
-   if (!*pte_list)
-   return NULL;
-   else if (!(*pte_list  1)) {
-   if (!spte)
-   return (u64 *)*pte_list;
-   return NULL;
-   }
-   desc = (struct pte_list_desc *)(*pte_list  ~1ul);
-   prev_spte = NULL;
-   while (desc) {
-   for (i = 0; i  PTE_LIST_EXT  desc-sptes[i]; ++i) {
-   if (prev_spte == spte)
-   return desc-sptes[i];
-   prev_spte = desc-sptes[i];
-   }
-   desc = desc-more;
-   }
-   return NULL;
-}
-
 static void
 pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
   int i, struct pte_list_desc *prev_desc)
@@ -988,11 +962,6 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, 
gfn_t gfn)
return pte_list_add(vcpu, spte, rmapp);
 }
 
-static u64 *rmap_next(unsigned long *rmapp, u64 *spte)
-{
-   return pte_list_next(rmapp, spte);
-}
-
 static void rmap_remove(struct kvm *kvm, u64 *spte)
 {
struct kvm_mmu_page *sp;
@@ -1005,6 +974,67 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
pte_list_remove(spte, rmapp);
 }
 
+/*
+ * Used by the following functions to iterate through the sptes linked by a
+ * rmap.  All fields are private and not assumed to be used outside.
+ */
+struct rmap_iterator {
+   /* private fields */
+   struct pte_list_desc *desc; /* holds the sptep if not NULL */
+   int pos;/* index of the sptep */
+};
+
+/*
+ * Iteration must be started by this function.  This should also be used after
+ * removing/dropping sptes from the rmap link because in such cases the
+ * information in the itererator may not be valid.
+ *
+ * Returns sptep if found, NULL otherwise.
+ */
+static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
+{
+   if (!rmap)
+   return NULL;
+
+   if (!(rmap  1)) {
+   iter-desc = NULL;
+   return (u64 *)rmap;
+   }
+
+   iter-desc = (struct pte_list_desc *)(rmap  ~1ul);
+   iter-pos = 0;
+   return iter-desc-sptes[iter-pos];
+}
+
+/*
+ * Must be used with a valid iterator: e.g. after rmap_get_first().
+ *
+ * Returns sptep if found, NULL otherwise.
+ */
+static u64 *rmap_get_next(struct rmap_iterator *iter)
+{
+   if (iter-desc) {
+   if (iter-pos  PTE_LIST_EXT - 1) {
+   u64 *sptep;
+
+   ++iter-pos;
+   sptep = iter-desc-sptes[iter-pos];
+   if (sptep)
+   return sptep;
+   }
+
+   iter-desc = iter-desc-more;
+
+   if (iter-desc) {
+   iter-pos = 0;
+   /* desc-sptes[0] cannot be NULL */
+   return iter-desc-sptes[iter-pos];
+   }
+   }
+
+   return NULL;
+}
+
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
if (mmu_spte_clear_track_bits(sptep))
@@ -1013,23 +1043,27 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 
 static int

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Corey Minyard




Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
event over IMPI. The code is pretty complex. Of course if we a going to
implement something more complex than simple hypercall for panic
notification we better do something more interesting with it than just
saying panic happened, like sending stack traces on all cpus for
instance.


I doubt that's the best example, unfortunately.  The IPMI event log has 
limited space and it has to be send a little piece at a time since each 
log entry is 14 bytes.  It just prints the panic string, nothing else.  
Not that it isn't useful, it has saved my butt before.


You have lots of interesting options with paravirtualization.  You 
could, for instance, create a console driver that delivered all console 
output efficiently through a hypercall.  That would be really easy.  Or, 
as you mention, a custom way to deliver panic information.  Collecting 
information like stack traces would be harder to accomplish, as I don't 
think there is currently a way to get it except by sending it to printk.


-corey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Gleb Natapov

On Wed, Mar 21, 2012 at 11:18:16AM -0500, Corey Minyard wrote:
 
 Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
 event over IMPI. The code is pretty complex. Of course if we a going to
 implement something more complex than simple hypercall for panic
 notification we better do something more interesting with it than just
 saying panic happened, like sending stack traces on all cpus for
 instance.
 
 I doubt that's the best example, unfortunately.  The IPMI event log
 has limited space and it has to be send a little piece at a time
 since each log entry is 14 bytes.  It just prints the panic string,
 nothing else.  Not that it isn't useful, it has saved my butt
 before.
 
I gave ipmi example just to show that others do complex things on panic,
not as an example of what we should do on a guest panic.

 You have lots of interesting options with paravirtualization.  You
 could, for instance, create a console driver that delivered all
 console output efficiently through a hypercall.  That would be
 really easy.  Or, as you mention, a custom way to deliver panic
 information.  Collecting information like stack traces would be
 harder to accomplish, as I don't think there is currently a way to
 get it except by sending it to printk.
 
Why using hypercall for that though? You can do that with
virtio-console. Make it zero config: virtio-console detected - send
console output there.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Avi Kivity

On 03/21/2012 06:18 PM, Corey Minyard wrote:

 Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
 event over IMPI. The code is pretty complex. Of course if we a going to
 implement something more complex than simple hypercall for panic
 notification we better do something more interesting with it than just
 saying panic happened, like sending stack traces on all cpus for
 instance.

 I doubt that's the best example, unfortunately.  The IPMI event log
 has limited space and it has to be send a little piece at a time since
 each log entry is 14 bytes.  It just prints the panic string, nothing
 else.  Not that it isn't useful, it has saved my butt before.

 You have lots of interesting options with paravirtualization.  You
 could, for instance, create a console driver that delivered all
 console output efficiently through a hypercall.  That would be really
 easy.  Or, as you mention, a custom way to deliver panic information. 
 Collecting information like stack traces would be harder to
 accomplish, as I don't think there is currently a way to get it except
 by sending it to printk.

That already exists; virtio-console (or serial console emulation) can do
the job.

In fact the feature can be implemented 100% host side by searching for a
panic string signature in the console logs.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Qemu-devel] [PATCH v4 4/7] RTC: Set internal millisecond register to 500ms when reset divider

2012-03-21 Thread Stefano Stabellini

On Wed, 21 Mar 2012, Zhang, Yang Z wrote:
struct tm *tm = s-current_tm;
   -int64_t host_usec, guest_sec, guest_usec;
   +int64_t host_usec, guest_sec, guest_usec, offset_usec, 
   old_guest_usec;
  
host_usec = qemu_get_clock_ns(host_clock) / NS_PER_USEC;
   +offset_usec = s-offset_sec * USEC_PER_SEC + s-offset_usec;
   +old_guest_usec = (host_usec + offset_usec) % USEC_PER_SEC;
  
guest_sec = mktimegm(tm);
   -guest_usec = guest_sec * USEC_PER_SEC;
  
   +/* start_usec equal 0 means rtc internal millisecond is
   + * same with before */
   +if (start_usec == 0) {
   +guest_usec = guest_sec * USEC_PER_SEC + old_guest_usec;
   +} else {
   +guest_usec = guest_sec * USEC_PER_SEC + start_usec;
   +}
  
  This looks like a mistake to me: before guest_usec was derived
  exclusively from mktimegm (take or leave USEC_PER_SEC), while now
  guest_usec is the sum of the value returned by mktimegm and
  old_guest_usec, even when start_usec is 0.
  Are you sure that this is correct?
 The logic is right. Before this patch, we assume the rtc internal millisecond 
 register is same with host time, so we don't need to consider it and using 
 mktimeis enough. Now, the rtc internal millisecond can be changed, so I use 
 the old_guest_usec to record the current internal millisecond. When 
 start_usec is 0, it means we don't need to change the internal millisecond 
 register and the offset_sec is same as before.
 
I am starting to understand the intention of this code, but I am still
unconvinced that the start_usec == 0 case is correct.
If I am not mistaken you are calculating:

offset = guest + old_guest - host

while it should be:

offset = guest + old_start - host

where old_start is start_usec as it was set last time, correct? Am I
missing something? In any case please add a comment to explain what the
change is trying to accomplish.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v4 5/7] RTC:Add RTC update-ended interrupt support

2012-03-21 Thread Stefano Stabellini

On Wed, 21 Mar 2012, Paolo Bonzini wrote:
 Il 20/03/2012 19:35, Stefano Stabellini ha scritto:
  This is the function that is used to figure out whether we need the
  timers or not, the condition seems to be:
  
  (Not (REG_C_UF | REG_C_AF)) And (Not (REG_B_SET))
  
  Shouldn't actually check for UIE being enabled?
 
 No, you need to set UF in case the code observes it without actually
 enabling interrupt delivery on the ISA bus.
 
Well, if it is just about updating UF, can we do it only when the user
reads REG_C?
I am just trying to limit the need for the update_timers...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks

2012-03-21 Thread Linus Torvalds

On Wed, Mar 21, 2012 at 3:21 AM, Raghavendra K T
raghavendra...@linux.vnet.ibm.com wrote:
 From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

 The code size expands somewhat, and its probably better to just call
 a function rather than inline it.

 Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/Kconfig     |    3 +++
  kernel/Kconfig.locks |    2 +-
  2 files changed, 4 insertions(+), 1 deletions(-)
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 5bed94e..10c28ec 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -623,6 +623,9 @@ config PARAVIRT_SPINLOCKS

          If you are unsure how to answer this question, answer N.

 +config ARCH_NOINLINE_SPIN_UNLOCK
 +       def_bool PARAVIRT_SPINLOCKS
 +
  config PARAVIRT_CLOCK
        bool

 diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
 index 5068e2a..584637b 100644
 --- a/kernel/Kconfig.locks
 +++ b/kernel/Kconfig.locks
 @@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE
                 ARCH_INLINE_SPIN_LOCK_IRQSAVE

  config INLINE_SPIN_UNLOCK
 -       def_bool !DEBUG_SPINLOCK  (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)
 +       def_bool !DEBUG_SPINLOCK  (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)  
 !ARCH_NOINLINE_SPIN_UNLOCK

  config INLINE_SPIN_UNLOCK_BH
        def_bool !DEBUG_SPINLOCK  ARCH_INLINE_SPIN_UNLOCK_BH

Ugh. This is getting really ugly.

Can we just fix it by
 - getting rid of INLINE_SPIN_UNLOCK entirely

 - replacing it with UNINLINE_SPIN_UNLOCK instead with the reverse
meaning, and no def_bool at all, just a simple

config UNINLINE_SPIN_UNLOCK
bool

 - make the various people who want to uninline the spinlocks (like
spinlock debugging, paravirt etc) all just do

select UNINLINE_SPIN_UNLOCK

because quite frankly, the whole spinunlock inlining logic is
*already* unreadable, and you just made it worse.

  Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Daniel P. Berrange

On Wed, Mar 21, 2012 at 06:25:16PM +0200, Avi Kivity wrote:
 On 03/21/2012 06:18 PM, Corey Minyard wrote:
 
  Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
  event over IMPI. The code is pretty complex. Of course if we a going to
  implement something more complex than simple hypercall for panic
  notification we better do something more interesting with it than just
  saying panic happened, like sending stack traces on all cpus for
  instance.
 
  I doubt that's the best example, unfortunately.  The IPMI event log
  has limited space and it has to be send a little piece at a time since
  each log entry is 14 bytes.  It just prints the panic string, nothing
  else.  Not that it isn't useful, it has saved my butt before.
 
  You have lots of interesting options with paravirtualization.  You
  could, for instance, create a console driver that delivered all
  console output efficiently through a hypercall.  That would be really
  easy.  Or, as you mention, a custom way to deliver panic information. 
  Collecting information like stack traces would be harder to
  accomplish, as I don't think there is currently a way to get it except
  by sending it to printk.
 
 That already exists; virtio-console (or serial console emulation) can do
 the job.
 
 In fact the feature can be implemented 100% host side by searching for a
 panic string signature in the console logs.

You can even go one better and search for the panic string in the
guest memory directly, which is what virt-dmesg does :-)

  http://people.redhat.com/~rjones/virt-dmesg/

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: performance trouble

2012-03-21 Thread Peter Lieven


On 21.03.2012 12:10, David Cure wrote:

hello,

Le Tue, Mar 20, 2012 at 02:38:22PM +0200, Gleb Natapov ecrivait :

Try to addfeature policy='disable' name='hypervisor'/  to cpu
definition in XML and check command line.

ok I try this but I can't usecpu model  to map the host cpu
(my libvirt is 0.9.8) so I use :

   cpu match='exact'
 modelOpteron_G3/model
 feature policy='disable' name='hypervisor'/
   /cpu

(the physical server use Opteron CPU).

The log is here :
http://www.roullier.net/Report/report-3.2-vhost-net-1vcpu-cpu.txt.gz

And now with only 1 vcpu, the response time is 8.5s, great
improvment. We keep this configuration for production : we check the
response time when some other users are connected.
please keep in mind, that setting -hypervisor, disabling hpet and only 
one vcpu

makes windows use tsc as clocksource. you have to make sure, that your vm
is not switching between physical sockets on your system and that you have
constant_tsc feature to have a stable tsc between the cores in the
same socket. its also likely that the vm will crash when live migrated.

@gleb: do you know whats the state of in-kernel hyper-v timers?

peter


David.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Avi Kivity

On 03/21/2012 07:04 PM, Daniel P. Berrange wrote:
  
  In fact the feature can be implemented 100% host side by searching for a
  panic string signature in the console logs.

 You can even go one better and search for the panic string in the
 guest memory directly, which is what virt-dmesg does :-)

   http://people.redhat.com/~rjones/virt-dmesg/


-ETOOHACKY

Any guest change will break this, no?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Jan Kiszka

On 2012-03-21 18:34, Avi Kivity wrote:
 On 03/21/2012 07:04 PM, Daniel P. Berrange wrote:

 In fact the feature can be implemented 100% host side by searching for a
 panic string signature in the console logs.

 You can even go one better and search for the panic string in the
 guest memory directly, which is what virt-dmesg does :-)

   http://people.redhat.com/~rjones/virt-dmesg/

 
 -ETOOHACKY
 
 Any guest change will break this, no?

/me has a simple python script (a few ten lines) to do this during
runtime (kgdb, qemu gdbstub) or post-portem (gdb -c vmcore). Ideally,
that will once come with the Linux sources where it can be kept in sync
with the kernel data structures.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/38] powerpc/booke: Provide exception macros with interrupt name

2012-03-21 Thread Scott Wood

On 03/21/2012 01:04 PM, Kumar Gala wrote:
 
 On Feb 28, 2012, at 6:09 PM, Alexander Graf wrote:
 
 From: Scott Wood scottw...@freescale.com

 DO_KVM will need to identify the particular exception type.

 There is an existing set of arbitrary numbers that Linux passes,
 but it's an undocumented mess that sort of corresponds to server/classic
 exception vectors but not really.
 
 So what do the new names correspond to?

The names are Linux-defined.  The values are IVOR numbers.

 What header is defining MACHINE_CHECK, BOOKE_INTERRUPT_EXTERNAL, etc
 (asm/kvm_asm.h)?

Yes, it's asm/kvm_asm.h at the moment.

It's actually BOOKE_INTERRUPT_MACHINE_CHECK, etc.  The exception macros
paste on BOOKE_INTERRUPT_ when they use it, to keep the macro users from
having to wrap lines even more often.

 If so we really should move these out of
 asm/kvm_asm.h and into something a bit more appropriate.

Yes, that was one of the things I had been planning to fix post-RFC.
Any preference what header it should be in?  asm/reg_booke.h seems to be
the least bad option of the existing headers -- or perhaps move
head_booke.h to arch/powerpc/include/asm.

Long-term it would also be nice for this to replace, rather than
supplement, the current numbers, though as Ben pointed out there are a
number of places throughout the code that will have to be fixed for that
to happen.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-03-21 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #20 from Avi Kivity a...@redhat.com  2012-03-21 18:21:33 ---
Please try

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git emulator-movq

this emulates the two movq instructions used.  There may be other unemulated
instructions though.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] Emulate MMX MOVQ

2012-03-21 Thread Avi Kivity

This patchset emulates two instructions:

  MOVQ mm, mm/m64
  MOVQ mm/m64, mm

Needed by https://bugzilla.kernel.org/show_bug.cgi?id=42779. However it seems
to fail sometimes in my unit test, generating a math fault at the fwait
instruction (which is duly trapped).  Hints appreciated.

Avi Kivity (2):
  KVM: x86 emulator: MMX support
  KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)

 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |   98 ++--
 2 files changed, 96 insertions(+), 6 deletions(-)

-- 
1.7.9

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: x86 emulator: MMX support

2012-03-21 Thread Avi Kivity

General support for the MMX instruction set.  Special care is taken
to trap pending x87 exceptions so that they are properly reflected
to the guest.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |   90 ++--
 2 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c222e1a..1ac46c22 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -200,7 +200,7 @@ typedef u32 __attribute__((vector_size(16))) sse128_t;
 
 /* Type, address-of, and value of an instruction's operand. */
 struct operand {
-   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_NONE } type;
+   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_MM, OP_NONE } type;
unsigned int bytes;
union {
unsigned long orig_val;
@@ -213,12 +213,14 @@ struct operand {
unsigned seg;
} mem;
unsigned xmm;
+   unsigned mm;
} addr;
union {
unsigned long val;
u64 val64;
char valptr[sizeof(unsigned long) + 2];
sse128_t vec_val;
+   u64 mm_val;
};
 };
 
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8375622..19aec76 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -142,6 +142,7 @@
 #define Src2FS  (OpFS  Src2Shift)
 #define Src2GS  (OpGS  Src2Shift)
 #define Src2Mask(OpMask  Src2Shift)
+#define Mmx (1ull40) /* MMX Vector instruction */
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -859,6 +860,40 @@ static void write_sse_reg(struct x86_emulate_ctxt *ctxt, 
sse128_t *data,
ctxt-ops-put_fpu(ctxt);
 }
 
+static void read_mmx_reg(struct x86_emulate_ctxt *ctxt, u64 *data, int reg)
+{
+   ctxt-ops-get_fpu(ctxt);
+   switch (reg) {
+   case 0: asm(movq %%mm0, %0 : =m(*data)); break;
+   case 1: asm(movq %%mm1, %0 : =m(*data)); break;
+   case 2: asm(movq %%mm2, %0 : =m(*data)); break;
+   case 3: asm(movq %%mm3, %0 : =m(*data)); break;
+   case 4: asm(movq %%mm4, %0 : =m(*data)); break;
+   case 5: asm(movq %%mm5, %0 : =m(*data)); break;
+   case 6: asm(movq %%mm6, %0 : =m(*data)); break;
+   case 7: asm(movq %%mm7, %0 : =m(*data)); break;
+   default: BUG();
+   }
+   ctxt-ops-put_fpu(ctxt);
+}
+
+static void write_mmx_reg(struct x86_emulate_ctxt *ctxt, u64 *data, int reg)
+{
+   ctxt-ops-get_fpu(ctxt);
+   switch (reg) {
+   case 0: asm(movq %0, %%mm0 : : m(*data)); break;
+   case 1: asm(movq %0, %%mm1 : : m(*data)); break;
+   case 2: asm(movq %0, %%mm2 : : m(*data)); break;
+   case 3: asm(movq %0, %%mm3 : : m(*data)); break;
+   case 4: asm(movq %0, %%mm4 : : m(*data)); break;
+   case 5: asm(movq %0, %%mm5 : : m(*data)); break;
+   case 6: asm(movq %0, %%mm6 : : m(*data)); break;
+   case 7: asm(movq %0, %%mm7 : : m(*data)); break;
+   default: BUG();
+   }
+   ctxt-ops-put_fpu(ctxt);
+}
+
 static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
struct operand *op)
 {
@@ -875,6 +910,14 @@ static void decode_register_operand(struct 
x86_emulate_ctxt *ctxt,
read_sse_reg(ctxt, op-vec_val, reg);
return;
}
+   if (ctxt-d  Mmx) {
+   reg = 7;
+   op-type = OP_MM;
+   op-bytes = 8;
+   op-addr.mm = reg;
+   read_mmx_reg(ctxt, op-mm_val, reg);
+   return;
+   }
 
op-type = OP_REG;
if (ctxt-d  ByteOp) {
@@ -920,6 +963,13 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
read_sse_reg(ctxt, op-vec_val, ctxt-modrm_rm);
return rc;
}
+   if (ctxt-d  Mmx) {
+   op-type = OP_MM;
+   op-bytes = 8;
+   op-addr.xmm = ctxt-modrm_rm  7;
+   read_mmx_reg(ctxt, op-mm_val, ctxt-modrm_rm);
+   return rc;
+   }
fetch_register_operand(op);
return rc;
}
@@ -1387,6 +1437,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt)
case OP_XMM:
write_sse_reg(ctxt, ctxt-dst.vec_val, ctxt-dst.addr.xmm);
break;
+   case OP_MM:
+   write_mmx_reg(ctxt, ctxt-dst.mm_val, ctxt-dst.addr.mm);
+   break;
case OP_NONE:
/* no writeback */
break;
@@ -3960,6 +4013,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
 
if (ctxt-d  Sse)
ctxt-op_bytes = 16;
+   else if (ctxt-d  Mmx)
+   ctxt-op_bytes = 8;
 
/* ModRM and

[PATCH 2/2] KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)

2012-03-21 Thread Avi Kivity

Needed by some framebuffer drivers.  See

https://bugzilla.kernel.org/show_bug.cgi?id=42779

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 19aec76..5997513 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2929,6 +2929,12 @@ static int em_movdqu(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_movq(struct x86_emulate_ctxt *ctxt)
+{
+   ctxt-dst.mm_val = ctxt-src.mm_val;
+   return X86EMUL_CONTINUE;
+}
+
 static int em_invlpg(struct x86_emulate_ctxt *ctxt)
 {
int rc;
@@ -3468,7 +3474,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, N, N, I(Sse, em_movdqu),
+   I(Mmx, em_movq), N, N, I(Sse, em_movdqu),
 };
 
 static struct opcode opcode_table[256] = {
-- 
1.7.9

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RESEND][PATCH 2/2 v3] deal with guest panicked event

2012-03-21 Thread Anthony Liguori


On 03/09/2012 04:22 PM, Marcelo Tosatti wrote:

On Thu, Mar 08, 2012 at 11:56:56AM +, Daniel P. Berrange wrote:

On Thu, Mar 08, 2012 at 01:52:45PM +0200, Avi Kivity wrote:

On 03/08/2012 01:36 PM, Daniel P. Berrange wrote:

On Thu, Mar 08, 2012 at 01:28:56PM +0200, Avi Kivity wrote:

On 03/08/2012 12:15 PM, Wen Congyang wrote:

When the host knows the guest is panicked, it will set
exit_reason to KVM_EXIT_GUEST_PANICKED. So if qemu receive
this exit_reason, we can send a event to tell management
application that the guest is panicked and set the guest
status to RUN_STATE_PANICKED.

Signed-off-by: Wen Congyangwe...@cn.fujitsu.com
---
  kvm-all.c|5 +
  monitor.c|3 +++
  monitor.h|1 +
  qapi-schema.json |2 +-
  qmp.c|3 ++-
  vl.c |1 +
  6 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 77eadf6..b3c9a83 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1290,6 +1290,11 @@ int kvm_cpu_exec(CPUState *env)
  (uint64_t)run-hw.hardware_exit_reason);
  ret = -1;
  break;
+case KVM_EXIT_GUEST_PANICKED:
+monitor_protocol_event(QEVENT_GUEST_PANICKED, NULL);
+vm_stop(RUN_STATE_PANICKED);
+ret = -1;
+break;



If the management application is not aware of this event, then it will
never resume the guest, so it will appear hung.


Even if the mgmt app doesn't know about the QEVENT_GUEST_PANICKED, it should
still see a QEVENT_STOP event emitted by vm_stop() surely ? So it will
know the guest CPUs have been stopped, even if it isn't aware of the
reason why, which seems fine to me.


No.  The guest is stopped, and there's no reason to suppose that the
management app will restart it.  Behaviour has changed.

Suppose the guest has reboot_on_panic set; now the behaviour change is
even more visible - service will stop completely instead of being
interrupted for a bit while the guest reboots.


Hmm, so this calls for a new command line argument to control behaviour,
similar to what we do for disk werror, eg something like

   --onpanic report|pause|stop|...

where

  report - emit QEVENT_GUEST_PANICKED only


Should be the default.


Should we just have a mechanism to stop the guest on certain types of QMP 
events?  For instance:


-stop-on guest-panicked,block-ioerror

Likewise, we could have a:

-quit-on guest-panicked.

In the very least, we should make what we use for rerror,werror an enumeration 
that's shared here.


Regards,

Anthony Liguori




  pause  - emit QEVENT_GUEST_PANICKED and pause VM
  stop   - emit QEVENT_GUEST_PANICKED and quit VM


quit is a better name than stop.


This would map fairly well into libvirt, where we already have config
parameters for controlling what todo with a guest when it panics.

Regards,
Daniel





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Anthony Liguori


On 03/13/2012 05:47 AM, Avi Kivity wrote:

On 03/13/2012 11:18 AM, Daniel P. Berrange wrote:

On Mon, Mar 12, 2012 at 12:33:33PM +0200, Avi Kivity wrote:

On 03/12/2012 11:04 AM, Wen Congyang wrote:

Do you have any other comments about this patch?



Not really, but I'm not 100% convinced the patch is worthwhile.  It's
likely to only be used by Linux, which has kexec facilities, and you can
put talk to management via virtio-serial and describe the crash in more
details than a simple hypercall.


As mentioned before, I don't think virtio-serial is a good fit for this.
We want something that is simple  guaranteed always available. Using
virtio-serial requires significant setup work on both the host and guest.


So what?  It needs to be done anyway for the guest agent.


Many management application won't know to make a vioserial device available
to all guests they create.


Then they won't know to deal with the panic event either.


Most administrators won't even configure kexec,
let alone virtio serial on top of it.


It should be done by the OS vendor, not the individual admin.


The hypercall requires zero host
side config, and zero guest side config, which IMHO is what we need for
this feature.


If it was this one feature, yes.  But we keep getting more and more
features like that and we bloat the hypervisor.  There's a reason we
have a host-to-guest channel, we should use it.


The problem is that virtio-serial sucks for something like this.

We have two options I think:

1) We could reserve a portion of the hypercall space to be deferred to userspace 
for stuff like this.


2) We could invent a new hypercall like facility that was less bloated than 
virtio-serial for stuff like this using MMIO/PIO.


Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3][Autotest][virt] autotest.base_utils: Move virt.utils.Thread to common_lib

2012-03-21 Thread Jiří Župka

Move the class virt.utils.Thread to base_utils.InterruptedThread thus it can be 
used in general utils.

Signed-off-by: Jiří Župka jzu...@redhat.com
---
 client/common_lib/base_utils.py|   65 ++
 .../kvm/tests/migration_with_file_transfer.py  |6 +-
 client/tests/kvm/tests/migration_with_reboot.py|4 +-
 client/tests/kvm/tests/nic_bonding.py  |9 ++-
 client/tests/kvm/tests/vmstop.py   |6 +-
 client/virt/tests/nic_promisc.py   |5 +-
 client/virt/tests/nicdriver_unload.py  |4 +-
 client/virt/tests/ntttcp.py|2 +-
 client/virt/virt_test_utils.py |2 +-
 client/virt/virt_utils.py  |   69 +---
 10 files changed, 87 insertions(+), 85 deletions(-)

diff --git a/client/common_lib/base_utils.py b/client/common_lib/base_utils.py
index 972d18a..c40e5dc 100644
--- a/client/common_lib/base_utils.py
+++ b/client/common_lib/base_utils.py
@@ -817,6 +817,71 @@ def run_parallel(commands, timeout=None, 
ignore_status=False,
 return [bg_job.result for bg_job in bg_jobs]
 
 
+class InterruptedThread(Thread):
+
+Run a function in a background thread.
+
+def __init__(self, target, args=(), kwargs={}):
+
+Initialize the instance.
+
+@param target: Function to run in the thread.
+@param args: Arguments to pass to target.
+@param kwargs: Keyword arguments to pass to target.
+
+Thread.__init__(self)
+self._target = target
+self._args = args
+self._kwargs = kwargs
+
+
+def run(self):
+
+Run target (passed to the constructor).  No point in calling this
+function directly.  Call start() to make this function run in a new
+thread.
+
+self._e = None
+self._retval = None
+try:
+try:
+self._retval = self._target(*self._args, **self._kwargs)
+except Exception:
+self._e = sys.exc_info()
+raise
+finally:
+# Avoid circular references (start() may be called only once so
+# it's OK to delete these)
+del self._target, self._args, self._kwargs
+
+
+def join(self, timeout=None, suppress_exception=False):
+
+Join the thread.  If target raised an exception, re-raise it.
+Otherwise, return the value returned by target.
+
+@param timeout: Timeout value to pass to threading.Thread.join().
+@param suppress_exception: If True, don't re-raise the exception.
+
+Thread.join(self, timeout)
+try:
+if self._e:
+if not suppress_exception:
+# Because the exception was raised in another thread, we
+# need to explicitly insert the current context into it
+s = error.exception_context(self._e[1])
+s = error.join_contexts(error.get_context(), s)
+error.set_exception_context(self._e[1], s)
+raise self._e[0], self._e[1], self._e[2]
+else:
+return self._retval
+finally:
+# Avoid circular references (join() may be called multiple times
+# so we can't delete these)
+self._e = None
+self._retval = None
+
+
 @deprecated
 def run_bg(command):
 Function deprecated. Please use BgJob class instead.
diff --git a/client/tests/kvm/tests/migration_with_file_transfer.py 
b/client/tests/kvm/tests/migration_with_file_transfer.py
index 075148d..073b87e 100644
--- a/client/tests/kvm/tests/migration_with_file_transfer.py
+++ b/client/tests/kvm/tests/migration_with_file_transfer.py
@@ -56,13 +56,13 @@ def run_migration_with_file_transfer(test, params, env):
 
 error.context(transferring file to guest while migrating,
   logging.info)
-bg = virt_utils.Thread(vm.copy_files_to, (host_path, guest_path),
-  dict(verbose=True, timeout=transfer_timeout))
+bg = utils.InterruptedThread(vm.copy_files_to, (host_path, guest_path),
+ dict(verbose=True, timeout=transfer_timeout))
 run_and_migrate(bg)
 
 error.context(transferring file back to host while migrating,
   logging.info)
-bg = virt_utils.Thread(vm.copy_files_from,
+bg = utils.InterruptedThread(vm.copy_files_from,
   (guest_path, host_path_returned),
   dict(verbose=True, timeout=transfer_timeout))
 run_and_migrate(bg)
diff --git a/client/tests/kvm/tests/migration_with_reboot.py 
b/client/tests/kvm/tests/migration_with_reboot.py
index b291a83..6ee2da5 100644
--- a/client/tests/kvm/tests/migration_with_reboot.py
+++

Re: [Qemu-devel] [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Anthony Liguori


On 03/15/2012 06:46 AM, Avi Kivity wrote:

On 03/15/2012 01:25 PM, Jan Kiszka wrote:



There was such vm exit (KVM_EXIT_HYPERCALL), but it was deemed to be a
bad idea.


BTW, this would help a lot in emulating hypercalls of other hypervisors
(or of KVM's VAPIC in the absence of in-kernel irqchip - I had to jump
through hoops therefore) in user space. Not all those hypercall handlers
actually have to reside in the KVM module.



That is true.  On the other hand the hypercall ABI might go to pieces if
there was no central implementation.


Just declare that outl 0x505 is a megaultracall and s/vmcall/outb/g and call it 
a day.


The performance difference between vmcall and outl is so tiny compared to the 
cost of dropping to userspace that it really doesn't matter which instruction is 
used.


Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Anthony Liguori


On 03/21/2012 11:25 AM, Avi Kivity wrote:

On 03/21/2012 06:18 PM, Corey Minyard wrote:



Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
event over IMPI. The code is pretty complex. Of course if we a going to
implement something more complex than simple hypercall for panic
notification we better do something more interesting with it than just
saying panic happened, like sending stack traces on all cpus for
instance.


I doubt that's the best example, unfortunately.  The IPMI event log
has limited space and it has to be send a little piece at a time since
each log entry is 14 bytes.  It just prints the panic string, nothing
else.  Not that it isn't useful, it has saved my butt before.

You have lots of interesting options with paravirtualization.  You
could, for instance, create a console driver that delivered all
console output efficiently through a hypercall.  That would be really
easy.  Or, as you mention, a custom way to deliver panic information.
Collecting information like stack traces would be harder to
accomplish, as I don't think there is currently a way to get it except
by sending it to printk.


That already exists; virtio-console (or serial console emulation) can do
the job.


I think the use case here is pretty straight forward: if the guest finds itself 
in bad place, it wants to indicate that to the host.


We shouldn't rely on any device drivers or complex code.  It should be as close 
to a single instruction as possible that can run even if interrupts are disabled.


An out instruction fits this very well.  I think a simple protocol like:

inl PORT - returns a magic number indicating the presence of qemucalls
inl PORT+1 - returns a bitmap of supported features

outl PORT+1 - data reg1
outl PORT+2 - data reg2
outl PORT+N - data regN

outl PORT - qemucall of index value with arguments 1..N

Regards,

Anthony Liguori



In fact the feature can be implemented 100% host side by searching for a
panic string signature in the console logs.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/38] powerpc/booke: Provide exception macros with interrupt name

2012-03-21 Thread Kumar Gala


On Mar 21, 2012, at 1:19 PM, Scott Wood wrote:

 On 03/21/2012 01:04 PM, Kumar Gala wrote:
 
 On Feb 28, 2012, at 6:09 PM, Alexander Graf wrote:
 
 From: Scott Wood scottw...@freescale.com
 
 DO_KVM will need to identify the particular exception type.
 
 There is an existing set of arbitrary numbers that Linux passes,
 but it's an undocumented mess that sort of corresponds to server/classic
 exception vectors but not really.
 
 So what do the new names correspond to?
 
 The names are Linux-defined.  The values are IVOR numbers.
 
 What header is defining MACHINE_CHECK, BOOKE_INTERRUPT_EXTERNAL, etc
 (asm/kvm_asm.h)?
 
 Yes, it's asm/kvm_asm.h at the moment.
 
 It's actually BOOKE_INTERRUPT_MACHINE_CHECK, etc.  The exception macros
 paste on BOOKE_INTERRUPT_ when they use it, to keep the macro users from
 having to wrap lines even more often.
 
 If so we really should move these out of
 asm/kvm_asm.h and into something a bit more appropriate.
 
 Yes, that was one of the things I had been planning to fix post-RFC.
 Any preference what header it should be in?  asm/reg_booke.h seems to be
 the least bad option of the existing headers -- or perhaps move
 head_booke.h to arch/powerpc/include/asm.

asm/reg_booke.h seems the least painful right now.  head_booke.h is only used 
on 32-bit so that's not the best choice at this point.  We could create an 
asm/head_booke.h in addition to the one in arch/powerpc/kernel/

 Long-term it would also be nice for this to replace, rather than
 supplement, the current numbers, though as Ben pointed out there are a
 number of places throughout the code that will have to be fixed for that
 to happen.

Yeah, not too worried about that.  More just wanting it to be clear what one 
has to 'update' if adding support for a new exception.

- k--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Isolation groups

2012-03-21 Thread Alex Williamson

On Sat, 2012-03-17 at 15:57 +1100, David Gibson wrote:
 On Fri, Mar 16, 2012 at 01:31:18PM -0600, Alex Williamson wrote:
  On Fri, 2012-03-16 at 14:45 +1100, David Gibson wrote:
   On Thu, Mar 15, 2012 at 02:15:01PM -0600, Alex Williamson wrote:
On Wed, 2012-03-14 at 20:58 +1100, David Gibson wrote:
 On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote:
  On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote:
   On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote:
+/*
+ * Add a device to an isolation group.  Isolation groups start 
empty and
+ * must be told about the devices they contain.  Expect this 
to be called
+ * from isolation group providers via notifier.
+ */
   
   Doesn't necessarily have to be from a notifier, particularly if 
   the
   provider is integrated into host bridge code.
  
  Sure, a provider could do this on it's own if it wants.  This just
  provides some infrastructure for a common path.  Also note that this
  helps to eliminate all the #ifdef CONFIG_ISOLATION in the provider. 
   Yet
  to be seen whether that can reasonably be the case once isolation 
  groups
  are added to streaming DMA paths.
 
 Right, but other than the #ifdef safety, which could be achieved more
 simply, I'm not seeing what benefit the infrastructure provides over
 directly calling the bus notifier function.  The infrastructure groups
 the notifiers by bus type internally, but AFAICT exactly one bus
 notifier call would become exactly one isolation notifier call, and
 the notifier callback itself would be almost identical.

I guess I don't see this as a fundamental design point of the proposal,
it's just a convenient way to initialize groups as a side-band addition
until isolation groups become a more fundamental part of the iommu
infrastructure.  If you want to do that level of integration in your
provider, do it and make the callbacks w/o the notifier.  If nobody ends
up using them, we'll remove them.  Maybe it will just end up being a
bootstrap.  In the typical case, yes, one bus notifier is one isolation
notifier.  It does however also allow one bus notifier to become
multiple isolation notifiers, and includes some filtering that would
just be duplicated if every provider decided to implement their own bus
notifier.
   
   Uh.. I didn't notice any filtering?  That's why I'm asking.
  
  Not much, but a little:
  
  +   switch (action) {
  +   case BUS_NOTIFY_ADD_DEVICE:
  +   if (!dev-isolation_group)
  +   blocking_notifier_call_chain(notifier-notifier,
  +   ISOLATION_NOTIFY_ADD_DEVICE, dev);
  +   break;
  +   case BUS_NOTIFY_DEL_DEVICE:
  +   if (dev-isolation_group)
  +   blocking_notifier_call_chain(notifier-notifier,
  +   ISOLATION_NOTIFY_DEL_DEVICE, dev);
  +   break;
  +   }
 
 
 Ah, I see, fair enough.
 
 A couple of tangential observations.  First, I suspect using
 BUS_NOTIFY_DEL_DEVICE is a very roundabout way of handling hot-unplug,
 it might be better to have an unplug callback in the group instead.

There's really one already.  Assuming the device is attached to a group
driver, the .remove entry point on the driver will get called first.  It
doesn't get to return error, but it can block.  The DEL_DEVICE will only
happen after the group driver has relinquished the device, or at least
returned from remove().  For a device with no driver, the group would
only learn about it through the notifier, but it probably doesn't need
anything more direct.

 Second, I don't think aborting the call chain early for hot-plug is
 actually a good idea.  I can't see a clear guarantee on the order, so
 individual providers couldn't rely on that short-cut behaviour.  Which
 means that if two providers would have attempted to claim the same
 device, something is seriously wrong and we should probably report
 that.

Yeah, that seems like a reasonable safety check.

  ...
   So, somewhere, I think we need a fallback path, but I'm not sure
   exactly where.  If an isolation provider doesn't explicitly put a
   device into a group, the device should go into the group of its 
   parent
   bridge.  This covers the case of a bus with IOMMU which has below 
   it a
   bridge to a different type of DMA capable bus (which the IOMMU 
   isn't
   explicitly aware of).  DMAs from devices on the subordinate bus 
   can be
   translated by the top-level IOMMU (assuming it sees them as coming
   from the bridge), but they all need to be treated as one group.
  
  Why would the top level IOMMU provider not set the isolation group 
  in
  this case.
 
 Because it knows

Re: [Qemu-devel] [PATCH v4 5/7] RTC:Add RTC update-ended interrupt support

2012-03-21 Thread Paolo Bonzini

Il 21/03/2012 17:54, Stefano Stabellini ha scritto:
  
  No, you need to set UF in case the code observes it without actually
  enabling interrupt delivery on the ISA bus.
  
 Well, if it is just about updating UF, can we do it only when the user
 reads REG_C?

Perhaps, but for AF it is much harder.  Note that if the guest does not
care, this will run just once.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: x86 emulator: MMX support

2012-03-21 Thread Avi Kivity

On 03/21/2012 08:40 PM, Avi Kivity wrote:
  
 +static int flush_pending_x87_faults(struct x86_emulate_ctxt *ctxt)
 +{
 + bool fault = false;
 +
 + ctxt-ops-get_fpu(ctxt);
 + asm volatile(1: fwait \n\t
 +  2: \n\t
 +  .pushsection .fixup,\ax\ \n\t
 +  3: \n\t
 +  movb $1, %[fault] \n\t
 +  jmp 2b \n\t
 +  .popsection \n\t
 +  _ASM_EXTABLE(1b, 3b)
 +  : [fault]=rm(fault));


= tells gcc that the asm section unconditionally writes 'fault', so it
skips the initialization.  Use + instead.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-03-21 Thread Jan Kiszka

From: Jan Kiszka jan.kis...@siemens.com

This is now implied by kvm_irqchip_in_kernel.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

Rebased over latest uq/master.

 kvm-all.c  |6 --
 kvm-stub.c |6 --
 kvm.h  |2 --
 3 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index ba2cee1..3a1fa40 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -75,7 +75,6 @@ struct KVMState
 #ifdef KVM_CAP_SET_GUEST_DEBUG
 struct kvm_sw_breakpoint_head kvm_sw_breakpoints;
 #endif
-int pit_in_kernel;
 int pit_state2;
 int xsave, xcrs;
 int many_ioeventfds;
@@ -198,11 +197,6 @@ static void kvm_reset_vcpu(void *opaque)
 kvm_arch_reset_vcpu(env);
 }
 
-int kvm_pit_in_kernel(void)
-{
-return kvm_state-pit_in_kernel;
-}
-
 int kvm_init_vcpu(CPUArchState *env)
 {
 KVMState *s = kvm_state;
diff --git a/kvm-stub.c b/kvm-stub.c
index 69a1228..ec8fe74 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -16,12 +16,6 @@
 #include gdbstub.h
 #include kvm.h
 
-int kvm_pit_in_kernel(void)
-{
-return 0;
-}
-
-
 int kvm_init_vcpu(CPUArchState *env)
 {
 return -ENOSYS;
diff --git a/kvm.h b/kvm.h
index 330f17b..55f107d 100644
--- a/kvm.h
+++ b/kvm.h
@@ -83,8 +83,6 @@ int kvm_update_guest_debug(CPUArchState *env, unsigned long 
reinject_trap);
 int kvm_set_signal_mask(CPUArchState *env, const sigset_t *sigset);
 #endif
 
-int kvm_pit_in_kernel(void);
-
 int kvm_on_sigbus_vcpu(CPUArchState *env, int code, void *addr);
 int kvm_on_sigbus(int code, void *addr);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 0/2] uq/master: Basic MSI support for in-kernel irqchip mode

2012-03-21 Thread Jan Kiszka

Some half a year ago when I posted my first attempt to refactor MSI
for KVM support, we came to the conclusion that it might suffice to do
transparent dynamic routing for user-space injected MSI messages. These
two patches now implement such an approach for upstream.

As QEMU does not yet include irqfd support (for vhost) or pci device
assignment, this is already enough to enable MSI over the in-kernel
irqchip. Still, this is only RFC as it is just lightly tested and should
primarily collect feedback regarding the direction. If it's fine, I'd
like to base further qemu-kvm refactorings and upstream preparations on
top of such a series.

Also, I'd like to reanimate my KVM patch to provide direct MSI injection
in future kernels so that we do not need to take this long path here
forever.

Jan Kiszka (2):
  kvm: Introduce basic MSI support in-kernel irqchips
  KVM: x86: Wire up MSI support for in-kernel irqchip

 hw/apic.c |3 +
 hw/kvm/apic.c |   33 ++-
 hw/pc.c   |5 --
 kvm-all.c |  171 -
 kvm.h |1 +
 5 files changed, 205 insertions(+), 8 deletions(-)

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 2/2] KVM: x86: Wire up MSI support for in-kernel irqchip

2012-03-21 Thread Jan Kiszka

From: Jan Kiszka jan.kis...@siemens.com

Catch writes to the MSI MMIO region in the KVM APIC and forward them to
the kernel. Provide the kernel support GSI routing, this allows to
enable MSI support also for in-kernel irqchip mode.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/apic.c |3 +++
 hw/kvm/apic.c |   33 +++--
 hw/pc.c   |5 -
 3 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 4eeaf88..5fbf01c 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -19,6 +19,7 @@
 #include apic_internal.h
 #include apic.h
 #include ioapic.h
+#include msi.h
 #include host-utils.h
 #include trace.h
 #include pc.h
@@ -862,6 +863,8 @@ static void apic_init(APICCommonState *s)
 
 s-timer = qemu_new_timer_ns(vm_clock, apic_timer, s);
 local_apics[s-idx] = s;
+
+msi_supported = true;
 }
 
 static void apic_class_init(ObjectClass *klass, void *data)
diff --git a/hw/kvm/apic.c b/hw/kvm/apic.c
index ffe7a52..7d83b1a 100644
--- a/hw/kvm/apic.c
+++ b/hw/kvm/apic.c
@@ -10,6 +10,7 @@
  * See the COPYING file in the top-level directory.
  */
 #include hw/apic_internal.h
+#include hw/msi.h
 #include kvm.h
 
 static inline void kvm_apic_set_reg(struct kvm_lapic_state *kapic,
@@ -145,10 +146,38 @@ static void kvm_apic_external_nmi(APICCommonState *s)
 run_on_cpu(s-cpu_env, do_inject_external_nmi, s);
 }
 
+static uint64_t kvm_apic_mem_read(void *opaque, target_phys_addr_t addr,
+  unsigned size)
+{
+return -1U;
+}
+
+static void kvm_apic_mem_write(void *opaque, target_phys_addr_t addr,
+   uint64_t data, unsigned size)
+{
+int ret;
+
+ret = kvm_irqchip_send_msi(kvm_state, addr, data);
+if (ret  0) {
+fprintf(stderr, KVM: injection failed, MSI lost (%s)\n,
+strerror(-ret));
+}
+}
+
+static const MemoryRegionOps kvm_apic_io_ops = {
+.read = kvm_apic_mem_read,
+.write = kvm_apic_mem_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+};
+
 static void kvm_apic_init(APICCommonState *s)
 {
-memory_region_init_reservation(s-io_memory, kvm-apic-msi,
-   MSI_SPACE_SIZE);
+memory_region_init_io(s-io_memory, kvm_apic_io_ops, s, kvm-apic-msi,
+  MSI_SPACE_SIZE);
+
+if (kvm_has_gsi_routing()) {
+msi_supported = true;
+}
 }
 
 static void kvm_apic_class_init(ObjectClass *klass, void *data)
diff --git a/hw/pc.c b/hw/pc.c
index 83a1b5b..fab620a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -907,11 +907,6 @@ static DeviceState *apic_init(void *env, uint8_t apic_id)
 apic_mapped = 1;
 }
 
-/* KVM does not support MSI yet. */
-if (!kvm_irqchip_in_kernel()) {
-msi_supported = true;
-}
-
 return dev;
 }
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH 1/2] kvm: Introduce basic MSI support in-kernel irqchips

2012-03-21 Thread Jan Kiszka

From: Jan Kiszka jan.kis...@siemens.com

This patch basically adds kvm_irqchip_send_msi, a service for sending
arbitrary MSI messages to KVM's in-kernel irqchip models.

As the current KVI API requires us to establish a static route from a
pseudo GSI to the target MSI message and inject the MSI via toggling
that GSI, we need to play some tricks to make this unfortunately
interface transparent. We create those routes on demand and keep them
in a hash table. Succeeding messages can then search for an existing
route in the table first and reuse it whenever possible. If we should
run out of limited GSIs, we simply flush the table and rebuild it as
messages are sent.

This approach is rather simple and could be optimized further. However,
it is more efficient to enhance the KVM API so that we do not need this
clumsy dynamic routing over futures kernels.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 kvm-all.c |  171 -
 kvm.h |1 +
 2 files changed, 171 insertions(+), 1 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 3a1fa40..11ed19c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -48,6 +48,8 @@
 do { } while (0)
 #endif
 
+#define KVM_MSI_HASHTAB_SIZE256
+
 typedef struct KVMSlot
 {
 target_phys_addr_t start_addr;
@@ -59,6 +61,11 @@ typedef struct KVMSlot
 
 typedef struct kvm_dirty_log KVMDirtyLog;
 
+typedef struct KVMMSIRoute {
+struct kvm_irq_routing_entry kroute;
+QTAILQ_ENTRY(KVMMSIRoute) entry;
+} KVMMSIRoute;
+
 struct KVMState
 {
 KVMSlot slots[32];
@@ -87,6 +94,7 @@ struct KVMState
 int nr_allocated_irq_routes;
 uint32_t *used_gsi_bitmap;
 unsigned int max_gsi;
+QTAILQ_HEAD(msi_hashtab, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE];
 #endif
 };
 
@@ -862,9 +870,16 @@ static void set_gsi(KVMState *s, unsigned int gsi)
 s-used_gsi_bitmap[gsi / 32] |= 1U  (gsi % 32);
 }
 
+static void clear_gsi(KVMState *s, unsigned int gsi)
+{
+assert(gsi  s-max_gsi);
+
+s-used_gsi_bitmap[gsi / 32] = ~(1U  (gsi % 32));
+}
+
 static void kvm_init_irq_routing(KVMState *s)
 {
-int gsi_count;
+int gsi_count, i;
 
 gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING);
 if (gsi_count  0) {
@@ -884,6 +899,10 @@ static void kvm_init_irq_routing(KVMState *s)
 s-irq_routes = g_malloc0(sizeof(*s-irq_routes));
 s-nr_allocated_irq_routes = 0;
 
+for (i = 0; i  KVM_MSI_HASHTAB_SIZE; i++) {
+QTAILQ_INIT(s-msi_hashtab[i]);
+}
+
 kvm_arch_init_irq_routing(s);
 }
 
@@ -914,6 +933,54 @@ static void kvm_add_routing_entry(KVMState *s,
 set_gsi(s, entry-gsi);
 }
 
+static void kvm_remove_routing_entry(KVMState *s,
+ struct kvm_irq_routing_entry *entry)
+{
+struct kvm_irq_routing_entry *e;
+int gsi = entry-gsi;
+int i;
+
+for (i = 0; i  s-irq_routes-nr; ++i) {
+e = s-irq_routes-entries[i];
+if (e-type == entry-type  e-gsi == gsi) {
+switch (e-type) {
+case KVM_IRQ_ROUTING_IRQCHIP:
+if (e-u.irqchip.irqchip == entry-u.irqchip.irqchip 
+e-u.irqchip.pin == entry-u.irqchip.pin) {
+goto found;
+}
+break;
+case KVM_IRQ_ROUTING_MSI:
+if (e-u.msi.address_lo == entry-u.msi.address_lo 
+e-u.msi.address_hi == entry-u.msi.address_hi 
+e-u.msi.data == entry-u.msi.data) {
+goto found;
+}
+break;
+default:
+break;
+}
+}
+}
+/* route not found */
+return;
+
+found:
+s-irq_routes-nr--;
+*e = s-irq_routes-entries[s-irq_routes-nr];
+
+/* If there are no other users of this GSI, release it. */
+for (i = 0; i  s-irq_routes-nr; i++) {
+e = s-irq_routes-entries[i];
+if (e-gsi == gsi) {
+break;
+}
+}
+if (i == s-irq_routes-nr) {
+clear_gsi(s, gsi);
+}
+}
+
 void kvm_irqchip_add_route(KVMState *s, int irq, int irqchip, int pin)
 {
 struct kvm_irq_routing_entry e;
@@ -932,11 +999,113 @@ int kvm_irqchip_commit_routes(KVMState *s)
 return kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s-irq_routes);
 }
 
+static unsigned int kvm_hash_msi(uint32_t data)
+{
+/* This is optimized for IA32 MSI layout. However, no other arch shall
+ * repeat the mistake of not providing a direct MSI injection API. */
+return data  0xff;
+}
+
+static void kvm_flush_dynamic_msi_routes(KVMState *s)
+{
+KVMMSIRoute *route, *next;
+unsigned int hash;
+
+for (hash = 0; hash  KVM_MSI_HASHTAB_SIZE; hash++) {
+QTAILQ_FOREACH_SAFE(route, s-msi_hashtab[hash], entry, next) {
+kvm_remove_routing_entry(s, route-kroute);
+QTAILQ_REMOVE(s-msi_hashtab[hash], route, entry);
+}
+}
+}
+
+static int kvm_get_pseudo_gsi(KVMState *s)
+{
+uint32_t

Re: [PATCH v4 1/2] qemu-socket: change inet_connect() to to support nonblock socket

2012-03-21 Thread Michael Roth

On Mon, Mar 19, 2012 at 10:11:54PM +0800, Amos Kong wrote:
 Change inet_connect(const char *str, int socktype) to
 inet_connect(const char *str, bool block, int *sock_err),
 socktype is unused, block is used to assign if set socket
 to block/nonblock, sock_err is used to restore socket error.
 
 Connect's successful for nonblock socket when following errors are returned:
   -EINPROGRESS or -EWOULDBLOCK
   -WSAEALREADY or -WSAEINVAL (win32)
 
 Also change the wrap function inet_connect_opts(QemuOpts *opts)
 to inet_connect_opts(QemuOpts *opts, int *sock_err).
 
 Add a bool entry(block) for dummy_opts to tag block type.
 Change nbd, vnc to use new interface.
 
 Signed-off-by: Amos Kong ak...@redhat.com

Looks good, and non-blocking support for qemu-socket interfaces is pretty
useful in and of itself. One general suggestion I would make is to use Error to
pass back errors to the callers rather than an int*.

Some additional comments below.

 ---
  nbd.c  |2 +-
  qemu-char.c|2 +-
  qemu-sockets.c |   73 
 
  qemu_socket.h  |4 ++-
  ui/vnc.c   |2 +-
  5 files changed, 62 insertions(+), 21 deletions(-)
 
 diff --git a/nbd.c b/nbd.c
 index 567e94e..ad4de06 100644
 --- a/nbd.c
 +++ b/nbd.c
 @@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address, uint16_t 
 port)
 
  int tcp_socket_outgoing_spec(const char *address_and_port)
  {
 -return inet_connect(address_and_port, SOCK_STREAM);
 +return inet_connect(address_and_port, true, NULL);
  }
 
  int tcp_socket_incoming(const char *address, uint16_t port)
 diff --git a/qemu-char.c b/qemu-char.c
 index bb9e3f5..d3543ea 100644
 --- a/qemu-char.c
 +++ b/qemu-char.c
 @@ -2443,7 +2443,7 @@ static CharDriverState *qemu_chr_open_socket(QemuOpts 
 *opts)
  if (is_listen) {
  fd = inet_listen_opts(opts, 0);
  } else {
 -fd = inet_connect_opts(opts);
 +fd = inet_connect_opts(opts, NULL);
  }
  }
  if (fd  0) {
 diff --git a/qemu-sockets.c b/qemu-sockets.c
 index 6bcb8e3..8ed45f8 100644
 --- a/qemu-sockets.c
 +++ b/qemu-sockets.c
 @@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = {
  },{
  .name = ipv6,
  .type = QEMU_OPT_BOOL,
 +},{
 +.name = block,
 +.type = QEMU_OPT_BOOL,
  },
  { /* end if list */ }
  },
 @@ -194,14 +197,15 @@ listen:
  return slisten;
  }
 
 -int inet_connect_opts(QemuOpts *opts)
 +int inet_connect_opts(QemuOpts *opts, int *sock_err)
  {
  struct addrinfo ai,*res,*e;
  const char *addr;
  const char *port;
  char uaddr[INET6_ADDRSTRLEN+1];
  char uport[33];
 -int sock,rc;
 +int sock, rc, err;
 +bool block;
 
  memset(ai,0, sizeof(ai));
  ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
 @@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts)
 
  addr = qemu_opt_get(opts, host);
  port = qemu_opt_get(opts, port);
 +block = qemu_opt_get_bool(opts, block, 0);
  if (addr == NULL || port == NULL) {
  fprintf(stderr, inet_connect: host and/or port not specified\n);
 -return -1;
 +err = -EINVAL;
 +goto err;
  }
 
  if (qemu_opt_get_bool(opts, ipv4, 0))
 @@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts)
  if (0 != (rc = getaddrinfo(addr, port, ai, res))) {
  fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port,
  gai_strerror(rc));
 - return -1;
 +err = -EINVAL;
 +goto err;
  }
 
  for (e = res; e != NULL; e = e-ai_next) {
 @@ -241,21 +248,52 @@ int inet_connect_opts(QemuOpts *opts)
  continue;
  }
  setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on));
 -
 +if (!block) {
 +socket_set_nonblock(sock);
 +}
  /* connect to peer */
 -if (connect(sock,e-ai_addr,e-ai_addrlen)  0) {
 -if (NULL == e-ai_next)
 -fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, 
 __FUNCTION__,
 -inet_strfamily(e-ai_family),
 -e-ai_canonname, uaddr, uport, strerror(errno));
 -closesocket(sock);
 -continue;
 +do {
 +err = 0;
 +if (connect(sock, e-ai_addr, e-ai_addrlen)  0) {
 +err = -socket_error();
 +if (block) {
 +break;
 +}

Can still get an EINTR even in blocking mode, is breaking in that case
intentional?

 +}
 +} while (err == -EINTR);
 +
 +if (err = 0) {
 +goto success;
 +} else if (!block  (err == -EINPROGRESS || err == -EWOULDBLOCK)) {
 +goto success;

EWOULDBLOCK should actually be an error:

EWOULDBLOCK == EAGAIN:
   No  more  free  local  ports  or insufficient entries in the routing
   cache.  For AF_INET see the

RE: [Qemu-devel] [PATCH v4 5/7] RTC:Add RTC update-ended interrupt support

2012-03-21 Thread Zhang, Yang Z

 -Original Message-
 From: Stefano Stabellini [mailto:stefano.stabell...@eu.citrix.com]
 Sent: Wednesday, March 21, 2012 2:04 AM
 
 On Mon, 19 Mar 2012, Zhang, Yang Z wrote:
  Use a timer to emulate update cycle. When update cycle ended and UIE is
 setting, then raise an interrupt. The timer runs only when UF or AF is 
 cleared.
 
 The idea is that if the user requests the update-ended interrupt (UIE)
 we setup a timer to inject it at the right time into the guest and
 another timer to update the UIP bit, correct?
No, the timer runs whenever the UF and AF is cleared, not only UIE or AIE is 
set.

 But do we actually need the second timer? Why can't we just update the
 UIP bit whenever the user tries to read reg A, as we do when UIE is not set?
The purpose of using two timer is trying to keep the UF, AF and UIP 
synchronous. User can poll UIP to check UF and AF bit. If we use timer for 
UF/AF bit track and check UIP by another way, since the timer will be fired 
with delay, then the problem is encountered: the UIP is cleared, but due to the 
delay of timer, the UF/AF bit is not set. So we need to check them on a same 
level. Although we can update UF/AF when reading it, the logic is too 
complicated, especially for AF bit. 

 
 Also I am not sure it is a great idea to rename rtc_timer_update into
 periodic_timer_update in the same patch where you introduce this new
 logic.
Ok, it make sense.

best regards
yang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v4 0/7] RTC: New logic to emulate RTC

2012-03-21 Thread Zhang, Yang Z

 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 I attach a patch that fixes some problems with divider reset and in
 general simplifies the logic.  Even with the patch, however, I still see
 failures in my test case unfortunately.  Probably there are rounding
 errors somewhere.
Actually, I also see some failures during testing. And most of them are fail to 
pass the 244us update cycle checking. Since we are in emulation environment, we 
cannot ensure it always greater than 244us. So I think it should not be a bug.
Also, maybe there are some other bugs, I will do more testing.

 Also (minor change) you need to use rtc_clock instead of host_clock.
 
 I'm afraid that we're hitting a wall. :(  The problem I have is that the
 logic in your patch is very complex and, without understanding it well,
 I'm afraid we'll never be able to fix it completely (while the old
 inefficient one is buggy but it _can_ be fixed).
 
 By the way in general the SET bit and the divider should be handled in
 the same way, because both stop the updates.  That is, in general there
 should be a function like
 
 static inline bool rtc_running(RTCState *s)
 {
 return (!(s-cmos_data[RTC_REG_B]  REG_B_SET) 
   (s-cmos_data[RTC_REG_A]  0x70) == 0x20);
 }
 
 that is used instead of testing REG_B_SET.  I pushed some work along
 these lines at git://github.com/bonzini/qemu.git (branch rtc-intel), but
 it does not really improve the situation with respect to rounding errors
 so the bug must be elsewhere.
You are right, I missed this logic. Will add it in next version.

Best regrads
yang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/7] RTC: New logic to emulate RTC

2012-03-21 Thread Paolo Bonzini

Il 22/03/2012 01:23, Zhang, Yang Z ha scritto:
 Actually, I also see some failures during testing. And most of them
 are fail to pass the 244us update cycle checking. Since we are in
 emulation environment, we cannot ensure it always greater than 244us.
 So I think it should not be a bug.

Yes, that's fine.  But I also see some cases in which the first upgrade
cycle after divider reset is not reported.  The second, 1.5 seconds
after divider reset, is fine.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-21 Thread Wen Congyang

At 03/22/2012 03:19 AM, Anthony Liguori Wrote:
 On 03/21/2012 11:25 AM, Avi Kivity wrote:
 On 03/21/2012 06:18 PM, Corey Minyard wrote:

 Look at drivers/char/ipmi/ipmi_msghandler.c. It has code to send panic
 event over IMPI. The code is pretty complex. Of course if we a going to
 implement something more complex than simple hypercall for panic
 notification we better do something more interesting with it than just
 saying panic happened, like sending stack traces on all cpus for
 instance.

 I doubt that's the best example, unfortunately.  The IPMI event log
 has limited space and it has to be send a little piece at a time since
 each log entry is 14 bytes.  It just prints the panic string, nothing
 else.  Not that it isn't useful, it has saved my butt before.

 You have lots of interesting options with paravirtualization.  You
 could, for instance, create a console driver that delivered all
 console output efficiently through a hypercall.  That would be really
 easy.  Or, as you mention, a custom way to deliver panic information.
 Collecting information like stack traces would be harder to
 accomplish, as I don't think there is currently a way to get it except
 by sending it to printk.

 That already exists; virtio-console (or serial console emulation) can do
 the job.
 
 I think the use case here is pretty straight forward: if the guest finds
 itself in bad place, it wants to indicate that to the host.
 
 We shouldn't rely on any device drivers or complex code.  It should be
 as close to a single instruction as possible that can run even if
 interrupts are disabled.
 
 An out instruction fits this very well.  I think a simple protocol like:

This solution is more simple than using virtio-serial.

 
 inl PORT - returns a magic number indicating the presence of qemucalls

I donot understantd this instruction's purpose.

 inl PORT+1 - returns a bitmap of supported features

Hmm, we can execute this instruction when guest starts. If the userspace
does not process panicked event, there is no need to notify it.

 
 outl PORT+1 - data reg1
 outl PORT+2 - data reg2
 outl PORT+N - data regN

We can get the register value from vmcs. So there is no need to tell
the register value to the host.

If we decide to avoid touching hypervisor, I agree with this solution.

Thanks
Wen Congyang
 
 outl PORT - qemucall of index value with arguments 1..N
 
 Regards,
 
 Anthony Liguori
 

 In fact the feature can be implemented 100% host side by searching for a
 panic string signature in the console logs.

 
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost question

2012-03-21 Thread Steve Glass


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Just some further information concerning my earlier question
concerning vhost and virtio.

I'm using virtio to implement an emulated mac80211 device in the
guest. A simple network simulation will be used to control delivery of
frames between guests and for this I am using the vhost approach.

A simple first-cut attempt at the tx and rx kick handlers are given
below. When the guest transmits frames the vhost's TX kick handler is
executed and copies the buffers onto a queue for the intended
recipient(s). When the vhost's RX kick handler is run it copies the
buffer from the queue and notifies the client that the buffers have
been used.

The problem is that if there are no frames in the queue when the guest
rx kick handler runs then it has to exit and I have to arrange that it
runs again. That's done in the current prototype by having the guests
poll using a timer - which is ugly and inefficient. Can I get the
vhost tx kick handler to wake the appropriate vhost rx kick handler?
How can I achieve this?

Many thanks,

Steve

static void
handle_rx(struct vhost_work *work)
{
int n;
unsigned out, in, frames;
struct transmission *t;
struct vhost_poll *p = container_of(work, struct vhost_poll,
work);
struct vhost_virtqueue *vq =
container_of(p, struct vhost_virtqueue, poll);
struct vhost_node *node =
container_of(vq, struct vhost_node, vqs[WLAN_VQ_RX]);
struct vhost_dev *dev = node-vdev;

mutex_lock(vq-mutex);
vhost_disable_notify(dev, vq);
while (!queue_empty(node-rxq)) {
n = vhost_get_vq_desc(dev,
  vq,
  vq-iov,
  ARRAY_SIZE(vq-iov),
  out,
  in,
  NULL,
  NULL);
if (0  n || n == vq-num)
break;
t = queue_pop(node-rxq);
BUG_ON(copy_to_user(vq-iov[0].iov_base, t-buf,
t-buf_sz));
vq-iov[0].iov_len = t-buf_sz;
vhost_add_used(vq, n, out);
transmission_free(t);
++frames;
}
if (frames)
vhost_signal(dev, vq);
vhost_enable_notify(dev, vq);
mutex_unlock(vq-mutex);
}

static void
handle_tx(struct vhost_work *work)
{
int n;
unsigned out, in;
struct transmission *t;
struct vhost_node *receiver;

struct vhost_poll *p =
container_of(work, struct vhost_poll, work);
struct vhost_virtqueue *vq =
container_of(p, struct vhost_virtqueue, poll);
struct vhost_node *w =
container_of(vq, struct vhost_node, vqs[WLAN_VQ_TX]);
struct vhost_dev *dev = w-vdev;

mutex_lock(vq-mutex);
do {
vhost_disable_notify(dev, vq);
n = vhost_get_vq_desc(dev,
  vq,
  vq-iov,
  ARRAY_SIZE(vq-iov),
  out,
  in,
  NULL,
  NULL);
while (n = 0  n != vq-num) {
receiver = net_get_receiver(w);
if ((receiver)  (t = transmission_alloc())) {
BUG_ON(copy_from_user(t-buf,
  vq-iov[1].iov_base,
 
vq-iov[1].iov_len));
t-buf_sz = vq-iov[1].iov_len;
queue_push(receiver-rxq, t);
// ToDo: kick receiver's handle_rx
}
vhost_add_used(vq, n, out);
n = vhost_get_vq_desc(dev,
  vq,
  vq-iov,
  ARRAY_SIZE(vq-iov),
  out,
  in,
  NULL,
  NULL);
}
vhost_signal(dev, vq);
} while (vhost_enable_notify(dev, vq));
mutex_unlock(vq-mutex);
}
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk9qhO4ACgkQW7aAm65EWy7w4wCgrzGB2Zit4rWUzMjwpJEJnIfj
xDsAoLBDMj+4MVrjPS5upgDSIGOi4IzL
=Ms/+
-END PGP SIGNATURE-

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of

RE: [PATCH v4 0/7] RTC: New logic to emulate RTC

2012-03-21 Thread Zhang, Yang Z

 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 
 Il 22/03/2012 01:23, Zhang, Yang Z ha scritto:
  Actually, I also see some failures during testing. And most of them
  are fail to pass the 244us update cycle checking. Since we are in
  emulation environment, we cannot ensure it always greater than 244us.
  So I think it should not be a bug.
 
 Yes, that's fine.  But I also see some cases in which the first upgrade
 cycle after divider reset is not reported.  The second, 1.5 seconds
 after divider reset, is fine.
I think your patch can fix it. Right?

Best regards
yang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/7] RTC: New logic to emulate RTC

2012-03-21 Thread Paolo Bonzini

Il 22/03/2012 04:03, Zhang, Yang Z ha scritto:
  Yes, that's fine.  But I also see some cases in which the first upgrade
  cycle after divider reset is not reported.  The second, 1.5 seconds
  after divider reset, is fine.
 I think your patch can fix it. Right?

No. :(  It makes it a lot less likely, but still occurs every 2-3 runs.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v4 0/7] RTC: New logic to emulate RTC

2012-03-21 Thread Zhang, Yang Z

 -Original Message-
 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 Sent: Thursday, March 22, 2012 11:05 AM
 To: Zhang, Yang Z
 Cc: qemu-de...@nongnu.org; aligu...@us.ibm.com; kvm@vger.kernel.org
 Subject: Re: [PATCH v4 0/7] RTC: New logic to emulate RTC

 Il 22/03/2012 04:03, Zhang, Yang Z ha scritto:
   Yes, that's fine.  But I also see some cases in which the first upgrade
   cycle after divider reset is not reported.  The second, 1.5 seconds
   after divider reset, is fine.
  I think your patch can fix it. Right?

 No. :(  It makes it a lot less likely, but still occurs every 2-3 runs.
Ok, I will look into it.

best regards
yang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/2] qemu-socket: change inet_connect() to to support nonblock socket

2012-03-21 Thread Amos Kong

- Original Message -
 On Mon, Mar 19, 2012 at 10:11:54PM +0800, Amos Kong wrote:
  Change inet_connect(const char *str, int socktype) to
  inet_connect(const char *str, bool block, int *sock_err),
  socktype is unused, block is used to assign if set socket
  to block/nonblock, sock_err is used to restore socket error.
  
  Connect's successful for nonblock socket when following errors are
  returned:
-EINPROGRESS or -EWOULDBLOCK
-WSAEALREADY or -WSAEINVAL (win32)
  
  Also change the wrap function inet_connect_opts(QemuOpts *opts)
  to inet_connect_opts(QemuOpts *opts, int *sock_err).
  
  Add a bool entry(block) for dummy_opts to tag block type.
  Change nbd, vnc to use new interface.
  
  Signed-off-by: Amos Kong ak...@redhat.com
 
 Looks good, and non-blocking support for qemu-socket interfaces is
 pretty
 useful in and of itself. One general suggestion I would make is to
 use Error to
 pass back errors to the callers rather than an int*.

I have implemented it, thanks

 Some additional comments below.
 
  ---
   nbd.c  |2 +-
   qemu-char.c|2 +-
   qemu-sockets.c |   73
   
   qemu_socket.h  |4 ++-
   ui/vnc.c   |2 +-
   5 files changed, 62 insertions(+), 21 deletions(-)
  
  diff --git a/nbd.c b/nbd.c
  index 567e94e..ad4de06 100644
  --- a/nbd.c
  +++ b/nbd.c
  @@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address,
  uint16_t port)
  
   int tcp_socket_outgoing_spec(const char *address_and_port)
   {
  -return inet_connect(address_and_port, SOCK_STREAM);
  +return inet_connect(address_and_port, true, NULL);
   }
  
   int tcp_socket_incoming(const char *address, uint16_t port)
  diff --git a/qemu-char.c b/qemu-char.c
  index bb9e3f5..d3543ea 100644
  --- a/qemu-char.c
  +++ b/qemu-char.c
  @@ -2443,7 +2443,7 @@ static CharDriverState
  *qemu_chr_open_socket(QemuOpts *opts)
   if (is_listen) {
   fd = inet_listen_opts(opts, 0);
   } else {
  -fd = inet_connect_opts(opts);
  +fd = inet_connect_opts(opts, NULL);
   }
   }
   if (fd  0) {
  diff --git a/qemu-sockets.c b/qemu-sockets.c
  index 6bcb8e3..8ed45f8 100644
  --- a/qemu-sockets.c
  +++ b/qemu-sockets.c
  @@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = {
   },{
   .name = ipv6,
   .type = QEMU_OPT_BOOL,
  +},{
  +.name = block,
  +.type = QEMU_OPT_BOOL,
   },
   { /* end if list */ }
   },
  @@ -194,14 +197,15 @@ listen:
   return slisten;
   }
  
  -int inet_connect_opts(QemuOpts *opts)
  +int inet_connect_opts(QemuOpts *opts, int *sock_err)
   {
   struct addrinfo ai,*res,*e;
   const char *addr;
   const char *port;
   char uaddr[INET6_ADDRSTRLEN+1];
   char uport[33];
  -int sock,rc;
  +int sock, rc, err;
  +bool block;
  
   memset(ai,0, sizeof(ai));
   ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
  @@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts)
  
   addr = qemu_opt_get(opts, host);
   port = qemu_opt_get(opts, port);
  +block = qemu_opt_get_bool(opts, block, 0);
   if (addr == NULL || port == NULL) {
   fprintf(stderr, inet_connect: host and/or port not
   specified\n);
  -return -1;
  +err = -EINVAL;
  +goto err;
   }
  
   if (qemu_opt_get_bool(opts, ipv4, 0))
  @@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts)
   if (0 != (rc = getaddrinfo(addr, port, ai, res))) {
   fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port,
   gai_strerror(rc));
  -   return -1;
  +err = -EINVAL;
  +goto err;
   }
  
   for (e = res; e != NULL; e = e-ai_next) {
  @@ -241,21 +248,52 @@ int inet_connect_opts(QemuOpts *opts)
   continue;
   }
   setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on));
  -
  +if (!block) {
  +socket_set_nonblock(sock);
  +}
   /* connect to peer */
  -if (connect(sock,e-ai_addr,e-ai_addrlen)  0) {
  -if (NULL == e-ai_next)
  -fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n,
  __FUNCTION__,
  -inet_strfamily(e-ai_family),
  -e-ai_canonname, uaddr, uport,
  strerror(errno));
  -closesocket(sock);
  -continue;
  +do {
  +err = 0;
  +if (connect(sock, e-ai_addr, e-ai_addrlen)  0) {
  +err = -socket_error();


  +if (block) {
  +break;

Above code is used to keep behavior same as original, return this error(connect 
will be failed)

analysis about process EINTR : 
http://www.madore.org/~david/computers/connect-intr.html
 If connect() is interrupted by a signal that is caught while blocked 
waiting to establish a connection,

[PATCH v5 0/2] support to migrate with IPv6 address

2012-03-21 Thread Amos Kong

Those patches make tcp migration use the help functions in
qemu-socket.c for support IPv6 migration.

Changes from v1:
- split different changes to small patches, it will be easier to review
- fixed some problem according to Kevin's comment

Changes from v2:
- fix issue of returning real error 
- set s-fd to -1 when parse fails, won't call migrate_fd_error()

Changes from v3:
- try to use help functions in qemu-socket.c

Changes from v4:
- introduce set_socket_error() to restore real errno
- fix connect error process

---

Amos Kong (4):
  sockets: introduce set_socket_error()
  qemu-socket: change inet_connect() to to support nonblock socket
  sockets: pass back errors in inet_listen()
  use inet_listen()/inet_connect() to support ipv6 migration


 migration-tcp.c |   74 +--
 nbd.c   |2 +
 qemu-sockets.c  |   87 ++-
 qemu_socket.h   |4 ++-
 ui/vnc.c|2 +
 5 files changed, 92 insertions(+), 77 deletions(-)

-- 
Amos Kong
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 1/4] sockets: introduce set_socket_error()

2012-03-21 Thread Amos Kong

Introduce set_socket_error() to set the errno, use
WSASetLastError() for win32.
Sometimes, clean work would rewrite errno in error path,
we can use this function to restore real errno.

Signed-off-by: Amos Kong ak...@redhat.com
---
 qemu_socket.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/qemu_socket.h b/qemu_socket.h
index fe4cf6c..a4c5170 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -8,6 +8,7 @@
 #include ws2tcpip.h
 
 #define socket_error() WSAGetLastError()
+#define set_socket_error(e) WSASetLastError(e)
 #undef EINTR
 #define EWOULDBLOCK WSAEWOULDBLOCK
 #define EINTR   WSAEINTR
@@ -26,6 +27,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
 #include sys/un.h
 
 #define socket_error() errno
+#define set_socket_error(e) errno = e
 #define closesocket(s) close(s)
 
 #endif /* !_WIN32 */

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 2/4] qemu-socket: change inet_connect() to to support nonblock socket

2012-03-21 Thread Amos Kong

Change inet_connect(const char *str, int socktype) to
inet_connect(const char *str, bool block),
socktype is unused, block is used to assign if set socket
to block/nonblock.

Retry to connect when -EINTR/-EWOULDBLOCK is got.
Connect's successful for nonblock socket when following
errors are got:
  -EINPROGRESS
  -WSAEALREADY/-WSAEINVAL (win32)

Add a bool entry(block) for dummy_opts to tag block type.

Use set_socket_error() to set real errno, set errno to
EINVAL for parse error.

Change nbd, vnc to use new interface.

Signed-off-by: Amos Kong ak...@redhat.com
---
 nbd.c  |2 +-
 qemu-sockets.c |   66 +++-
 qemu_socket.h  |2 +-
 ui/vnc.c   |2 +-
 4 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/nbd.c b/nbd.c
index 567e94e..3618344 100644
--- a/nbd.c
+++ b/nbd.c
@@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address, uint16_t port)
 
 int tcp_socket_outgoing_spec(const char *address_and_port)
 {
-return inet_connect(address_and_port, SOCK_STREAM);
+return inet_connect(address_and_port, true);
 }
 
 int tcp_socket_incoming(const char *address, uint16_t port)
diff --git a/qemu-sockets.c b/qemu-sockets.c
index 6bcb8e3..908479e 100644
--- a/qemu-sockets.c
+++ b/qemu-sockets.c
@@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = {
 },{
 .name = ipv6,
 .type = QEMU_OPT_BOOL,
+},{
+.name = block,
+.type = QEMU_OPT_BOOL,
 },
 { /* end if list */ }
 },
@@ -201,7 +204,8 @@ int inet_connect_opts(QemuOpts *opts)
 const char *port;
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int sock,rc;
+int sock, rc, err;
+bool block;
 
 memset(ai,0, sizeof(ai));
 ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
@@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts)
 
 addr = qemu_opt_get(opts, host);
 port = qemu_opt_get(opts, port);
+block = qemu_opt_get_bool(opts, block, 0);
 if (addr == NULL || port == NULL) {
 fprintf(stderr, inet_connect: host and/or port not specified\n);
-return -1;
+err = -EINVAL;
+goto err;
 }
 
 if (qemu_opt_get_bool(opts, ipv4, 0))
@@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts)
 if (0 != (rc = getaddrinfo(addr, port, ai, res))) {
 fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port,
 gai_strerror(rc));
-   return -1;
+err = -EINVAL;
+goto err;
 }
 
 for (e = res; e != NULL; e = e-ai_next) {
@@ -241,21 +248,44 @@ int inet_connect_opts(QemuOpts *opts)
 continue;
 }
 setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on));
-
+if (!block) {
+socket_set_nonblock(sock);
+}
 /* connect to peer */
-if (connect(sock,e-ai_addr,e-ai_addrlen)  0) {
-if (NULL == e-ai_next)
-fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __FUNCTION__,
-inet_strfamily(e-ai_family),
-e-ai_canonname, uaddr, uport, strerror(errno));
-closesocket(sock);
-continue;
+do {
+err = 0;
+if (connect(sock, e-ai_addr, e-ai_addrlen)  0) {
+err = -socket_error();
+}
+} while (err == -EINTR || err == -EWOULDBLOCK);
+
+if (err = 0) {
+goto success;
+} else if (!block  err == -EINPROGRESS) {
+goto success;
+#ifdef _WIN32
+} else if (!block  (err == -WSAEALREADY || err == -WSAEINVAL)) {
+goto success;
+#endif
 }
-freeaddrinfo(res);
-return sock;
+
+if (NULL == e-ai_next) {
+fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __func__,
+inet_strfamily(e-ai_family),
+e-ai_canonname, uaddr, uport, strerror(errno));
+}
+closesocket(sock);
 }
 freeaddrinfo(res);
+
+err:
+set_socket_error(-err);
 return -1;
+
+success:
+freeaddrinfo(res);
+set_socket_error(-err);
+return sock;
 }
 
 int inet_dgram_opts(QemuOpts *opts)
@@ -449,14 +479,20 @@ int inet_listen(const char *str, char *ostr, int olen,
 return sock;
 }
 
-int inet_connect(const char *str, int socktype)
+int inet_connect(const char *str, bool block)
 {
 QemuOpts *opts;
 int sock = -1;
 
 opts = qemu_opts_create(dummy_opts, NULL, 0);
-if (inet_parse(opts, str) == 0)
+if (inet_parse(opts, str) == 0) {
+if (block) {
+qemu_opt_set(opts, block, on);
+}
 sock = inet_connect_opts(opts);
+} else {
+set_socket_error(EINVAL);
+}
 qemu_opts_del(opts);
 return sock;
 }
diff --git a/qemu_socket.h b/qemu_socket.h
index a4c5170..f86cd3f 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -47,7 +47,7 @@ int inet_listen_opts(QemuOpts *opts, int port_offset);
 int

[PATCH v5 3/4] sockets: pass back errors in inet_listen()

2012-03-21 Thread Amos Kong

Use set_socket_error() to restore real erron,
set errno to EINVAL for parse error.

Signed-off-by: Amos Kong ak...@redhat.com
---
 qemu-sockets.c |   21 -
 1 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/qemu-sockets.c b/qemu-sockets.c
index 908479e..f1c6524 100644
--- a/qemu-sockets.c
+++ b/qemu-sockets.c
@@ -110,7 +110,7 @@ int inet_listen_opts(QemuOpts *opts, int port_offset)
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, to, port_min, port_max, p;
+int slisten, rc, to, port_min, port_max, p, err;
 
 memset(ai,0, sizeof(ai));
 ai.ai_flags = AI_PASSIVE | AI_ADDRCONFIG;
@@ -120,7 +120,8 @@ int inet_listen_opts(QemuOpts *opts, int port_offset)
 if ((qemu_opt_get(opts, host) == NULL) ||
 (qemu_opt_get(opts, port) == NULL)) {
 fprintf(stderr, %s: host and/or port not specified\n, __FUNCTION__);
-return -1;
+err = -EINVAL;
+goto err;
 }
 pstrcpy(port, sizeof(port), qemu_opt_get(opts, port));
 addr = qemu_opt_get(opts, host);
@@ -138,7 +139,8 @@ int inet_listen_opts(QemuOpts *opts, int port_offset)
 if (rc != 0) {
 fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port,
 gai_strerror(rc));
-return -1;
+err = -EINVAL;
+goto err;
 }
 
 /* create socket + bind */
@@ -150,6 +152,7 @@ int inet_listen_opts(QemuOpts *opts, int port_offset)
 if (slisten  0) {
 fprintf(stderr,%s: socket(%s): %s\n, __FUNCTION__,
 inet_strfamily(e-ai_family), strerror(errno));
+err = -socket_error();
 continue;
 }
 
@@ -169,6 +172,7 @@ int inet_listen_opts(QemuOpts *opts, int port_offset)
 if (bind(slisten, e-ai_addr, e-ai_addrlen) == 0) {
 goto listen;
 }
+err = -socket_error();
 if (p == port_max) {
 fprintf(stderr,%s: bind(%s,%s,%d): %s\n, __FUNCTION__,
 inet_strfamily(e-ai_family), uaddr, inet_getport(e),
@@ -179,14 +183,15 @@ int inet_listen_opts(QemuOpts *opts, int port_offset)
 }
 fprintf(stderr, %s: FAILED\n, __FUNCTION__);
 freeaddrinfo(res);
-return -1;
+goto err;
 
 listen:
 if (listen(slisten,1) != 0) {
+err = -socket_error();
 perror(listen);
 closesocket(slisten);
 freeaddrinfo(res);
-return -1;
+goto err;
 }
 snprintf(uport, sizeof(uport), %d, inet_getport(e) - port_offset);
 qemu_opt_set(opts, host, uaddr);
@@ -195,6 +200,10 @@ listen:
 qemu_opt_set(opts, ipv4, (e-ai_family != PF_INET6) ? on : off);
 freeaddrinfo(res);
 return slisten;
+
+err:
+set_socket_error(-err);
+return -1;
 }
 
 int inet_connect_opts(QemuOpts *opts)
@@ -474,6 +483,8 @@ int inet_listen(const char *str, char *ostr, int olen,
  optstr ? optstr : );
 }
 }
+} else {
+set_socket_error(EINVAL);
 }
 qemu_opts_del(opts);
 return sock;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 4/4] use inet_listen()/inet_connect() to support ipv6 migration

2012-03-21 Thread Amos Kong

Use help functions in qemu-socket.c for tcp migration,
which already support ipv6 addresses.

For IPv6 brackets must be mandatory if you require a port.
Referencing to RFC5952, the recommended format is:

 [2312::8274]:5200

test status: Successed
listen side: qemu-kvm  -incoming tcp:[2312::8274]:5200
client side: qemu-kvm ...
 (qemu) migrate -d tcp:[2312::8274]:5200

Signed-off-by: Amos Kong ak...@redhat.com
---
 migration-tcp.c |   74 +++
 1 files changed, 20 insertions(+), 54 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index 35a5781..251d955 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -81,43 +81,32 @@ static void tcp_wait_for_connect(void *opaque)
 
 int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
 {
-struct sockaddr_in addr;
-int ret;
-
-ret = parse_host_port(addr, host_port);
-if (ret  0) {
-return ret;
-}
+int err;
 
 s-get_error = socket_errno;
 s-write = socket_write;
 s-close = tcp_close;
 
-s-fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
-if (s-fd == -1) {
-DPRINTF(Unable to open socket);
-return -socket_error();
-}
-
-socket_set_nonblock(s-fd);
+s-fd = inet_connect(host_port, false);
+err = -socket_error();
 
-do {
-ret = connect(s-fd, (struct sockaddr *)addr, sizeof(addr));
-if (ret == -1) {
-ret = -socket_error();
-}
-if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) {
-qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s);
-return 0;
+if (err == -EINPROGRESS) {
+DPRINTF(connect in progress);
+qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s);
+#ifdef _WIN32
+} else if (err == -WSAEALREADY || err == -WSAEINVAL) {
+DPRINTF(connect in progress);
+qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s);
+#endif
+} else if (err  0) {
+DPRINTF(connect failed: %s\n, strerror(-err));
+if (s-fd != -1) {
+migrate_fd_error(s);
 }
-} while (ret == -EINTR);
-
-if (ret  0) {
-DPRINTF(connect failed\n);
-migrate_fd_error(s);
-return ret;
+return err;
+} else {
+migrate_fd_connect(s);
 }
-migrate_fd_connect(s);
 return 0;
 }
 
@@ -157,38 +146,15 @@ out2:
 
 int tcp_start_incoming_migration(const char *host_port)
 {
-struct sockaddr_in addr;
-int val;
 int s;
 
-DPRINTF(Attempting to start an incoming migration\n);
-
-if (parse_host_port(addr, host_port)  0) {
-fprintf(stderr, invalid host/port combination: %s\n, host_port);
-return -EINVAL;
-}
-
-s = qemu_socket(PF_INET, SOCK_STREAM, 0);
-if (s == -1) {
+s = inet_listen(host_port, NULL, 256, SOCK_STREAM, 0);
+if (s  0) {
 return -socket_error();
 }
 
-val = 1;
-setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (const char *)val, sizeof(val));
-
-if (bind(s, (struct sockaddr *)addr, sizeof(addr)) == -1) {
-goto err;
-}
-if (listen(s, 1) == -1) {
-goto err;
-}
-
 qemu_set_fd_handler2(s, NULL, tcp_accept_incoming_migration, NULL,
  (void *)(intptr_t)s);
 
 return 0;
-
-err:
-close(s);
-return -socket_error();
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/38] powerpc/booke: Provide exception macros with interrupt name

2012-03-21 Thread Scott Wood

On 03/21/2012 01:04 PM, Kumar Gala wrote:
 
 On Feb 28, 2012, at 6:09 PM, Alexander Graf wrote:
 
 From: Scott Wood scottw...@freescale.com

 DO_KVM will need to identify the particular exception type.

 There is an existing set of arbitrary numbers that Linux passes,
 but it's an undocumented mess that sort of corresponds to server/classic
 exception vectors but not really.
 
 So what do the new names correspond to?

The names are Linux-defined.  The values are IVOR numbers.

 What header is defining MACHINE_CHECK, BOOKE_INTERRUPT_EXTERNAL, etc
 (asm/kvm_asm.h)?

Yes, it's asm/kvm_asm.h at the moment.

It's actually BOOKE_INTERRUPT_MACHINE_CHECK, etc.  The exception macros
paste on BOOKE_INTERRUPT_ when they use it, to keep the macro users from
having to wrap lines even more often.

 If so we really should move these out of
 asm/kvm_asm.h and into something a bit more appropriate.

Yes, that was one of the things I had been planning to fix post-RFC.
Any preference what header it should be in?  asm/reg_booke.h seems to be
the least bad option of the existing headers -- or perhaps move
head_booke.h to arch/powerpc/include/asm.

Long-term it would also be nice for this to replace, rather than
supplement, the current numbers, though as Ben pointed out there are a
number of places throughout the code that will have to be fixed for that
to happen.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/38] powerpc/booke: Provide exception macros with interrupt name

2012-03-21 Thread Kumar Gala


On Mar 21, 2012, at 1:19 PM, Scott Wood wrote:

 On 03/21/2012 01:04 PM, Kumar Gala wrote:
 
 On Feb 28, 2012, at 6:09 PM, Alexander Graf wrote:
 
 From: Scott Wood scottw...@freescale.com
 
 DO_KVM will need to identify the particular exception type.
 
 There is an existing set of arbitrary numbers that Linux passes,
 but it's an undocumented mess that sort of corresponds to server/classic
 exception vectors but not really.
 
 So what do the new names correspond to?
 
 The names are Linux-defined.  The values are IVOR numbers.
 
 What header is defining MACHINE_CHECK, BOOKE_INTERRUPT_EXTERNAL, etc
 (asm/kvm_asm.h)?
 
 Yes, it's asm/kvm_asm.h at the moment.
 
 It's actually BOOKE_INTERRUPT_MACHINE_CHECK, etc.  The exception macros
 paste on BOOKE_INTERRUPT_ when they use it, to keep the macro users from
 having to wrap lines even more often.
 
 If so we really should move these out of
 asm/kvm_asm.h and into something a bit more appropriate.
 
 Yes, that was one of the things I had been planning to fix post-RFC.
 Any preference what header it should be in?  asm/reg_booke.h seems to be
 the least bad option of the existing headers -- or perhaps move
 head_booke.h to arch/powerpc/include/asm.

asm/reg_booke.h seems the least painful right now.  head_booke.h is only used 
on 32-bit so that's not the best choice at this point.  We could create an 
asm/head_booke.h in addition to the one in arch/powerpc/kernel/

 Long-term it would also be nice for this to replace, rather than
 supplement, the current numbers, though as Ben pointed out there are a
 number of places throughout the code that will have to be fixed for that
 to happen.

Yeah, not too worried about that.  More just wanting it to be clear what one 
has to 'update' if adding support for a new exception.

- k--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

90 matches

Mail list logo