Re: [PATCH 07/18] kvm: Add MCE signal support for !CONFIG_IOTHREAD

2011-01-26 Thread Jan Kiszka
On 2011-01-24 13:36, Jan Kiszka wrote:
 On 2011-01-24 12:17, Marcelo Tosatti wrote:
 On Mon, Jan 10, 2011 at 09:32:00AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 Currently, we only configure and process MCE-related SIGBUS events if
 CONFIG_IOTHREAD is enabled. Fix this by factoring out the required
 handler registration and system configuration. Make sure that events
 happening over a VCPU context in non-threaded mode get dispatched as
 VCPU MCEs.

 We also need to call qemu_kvm_eat_signals in non-threaded mode now, so
 move it (unmodified) and add the required Windows stub.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Huang Ying ying.hu...@intel.com
 ---
  cpus.c |  200 
 +++
  1 files changed, 124 insertions(+), 76 deletions(-)

 diff --git a/cpus.c b/cpus.c
 index 6da0f8f..b6f1cfb 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -34,9 +34,6 @@
  
  #include cpus.h
  #include compatfd.h
 -#ifdef CONFIG_LINUX
 -#include sys/prctl.h
 -#endif
  
  #ifdef SIGRTMIN
  #define SIG_IPI (SIGRTMIN+4)
 @@ -44,10 +41,24 @@
  #define SIG_IPI SIGUSR1
  #endif
  

 @@ -912,6 +954,8 @@ static int qemu_cpu_exec(CPUState *env)
  
  bool cpu_exec_all(void)
  {
 +int r;
 +
  if (next_cpu == NULL)
  next_cpu = first_cpu;
  for (; next_cpu != NULL  !exit_request; next_cpu = 
 next_cpu-next_cpu) {
 @@ -923,7 +967,11 @@ bool cpu_exec_all(void)
  if (qemu_alarm_pending())
  break;
  if (cpu_can_run(env)) {
 -if (qemu_cpu_exec(env) == EXCP_DEBUG) {
 +r = qemu_cpu_exec(env);
 +if (kvm_enabled()) {
 +qemu_kvm_eat_signals(env);
 +}
 +if (r == EXCP_DEBUG) {
  break;
  }

 SIGBUS should be processed outside of vcpu execution context, think of a
 non MCE SIGBUS while vm is stopped. Could use signalfd for that.
 
 signalfd - that's the missing bit. I was thinking of how to handle
 SIGBUS events raised outside the vcpu context. We need to handle them
 synchronously, and signalfd should allow this.

This was straightforward. But now I wonder what actually makes this
pattern work. Doesn't the kernel force-inject SIGBUS, i.e. ignores any
blocking? Or does this only apply to BUS_MCEERR_AR?

 

 But the SIGBUS handler for !IOTHREAD case should not ignore Action
 Required, since it might have been generated in vcpu context.

 
 Yes, the sigbus handler will require some rework when we actually start
 using it for !IOTHREAD.

And this no longer makes sense to me. The current version simply uses
the same sigbus handler for both modes, an that forwards the error code
properly. What did you mean?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Jan Kiszka
On 2011-01-25 17:49, Stefan Berger wrote:
 On 01/25/2011 02:26 AM, Jan Kiszka wrote:

 Do you see a chance to look closer at the issue yourself? E.g.
 instrument the kernel's irqchip models and dump their states once your
 guest is stuck?
 The device runs on iRQ 3. So I applied this patch here.
 
 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 3cece05..8f4f94c 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct kvm_kpic_state
 *s, int irq, int level)
  {
  int mask, ret = 1;
  mask = 1  irq;
 -if (s-elcr  mask)/* level triggered */
 +if (s-elcr  mask)/* level triggered */ {
  if (level) {
  ret = !(s-irr  mask);
  s-irr |= mask;
 @@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state *s, int irq, int level)
  s-irr= ~mask;
  s-last_irr= ~mask;
  }
 -else/* edge triggered */
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
 s-irr);
 +}
 +else/* edge triggered */ {
  if (level) {
  if ((s-last_irr  mask) == 0) {
  ret = !(s-irr  mask);
 @@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct kvm_kpic_state
 *s, int irq, int level)
  s-last_irr |= mask;
  } else
  s-last_irr= ~mask;
 -
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
 s-irr);
 +}
  return (s-imr  mask) ? -1 : ret;
  }
 
 @@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
 
  pic_lock(s);
  if (irq= 0  irq  PIC_NUM_PINS) {
 +if (irq == 3)
 +printk(%s\n, __FUNCTION__);
  ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
  pic_update_irq(s);
  trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
 
 
 
 While it's still working I see this here with the levels changing 0-1-0.
 Though then it stops and levels are only at '1'.
 
 [ 1773.833824] kvm_pic_set_irq
 [ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.834161] kvm_pic_set_irq
 [ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.834193] kvm_pic_set_irq
 [ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.835028] kvm_pic_set_irq
 [ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.835542] kvm_pic_set_irq
 [ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.889892] kvm_pic_set_irq
 [ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
 [ 1791.258793] pic_set_irq1 119: level=1, irr = d9
 [ 1791.258824] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402476] pic_set_irq1 119: level=1, irr = d9
 [ 1791.402534] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402538] pic_set_irq1 119: level=1, irr = d9
 [...]
 
 
 I believe the last 5 shown calls can be ignored. After that the
 interrupts don't go through anymore.
 
 In the device model I see interrupts being raised and cleared. After the
 last one was cleared in 'my' device model, only interrupts are raised.
 This looks like as if the interrupt handler in the guest Linux was never
 run, thus the IRQ is never cleared and we're stuck.
 

User space is responsible for both setting and clearing that line. IRQ3
means you are using some serial device model? Then you should check what
its state is.

Moreover, a complete picture of the kernel/user space interaction should
be obtainable by using fstrace for capturing kvm events.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Jan Kiszka
On 2011-01-25 20:13, Alex Williamson wrote:
 On Tue, 2011-01-25 at 17:35 +0100, Jan Kiszka wrote:
 On 2011-01-25 15:53, Avi Kivity wrote:
 On 01/25/2011 04:41 PM, Alex Williamson wrote:
  
  
kvm: Allow memory slot array to grow on demand
  
Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
to grow on demand.  Private slots are now allocated at the
front instead of the end.  Only x86 seems to use private slots,

  Hmm, doesn't current user space expect slots 8..11 to be the private
  ones and wouldn't it cause troubles if slots 0..3 are suddenly
 reserved?

 The private slots aren't currently visible to userspace, they're
 actually slots 32..35.  The patch automatically increments user passed
 slot ids so userspace has it's own zero-based view of the array.
 Frankly, I don't understand why userspace reserves slots 8..11, is this
 compatibility with older kernel implementations?

 I think so.  I believe these kernel versions are too old now to matter,
 but of course I can't be sure.

 Will check and enable those 4 additional slots for user space if we no
 longer need to exclude them.
 
 Appears that the history goes something like this...
 
   * Once upon a time we had 8 memory slots
 
   * v2.6.24-4949-ge0d62c7 added 4 reserved slots above those 8
 
   * v2.6.24-4950-gcbc9402 made use of slot 8 for tss
 
   * v2.6.24-4962-gf78e0e2 made use of slot 9 for tpr
 
   * v2.6.25-5341-gef2979b bumped the 8 slots to 32
 
   * v2.6.26-rc1-13-gb7ebfb0 made use of slot 10 for ept
 
   * v2.6.28-4952-g6fe6397 moved the previously hard coded slots
 8,9,10 up to KVM_MEMORY_SLOTS + {0,1,2}
 

Nice overview!

 So we haven't needed to reserve 8..11 for a while and we've never made
 use of all 4 private slots.  I should reduce that to 3 in my patch.
 Maybe there's a test we could do in userspace to figure out we don't
 have to skip those anymore?

We can simply remove the test, 2.6.29 is our entry barrier for both
upstream and qemu-kvm now.

Jan



signature.asc
Description: OpenPGP digital signature


Re: Graphics pass-through

2011-01-26 Thread Gerd Hoffmann

  Hi,


The changes we made are very less, mostly disabling default QEMU VGA.

But there are few problems that we are still working on
1. The display on the monitor, probably only appears after the KMS is
enabled. It does not display the grub menu and booting log.


Sounds like you're not enabling legacy VGA MMIO and I/O port space to
the physical device.  Without card specific drivers, seabios and grub
have no way to write to the screen.


There is a initialization order issue with the legacy vga memory @ 
0xa0 and the piix3 chipset.  Booting a guest with '-vga none -device 
VGA' has the same effect (no boot messages).


cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API

2011-01-26 Thread Sheng Yang
On Tuesday 25 January 2011 20:47:38 Avi Kivity wrote:
 On 01/19/2011 10:21 AM, Sheng Yang wrote:
  We already got an guest MMIO address for that in the exit
  information. I've created a chain of handler in qemu to handle it.

But we already decoded the table and entry...
  
  But the handler is still wrapped by vcpu_mmio_write(), as a part of MMIO.
  So it's not quite handy to get the table and entry out.
 
 The kernel handler can create a new kvm_run exit description.
 
Also the updater in the userspace
  
  can share the most logic with ordinary userspace MMIO handler, which take
  address as parameter. So I think we don't need to pass the decoded
  table_id and entry to userspace.
 
 It's mixing layers, which always leads to trouble.  For one, the user
 handler shouldn't do anything with the write since the kernel already
 wrote it into the table.  For another, if two vcpus write to the same
 entry simultaneously, you could see different ordering in the kernel and
 userspace, and get inconsistent results.

The shared logic is not about writing, but about interpret what's written. Old 
MMIO handler would write the data, then interpret it; and our new MMIO would 
only 
share the logic of interpretation. I think that's fair enough?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Avi Kivity

On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:

On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
  On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
 For the other lookups, which we
 believe will succeed, we can assume the probablity of a match is
 related to the slot size, and sort the slots by page count.
  
  Unlikely to be true for assigned device BARs.

  We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
  lots of small slots for BARs and such.

  The vast majority of faults will either touch one of the two largest
  slots, or will miss all slots.  Relatively few will hit the small
  slots.

Not if you are using one of the assigned devices and don't
do any faults on memory :)


It's impossible not to fault on memory.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Michael S. Tsirkin
On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
 On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
   On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
   On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
  For the other lookups, which we
  believe will succeed, we can assume the probablity of a match is
  related to the slot size, and sort the slots by page count.
   
   Unlikely to be true for assigned device BARs.
 
   We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
   lots of small slots for BARs and such.
 
   The vast majority of faults will either touch one of the two largest
   slots, or will miss all slots.  Relatively few will hit the small
   slots.
 
 Not if you are using one of the assigned devices and don't
 do any faults on memory :)
 
 It's impossible not to fault on memory.

No I mean the RAM.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Avi Kivity

On 01/25/2011 07:43 PM, Alex Williamson wrote:

On Tue, 2011-01-25 at 19:11 +0200, Avi Kivity wrote:
  On 01/25/2011 04:57 PM, Alex Williamson wrote:
On Tue, 2011-01-25 at 12:23 +0200, Avi Kivity wrote:
   On 01/25/2011 07:37 AM, Alex Williamson wrote:
  On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote:
  I'll look at how we might be
  able to allocate slots on demand.  Thanks,
   
  Here's a first cut just to see if this looks agreeable.  This 
allows the
  slot array to grow on demand.  This works with current userspace, 
as
  well as userspace trivially modified to double KVMState.slots and
  hotplugging enough pci-assign devices to exceed the previous limit 
(w/
  w/o ept).  Hopefully I got the rcu bits correct.  Does this look 
like
  the right path?

   This can be trivially exhausted to pin all RAM.
  
What's a reasonable upper limit?  A PCI device can have at most 6 MMIO
BARs, each taking one slot.

  A BAR can take no slots, or several slots.  For example a BAR might have
  a framebuffer, followed by an off-screen framebuffer, followed by an
  mmio register area in one BAR.  You'd want the framebuffer to be dirty
  logged while the offscreen framebuffer is not tracked (so one slot for
  each) while the mmio range cannot be used as a slot.

  That only holds for emulated devices.

Sure, emulated devices can do lots of specialty mappings, but I also
expect that more typically, mmio access to emulated devices will get
bounced through qemu and not use any slots.


Right; the example I gave (a framebuffer) is the exception rather than 
the rule; and I believe modern framebuffers are usually accessed through 
dma rather than BARs.



  It might also support MSI-X in a way that
splits a BAR, so an absolute max of 7 slots per PCI device.  Assuming
only 1 PCI bus for the moment and also assuming assigned devices are
typically single function, 32 devices * 7 slots/device = 224 slots, so
maybe a 256 limit?  Maybe we even bump it up to a limit of 64 devices
with a slot limit of 512.  It would be easier in device assignment code
to keep track of a number of devices limit than trying to guess whether
slots will be available when we need them.

  Sounds reasonable.  But we'd need a faster search for 512 slots.

Ok, I'll add a limit.  I'm not convinced the typical use case is going
to increase slots at all, so I'm still a little resistant to optimizing
the search at this point.


I don't want there to be two kvm implementations out there, both 
supporting 512 slots, while one is slow and one is fast.  It means that 
you have no idea what performance to expect.



BTW, simply by the ordering of when the
platform registers memory vs when other devices are mapped, we seem to
end up placing the actual memory regions at the front of the list.  Not
sure if that's by design or luck.  Thanks,


It's was done by design, though as qemu evolves, it becomes more and 
more a matter of luck that the order doesn't change.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Avi Kivity

On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote:

On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
  On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
 On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
 For the other lookups, which we
 believe will succeed, we can assume the probablity of a match is
 related to the slot size, and sort the slots by page count.
 
 Unlikely to be true for assigned device BARs.
  
 We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
 lots of small slots for BARs and such.
  
 The vast majority of faults will either touch one of the two largest
 slots, or will miss all slots.  Relatively few will hit the small
 slots.
  
  Not if you are using one of the assigned devices and don't
  do any faults on memory :)

  It's impossible not to fault on memory.

No I mean the RAM.


No idea what you mean.  It's impossible not to fault on RAM, either 
(unless you don't use it at all).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Avi Kivity

On 01/25/2011 08:00 PM, Michael S. Tsirkin wrote:

  
  I see now. Yes, a flag in spte would help.  changes in slots would then
  have to update all these flags.

  That's easy, we drop all sptes.

Ah, right. Hmm cpu has no flag to distinguish mmio sptes somehow already?


No.  Allocated-but-not-mapped and mmio sptes are identical.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Michael S. Tsirkin
On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote:
 On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote:
 On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
   On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
   On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
  On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
  For the other lookups, which we
  believe will succeed, we can assume the probablity of a match 
  is
  related to the slot size, and sort the slots by page count.
  
  Unlikely to be true for assigned device BARs.
   
  We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
  lots of small slots for BARs and such.
   
  The vast majority of faults will either touch one of the two largest
  slots, or will miss all slots.  Relatively few will hit the small
  slots.
   
   Not if you are using one of the assigned devices and don't
   do any faults on memory :)
 
   It's impossible not to fault on memory.
 
 No I mean the RAM.
 
 No idea what you mean.  It's impossible not to fault on RAM, either
 (unless you don't use it at all).

I just mean that once you fault you map sptes and then you can use them
without exits.  mmio will cause exits each time. Right?

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/19] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.

2011-01-26 Thread Yoshiaki Tamura
Record mmio write event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 exec.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index e950df2..c81fd09 100644
--- a/exec.c
+++ b/exec.c
@@ -33,6 +33,7 @@
 #include osdep.h
 #include kvm.h
 #include qemu-timer.h
+#include event-tap.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
 #include signal.h
@@ -3632,6 +3633,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 io_index = (pd  IO_MEM_SHIFT)  (IO_MEM_NB_ENTRIES - 1);
 if (p)
 addr1 = (addr  ~TARGET_PAGE_MASK) + p-region_offset;
+
+event_tap_mmio(addr, buf, len);
+
 /* XXX: could force cpu_single_env to NULL to avoid
potential bugs */
 if (l = 4  ((addr1  3) == 0)) {
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/19] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().

2011-01-26 Thread Yoshiaki Tamura
event-tap function is called only when it is on.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 net.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net.c b/net.c
index 9ba5be2..1176124 100644
--- a/net.c
+++ b/net.c
@@ -36,6 +36,7 @@
 #include qemu-common.h
 #include qemu_socket.h
 #include hw/qdev.h
+#include event-tap.h
 
 static QTAILQ_HEAD(, VLANState) vlans;
 static QTAILQ_HEAD(, VLANClientState) non_vlan_clients;
@@ -559,6 +560,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender,
 
 void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size)
 {
+if (event_tap_is_on()) {
+return event_tap_send_packet(vc, buf, size);
+}
+
 qemu_send_packet_async(vc, buf, size, NULL);
 }
 
@@ -657,6 +662,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender,
 {
 NetQueue *queue;
 
+if (event_tap_is_on()) {
+return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb);
+}
+
 if (sender-link_down || (!sender-peer  !sender-vlan)) {
 return calc_iov_length(iov, iovcnt);
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19] Kemari for KVM v0.2.7

2011-01-26 Thread Yoshiaki Tamura
Hi,

This patch series is a revised version of Kemari for KVM, which
applied comments for the previous post.  The current code is based on
qemu.git 0bfe006c5380c5f8a485a55ded3329fbbc224396.

The changes from v0.2.6 - v0.2.7 are:

- add AIOCB, AIOPool and cancel functions (Kevin)
- insert event-tap for bdrv_flush (Kevin)
- add error handing when calling bdrv functions (Kevin)
- fix usage of qemu_aio_flush and bdrv_flush (Kevin)
- use bs in AIOCB on the primary (Kevin)
- reorder event-tap functions to gather with block/net (Kevin)
- fix checking bs-device_name (Kevin)

The changes from v0.2.5 - v0.2.6 are:

- use qemu_{put,get}_be32() to save/load niov in event-tap

The changes from v0.2.4 - v0.2.5 are:

- fixed braces and trailing spaces by using Blue's checkpatch.pl (Blue)
- event-tap: don't try to send blk_req if it's a bdrv_aio_flush event

The changes from v0.2.3 - v0.2.4 are:

- call vm_start() before event_tap_flush_one() to avoid failure in
  virtio-net assertion
- add vm_change_state_handler to turn off ft_mode
- use qemu_iovec functions in event-tap
- remove duplicated code in migration
- remove unnecessary new line for error_report in ft_trans_file

The changes from v0.2.2 - v0.2.3 are:

- queue async net requests without copying (MST)
-- if not async, contents of the packets are sent to the secondary
- better description for option -k (MST)
- fix memory transfer failure
- fix ft transaction initiation failure

The changes from v0.2.1 - v0.2.2 are:

- decrement last_avaid_idx with inuse before saving (MST)
- remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit()

The changes from v0.2 - v0.2.1 are:

- Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin)
- Tap bdrv_aio_flush (Marcelo)
- Remove multiwrite interface in event-tap (Stefan)
- Fix event-tap to use pio/mmio to replay both net/block (Stefan)
- Improve error handling in event-tap (Stefan)
- Fix leak in event-tap (Stefan)
- Revise virtio last_avail_idx manipulation (MST)
- Clean up migration.c hook (Marcelo)
- Make deleting change state handler robust (Isaku, Anthony)

The changes from v0.1.1 - v0.2 are:

- Introduce a queue in event-tap to make VM sync live.
- Change transaction receiver to a state machine for async receiving.
- Replace net/block layer functions with event-tap proxy functions.
- Remove dirty bitmap optimization for now.
- convert DPRINTF() in ft_trans_file to trace functions.
- convert fprintf() in ft_trans_file to error_report().
- improved error handling in ft_trans_file.
- add a tmp pointer to qemu_del_vm_change_state_handler.

The changes from v0.1 - v0.1.1 are:

- events are tapped in net/block layer instead of device emulation layer.
- Introduce a new option for -incoming to accept FT transaction.

- Removed writev() support to QEMUFile and FdMigrationState for now.
  I would post this work in a different series.

- Modified virtio-blk save/load handler to send inuse variable to
  correctly replay.

- Removed configure --enable-ft-mode.
- Removed unnecessary check for qemu_realloc().

The first 6 patches modify several functions of qemu to prepare
introducing Kemari specific components.

The next 6 patches are the components of Kemari.  They introduce
event-tap and the FT transaction protocol file based on buffered file.
The design document of FT transaction protocol can be found at,
http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf

Then the following 2 patches modifies net/block layer functions with
event-tap functions.  Please note that if Kemari is off, event-tap
will just passthrough, and there is most no intrusion to exisiting
functions including normal live migration.

Finally, the migration layer are modified to support Kemari in the
last 5 patches.  Again, there shouldn't be any affection if a user
doesn't specify Kemari specific options.  The transaction is now async
on both sender and receiver side.  The sender side respects the
max_downtime to decide when to switch from async to sync mode.

The repository contains all patches I'm sending with this message.
For those who want to try, please pull the following repository.  It
also includes dirty bitmap optimization which aren't ready for posting
yet.  To remove the dirty bitmap optimization, please look at HEAD~5
of the tree.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari next

Thanks,

Yoshi

Yoshiaki Tamura (19):
  Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and
qemu_clear_buffer().
  Introduce read() to FdMigrationState.
  Introduce skip_header parameter to qemu_loadvm_state().
  qemu-char: export socket_set_nodelay().
  vl.c: add deleted flag for deleting the handler.
  virtio: decrement last_avail_idx with inuse before saving.
  Introduce fault tolerant VM transaction QEMUFile and ft_mode.
  savevm: introduce util functions to control ft_trans_file from savevm
layer.
  Introduce event-tap.
  Call init handler of event-tap at main() in vl.c.
  ioport: insert 

[PATCH 08/19] savevm: introduce util functions to control ft_trans_file from savevm layer.

2011-01-26 Thread Yoshiaki Tamura
To utilize ft_trans_file function, savevm needs interfaces to be
exported.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/hw.h  |5 ++
 savevm.c |  149 ++
 2 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 7f05830..52e807c 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc 
*put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size);
 void qemu_put_byte(QEMUFile *f, int v);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_ft_trans_begin(QEMUFile *f);
+int qemu_ft_trans_commit(QEMUFile *f);
+int qemu_ft_trans_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
@@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f);
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
  * resume output. */
 void qemu_file_put_notify(QEMUFile *f);
+void qemu_file_get_notify(void *opaque);
 
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
diff --git a/savevm.c b/savevm.c
index dc15c03..5418280 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,7 @@
 #include migration.h
 #include qemu_socket.h
 #include qemu-queue.h
+#include ft_trans_file.h
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -190,6 +191,13 @@ typedef struct QEMUFileSocket
 QEMUFile *file;
 } QEMUFileSocket;
 
+typedef struct QEMUFileSocketTrans
+{
+int fd;
+QEMUFileSocket *s;
+VMChangeStateEntry *e;
+} QEMUFileSocketTrans;
+
 static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
 {
 QEMUFileSocket *s = opaque;
@@ -205,6 +213,22 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, 
int64_t pos, int size)
 return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+QEMUFileSocket *s = opaque;
+ssize_t len;
+
+do {
+len = send(s-fd, (void *)buf, size, 0);
+} while (len == -1  socket_error() == EINTR);
+
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int socket_close(void *opaque)
 {
 QEMUFileSocket *s = opaque;
@@ -212,6 +236,70 @@ static int socket_close(void *opaque)
 return 0;
 }
 
+static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, 
size_t size)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+ssize_t len;
+
+len = socket_get_buffer(s, buf, pos, size);
+
+return len;
+}
+
+static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t 
size)
+{
+QEMUFileSocketTrans *t = opaque;
+
+return socket_put_buffer(t-s, buf, size);
+}
+
+
+static int socket_trans_get_ready(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+QEMUFile *f = s-file;
+int ret = 0;
+
+ret = qemu_loadvm_state(f, 1);
+if (ret  0) {
+fprintf(stderr,
+socket_trans_get_ready: error while loading vmstate\n);
+}
+
+return ret;
+}
+
+static int socket_trans_close(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+
+qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
+qemu_set_fd_handler2(t-fd, NULL, NULL, NULL, NULL);
+qemu_del_vm_change_state_handler(t-e);
+close(s-fd);
+close(t-fd);
+qemu_free(s);
+qemu_free(t);
+
+return 0;
+}
+
+static void socket_trans_resume(void *opaque, int running, int reason)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+
+if (!running) {
+return;
+}
+
+qemu_announce_self();
+qemu_fclose(s-file);
+}
+
 static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int 
size)
 {
 QEMUFileStdio *s = opaque;
@@ -334,6 +422,26 @@ QEMUFile *qemu_fopen_socket(int fd)
 return s-file;
 }
 
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd)
+{
+QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans));
+QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
+
+t-s = s;
+t-fd = s_fd;
+t-e = qemu_add_vm_change_state_handler(socket_trans_resume, t);
+
+s-fd = c_fd;
+s-file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer,
+  socket_trans_get_buffer, NULL,
+  socket_trans_get_ready,
+  migrate_fd_wait_for_unfreeze,
+  socket_trans_close, 0);

[PATCH 19/19] migration: add a parser to accept FT migration incoming mode.

2011-01-26 Thread Yoshiaki Tamura
The option looks like, -incoming protocol:address:port,ft_mode

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/migration.c b/migration.c
index 1752cf4..29d4fb1 100644
--- a/migration.c
+++ b/migration.c
@@ -45,6 +45,12 @@ int qemu_start_incoming_migration(const char *uri)
 const char *p;
 int ret;
 
+/* check ft_mode option  */
+p = strstr(uri, ft_mode);
+if (p  !strcmp(p, ft_mode)) {
+ft_mode = FT_INIT;
+}
+
 if (strstart(uri, tcp:, p))
 ret = tcp_start_incoming_migration(p);
 #if !defined(WIN32)
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/19] ioport: insert event_tap_ioport() to ioport_write().

2011-01-26 Thread Yoshiaki Tamura
Record ioport event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 ioport.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index aa4188a..74aebf5 100644
--- a/ioport.c
+++ b/ioport.c
@@ -27,6 +27,7 @@
 
 #include ioport.h
 #include trace.h
+#include event-tap.h
 
 /***/
 /* IO Port */
@@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, 
uint32_t data)
 default_ioport_writel
 };
 IOPortWriteFunc *func = ioport_write_table[index][address];
+event_tap_ioport(index, address, data);
 if (!func)
 func = default_func[index];
 func(ioport_opaque[address], address, data);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/19] Call init handler of event-tap at main() in vl.c.

2011-01-26 Thread Yoshiaki Tamura
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 vl.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index 8bbb785..9faeb27 100644
--- a/vl.c
+++ b/vl.c
@@ -162,6 +162,7 @@ int main(int argc, char **argv)
 #include qemu-queue.h
 #include cpus.h
 #include arch_init.h
+#include event-tap.h
 
 #include ui/qemu-spice.h
 
@@ -2895,6 +2896,8 @@ int main(int argc, char **argv, char **envp)
 
 blk_mig_init();
 
+event_tap_init();
+
 if (default_cdrom) {
 /* we always create the cdrom drive, even if no disk is there */
 drive_add(NULL, CDROM_ALIAS);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/19] block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush().

2011-01-26 Thread Yoshiaki Tamura
event-tap function is called only when it is on, and requests sent
from device emulators.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 block.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index ff2795b..e4df9b6 100644
--- a/block.c
+++ b/block.c
@@ -28,6 +28,7 @@
 #include block_int.h
 #include module.h
 #include qemu-objects.h
+#include event-tap.h
 
 #ifdef CONFIG_BSD
 #include sys/types.h
@@ -1476,6 +1477,10 @@ int bdrv_flush(BlockDriverState *bs)
 }
 
 if (bs-drv  bs-drv-bdrv_flush) {
+if (*bs-device_name  event_tap_is_on()) {
+event_tap_bdrv_flush();
+}
+
 return bs-drv-bdrv_flush(bs);
 }
 
@@ -2111,6 +2116,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
 if (bdrv_check_request(bs, sector_num, nb_sectors))
 return NULL;
 
+if (*bs-device_name  event_tap_is_on()) {
+return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
+ cb, opaque);
+}
+
 if (bs-dirty_bitmap) {
 blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
  opaque);
@@ -2374,6 +2384,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
 
 if (!drv)
 return NULL;
+
+if (*bs-device_name  event_tap_is_on()) {
+return event_tap_bdrv_aio_flush(bs, cb, opaque);
+}
+
 return drv-bdrv_aio_flush(bs, cb, opaque);
 }
 
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/19] Introduce read() to FdMigrationState.

2011-01-26 Thread Yoshiaki Tamura
Currently FdMigrationState doesn't support read(), and this patch
introduces it to get response from the other side.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration-tcp.c |   15 +++
 migration.c |   13 +
 migration.h |3 +++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index b55f419..55777c8 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * 
buf, size_t size)
 return send(s-fd, buf, size, 0);
 }
 
+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+ssize_t len;
+
+do {
+len = recv(s-fd, (void *)buf, size, 0);
+} while (len == -1  socket_error() == EINTR);
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF(tcp_close\n);
@@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 
 s-get_error = socket_errno;
 s-write = socket_write;
+s-read = socket_read;
 s-close = tcp_close;
 s-mig_state.cancel = migrate_fd_cancel;
 s-mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index d593b1d..bee20f0 100644
--- a/migration.c
+++ b/migration.c
@@ -334,6 +334,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
 return ret;
 }
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size)
+{
+FdMigrationState *s = opaque;
+int ret;
+
+ret = s-read(s, data, size);
+if (ret == -1) {
+ret = -(s-get_error(s));
+}
+
+return ret;
+}
+
 void migrate_fd_connect(FdMigrationState *s)
 {
 int ret;
diff --git a/migration.h b/migration.h
index d13ed4f..7bf6747 100644
--- a/migration.h
+++ b/migration.h
@@ -47,6 +47,7 @@ struct FdMigrationState
 int (*get_error)(struct FdMigrationState*);
 int (*close)(struct FdMigrationState*);
 int (*write)(struct FdMigrationState*, const void *, size_t);
+int (*read)(struct FdMigrationState *, const void *, size_t);
 void *opaque;
 };
 
@@ -115,6 +116,8 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size);
+
 void migrate_fd_connect(FdMigrationState *s);
 
 void migrate_fd_put_ready(void *opaque);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin,commit}.

2011-01-26 Thread Yoshiaki Tamura
Introduce qemu_savevm_state_{begin,commit} to send the memory and
device info together, while avoiding cancelling memory state tracking.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 savevm.c |   93 ++
 sysemu.h |2 +
 2 files changed, 95 insertions(+), 0 deletions(-)

diff --git a/savevm.c b/savevm.c
index 5418280..73465ed 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1726,6 +1726,99 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 return 0;
 }
 
+int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init)
+{
+SaveStateEntry *se;
+int skipped = 0;
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+int len, stage, ret;
+
+if (se-save_live_state == NULL) {
+continue;
+}
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_START);
+qemu_put_be32(f, se-section_id);
+
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+
+stage = init ? QEMU_VM_SECTION_START : QEMU_VM_SECTION_PART;
+ret = se-save_live_state(mon, f, stage, se-opaque);
+if (!ret) {
+skipped++;
+}
+}
+
+if (qemu_file_has_error(f)) {
+return -EIO;
+}
+
+return skipped;
+}
+
+int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f)
+{
+SaveStateEntry *se;
+
+cpu_synchronize_all_states();
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+int ret;
+
+if (se-save_live_state == NULL) {
+continue;
+}
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_PART);
+qemu_put_be32(f, se-section_id);
+
+ret = se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque);
+if (!ret) {
+/* do not proceed to the next vmstate. */
+return 1;
+}
+}
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+int len;
+
+if (se-save_state == NULL  se-vmsd == NULL) {
+continue;
+}
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+qemu_put_be32(f, se-section_id);
+
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+
+vmstate_save(f, se);
+}
+
+qemu_put_byte(f, QEMU_VM_EOF);
+
+if (qemu_file_has_error(f)) {
+return -EIO;
+}
+
+return 0;
+}
+
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
 {
 SaveStateEntry *se;
diff --git a/sysemu.h b/sysemu.h
index 329415f..ee2c382 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -81,6 +81,8 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
+int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init);
+int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.

2011-01-26 Thread Yoshiaki Tamura
When ft_mode is set in the header, tcp_accept_incoming_migration()
sets ft_trans_incoming() as a callback, and call
qemu_file_get_notify() to receive FT transaction iteratively.  We also
need a hack no to close fd before moving to ft_transaction mode, so
that we can reuse the fd for it.  vm_change_state_handler is added to
turn off ft_mode when cont is pressed.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration-tcp.c |   67 ++-
 1 files changed, 66 insertions(+), 1 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index 55777c8..84076d6 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -18,6 +18,8 @@
 #include sysemu.h
 #include buffered_file.h
 #include block.h
+#include ft_trans_file.h
+#include event-tap.h
 
 //#define DEBUG_MIGRATION_TCP
 
@@ -29,6 +31,8 @@
 do { } while (0)
 #endif
 
+static VMChangeStateEntry *vmstate;
+
 static int socket_errno(FdMigrationState *s)
 {
 return socket_error();
@@ -56,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, 
size_t size)
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF(tcp_close\n);
-if (s-fd != -1) {
+/* FIX ME: accessing ft_mode here isn't clean */
+if (s-fd != -1  ft_mode != FT_INIT) {
 close(s-fd);
 s-fd = -1;
 }
@@ -150,6 +155,36 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 return s-mig_state;
 }
 
+static void ft_trans_incoming(void *opaque)
+{
+QEMUFile *f = opaque;
+
+qemu_file_get_notify(f);
+if (qemu_file_has_error(f)) {
+ft_mode = FT_ERROR;
+qemu_fclose(f);
+}
+}
+
+static void ft_trans_reset(void *opaque, int running, int reason)
+{
+QEMUFile *f = opaque;
+
+if (running) {
+if (ft_mode != FT_ERROR) {
+qemu_fclose(f);
+}
+ft_mode = FT_OFF;
+qemu_del_vm_change_state_handler(vmstate);
+}
+}
+
+static void ft_trans_schedule_replay(QEMUFile *f)
+{
+event_tap_schedule_replay();
+vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f);
+}
+
 static void tcp_accept_incoming_migration(void *opaque)
 {
 struct sockaddr_in addr;
@@ -175,8 +210,38 @@ static void tcp_accept_incoming_migration(void *opaque)
 goto out;
 }
 
+if (ft_mode == FT_INIT) {
+autostart = 0;
+}
+
 process_incoming_migration(f);
+
+if (ft_mode == FT_INIT) {
+int ret;
+
+socket_set_nodelay(c);
+
+f = qemu_fopen_ft_trans(s, c);
+if (f == NULL) {
+fprintf(stderr, could not qemu_fopen_ft_trans\n);
+goto out;
+}
+
+/* need to wait sender to setup */
+ret = qemu_ft_trans_begin(f);
+if (ret  0) {
+goto out;
+}
+
+qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f);
+ft_trans_schedule_replay(f);
+ft_mode = FT_TRANSACTION_RECV;
+
+return;
+}
+
 qemu_fclose(f);
+
 out:
 close(c);
 out2:
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/19] Introduce event-tap.

2011-01-26 Thread Yoshiaki Tamura
event-tap controls when to start FT transaction, and provides proxy
functions to called from net/block devices.  While FT transaction, it
queues up net/block requests, and flush them when the transaction gets
completed.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 Makefile.target |1 +
 event-tap.c |  925 +++
 event-tap.h |   44 +++
 qemu-tool.c |   28 ++
 trace-events|9 +
 5 files changed, 1007 insertions(+), 0 deletions(-)
 create mode 100644 event-tap.c
 create mode 100644 event-tap.h

diff --git a/Makefile.target b/Makefile.target
index cd2abde..20f02d5 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,7 @@ obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
 LIBS+=-lz
+obj-y += event-tap.o
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
diff --git a/event-tap.c b/event-tap.c
new file mode 100644
index 000..ca2a204
--- /dev/null
+++ b/event-tap.c
@@ -0,0 +1,925 @@
+/*
+ * Event Tap functions for QEMU
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include block.h
+#include block_int.h
+#include ioport.h
+#include osdep.h
+#include sysemu.h
+#include hw/hw.h
+#include net.h
+#include event-tap.h
+#include trace.h
+
+enum EVENT_TAP_STATE {
+EVENT_TAP_OFF,
+EVENT_TAP_ON,
+EVENT_TAP_SUSPEND,
+EVENT_TAP_FLUSH,
+EVENT_TAP_LOAD,
+EVENT_TAP_REPLAY,
+};
+
+static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF;
+
+typedef struct EventTapIOport {
+uint32_t address;
+uint32_t data;
+int  index;
+} EventTapIOport;
+
+#define MMIO_BUF_SIZE 8
+
+typedef struct EventTapMMIO {
+uint64_t address;
+uint8_t  buf[MMIO_BUF_SIZE];
+int  len;
+} EventTapMMIO;
+
+typedef struct EventTapNetReq {
+char *device_name;
+int iovcnt;
+int vlan_id;
+bool vlan_needed;
+bool async;
+struct iovec *iov;
+NetPacketSent *sent_cb;
+} EventTapNetReq;
+
+#define MAX_BLOCK_REQUEST 32
+
+typedef struct EventTapAIOCB EventTapAIOCB;
+
+typedef struct EventTapBlkReq {
+char *device_name;
+int num_reqs;
+int num_cbs;
+bool is_flush;
+BlockRequest reqs[MAX_BLOCK_REQUEST];
+EventTapAIOCB *acb[MAX_BLOCK_REQUEST];
+} EventTapBlkReq;
+
+#define EVENT_TAP_IOPORT (1  0)
+#define EVENT_TAP_MMIO   (1  1)
+#define EVENT_TAP_NET(1  2)
+#define EVENT_TAP_BLK(1  3)
+
+#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1)
+
+typedef struct EventTapLog {
+int mode;
+union {
+EventTapIOport ioport;
+EventTapMMIO mmio;
+};
+union {
+EventTapNetReq net_req;
+EventTapBlkReq blk_req;
+};
+QTAILQ_ENTRY(EventTapLog) node;
+} EventTapLog;
+
+struct EventTapAIOCB {
+BlockDriverAIOCB common;
+BlockDriverAIOCB *acb;
+bool is_canceled;
+};
+
+static EventTapLog *last_event_tap;
+
+static QTAILQ_HEAD(, EventTapLog) event_list;
+static QTAILQ_HEAD(, EventTapLog) event_pool;
+
+static int (*event_tap_cb)(void);
+static QEMUBH *event_tap_bh;
+static VMChangeStateEntry *vmstate;
+
+static void event_tap_bh_cb(void *p)
+{
+if (event_tap_cb) {
+event_tap_cb();
+}
+
+qemu_bh_delete(event_tap_bh);
+event_tap_bh = NULL;
+}
+
+static void event_tap_schedule_bh(void)
+{
+trace_event_tap_ignore_bh(!!event_tap_bh);
+
+/* if bh is already set, we ignore it for now */
+if (event_tap_bh) {
+return;
+}
+
+event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL);
+qemu_bh_schedule(event_tap_bh);
+
+return ;
+}
+
+static void *event_tap_alloc_log(void)
+{
+EventTapLog *log;
+
+if (QTAILQ_EMPTY(event_pool)) {
+log = qemu_mallocz(sizeof(EventTapLog));
+} else {
+log = QTAILQ_FIRST(event_pool);
+QTAILQ_REMOVE(event_pool, log, node);
+}
+
+return log;
+}
+
+static void event_tap_free_log(EventTapLog *log)
+{
+int i, mode = log-mode  ~EVENT_TAP_TYPE_MASK;
+
+if (mode == EVENT_TAP_NET) {
+EventTapNetReq *net_req = log-net_req;
+
+if (!net_req-async) {
+for (i = 0; i  net_req-iovcnt; i++) {
+qemu_free(net_req-iov[i].iov_base);
+}
+qemu_free(net_req-iov);
+} else if (event_tap_state = EVENT_TAP_LOAD) {
+qemu_free(net_req-iov);
+}
+
+qemu_free(net_req-device_name);
+} else if (mode == EVENT_TAP_BLK) {
+EventTapBlkReq *blk_req = log-blk_req;
+
+if (event_tap_state = EVENT_TAP_LOAD  !blk_req-is_flush) {
+for (i = 0; i  blk_req-num_reqs; i++) {
+qemu_iovec_destroy(blk_req-reqs[i].qiov);
+qemu_free(blk_req-reqs[i].qiov);
+  

[PATCH 16/19] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.

2011-01-26 Thread Yoshiaki Tamura
Introduce migrate_ft_trans_put_ready() which kicks the FT transaction
cycle.  When ft_mode is on, migrate_fd_put_ready() would open
ft_trans_file and turn on event_tap.  To end or cancel FT transaction,
ft_mode and event_tap is turned off.  migrate_ft_trans_get_ready() is
called to receive ack from the receiver.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration.c |  267 ++-
 1 files changed, 266 insertions(+), 1 deletions(-)

diff --git a/migration.c b/migration.c
index cd02b7e..aa30ecd 100644
--- a/migration.c
+++ b/migration.c
@@ -21,6 +21,7 @@
 #include qemu_socket.h
 #include block-migration.h
 #include qemu-objects.h
+#include event-tap.h
 
 //#define DEBUG_MIGRATION
 
@@ -278,6 +279,14 @@ void migrate_fd_error(FdMigrationState *s)
 migrate_fd_cleanup(s);
 }
 
+static void migrate_ft_trans_error(FdMigrationState *s)
+{
+ft_mode = FT_ERROR;
+qemu_savevm_state_cancel(s-mon, s-file);
+migrate_fd_error(s);
+event_tap_unregister();
+}
+
 int migrate_fd_cleanup(FdMigrationState *s)
 {
 int ret = 0;
@@ -313,6 +322,17 @@ void migrate_fd_put_notify(void *opaque)
 qemu_file_put_notify(s-file);
 }
 
+static void migrate_fd_get_notify(void *opaque)
+{
+FdMigrationState *s = opaque;
+
+qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
+qemu_file_get_notify(s-file);
+if (qemu_file_has_error(s-file)) {
+migrate_ft_trans_error(s);
+}
+}
+
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
 {
 FdMigrationState *s = opaque;
@@ -347,6 +367,10 @@ int migrate_fd_get_buffer(void *opaque, uint8_t *data, 
int64_t pos, size_t size)
 ret = -(s-get_error(s));
 }
 
+if (ret == -EAGAIN) {
+qemu_set_fd_handler2(s-fd, NULL, migrate_fd_get_notify, NULL, s);
+}
+
 return ret;
 }
 
@@ -373,6 +397,236 @@ void migrate_fd_connect(FdMigrationState *s)
 migrate_fd_put_ready(s);
 }
 
+static int migrate_ft_trans_commit(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_COMMIT  ft_mode != FT_TRANSACTION_ATOMIC) {
+fprintf(stderr,
+migrate_ft_trans_commit: invalid ft_mode %d\n, ft_mode);
+goto out;
+}
+
+do {
+if (ft_mode == FT_TRANSACTION_ATOMIC) {
+if (qemu_ft_trans_begin(s-file)  0) {
+fprintf(stderr, qemu_ft_trans_begin failed\n);
+goto out;
+}
+
+ret = qemu_savevm_trans_begin(s-mon, s-file, 0);
+if (ret  0) {
+fprintf(stderr, qemu_savevm_trans_begin failed\n);
+goto out;
+}
+
+ft_mode = FT_TRANSACTION_COMMIT;
+if (ret) {
+/* don't proceed until if fd isn't ready */
+goto out;
+}
+}
+
+/* make the VM state consistent by flushing outstanding events */
+vm_stop(0);
+
+/* send at full speed */
+qemu_file_set_rate_limit(s-file, 0);
+
+ret = qemu_savevm_trans_complete(s-mon, s-file);
+if (ret  0) {
+fprintf(stderr, qemu_savevm_trans_complete failed\n);
+goto out;
+}
+
+if (ret) {
+/* don't proceed until if fd isn't ready */
+ret = 1;
+goto out;
+}
+
+ret = qemu_ft_trans_commit(s-file);
+if (ret  0) {
+fprintf(stderr, qemu_ft_trans_commit failed\n);
+goto out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_RECV;
+ret = 1;
+goto out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+ret = event_tap_flush_one();
+if (ret  0) {
+fprintf(stderr, event_tap_flush_one failed\n);
+goto out;
+}
+
+ft_mode =  ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC;
+} while (ft_mode != FT_TRANSACTION_BEGIN);
+
+vm_start();
+ret = 0;
+
+out:
+return ret;
+}
+
+static int migrate_ft_trans_get_ready(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_RECV) {
+fprintf(stderr,
+migrate_ft_trans_get_ready: invalid ft_mode %d\n, ft_mode);
+goto error_out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+ret = event_tap_flush_one();
+if (ret  0) {
+fprintf(stderr, event_tap_flush_one failed\n);
+goto error_out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_BEGIN;
+} else {
+ft_mode = FT_TRANSACTION_ATOMIC;
+
+ret = migrate_ft_trans_commit(s);
+if (ret  0) {
+goto error_out;
+}
+if (ret) {
+goto out;
+}
+}
+
+vm_start();
+ret = 0;
+goto out;
+
+error_out:
+migrate_ft_trans_error(s);
+
+out:
+

[PATCH 03/19] Introduce skip_header parameter to qemu_loadvm_state().

2011-01-26 Thread Yoshiaki Tamura
Introduce skip_header parameter to qemu_loadvm_state() so that it can
be called iteratively without reading the header.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration.c |2 +-
 savevm.c|   24 +---
 sysemu.h|2 +-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/migration.c b/migration.c
index bee20f0..11eff51 100644
--- a/migration.c
+++ b/migration.c
@@ -60,7 +60,7 @@ int qemu_start_incoming_migration(const char *uri)
 
 void process_incoming_migration(QEMUFile *f)
 {
-if (qemu_loadvm_state(f)  0) {
+if (qemu_loadvm_state(f, 0)  0) {
 fprintf(stderr, load of migration failed\n);
 exit(0);
 }
diff --git a/savevm.c b/savevm.c
index d1efdd3..dc15c03 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1709,7 +1709,7 @@ typedef struct LoadStateEntry {
 int version_id;
 } LoadStateEntry;
 
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
 {
 QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
 QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1722,17 +1722,19 @@ int qemu_loadvm_state(QEMUFile *f)
 return -EINVAL;
 }
 
-v = qemu_get_be32(f);
-if (v != QEMU_VM_FILE_MAGIC)
-return -EINVAL;
+if (!skip_header) {
+v = qemu_get_be32(f);
+if (v != QEMU_VM_FILE_MAGIC)
+return -EINVAL;
 
-v = qemu_get_be32(f);
-if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
-return -ENOTSUP;
+v = qemu_get_be32(f);
+if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
+return -ENOTSUP;
+}
+if (v != QEMU_VM_FILE_VERSION)
+return -ENOTSUP;
 }
-if (v != QEMU_VM_FILE_VERSION)
-return -ENOTSUP;
 
 while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
 uint32_t instance_id, version_id, section_id;
@@ -2055,7 +2057,7 @@ int load_vmstate(const char *name)
 return -EINVAL;
 }
 
-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
 
 qemu_fclose(f);
 if (ret  0) {
diff --git a/sysemu.h b/sysemu.h
index 0c969f2..329415f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -81,7 +81,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/19] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2011-01-26 Thread Yoshiaki Tamura
Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/hw.h  |2 ++
 savevm.c |   20 +++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index dd993de..7f05830 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void *qemu_realloc_buffer(QEMUFile *f, int size);
+void qemu_clear_buffer(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index fcd8db4..d1efdd3 100644
--- a/savevm.c
+++ b/savevm.c
@@ -172,7 +172,8 @@ struct QEMUFile {
when reading */
 int buf_index;
 int buf_size; /* 0 when writing */
-uint8_t buf[IO_BUF_SIZE];
+int buf_max_size;
+uint8_t *buf;
 
 int has_error;
 };
@@ -423,6 +424,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, 
QEMUFilePutBufferFunc *put_buffer,
 f-get_rate_limit = get_rate_limit;
 f-is_write = 0;
 
+f-buf_max_size = IO_BUF_SIZE;
+f-buf = qemu_malloc(sizeof(uint8_t) * f-buf_max_size);
+
 return f;
 }
 
@@ -453,6 +457,19 @@ void qemu_fflush(QEMUFile *f)
 }
 }
 
+void *qemu_realloc_buffer(QEMUFile *f, int size)
+{
+f-buf_max_size = size;
+f-buf = qemu_realloc(f-buf, f-buf_max_size);
+
+return f-buf;
+}
+
+void qemu_clear_buffer(QEMUFile *f)
+{
+f-buf_size = f-buf_index = f-buf_offset = 0;
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
 int len;
@@ -478,6 +495,7 @@ int qemu_fclose(QEMUFile *f)
 qemu_fflush(f);
 if (f-close)
 ret = f-close(f-opaque);
+qemu_free(f-buf);
 qemu_free(f);
 return ret;
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/19] virtio: decrement last_avail_idx with inuse before saving.

2011-01-26 Thread Yoshiaki Tamura
For regular migration inuse == 0 always as requests are flushed before
save. However, event-tap log when enabled introduces an extra queue
for requests which is not being flushed, thus the last inuse requests
are left in the event-tap queue.  Move the last_avail_idx value sent
to the remote back to make it repeat the last inuse requests.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/virtio.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 31bd9e3..f05d1b6 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -673,12 +673,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 qemu_put_be32(f, i);
 
 for (i = 0; i  VIRTIO_PCI_QUEUE_MAX; i++) {
+/* For regular migration inuse == 0 always as
+ * requests are flushed before save. However,
+ * event-tap log when enabled introduces an extra
+ * queue for requests which is not being flushed,
+ * thus the last inuse requests are left in the event-tap queue.
+ * Move the last_avail_idx value sent to the remote back
+ * to make it repeat the last inuse requests. */
+uint16_t last_avail = vdev-vq[i].last_avail_idx - vdev-vq[i].inuse;
 if (vdev-vq[i].vring.num == 0)
 break;
 
 qemu_put_be32(f, vdev-vq[i].vring.num);
 qemu_put_be64(f, vdev-vq[i].pa);
-qemu_put_be16s(f, vdev-vq[i].last_avail_idx);
+qemu_put_be16s(f, last_avail);
 if (vdev-binding-save_queue)
 vdev-binding-save_queue(vdev-binding_opaque, i, f);
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/19] vl.c: add deleted flag for deleting the handler.

2011-01-26 Thread Yoshiaki Tamura
Make deleting handlers robust against deletion of any elements in a
handler by using a deleted flag like in file descriptors.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 vl.c |   13 +
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/vl.c b/vl.c
index 0292184..8bbb785 100644
--- a/vl.c
+++ b/vl.c
@@ -1140,6 +1140,7 @@ static void nographic_update(void *opaque)
 struct vm_change_state_entry {
 VMChangeStateHandler *cb;
 void *opaque;
+int deleted;
 QLIST_ENTRY (vm_change_state_entry) entries;
 };
 
@@ -1160,8 +1161,7 @@ VMChangeStateEntry 
*qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
 
 void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
 {
-QLIST_REMOVE (e, entries);
-qemu_free (e);
+e-deleted = 1;
 }
 
 void vm_state_notify(int running, int reason)
@@ -1170,8 +1170,13 @@ void vm_state_notify(int running, int reason)
 
 trace_vm_state_notify(running, reason);
 
-for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) {
-e-cb(e-opaque, running, reason);
+QLIST_FOREACH(e, vm_change_state_head, entries) {
+if (e-deleted) {
+QLIST_REMOVE(e, entries);
+qemu_free(e);
+} else {
+e-cb(e-opaque, running, reason);
+}
 }
 }
 
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] qemu-char: export socket_set_nodelay().

2011-01-26 Thread Yoshiaki Tamura
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 qemu-char.c   |2 +-
 qemu_socket.h |1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index edc9ad6..737d347 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2116,7 +2116,7 @@ static void tcp_chr_telnet_init(int fd)
 send(fd, (char *)buf, 3, 0);
 }
 
-static void socket_set_nodelay(int fd)
+void socket_set_nodelay(int fd)
 {
 int val = 1;
 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val));
diff --git a/qemu_socket.h b/qemu_socket.h
index 897a8ae..b7f8465 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
 int qemu_socket(int domain, int type, int protocol);
 int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
 void socket_set_nonblock(int fd);
+void socket_set_nodelay(int fd);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/19] Introduce fault tolerant VM transaction QEMUFile and ft_mode.

2011-01-26 Thread Yoshiaki Tamura
This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 Makefile.objs   |1 +
 ft_trans_file.c |  624 +++
 ft_trans_file.h |   72 +++
 migration.c |3 +
 trace-events|   16 ++
 5 files changed, 716 insertions(+), 0 deletions(-)
 create mode 100644 ft_trans_file.c
 create mode 100644 ft_trans_file.h

diff --git a/Makefile.objs b/Makefile.objs
index fda366d..1f10fbc 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -100,6 +100,7 @@ common-obj-y += msmouse.o ps2.o
 common-obj-y += qdev.o qdev-properties.o
 common-obj-y += block-migration.o
 common-obj-y += pflib.o
+common-obj-y += ft_trans_file.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_trans_file.c b/ft_trans_file.c
new file mode 100644
index 000..2b42b95
--- /dev/null
+++ b/ft_trans_file.c
@@ -0,0 +1,624 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguorialigu...@us.ibm.com
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include hw/hw.h
+#include qemu-timer.h
+#include sysemu.h
+#include qemu-char.h
+#include trace.h
+#include ft_trans_file.h
+
+typedef struct FtTransHdr
+{
+uint16_t cmd;
+uint16_t id;
+uint32_t seq;
+uint32_t payload_len;
+} FtTransHdr;
+
+typedef struct QEMUFileFtTrans
+{
+FtTransPutBufferFunc *put_buffer;
+FtTransGetBufferFunc *get_buffer;
+FtTransPutReadyFunc *put_ready;
+FtTransGetReadyFunc *get_ready;
+FtTransWaitForUnfreezeFunc *wait_for_unfreeze;
+FtTransCloseFunc *close;
+void *opaque;
+QEMUFile *file;
+
+enum QEMU_VM_TRANSACTION_STATE state;
+uint32_t seq;
+uint16_t id;
+
+int has_error;
+
+bool freeze_output;
+bool freeze_input;
+bool rate_limit;
+bool is_sender;
+bool is_payload;
+
+uint8_t *buf;
+size_t buf_max_size;
+size_t put_offset;
+size_t get_offset;
+
+FtTransHdr header;
+size_t header_offset;
+} QEMUFileFtTrans;
+
+#define IO_BUF_SIZE 32768
+
+static void ft_trans_append(QEMUFileFtTrans *s,
+const uint8_t *buf, size_t size)
+{
+if (size  (s-buf_max_size - s-put_offset)) {
+trace_ft_trans_realloc(s-buf_max_size, size + 1024);
+s-buf_max_size += size + 1024;
+s-buf = qemu_realloc(s-buf, s-buf_max_size);
+}
+
+trace_ft_trans_append(size);
+memcpy(s-buf + s-put_offset, buf, size);
+s-put_offset += size;
+}
+
+static void ft_trans_flush(QEMUFileFtTrans *s)
+{
+size_t offset = 0;
+
+if (s-has_error) {
+error_report(flush when error %d, bailing, s-has_error);
+return;
+}
+
+while (offset  s-put_offset) {
+ssize_t ret;
+
+ret = s-put_buffer(s-opaque, s-buf + offset, s-put_offset - 
offset);
+if (ret == -EAGAIN) {
+break;
+}
+
+if (ret = 0) {
+error_report(error flushing data, %s, strerror(errno));
+s-has_error = FT_TRANS_ERR_FLUSH;
+break;
+} else {
+offset += ret;
+}
+}
+
+trace_ft_trans_flush(offset, s-put_offset);
+memmove(s-buf, s-buf + offset, s-put_offset - offset);
+s-put_offset -= offset;
+s-freeze_output = !!s-put_offset;
+}
+
+static ssize_t ft_trans_put(void *opaque, void *buf, int size)
+{
+QEMUFileFtTrans *s = opaque;
+size_t offset = 0;
+ssize_t len;
+
+/* flush buffered data before putting next */
+if (s-put_offset) {
+ft_trans_flush(s);
+}
+
+while (!s-freeze_output  offset  size) {
+len = s-put_buffer(s-opaque, (uint8_t *)buf + offset, size - offset);
+
+if (len == -EAGAIN) {
+trace_ft_trans_freeze_output();
+s-freeze_output = 1;
+break;
+}
+
+if (len = 0) {
+error_report(putting data failed, %s, strerror(errno));
+s-has_error = 1;
+offset = -EINVAL;
+break;
+}
+
+offset += len;
+}
+
+if (s-freeze_output) {
+ft_trans_append(s, buf + offset, size - offset);
+offset = size;
+}
+
+return offset;
+}
+
+static int ft_trans_send_header(QEMUFileFtTrans *s,
+enum QEMU_VM_TRANSACTION_STATE state,
+uint32_t payload_len)
+{
+int ret;
+FtTransHdr 

Re: EPT: Misconfiguration

2011-01-26 Thread Avi Kivity

On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:

  When you say suddenly, this was with no changes to software and hardware?

The host software and hardware hasn't changed in the two months since
the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.

We host customer vms on it though, so virtual machines come and go.
Various operating systems, a mixture of Linux, FreeBSD and Windows
2008 R2. We have other machines with the same config without these
problems though.


Are those other machines running a similar workload?

The traces look awfully like bad hardware, though that can also be 
explained by random memory corruption due to a bug.



This time I have a few different messages though:

2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:  
[#1] SMP

RSI:  RDI: 1603a07305001568

2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00f0  ff 4f 08 0f 94 c0 84
c0 74 10 85 f6 75 07 e8 63 fe ff ff eb


lock decl 0x8(%rdi)

%rdi is completely crap, looks like corruption again.  Strangely, it is 
similar to the bad spte from the previous trace: 0x1603a0730500d277.  
The upper 48 bits are identical, the lower 16 bits are different.:

2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
page table at address 7f37b37ff000
2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
94e538067 PMD 61e5bf067 PTE 1603a0730500e067


Here are those magic 48 bits again, in the PTE entry.

2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
2011-01-25T12:38:49.417526+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
2011-01-25T12:38:49.417532+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5db595007 level 3
2011-01-25T12:38:49.417553+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
2011-01-25T12:38:49.417558+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1


Again.


2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
process qemu-kvm  pte:1603a0730500d067 pmd:61059f067


Again.

However, these all came from a single boot, yes?  If so they can be the 
same corruption.  Please collect more traces, with reboots in between.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Avi Kivity

On 01/26/2011 11:39 AM, Michael S. Tsirkin wrote:

On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote:
  On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote:
  On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
 On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
 On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
 On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
  For the other lookups, which we
  believe will succeed, we can assume the probablity of a 
match is
  related to the slot size, and sort the slots by page count.
 
 Unlikely to be true for assigned device BARs.
 
 We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, and
 lots of small slots for BARs and such.
 
 The vast majority of faults will either touch one of the two 
largest
 slots, or will miss all slots.  Relatively few will hit the small
 slots.
 
 Not if you are using one of the assigned devices and don't
 do any faults on memory :)
  
 It's impossible not to fault on memory.
  
  No I mean the RAM.

  No idea what you mean.  It's impossible not to fault on RAM, either
  (unless you don't use it at all).

I just mean that once you fault you map sptes and then you can use them
without exits.  mmio will cause exits each time. Right?


The swapper scanning sptes, ksmd, khugepaged, and swapping can all cause 
a page to be unmapped.  Though it should certainly happen with a much 
lower frequency than mmio.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-26 Thread Peter Zijlstra
On Tue, 2011-01-25 at 19:27 -0200, Glauber Costa wrote:
 On Tue, 2011-01-25 at 22:07 +0100, Peter Zijlstra wrote:
  On Tue, 2011-01-25 at 18:47 -0200, Glauber Costa wrote:
   On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:

 I fail to see how does clock_task influence cpu power.
 If we also have to touch clock_task for better accounting of other
 stuff, it is a separate story.
 But for cpu_power, I really fail. Please enlighten me.

static void update_rq_clock_task(struct rq *rq, s64 delta)
{
s64 irq_delta;

irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;

if (irq_delta  delta)
irq_delta = delta;

rq-prev_irq_time += irq_delta;
delta -= irq_delta;
rq-clock_task += delta;

if (irq_delta  sched_feat(NONIRQ_POWER))
sched_rt_avg_update(rq, irq_delta);
}

its done through that sched_rt_avg_update() (should probably rename
that), it computes a floating average of time not spend on fair tasks.

   It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
   This piece of code is simply compiled out if this option is disabled.
  
  We can pull this bit out and make the common bit also available for
  paravirt.
 
 scale_rt_power() seems to do the right thing, but all the path leading
 to it seem to work on rq-clock, rather than rq-clock_task.

Not quite, see how rq-clock_task is irq_delta less than the increment
to rq-clock? You want it to be your steal-time delta less too.

 Although I do can experiment with that as well, could you please
 elaborate on what are your reasons to prefer this over than variations
 of the method I proposed?

Because I want rq-clock_task to not include steal-time.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Jan 25

2011-01-26 Thread Avi Kivity

On 01/25/2011 04:35 PM, Stefan Hajnoczi wrote:

On Tue, Jan 25, 2011 at 2:26 PM, Avi Kivitya...@redhat.com  wrote:
  On 01/25/2011 12:06 AM, Anthony Liguori wrote:

  On 01/24/2011 07:25 AM, Chris Wright wrote:

  Please send in any agenda items you are interested in covering.

  - coroutines for the block layer

  I have a perpetually in progress branch for this, and would very much like
  to see this done.

Seen this?
http://repo.or.cz/w/qemu/stefanha.git/commit/8179e8ff20bb3f14f361109afe5b3bf2bac24f0d
http://repo.or.cz/w/qemu/stefanha.git/shortlog/8179e8ff20bb3f14f361109afe5b3bf2bac24f0d

And the qemu-devel thread:
http://www.mail-archive.com/qemu-devel@nongnu.org/msg52522.html


Thanks.  Will follow up on the thread.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


compilation problem

2011-01-26 Thread antoni artigues
Hello

I have RHEL4 with kernel 2.6.9

I'm trying to compile kvm-kmod-2.6.30

After make command returns:

-
[root@maeko kvm-kmod-2.6.30]# make
make -C /lib/modules/2.6.9-67.ELsmp/build M=`pwd` \
LINUXINCLUDE=-I`pwd`/include -Iinclude \
 \
-Iarch/x86/include -I`pwd`/include-compat \
-include include/linux/autoconf.h \
-include `pwd`/x86/external-module-compat.h  \
$@
make[1]: Entering directory `/usr/src/kernels/2.6.9-67.EL-smp-x86_64'
  Building modules, stage 2.
  MODPOST
make[1]: Leaving directory `/usr/src/kernels/2.6.9-67.EL-smp-x86_64'
-

It seems all has gone well. But the kvm.ko is not created.

The make install returns:

[root@maeko kvm-kmod-2.6.30]# make install
mkdir -p //lib/modules/2.6.9-67.ELsmp/extra
cp x86/*.ko //lib/modules/2.6.9-67.ELsmp/extra
cp: no se puede efectuar `stat' sobre «x86/*.ko»: No existe el fichero o
el directorio
make: *** [install] Error 1


I tried with kvm-kmod-2.6.31.5 but it returns the same error.

With the kvm-kmod-2.6.31.6 the configure command says:Error: kernel is
too old for this kvm-kmod release.

Thanks in advance.

Regards

Antoni Artigues

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure

2011-01-26 Thread Avi Kivity

On 01/24/2011 08:06 PM, Glauber Costa wrote:

KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This is basic KVM registration headers. I am keeping headers
separate to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Avi Kivitya...@redhat.com
---
  arch/x86/include/asm/kvm_para.h |   11 +++
  1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..b0b0ee0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
   */
  #define KVM_FEATURE_CLOCKSOURCE23
  #define KVM_FEATURE_ASYNC_PF  4
+#define KVM_FEATURE_MEMORY_AREA5

  /* The last 8 bits are used to indicate how to interpret the flags field
   * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,16 @@
  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02

+#define MSR_KVM_REGISTER_MEM_AREA 0x4b564d03
+
+struct kvm_memory_area {
+   __u64 base;
+   __u32 size;
+   __u32 type;
+   __u8  result;
+   __u8  pad[3];
+};
+
  #define KVM_MAX_MMU_OP_BATCH   32


I'm guessing the protocol here is:

- guest fills in -base/size/type
- issues wrmsr
- host registers the memory and updates -result
- guest examines -result

there are two issues with this approach:

- it doesn't lend itself will to live migration.  Extra state must be 
maintained in the hypervisor.

- it isn't how normal hardware operates

what's wrong with extending the normal approach of one msr per feature?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl

2011-01-26 Thread Avi Kivity

On 01/24/2011 08:06 PM, Glauber Costa wrote:

KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This patch contains the header part of the userspace communication 
implementation.
Userspace can query the presence/absence of this feature in the normal way.
It also tells the hypervisor that it is capable of handling - in whatever
way it chooses, registrations that the hypervisor does not know how to.
In x86, only user so far, this mechanism is implemented as generic userspace
msr exit, that could theorectically be used to implement msr-handling in
userspace.

I am keeping the headers separate to facilitate backports to people
who wants to backport the kernel part but not the hypervisor, or the other way 
around.



Again the protocol is not specified.  How about starting from 
Documentation/kvm/api.txt so we don't have to guess?



diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..5cc4fe8 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -161,6 +161,7 @@ struct kvm_pit_config {
  #define KVM_EXIT_NMI  16
  #define KVM_EXIT_INTERNAL_ERROR   17
  #define KVM_EXIT_OSI  18
+#define KVM_EXIT_X86_MSR_OP  19




  /*
@@ -541,6 +551,7 @@ struct kvm_ppc_pvinfo {
  #define KVM_CAP_PPC_GET_PVINFO 57
  #define KVM_CAP_PPC_IRQ_LEVEL 58
  #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_REGISTER_MEM_AREA 60


These two seem completely unrelated.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory

2011-01-26 Thread Avi Kivity

On 01/24/2011 08:06 PM, Glauber Costa wrote:

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.



So instead of adding MSRs, we're adding area identifiers.  What did we gain?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compilation problem

2011-01-26 Thread Avi Kivity

On 01/26/2011 12:48 PM, antoni artigues wrote:

Hello

I have RHEL4 with kernel 2.6.9

I'm trying to compile kvm-kmod-2.6.30


2.6.9 is way too old for kvm-kmod.  Any particular reason why you want 
kvm over such an old host?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ms_sysenter_eip zero value

2011-01-26 Thread Avi Kivity

On 01/23/2011 01:25 PM, Matteo Signorini wrote:

Hi,

I'm having some problems understanding the sysenter instruction.
As far as I know, in order to successfully call the sysenter instruction,
MSR_IA32_SYSENTER_CS and MSR_IA32_SYSENTER_EIP registers have to be
correctly set.

So I printed the value of such registers while the VM was running but
the output is 0 for both.

now:

1) I'm having this problem ONLY with the Intel CPU (vmx.c source code).
When I run the same code on an AMD CPU (svm.c source code)
MSR_IA32_SYSENTER_EIP and MSR_IA32_SYSENTER_CS contain nonzero values.

2) I am 100% sure the guest is not executing an int80 but a sysenter

so there is something here I can't understand...
please help me solving this problem.


How are you printing out the values? Maybe the problem is there?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compilation problem

2011-01-26 Thread antoni artigues
Hello

Well, we have a 32 nodes cluster. With RHEL4 and 2.6.9 kernel in all the
nodes. The cluster is in production. Now, is not possible to update each
of the 32 nodes.

We want to use opennebula with KVM in the cluster. So, we are trying to
compile the kvm-kmod.

Is not possible? Do we need to update the kernel of each node?

Thanks

Antoni Artigues

El mié, 26-01-2011 a las 13:17 +0200, Avi Kivity escribió:
 On 01/26/2011 12:48 PM, antoni artigues wrote:
  Hello
 
  I have RHEL4 with kernel 2.6.9
 
  I'm trying to compile kvm-kmod-2.6.30
 
 2.6.9 is way too old for kvm-kmod.  Any particular reason why you want 
 kvm over such an old host?
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/18] kvm: Add MCE signal support for !CONFIG_IOTHREAD

2011-01-26 Thread Marcelo Tosatti
On Wed, Jan 26, 2011 at 09:09:25AM +0100, Jan Kiszka wrote:
 On 2011-01-24 13:36, Jan Kiszka wrote:
  On 2011-01-24 12:17, Marcelo Tosatti wrote:
  On Mon, Jan 10, 2011 at 09:32:00AM +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  Currently, we only configure and process MCE-related SIGBUS events if
  CONFIG_IOTHREAD is enabled. Fix this by factoring out the required
  handler registration and system configuration. Make sure that events
  happening over a VCPU context in non-threaded mode get dispatched as
  VCPU MCEs.
 
  We also need to call qemu_kvm_eat_signals in non-threaded mode now, so
  move it (unmodified) and add the required Windows stub.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  CC: Huang Ying ying.hu...@intel.com
  ---
   cpus.c |  200 
  +++
   1 files changed, 124 insertions(+), 76 deletions(-)
 
  diff --git a/cpus.c b/cpus.c
  index 6da0f8f..b6f1cfb 100644
  --- a/cpus.c
  +++ b/cpus.c
  @@ -34,9 +34,6 @@
   
   #include cpus.h
   #include compatfd.h
  -#ifdef CONFIG_LINUX
  -#include sys/prctl.h
  -#endif
   
   #ifdef SIGRTMIN
   #define SIG_IPI (SIGRTMIN+4)
  @@ -44,10 +41,24 @@
   #define SIG_IPI SIGUSR1
   #endif
   
 
  @@ -912,6 +954,8 @@ static int qemu_cpu_exec(CPUState *env)
   
   bool cpu_exec_all(void)
   {
  +int r;
  +
   if (next_cpu == NULL)
   next_cpu = first_cpu;
   for (; next_cpu != NULL  !exit_request; next_cpu = 
  next_cpu-next_cpu) {
  @@ -923,7 +967,11 @@ bool cpu_exec_all(void)
   if (qemu_alarm_pending())
   break;
   if (cpu_can_run(env)) {
  -if (qemu_cpu_exec(env) == EXCP_DEBUG) {
  +r = qemu_cpu_exec(env);
  +if (kvm_enabled()) {
  +qemu_kvm_eat_signals(env);
  +}
  +if (r == EXCP_DEBUG) {
   break;
   }
 
  SIGBUS should be processed outside of vcpu execution context, think of a
  non MCE SIGBUS while vm is stopped. Could use signalfd for that.
  
  signalfd - that's the missing bit. I was thinking of how to handle
  SIGBUS events raised outside the vcpu context. We need to handle them
  synchronously, and signalfd should allow this.
 
 This was straightforward. But now I wonder what actually makes this
 pattern work. Doesn't the kernel force-inject SIGBUS, i.e. ignores any
 blocking? Or does this only apply to BUS_MCEERR_AR?

SIGBUS is only forced if BUS_MCEERR_AR and the poisoned memory was not accessed 
on behalf of the guest (say directly by qemu).

  But the SIGBUS handler for !IOTHREAD case should not ignore Action
  Required, since it might have been generated in vcpu context.
 
  
  Yes, the sigbus handler will require some rework when we actually start
  using it for !IOTHREAD.
 
 And this no longer makes sense to me. The current version simply uses
 the same sigbus handler for both modes, an that forwards the error code
 properly. What did you mean?

There are two handlers, kvm_on_sigbus and kvm_on_sigbus_vcpu.
kvm_on_sigbus, the handler for iothread, dies on BUS_MCEERR_AR (which
will be generated if poisoned memory is accessed on behalf of vcpu). It
should be handled with !CONFIG_IOTHREAD.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Stefan Berger

On 01/26/2011 03:14 AM, Jan Kiszka wrote:

On 2011-01-25 17:49, Stefan Berger wrote:

On 01/25/2011 02:26 AM, Jan Kiszka wrote:

Do you see a chance to look closer at the issue yourself? E.g.
instrument the kernel's irqchip models and dump their states once your
guest is stuck?

The device runs on iRQ 3. So I applied this patch here.

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 3cece05..8f4f94c 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct kvm_kpic_state
*s, int irq, int level)
  {
  int mask, ret = 1;
  mask = 1   irq;
-if (s-elcr   mask)/* level triggered */
+if (s-elcr   mask)/* level triggered */ {
  if (level) {
  ret = !(s-irr   mask);
  s-irr |= mask;
@@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
kvm_kpic_state *s, int irq, int level)
  s-irr= ~mask;
  s-last_irr= ~mask;
  }
-else/* edge triggered */
+if (irq == 3)
+printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
s-irr);
+}
+else/* edge triggered */ {
  if (level) {
  if ((s-last_irr   mask) == 0) {
  ret = !(s-irr   mask);
@@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct kvm_kpic_state
*s, int irq, int level)
  s-last_irr |= mask;
  } else
  s-last_irr= ~mask;
-
+if (irq == 3)
+printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
s-irr);
+}
  return (s-imr   mask) ? -1 : ret;
  }

@@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)

  pic_lock(s);
  if (irq= 0   irq   PIC_NUM_PINS) {
+if (irq == 3)
+printk(%s\n, __FUNCTION__);
  ret = pic_set_irq1(s-pics[irq   3], irq   7, level);
  pic_update_irq(s);
  trace_kvm_pic_set_irq(irq   3, irq   7, s-pics[irq   3].elcr,



While it's still working I see this here with the levels changing 0-1-0.
Though then it stops and levels are only at '1'.

[ 1773.833824] kvm_pic_set_irq
[ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
[ 1773.834161] kvm_pic_set_irq
[ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
[ 1773.834193] kvm_pic_set_irq
[ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
[ 1773.835028] kvm_pic_set_irq
[ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
[ 1773.835542] kvm_pic_set_irq
[ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
[ 1773.889892] kvm_pic_set_irq
[ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
[ 1791.258793] pic_set_irq1 119: level=1, irr = d9
[ 1791.258824] pic_set_irq1 119: level=0, irr = d1
[ 1791.402476] pic_set_irq1 119: level=1, irr = d9
[ 1791.402534] pic_set_irq1 119: level=0, irr = d1
[ 1791.402538] pic_set_irq1 119: level=1, irr = d9
[...]


I believe the last 5 shown calls can be ignored. After that the
interrupts don't go through anymore.

In the device model I see interrupts being raised and cleared. After the
last one was cleared in 'my' device model, only interrupts are raised.
This looks like as if the interrupt handler in the guest Linux was never
run, thus the IRQ is never cleared and we're stuck.


User space is responsible for both setting and clearing that line. IRQ3
means you are using some serial device model? Then you should check what
its state is.
Good hint. I moved it now to IRQ11 and it works fine now (with kvm-git) 
from what I can see. There was no UART on IRQ3 before, though, but 
certainly it was the wrong IRQ for it.

Moreover, a complete picture of the kernel/user space interaction should
be obtainable by using fstrace for capturing kvm events.


Should it be working on IRQ3? If so, I'd look into it when I get a chance...
   Stefan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/18] kvm: Add MCE signal support for !CONFIG_IOTHREAD

2011-01-26 Thread Jan Kiszka
On 2011-01-26 13:01, Marcelo Tosatti wrote:
 On Wed, Jan 26, 2011 at 09:09:25AM +0100, Jan Kiszka wrote:
 On 2011-01-24 13:36, Jan Kiszka wrote:
 On 2011-01-24 12:17, Marcelo Tosatti wrote:
 On Mon, Jan 10, 2011 at 09:32:00AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 Currently, we only configure and process MCE-related SIGBUS events if
 CONFIG_IOTHREAD is enabled. Fix this by factoring out the required
 handler registration and system configuration. Make sure that events
 happening over a VCPU context in non-threaded mode get dispatched as
 VCPU MCEs.

 We also need to call qemu_kvm_eat_signals in non-threaded mode now, so
 move it (unmodified) and add the required Windows stub.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Huang Ying ying.hu...@intel.com
 ---
  cpus.c |  200 
 +++
  1 files changed, 124 insertions(+), 76 deletions(-)

 diff --git a/cpus.c b/cpus.c
 index 6da0f8f..b6f1cfb 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -34,9 +34,6 @@
  
  #include cpus.h
  #include compatfd.h
 -#ifdef CONFIG_LINUX
 -#include sys/prctl.h
 -#endif
  
  #ifdef SIGRTMIN
  #define SIG_IPI (SIGRTMIN+4)
 @@ -44,10 +41,24 @@
  #define SIG_IPI SIGUSR1
  #endif
  

 @@ -912,6 +954,8 @@ static int qemu_cpu_exec(CPUState *env)
  
  bool cpu_exec_all(void)
  {
 +int r;
 +
  if (next_cpu == NULL)
  next_cpu = first_cpu;
  for (; next_cpu != NULL  !exit_request; next_cpu = 
 next_cpu-next_cpu) {
 @@ -923,7 +967,11 @@ bool cpu_exec_all(void)
  if (qemu_alarm_pending())
  break;
  if (cpu_can_run(env)) {
 -if (qemu_cpu_exec(env) == EXCP_DEBUG) {
 +r = qemu_cpu_exec(env);
 +if (kvm_enabled()) {
 +qemu_kvm_eat_signals(env);
 +}
 +if (r == EXCP_DEBUG) {
  break;
  }

 SIGBUS should be processed outside of vcpu execution context, think of a
 non MCE SIGBUS while vm is stopped. Could use signalfd for that.

 signalfd - that's the missing bit. I was thinking of how to handle
 SIGBUS events raised outside the vcpu context. We need to handle them
 synchronously, and signalfd should allow this.

 This was straightforward. But now I wonder what actually makes this
 pattern work. Doesn't the kernel force-inject SIGBUS, i.e. ignores any
 blocking? Or does this only apply to BUS_MCEERR_AR?
 
 SIGBUS is only forced if BUS_MCEERR_AR and the poisoned memory was not 
 accessed 
 on behalf of the guest (say directly by qemu).

OK. I didn't find this detail in the kernel code on first glance, only
force_sig(SIGBUS, current).

 
 But the SIGBUS handler for !IOTHREAD case should not ignore Action
 Required, since it might have been generated in vcpu context.


 Yes, the sigbus handler will require some rework when we actually start
 using it for !IOTHREAD.

 And this no longer makes sense to me. The current version simply uses
 the same sigbus handler for both modes, an that forwards the error code
 properly. What did you mean?
 
 There are two handlers, kvm_on_sigbus and kvm_on_sigbus_vcpu.
 kvm_on_sigbus, the handler for iothread, dies on BUS_MCEERR_AR (which
 will be generated if poisoned memory is accessed on behalf of vcpu). It
 should be handled with !CONFIG_IOTHREAD.

Just as with iothread, vcpu-triggered SIGBUS events are processes by
qemu_kvm_eat_signals, not the signal handler.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-26 Thread Michael S. Tsirkin
On Wed, Jan 26, 2011 at 11:54:21AM +0200, Avi Kivity wrote:
 On 01/26/2011 11:39 AM, Michael S. Tsirkin wrote:
 On Wed, Jan 26, 2011 at 11:23:21AM +0200, Avi Kivity wrote:
   On 01/26/2011 11:20 AM, Michael S. Tsirkin wrote:
   On Wed, Jan 26, 2011 at 11:17:11AM +0200, Avi Kivity wrote:
  On 01/25/2011 07:58 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 07:33:40PM +0200, Avi Kivity wrote:
  On 01/25/2011 04:59 PM, Michael S. Tsirkin wrote:
  On Tue, Jan 25, 2011 at 04:53:44PM +0200, Avi Kivity wrote:
   For the other lookups, which we
   believe will succeed, we can assume the probablity of a 
  match is
   related to the slot size, and sort the slots by page 
  count.
  
  Unlikely to be true for assigned device BARs.
  
  We'll have a large slot at 4G+ - EOM, a medium slot at 1M-3G, 
  and
  lots of small slots for BARs and such.
  
  The vast majority of faults will either touch one of the two 
  largest
  slots, or will miss all slots.  Relatively few will hit the 
  small
  slots.
  
  Not if you are using one of the assigned devices and don't
  do any faults on memory :)
   
  It's impossible not to fault on memory.
   
   No I mean the RAM.
 
   No idea what you mean.  It's impossible not to fault on RAM, either
   (unless you don't use it at all).
 
 I just mean that once you fault you map sptes and then you can use them
 without exits.  mmio will cause exits each time. Right?
 
 The swapper scanning sptes, ksmd, khugepaged, and swapping can all
 cause a page to be unmapped.  Though it should certainly happen with
 a much lower frequency than mmio.

Right. That's why I say that sorting by size might not be optimal.
Maybe a cache ...

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Jan Kiszka
On 2011-01-26 13:05, Stefan Berger wrote:
 On 01/26/2011 03:14 AM, Jan Kiszka wrote:
 On 2011-01-25 17:49, Stefan Berger wrote:
 On 01/25/2011 02:26 AM, Jan Kiszka wrote:
 Do you see a chance to look closer at the issue yourself? E.g.
 instrument the kernel's irqchip models and dump their states once your
 guest is stuck?
 The device runs on iRQ 3. So I applied this patch here.

 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 3cece05..8f4f94c 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct kvm_kpic_state
 *s, int irq, int level)
   {
   int mask, ret = 1;
   mask = 1   irq;
 -if (s-elcr   mask)/* level triggered */
 +if (s-elcr   mask)/* level triggered */ {
   if (level) {
   ret = !(s-irr   mask);
   s-irr |= mask;
 @@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state *s, int irq, int level)
   s-irr= ~mask;
   s-last_irr= ~mask;
   }
 -else/* edge triggered */
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
 s-irr);
 +}
 +else/* edge triggered */ {
   if (level) {
   if ((s-last_irr   mask) == 0) {
   ret = !(s-irr   mask);
 @@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct kvm_kpic_state
 *s, int irq, int level)
   s-last_irr |= mask;
   } else
   s-last_irr= ~mask;
 -
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
 s-irr);
 +}
   return (s-imr   mask) ? -1 : ret;
   }

 @@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int
 level)

   pic_lock(s);
   if (irq= 0   irq   PIC_NUM_PINS) {
 +if (irq == 3)
 +printk(%s\n, __FUNCTION__);
   ret = pic_set_irq1(s-pics[irq   3], irq   7, level);
   pic_update_irq(s);
   trace_kvm_pic_set_irq(irq   3, irq   7, s-pics[irq  
 3].elcr,



 While it's still working I see this here with the levels changing 0-1-0.
 Though then it stops and levels are only at '1'.

 [ 1773.833824] kvm_pic_set_irq
 [ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.834161] kvm_pic_set_irq
 [ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.834193] kvm_pic_set_irq
 [ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.835028] kvm_pic_set_irq
 [ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.835542] kvm_pic_set_irq
 [ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.889892] kvm_pic_set_irq
 [ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
 [ 1791.258793] pic_set_irq1 119: level=1, irr = d9
 [ 1791.258824] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402476] pic_set_irq1 119: level=1, irr = d9
 [ 1791.402534] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402538] pic_set_irq1 119: level=1, irr = d9
 [...]


 I believe the last 5 shown calls can be ignored. After that the
 interrupts don't go through anymore.

 In the device model I see interrupts being raised and cleared. After the
 last one was cleared in 'my' device model, only interrupts are raised.
 This looks like as if the interrupt handler in the guest Linux was never
 run, thus the IRQ is never cleared and we're stuck.

 User space is responsible for both setting and clearing that line. IRQ3
 means you are using some serial device model? Then you should check what
 its state is.
 Good hint. I moved it now to IRQ11 and it works fine now (with kvm-git)
 from what I can see. There was no UART on IRQ3 before, though, but
 certainly it was the wrong IRQ for it.
 Moreover, a complete picture of the kernel/user space interaction should
 be obtainable by using fstrace for capturing kvm events.

 Should it be working on IRQ3? If so, I'd look into it when I get a
 chance...

I don't know your customizations, so it's hard to tell if that should
work or not. IRQ3 is intended to be used by ISA devices on the PC
machine. Are you adding an ISA model, or what is your use case?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 13:06 +0200, Avi Kivity wrote:
 On 01/24/2011 08:06 PM, Glauber Costa wrote:
  KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
  Mojito),
  is a piece of shared memory that is visible to both the hypervisor and the 
  guest
  kernel - but not the guest userspace.
 
  The basic idea is that the guest can tell the hypervisor about a specific
  piece of memory, and what it expects to find in there. This is a generic
  abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
  handle a specific request, thus giving us flexibility in some features
  in the future.
 
  KVM (The hypervisor) can change the contents of this piece of memory at
  will. This works well with paravirtual information, and hopefully
  normal guest memory - like last update time for the watchdog, for
  instance.
 
  This is basic KVM registration headers. I am keeping headers
  separate to facilitate backports to people who wants to backport
  the kernel part but not the hypervisor, or the other way around.
 
  Signed-off-by: Glauber Costaglom...@redhat.com
  CC: Avi Kivitya...@redhat.com
  ---
arch/x86/include/asm/kvm_para.h |   11 +++
1 files changed, 11 insertions(+), 0 deletions(-)
 
  diff --git a/arch/x86/include/asm/kvm_para.h 
  b/arch/x86/include/asm/kvm_para.h
  index a427bf7..b0b0ee0 100644
  --- a/arch/x86/include/asm/kvm_para.h
  +++ b/arch/x86/include/asm/kvm_para.h
  @@ -21,6 +21,7 @@
 */
#define KVM_FEATURE_CLOCKSOURCE23
#define KVM_FEATURE_ASYNC_PF  4
  +#define KVM_FEATURE_MEMORY_AREA5
 
/* The last 8 bits are used to indicate how to interpret the flags field
 * in pvclock structure. If no bits are set, all flags are ignored.
  @@ -35,6 +36,16 @@
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
  +#define MSR_KVM_REGISTER_MEM_AREA 0x4b564d03
  +
  +struct kvm_memory_area {
  +   __u64 base;
  +   __u32 size;
  +   __u32 type;
  +   __u8  result;
  +   __u8  pad[3];
  +};
  +
#define KVM_MAX_MMU_OP_BATCH   32
 
 I'm guessing the protocol here is:
 
 - guest fills in -base/size/type
 - issues wrmsr
 - host registers the memory and updates -result
 - guest examines -result
 
 there are two issues with this approach:
 
 - it doesn't lend itself will to live migration.  Extra state must be 
 maintained in the hypervisor.
Yes, but can be queried at any time as well. I don't do it in this
patch, but this is explicitly mentioned in my TODO.

 - it isn't how normal hardware operates
Since we're trying to go for guest cooperation here, I don't really see
a need to stay close to hardware here.

 
 what's wrong with extending the normal approach of one msr per feature?

* It's harder to do discovery with MSRs. You can't just rely on getting
an error before the idts are properly setups. The way I am proposing
allow us to just try to register a memory area, and get a failure if we
can't handle it, at any time
* To overcome the above, we had usually relied on cpuids. This requires
qemu/userspace cooperation for feature enablement
* This mechanism just bumps us out to userspace if we can't handle a
request. As such, it allows for pure guest kernel - userspace
communication, that can be used, for instance, to emulate new features
in older hypervisors one does not want to change. BTW, maybe there is
value in exiting to userspace even if we stick to the
one-msr-per-feature approach?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 13:12 +0200, Avi Kivity wrote:
 On 01/24/2011 08:06 PM, Glauber Costa wrote:
  KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
  Mojito),
  is a piece of shared memory that is visible to both the hypervisor and the 
  guest
  kernel - but not the guest userspace.
 
  The basic idea is that the guest can tell the hypervisor about a specific
  piece of memory, and what it expects to find in there. This is a generic
  abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
  handle a specific request, thus giving us flexibility in some features
  in the future.
 
  KVM (The hypervisor) can change the contents of this piece of memory at
  will. This works well with paravirtual information, and hopefully
  normal guest memory - like last update time for the watchdog, for
  instance.
 
  This patch contains the header part of the userspace communication 
  implementation.
  Userspace can query the presence/absence of this feature in the normal way.
  It also tells the hypervisor that it is capable of handling - in whatever
  way it chooses, registrations that the hypervisor does not know how to.
  In x86, only user so far, this mechanism is implemented as generic userspace
  msr exit, that could theorectically be used to implement msr-handling in
  userspace.
 
  I am keeping the headers separate to facilitate backports to people
  who wants to backport the kernel part but not the hypervisor, or the other 
  way around.
 
 
 Again the protocol is not specified.  How about starting from 
 Documentation/kvm/api.txt so we don't have to guess?
I will do that in the next version, if the idea is not shoot up
completely.

 
  diff --git a/include/linux/kvm.h b/include/linux/kvm.h
  index ea2dc1a..5cc4fe8 100644
  --- a/include/linux/kvm.h
  +++ b/include/linux/kvm.h
  @@ -161,6 +161,7 @@ struct kvm_pit_config {
#define KVM_EXIT_NMI  16
#define KVM_EXIT_INTERNAL_ERROR   17
#define KVM_EXIT_OSI  18
  +#define KVM_EXIT_X86_MSR_OP  19
 
 
/*
  @@ -541,6 +551,7 @@ struct kvm_ppc_pvinfo {
#define KVM_CAP_PPC_GET_PVINFO 57
#define KVM_CAP_PPC_IRQ_LEVEL 58
#define KVM_CAP_ASYNC_PF 59
  +#define KVM_CAP_REGISTER_MEM_AREA 60
 
 These two seem completely unrelated.

Thanks, will put them in the right place.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
 On 01/24/2011 08:06 PM, Glauber Costa wrote:
  As a proof of concept to KVM - Kernel Virtual Memory, this patch
  implements wallclock grabbing on top of it. At first, it may seem
  as a waste of work to just redo it, since it is working well. But over the
  time, other MSRs were added - think ASYNC_PF - and more will probably come.
  After this patch, we won't need to ever add another virtual MSR to KVM.
 
 
 So instead of adding MSRs, we're adding area identifiers.  What did we gain?

* No risk of namespace clashes of any kind,
* less need for userspace coordination for feature enablement,
* more robust mechanism that can do discovery even on early boot,
* more informative result values can be passed on to guest kernel, 
* more flexibility, since we go back to userspace if we can't handle
some request. Also, some requests are better handled by userspace
anyway. But again, maybe this is a separate issue here...
* size information goes together with base, allowing for extending
structures (well, maybe I should add versioning explicitly?)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compilation problem

2011-01-26 Thread Avi Kivity

On 01/26/2011 01:37 PM, antoni artigues wrote:

Hello

Well, we have a 32 nodes cluster. With RHEL4 and 2.6.9 kernel in all the
nodes. The cluster is in production. Now, is not possible to update each
of the 32 nodes.

We want to use opennebula with KVM in the cluster. So, we are trying to
compile the kvm-kmod.

Is not possible? Do we need to update the kernel of each node?


A new kernel is needed.  Note the RHEL 6 kernel includes kvm, so you 
don't need to compile anything.  Of course, if the cluster is in 
production, you'll need to plan this carefully.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 7/8] kvm: keep track of which task is running a KVM vcpu

2011-01-26 Thread Avi Kivity

On 01/20/2011 11:36 PM, Rik van Riel wrote:

Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single run of the vcpu.

Signed-off-by: Rik van Rielr...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..9d56ed5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
  #endif
int vcpu_id;
struct mutex mutex;
+   struct pid *pid;
int   cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..86c4905 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -185,6 +185,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu-cpu = -1;
vcpu-kvm = kvm;
vcpu-vcpu_id = id;
+   vcpu-pid = NULL;
init_waitqueue_head(vcpu-wq);

page = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -208,6 +209,8 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init);

  void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
  {
+   if (vcpu-pid)
+   put_pid(vcpu-pid);


Unconditional put_pid() suffices.


kvm_arch_vcpu_uninit(vcpu);
free_page((unsigned long)vcpu-run);
  }
@@ -1456,6 +1459,14 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = -EINVAL;
if (arg)
goto out;
+   if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
+   /* The thread running this VCPU changed. */
+   struct pid *oldpid = vcpu-pid;
+   struct pid *newpid = get_task_pid(current, PIDTYPE_PID);
+   rcu_assign_pointer(vcpu-pid, newpid);
+   synchronize_rcu();
+   put_pid(oldpid);
+   }


This is executed without any lock held, so two concurrent KVM_RUNs can 
race and cause a double put_pid().


Suggest moving the code to vcpu_load(), where it can execute under the 
protection of vcpu-mutex.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Stefan Berger

On 01/26/2011 07:09 AM, Jan Kiszka wrote:

On 2011-01-26 13:05, Stefan Berger wrote:

On 01/26/2011 03:14 AM, Jan Kiszka wrote:

On 2011-01-25 17:49, Stefan Berger wrote:

On 01/25/2011 02:26 AM, Jan Kiszka wrote:

Do you see a chance to look closer at the issue yourself? E.g.
instrument the kernel's irqchip models and dump their states once your
guest is stuck?

The device runs on iRQ 3. So I applied this patch here.

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 3cece05..8f4f94c 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct kvm_kpic_state
*s, int irq, int level)
   {
   int mask, ret = 1;
   mask = 1irq;
-if (s-elcrmask)/* level triggered */
+if (s-elcrmask)/* level triggered */ {
   if (level) {
   ret = !(s-irrmask);
   s-irr |= mask;
@@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
kvm_kpic_state *s, int irq, int level)
   s-irr= ~mask;
   s-last_irr= ~mask;
   }
-else/* edge triggered */
+if (irq == 3)
+printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
s-irr);
+}
+else/* edge triggered */ {
   if (level) {
   if ((s-last_irrmask) == 0) {
   ret = !(s-irrmask);
@@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct kvm_kpic_state
*s, int irq, int level)
   s-last_irr |= mask;
   } else
   s-last_irr= ~mask;
-
+if (irq == 3)
+printk(%s %d: level=%d, irr = %x\n, __FUNCTION__,__LINE__,level,
s-irr);
+}
   return (s-imrmask) ? -1 : ret;
   }

@@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int
level)

   pic_lock(s);
   if (irq= 0irqPIC_NUM_PINS) {
+if (irq == 3)
+printk(%s\n, __FUNCTION__);
   ret = pic_set_irq1(s-pics[irq3], irq7, level);
   pic_update_irq(s);
   trace_kvm_pic_set_irq(irq3, irq7, s-pics[irq
3].elcr,



While it's still working I see this here with the levels changing 0-1-0.
Though then it stops and levels are only at '1'.

[ 1773.833824] kvm_pic_set_irq
[ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
[ 1773.834161] kvm_pic_set_irq
[ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
[ 1773.834193] kvm_pic_set_irq
[ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
[ 1773.835028] kvm_pic_set_irq
[ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
[ 1773.835542] kvm_pic_set_irq
[ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
[ 1773.889892] kvm_pic_set_irq
[ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
[ 1791.258793] pic_set_irq1 119: level=1, irr = d9
[ 1791.258824] pic_set_irq1 119: level=0, irr = d1
[ 1791.402476] pic_set_irq1 119: level=1, irr = d9
[ 1791.402534] pic_set_irq1 119: level=0, irr = d1
[ 1791.402538] pic_set_irq1 119: level=1, irr = d9
[...]


I believe the last 5 shown calls can be ignored. After that the
interrupts don't go through anymore.

In the device model I see interrupts being raised and cleared. After the
last one was cleared in 'my' device model, only interrupts are raised.
This looks like as if the interrupt handler in the guest Linux was never
run, thus the IRQ is never cleared and we're stuck.


User space is responsible for both setting and clearing that line. IRQ3
means you are using some serial device model? Then you should check what
its state is.

Good hint. I moved it now to IRQ11 and it works fine now (with kvm-git)
from what I can see. There was no UART on IRQ3 before, though, but
certainly it was the wrong IRQ for it.

Moreover, a complete picture of the kernel/user space interaction should
be obtainable by using fstrace for capturing kvm events.


Should it be working on IRQ3? If so, I'd look into it when I get a
chance...

I don't know your customizations, so it's hard to tell if that should
work or not. IRQ3 is intended to be used by ISA devices on the PC
machine. Are you adding an ISA model, or what is your use case?


The use case is to add a TPM device interface.

http://xenbits.xensource.com/xen-unstable.hg?file/1e56ac73b9b9/tools/ioemu/hw/tpm_tis.c

This one typically is connected to the LPC bus.

   Stefan


Jan



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Jan Kiszka
On 2011-01-26 14:08, Stefan Berger wrote:
 On 01/26/2011 07:09 AM, Jan Kiszka wrote:
 On 2011-01-26 13:05, Stefan Berger wrote:
 On 01/26/2011 03:14 AM, Jan Kiszka wrote:
 On 2011-01-25 17:49, Stefan Berger wrote:
 On 01/25/2011 02:26 AM, Jan Kiszka wrote:
 Do you see a chance to look closer at the issue yourself? E.g.
 instrument the kernel's irqchip models and dump their states once
 your
 guest is stuck?
 The device runs on iRQ 3. So I applied this patch here.

 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 3cece05..8f4f94c 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state
 *s, int irq, int level)
{
int mask, ret = 1;
mask = 1irq;
 -if (s-elcrmask)/* level triggered */
 +if (s-elcrmask)/* level triggered */ {
if (level) {
ret = !(s-irrmask);
s-irr |= mask;
 @@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state *s, int irq, int level)
s-irr= ~mask;
s-last_irr= ~mask;
}
 -else/* edge triggered */
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n,
 __FUNCTION__,__LINE__,level,
 s-irr);
 +}
 +else/* edge triggered */ {
if (level) {
if ((s-last_irrmask) == 0) {
ret = !(s-irrmask);
 @@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state
 *s, int irq, int level)
s-last_irr |= mask;
} else
s-last_irr= ~mask;
 -
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n,
 __FUNCTION__,__LINE__,level,
 s-irr);
 +}
return (s-imrmask) ? -1 : ret;
}

 @@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int
 level)

pic_lock(s);
if (irq= 0irqPIC_NUM_PINS) {
 +if (irq == 3)
 +printk(%s\n, __FUNCTION__);
ret = pic_set_irq1(s-pics[irq3], irq7, level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq3, irq7, s-pics[irq
 3].elcr,



 While it's still working I see this here with the levels changing
 0-1-0.
 Though then it stops and levels are only at '1'.

 [ 1773.833824] kvm_pic_set_irq
 [ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.834161] kvm_pic_set_irq
 [ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.834193] kvm_pic_set_irq
 [ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.835028] kvm_pic_set_irq
 [ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.835542] kvm_pic_set_irq
 [ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.889892] kvm_pic_set_irq
 [ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
 [ 1791.258793] pic_set_irq1 119: level=1, irr = d9
 [ 1791.258824] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402476] pic_set_irq1 119: level=1, irr = d9
 [ 1791.402534] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402538] pic_set_irq1 119: level=1, irr = d9
 [...]


 I believe the last 5 shown calls can be ignored. After that the
 interrupts don't go through anymore.

 In the device model I see interrupts being raised and cleared.
 After the
 last one was cleared in 'my' device model, only interrupts are raised.
 This looks like as if the interrupt handler in the guest Linux was
 never
 run, thus the IRQ is never cleared and we're stuck.

 User space is responsible for both setting and clearing that line. IRQ3
 means you are using some serial device model? Then you should check
 what
 its state is.
 Good hint. I moved it now to IRQ11 and it works fine now (with kvm-git)
 from what I can see. There was no UART on IRQ3 before, though, but
 certainly it was the wrong IRQ for it.
 Moreover, a complete picture of the kernel/user space interaction
 should
 be obtainable by using fstrace for capturing kvm events.

 Should it be working on IRQ3? If so, I'd look into it when I get a
 chance...
 I don't know your customizations, so it's hard to tell if that should
 work or not. IRQ3 is intended to be used by ISA devices on the PC
 machine. Are you adding an ISA model, or what is your use case?

 The use case is to add a TPM device interface.
 
 http://xenbits.xensource.com/xen-unstable.hg?file/1e56ac73b9b9/tools/ioemu/hw/tpm_tis.c
 
 
 This one typically is connected to the LPC bus.

I see. Do you also have the xen-free version of it? Maybe there are
still issues with proper qdev integration etc.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Jan Kiszka
On 2011-01-26 14:15, Jan Kiszka wrote:
 On 2011-01-26 14:08, Stefan Berger wrote:
 On 01/26/2011 07:09 AM, Jan Kiszka wrote:
 On 2011-01-26 13:05, Stefan Berger wrote:
 On 01/26/2011 03:14 AM, Jan Kiszka wrote:
 On 2011-01-25 17:49, Stefan Berger wrote:
 On 01/25/2011 02:26 AM, Jan Kiszka wrote:
 Do you see a chance to look closer at the issue yourself? E.g.
 instrument the kernel's irqchip models and dump their states once
 your
 guest is stuck?
 The device runs on iRQ 3. So I applied this patch here.

 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 3cece05..8f4f94c 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state
 *s, int irq, int level)
{
int mask, ret = 1;
mask = 1irq;
 -if (s-elcrmask)/* level triggered */
 +if (s-elcrmask)/* level triggered */ {
if (level) {
ret = !(s-irrmask);
s-irr |= mask;
 @@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state *s, int irq, int level)
s-irr= ~mask;
s-last_irr= ~mask;
}
 -else/* edge triggered */
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n,
 __FUNCTION__,__LINE__,level,
 s-irr);
 +}
 +else/* edge triggered */ {
if (level) {
if ((s-last_irrmask) == 0) {
ret = !(s-irrmask);
 @@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct
 kvm_kpic_state
 *s, int irq, int level)
s-last_irr |= mask;
} else
s-last_irr= ~mask;
 -
 +if (irq == 3)
 +printk(%s %d: level=%d, irr = %x\n,
 __FUNCTION__,__LINE__,level,
 s-irr);
 +}
return (s-imrmask) ? -1 : ret;
}

 @@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int
 level)

pic_lock(s);
if (irq= 0irqPIC_NUM_PINS) {
 +if (irq == 3)
 +printk(%s\n, __FUNCTION__);
ret = pic_set_irq1(s-pics[irq3], irq7, level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq3, irq7, s-pics[irq
 3].elcr,



 While it's still working I see this here with the levels changing
 0-1-0.
 Though then it stops and levels are only at '1'.

 [ 1773.833824] kvm_pic_set_irq
 [ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.834161] kvm_pic_set_irq
 [ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.834193] kvm_pic_set_irq
 [ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
 [ 1773.835028] kvm_pic_set_irq
 [ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.835542] kvm_pic_set_irq
 [ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
 [ 1773.889892] kvm_pic_set_irq
 [ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
 [ 1791.258793] pic_set_irq1 119: level=1, irr = d9
 [ 1791.258824] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402476] pic_set_irq1 119: level=1, irr = d9
 [ 1791.402534] pic_set_irq1 119: level=0, irr = d1
 [ 1791.402538] pic_set_irq1 119: level=1, irr = d9
 [...]


 I believe the last 5 shown calls can be ignored. After that the
 interrupts don't go through anymore.

 In the device model I see interrupts being raised and cleared.
 After the
 last one was cleared in 'my' device model, only interrupts are raised.
 This looks like as if the interrupt handler in the guest Linux was
 never
 run, thus the IRQ is never cleared and we're stuck.

 User space is responsible for both setting and clearing that line. IRQ3
 means you are using some serial device model? Then you should check
 what
 its state is.
 Good hint. I moved it now to IRQ11 and it works fine now (with kvm-git)
 from what I can see. There was no UART on IRQ3 before, though, but
 certainly it was the wrong IRQ for it.
 Moreover, a complete picture of the kernel/user space interaction
 should
 be obtainable by using fstrace for capturing kvm events.

 Should it be working on IRQ3? If so, I'd look into it when I get a
 chance...
 I don't know your customizations, so it's hard to tell if that should
 work or not. IRQ3 is intended to be used by ISA devices on the PC
 machine. Are you adding an ISA model, or what is your use case?

 The use case is to add a TPM device interface.

 http://xenbits.xensource.com/xen-unstable.hg?file/1e56ac73b9b9/tools/ioemu/hw/tpm_tis.c


 This one typically is connected to the LPC bus.
 
 I see. Do you also have the xen-free version of it? Maybe there are
 still issues with proper qdev integration etc.
 

Without knowing the hardware spec or what is actually behind set_irq,
this looks at least suspicious:

[...]
if (off == TPM_REG_INT_STATUS) {
/* clearing of interrupt flags */
if ((val  INTERRUPTS_SUPPORTED) 
(s-loc[locty].ints  INTERRUPTS_SUPPORTED)) {
s-set_irq(s-irq_opaque, s-irq, 0);
s-irq_pending = 0;
}
s-loc[locty].ints = ~(val  INTERRUPTS_SUPPORTED);
} else
[...]

The code does no
t check 

Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-26 Thread Stefan Berger

On 01/26/2011 08:31 AM, Jan Kiszka wrote:

On 2011-01-26 14:15, Jan Kiszka wrote:

On 2011-01-26 14:08, Stefan Berger wrote:

On 01/26/2011 07:09 AM, Jan Kiszka wrote:

On 2011-01-26 13:05, Stefan Berger wrote:

On 01/26/2011 03:14 AM, Jan Kiszka wrote:

On 2011-01-25 17:49, Stefan Berger wrote:

On 01/25/2011 02:26 AM, Jan Kiszka wrote:

Do you see a chance to look closer at the issue yourself? E.g.
instrument the kernel's irqchip models and dump their states once
your
guest is stuck?

The device runs on iRQ 3. So I applied this patch here.

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 3cece05..8f4f94c 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -106,7 +106,7 @@ static inline int pic_set_irq1(struct
kvm_kpic_state
*s, int irq, int level)
{
int mask, ret = 1;
mask = 1 irq;
-if (s-elcr mask)/* level triggered */
+if (s-elcr mask)/* level triggered */ {
if (level) {
ret = !(s-irr mask);
s-irr |= mask;
@@ -115,7 +115,10 @@ static inline int pic_set_irq1(struct
kvm_kpic_state *s, int irq, int level)
s-irr= ~mask;
s-last_irr= ~mask;
}
-else/* edge triggered */
+if (irq == 3)
+printk(%s %d: level=%d, irr = %x\n,
__FUNCTION__,__LINE__,level,
s-irr);
+}
+else/* edge triggered */ {
if (level) {
if ((s-last_irr mask) == 0) {
ret = !(s-irr mask);
@@ -124,7 +127,9 @@ static inline int pic_set_irq1(struct
kvm_kpic_state
*s, int irq, int level)
s-last_irr |= mask;
} else
s-last_irr= ~mask;
-
+if (irq == 3)
+printk(%s %d: level=%d, irr = %x\n,
__FUNCTION__,__LINE__,level,
s-irr);
+}
return (s-imr mask) ? -1 : ret;
}

@@ -206,6 +211,8 @@ int kvm_pic_set_irq(void *opaque, int irq, int
level)

pic_lock(s);
if (irq= 0 irq PIC_NUM_PINS) {
+if (irq == 3)
+printk(%s\n, __FUNCTION__);
ret = pic_set_irq1(s-pics[irq 3], irq 7, level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq
3].elcr,



While it's still working I see this here with the levels changing
0-1-0.
Though then it stops and levels are only at '1'.

[ 1773.833824] kvm_pic_set_irq
[ 1773.833827] pic_set_irq1 131: level=0, irr = 5b
[ 1773.834161] kvm_pic_set_irq
[ 1773.834163] pic_set_irq1 131: level=1, irr = 5b
[ 1773.834193] kvm_pic_set_irq
[ 1773.834195] pic_set_irq1 131: level=0, irr = 5b
[ 1773.835028] kvm_pic_set_irq
[ 1773.835031] pic_set_irq1 131: level=1, irr = 5b
[ 1773.835542] kvm_pic_set_irq
[ 1773.835545] pic_set_irq1 131: level=1, irr = 5b
[ 1773.889892] kvm_pic_set_irq
[ 1773.889894] pic_set_irq1 131: level=1, irr = 5b
[ 1791.258793] pic_set_irq1 119: level=1, irr = d9
[ 1791.258824] pic_set_irq1 119: level=0, irr = d1
[ 1791.402476] pic_set_irq1 119: level=1, irr = d9
[ 1791.402534] pic_set_irq1 119: level=0, irr = d1
[ 1791.402538] pic_set_irq1 119: level=1, irr = d9
[...]


I believe the last 5 shown calls can be ignored. After that the
interrupts don't go through anymore.

In the device model I see interrupts being raised and cleared.
After the
last one was cleared in 'my' device model, only interrupts are raised.
This looks like as if the interrupt handler in the guest Linux was
never
run, thus the IRQ is never cleared and we're stuck.


User space is responsible for both setting and clearing that line. IRQ3
means you are using some serial device model? Then you should check
what
its state is.

Good hint. I moved it now to IRQ11 and it works fine now (with kvm-git)
from what I can see. There was no UART on IRQ3 before, though, but
certainly it was the wrong IRQ for it.

Moreover, a complete picture of the kernel/user space interaction
should
be obtainable by using fstrace for capturing kvm events.


Should it be working on IRQ3? If so, I'd look into it when I get a
chance...

I don't know your customizations, so it's hard to tell if that should
work or not. IRQ3 is intended to be used by ISA devices on the PC
machine. Are you adding an ISA model, or what is your use case?


The use case is to add a TPM device interface.

http://xenbits.xensource.com/xen-unstable.hg?file/1e56ac73b9b9/tools/ioemu/hw/tpm_tis.c


This one typically is connected to the LPC bus.

I see. Do you also have the xen-free version of it? Maybe there are
still issues with proper qdev integration etc.


Without knowing the hardware spec or what is actually behind set_irq,
this looks at least suspicious:

[...]
if (off == TPM_REG_INT_STATUS) {
 /* clearing of interrupt flags */
 if ((val  INTERRUPTS_SUPPORTED)
 (s-loc[locty].ints  INTERRUPTS_SUPPORTED)) {
 s-set_irq(s-irq_opaque, s-irq, 0);
 s-irq_pending = 0;
 }
 s-loc[locty].ints= ~(val  INTERRUPTS_SUPPORTED);
} else
[...]

The code does no
t check if there are ints 

Re: EPT: Misconfiguration

2011-01-26 Thread Ruben Kerkhof
On Wed, Jan 26, 2011 at 10:52, Avi Kivity a...@redhat.com wrote:
 On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:

   When you say suddenly, this was with no changes to software and
  hardware?

 The host software and hardware hasn't changed in the two months since
 the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.

 We host customer vms on it though, so virtual machines come and go.
 Various operating systems, a mixture of Linux, FreeBSD and Windows
 2008 R2. We have other machines with the same config without these
 problems though.

 Are those other machines running a similar workload?

Yes, similar, or they're more heavily loaded.

On this machine, about half of the 48GB memory was used for virtual machines.

 The traces look awfully like bad hardware, though that can also be explained
 by random memory corruption due to a bug.

Yeah, that's what I'm expecting. We already replaced the memory, next
step is to move the disks over to another server to make sure it's not
the board or cpu's.

 This time I have a few different messages though:

 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:
  [#1] SMP

 RSI:  RDI: 1603a07305001568

 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00f0  ff 4f 08 0f 94 c0 84
 c0 74 10 85 f6 75 07 e8 63 fe ff ff eb

 lock decl 0x8(%rdi)

 %rdi is completely crap, looks like corruption again.  Strangely, it is
 similar to the bad spte from the previous trace: 0x1603a0730500d277.  The
 upper 48 bits are identical, the lower 16 bits are different.:

 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
 page table at address 7f37b37ff000
 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
 94e538067 PMD 61e5bf067 PTE 1603a0730500e067

 Here are those magic 48 bits again, in the PTE entry.

 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
 2011-01-25T12:38:49.417526+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
 2011-01-25T12:38:49.417532+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x5db595007 level 3
 2011-01-25T12:38:49.417553+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
 2011-01-25T12:38:49.417558+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1

 Again.

 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
 process qemu-kvm  pte:1603a0730500d067 pmd:61059f067

 Again.

 However, these all came from a single boot, yes?

Correct.

 If so they can be the same
 corruption.  Please collect more traces, with reboots in between.

Ok, thanks, will do.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure

2011-01-26 Thread Avi Kivity

On 01/26/2011 02:13 PM, Glauber Costa wrote:


  - it doesn't lend itself will to live migration.  Extra state must be
  maintained in the hypervisor.
Yes, but can be queried at any time as well. I don't do it in this
patch, but this is explicitly mentioned in my TODO.


Using the existing method (MSRs) takes care of this, which reduces churn.


  - it isn't how normal hardware operates
Since we're trying to go for guest cooperation here, I don't really see
a need to stay close to hardware here.


For Linux there is not much difference, since we can easily adapt it.  
But we don't know the impact on other guests, and we can't refactor 
them.  Staying close to precedent means it will be easier for other 
guests to work with a kvm host, if they choose.




  what's wrong with extending the normal approach of one msr per feature?

* It's harder to do discovery with MSRs. You can't just rely on getting
an error before the idts are properly setups. The way I am proposing
allow us to just try to register a memory area, and get a failure if we
can't handle it, at any time


Use cpuid to ensure that you won't get a #GP.


* To overcome the above, we had usually relied on cpuids. This requires
qemu/userspace cooperation for feature enablement


We need that anyway.  The kernel cannot enable features on its own since 
that breaks live migration.



* This mechanism just bumps us out to userspace if we can't handle a
request. As such, it allows for pure guest kernel -  userspace
communication, that can be used, for instance, to emulate new features
in older hypervisors one does not want to change. BTW, maybe there is
value in exiting to userspace even if we stick to the
one-msr-per-feature approach?


Yes.

I'm not 100% happy with emulating MSRs in userspace, but we can think 
about a mechanism that allows userspace to designate certain MSRs as 
handled by userspace.


Before we do that I'd like to see what fraction of MSRs can be usefully 
emulated in userspace (beyond those that just store a value and ignore it).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl

2011-01-26 Thread Avi Kivity

On 01/26/2011 02:14 PM, Glauber Costa wrote:

On Wed, 2011-01-26 at 13:12 +0200, Avi Kivity wrote:
  On 01/24/2011 08:06 PM, Glauber Costa wrote:
KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
Mojito),
is a piece of shared memory that is visible to both the hypervisor and 
the guest
kernel - but not the guest userspace.
  
The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.
  
KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.
  
This patch contains the header part of the userspace communication 
implementation.
Userspace can query the presence/absence of this feature in the normal 
way.
It also tells the hypervisor that it is capable of handling - in whatever
way it chooses, registrations that the hypervisor does not know how to.
In x86, only user so far, this mechanism is implemented as generic 
userspace
msr exit, that could theorectically be used to implement msr-handling in
userspace.
  
I am keeping the headers separate to facilitate backports to people
who wants to backport the kernel part but not the hypervisor, or the 
other way around.
  

  Again the protocol is not specified.  How about starting from
  Documentation/kvm/api.txt so we don't have to guess?
I will do that in the next version, if the idea is not shoot up
completely.


For some reason people write the code first and the documentation as an 
afterthought.  If you switch the order, you can get a high level review 
first, followed by a low-level code review once the high level details 
are agreed.  Of course it's hard to do major changes this way, since the 
API often evolves while writing the code.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory

2011-01-26 Thread Avi Kivity

On 01/26/2011 02:20 PM, Glauber Costa wrote:

On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
  On 01/24/2011 08:06 PM, Glauber Costa wrote:
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably 
come.
After this patch, we won't need to ever add another virtual MSR to KVM.
  

  So instead of adding MSRs, we're adding area identifiers.  What did we gain?

* No risk of namespace clashes of any kind,
* less need for userspace coordination for feature enablement,


That's a bug, not a feature.


* more robust mechanism that can do discovery even on early boot,


cpuid/wrmsr should be robust enough.


* more informative result values can be passed on to guest kernel,


True.


* more flexibility, since we go back to userspace if we can't handle
some request. Also, some requests are better handled by userspace
anyway. But again, maybe this is a separate issue here...


Yes.


* size information goes together with base, allowing for extending
structures (well, maybe I should add versioning explicitly?)



We could do that as well with wrmsr, by having the size as the first 
field of the structure.  Usually the size isn't really interesting, 
though, since you need to discover/enable the new features independently.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-26 Thread Michael S. Tsirkin
On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
 
 I am working on a KVM network performance issue found in our lab running
 the DayTrader benchmark.  The benchmark throughput takes a significant hit
 when running the application server in a KVM guest verses on bare metal.
 We have dug into the problem and found that DayTrader's use of small
 packets exposes KVM's overhead of handling network packets.  I have been
 able to reproduce the performance hit with a simpler setup using the
 netperf benchmark with the TCP_RR test and the request and response sizes
 set to 256 bytes.  I run the benchmark between two physical systems, each
 using a 1GB link.  In order to get the maximum throughput for the system I
 have to run 100 instances of netperf.  When I run the netserver processes
 in a guest, I see a maximum throughput that is 51% of what I get if I run
 the netserver processes directly on the host.  The CPU utilization in the
 guest is only 85% at maximum throughput, whereas it is 100% on bare metal.
 
 The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
 I run netperf on the host I boot the host with maxcpus=2 on the kernel
 command line.  The host is running the current KVM upstream kernel along
 with the current upstream qemu.  Here is the qemu command used to launch
 the guest:
 /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 
 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
  -drive file=/dev/virt/WAS,if=virtio,index=1 -net 
 nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev 
 tap,id=nic0,vhost=on -smp 2
 -vnc :1 -monitor telnet::4499,server,nowait -serial 
 telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize
 
 We have tried various proposed fixes, each with varying amounts of success.
 One such fix was to add code to the vhost thread such that when it found
 the work queue empty it wouldn't just exit the thread but rather would
 delay for 50 microseconds and then recheck the queue.  If there was work on
 the queue it would loop back and process it, else it would exit the thread.
 The change got us a 13% improvement in the DayTrader throughput.
 
 Running the same netperf configuration on the same hardware but using a
 different hypervisor gets us significantly better throughput numbers.   The
 guest on that hypervisor runs at 100% CPU utilization.  The various fixes
 we have tried have not gotten us close to the throughput seen on the other
 hypervisor.  I'm looking for ideas/input from the KVM experts on how to
 make KVM perform better when handling small packets.
 
 Thanks,
 Steve

I am seeing a similar problem, and am trying to fix that.
My current theory is that this is a variant of a receive livelock:
if the application isn't fast enough to process
incoming data, the guest net stack switches
from prequeue to backlog handling.

One thing I noticed is that locking the vhost thread
and the vcpu to the same physical CPU almost doubles the
bandwidth.  Can you confirm that in your setup?

My current guess is that when we lock both to
a single CPU, netperf in guest gets scheduled
slowing down the vhost thread in the host.

I also noticed that this specific workload
performs better with vhost off: presumably
we are loading the guest less.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 7/8] kvm: keep track of which task is running a KVM vcpu

2011-01-26 Thread Rik van Riel

On 01/26/2011 08:01 AM, Avi Kivity wrote:


Suggest moving the code to vcpu_load(), where it can execute under the
protection of vcpu-mutex.


I've made the suggested changes by you and Peter, and
will re-post the patch series in a bit...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 17:14 +0200, Avi Kivity wrote:
 On 01/26/2011 02:14 PM, Glauber Costa wrote:
  On Wed, 2011-01-26 at 13:12 +0200, Avi Kivity wrote:
On 01/24/2011 08:06 PM, Glauber Costa wrote:
  KVM, which stands for KVM Virtual Memory (I wanted to call it KVM 
   Virtual Mojito),
  is a piece of shared memory that is visible to both the hypervisor 
   and the guest
  kernel - but not the guest userspace.

  The basic idea is that the guest can tell the hypervisor about a 
   specific
  piece of memory, and what it expects to find in there. This is a 
   generic
  abstraction, that goes to userspace (qemu) if KVM (the hypervisor) 
   can't
  handle a specific request, thus giving us flexibility in some features
  in the future.

  KVM (The hypervisor) can change the contents of this piece of memory 
   at
  will. This works well with paravirtual information, and hopefully
  normal guest memory - like last update time for the watchdog, for
  instance.

  This patch contains the header part of the userspace communication 
   implementation.
  Userspace can query the presence/absence of this feature in the 
   normal way.
  It also tells the hypervisor that it is capable of handling - in 
   whatever
  way it chooses, registrations that the hypervisor does not know how 
   to.
  In x86, only user so far, this mechanism is implemented as generic 
   userspace
  msr exit, that could theorectically be used to implement msr-handling 
   in
  userspace.

  I am keeping the headers separate to facilitate backports to people
  who wants to backport the kernel part but not the hypervisor, or the 
   other way around.

  
Again the protocol is not specified.  How about starting from
Documentation/kvm/api.txt so we don't have to guess?
  I will do that in the next version, if the idea is not shoot up
  completely.
 
 For some reason people write the code first and the documentation as an 
 afterthought.  If you switch the order, you can get a high level review 
 first, followed by a low-level code review once the high level details 
 are agreed.  Of course it's hard to do major changes this way, since the 
 API often evolves while writing the code.

Thing is, processor runs code, not english. So I guess people write code
first at least to know what works vs what not works before writing a
potentially beautiful-but-only-theoretical novel.

Rest assured, I don't plan to push for merges of any feature before
providing docs.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 17:12 +0200, Avi Kivity wrote:
 On 01/26/2011 02:13 PM, Glauber Costa wrote:
  
- it doesn't lend itself will to live migration.  Extra state must be
maintained in the hypervisor.
  Yes, but can be queried at any time as well. I don't do it in this
  patch, but this is explicitly mentioned in my TODO.
 
 Using the existing method (MSRs) takes care of this, which reduces churn.

No, it doesn't.

First, we have to explicitly list some msrs for save/restore in
userspace anyway. But also, the MSRs only holds values. For the case I'm
trying to hit here, being: msrs being used to register something, like
kvmclock, there is usually accompanying code as well.


- it isn't how normal hardware operates
  Since we're trying to go for guest cooperation here, I don't really see
  a need to stay close to hardware here.
 
 For Linux there is not much difference, since we can easily adapt it.
 But we don't know the impact on other guests, and we can't refactor 
 them.  Staying close to precedent means it will be easier for other 
 guests to work with a kvm host, if they choose.

I honestly don't see the difference. I am not proposing anything
terribly different, in the end, for the sake of this specific point of
guest supportability it's all 1 msr+cpuid vs n msr+cpuid.

 
  
what's wrong with extending the normal approach of one msr per feature?
 
  * It's harder to do discovery with MSRs. You can't just rely on getting
  an error before the idts are properly setups. The way I am proposing
  allow us to just try to register a memory area, and get a failure if we
  can't handle it, at any time
 
 Use cpuid to ensure that you won't get a #GP.
Again, that increases confusion, IMHO. Your hypervisor may have a
feature, userspace lack it, and then you end up figuring why something
does not work.

 
  * To overcome the above, we had usually relied on cpuids. This requires
  qemu/userspace cooperation for feature enablement
 
 We need that anyway.  The kernel cannot enable features on its own since 
 that breaks live migration.

That is true. But easy to overcome as well.

  * This mechanism just bumps us out to userspace if we can't handle a
  request. As such, it allows for pure guest kernel -  userspace
  communication, that can be used, for instance, to emulate new features
  in older hypervisors one does not want to change. BTW, maybe there is
  value in exiting to userspace even if we stick to the
  one-msr-per-feature approach?
 
 Yes.
 
 I'm not 100% happy with emulating MSRs in userspace, but we can think 
 about a mechanism that allows userspace to designate certain MSRs as 
 handled by userspace.
 
 Before we do that I'd like to see what fraction of MSRs can be usefully 
 emulated in userspace (beyond those that just store a value and ignore it).

None of the existing. But for instance, I was discussing this issue with
anthony a while ago, and he thinks that in order to completely avoid
bogus softlockups, qemu/userspace, which is the entity here that knows
when it has stopped (think ctrl+Z or stop + cont, save/restore, etc),
could notify this to the guest kernel directly through a shared variable
like this.

See, this is not about new features, but rather, about between pieces
of memory. So what I'm doing in the end is just generalizing an MSR for
shared memory, instead of one new MSR for each piece of data.

Maybe I was unfortunate to mention async_pf in the description to begin
with.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 10:57 +0100, Peter Zijlstra wrote:
 On Tue, 2011-01-25 at 19:27 -0200, Glauber Costa wrote:
  On Tue, 2011-01-25 at 22:07 +0100, Peter Zijlstra wrote:
   On Tue, 2011-01-25 at 18:47 -0200, Glauber Costa wrote:
On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
 On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:
 
  I fail to see how does clock_task influence cpu power.
  If we also have to touch clock_task for better accounting of other
  stuff, it is a separate story.
  But for cpu_power, I really fail. Please enlighten me.
 
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
 s64 irq_delta;
 
 irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
 
 if (irq_delta  delta)
 irq_delta = delta;
 
 rq-prev_irq_time += irq_delta;
 delta -= irq_delta;
 rq-clock_task += delta;
 
 if (irq_delta  sched_feat(NONIRQ_POWER))
 sched_rt_avg_update(rq, irq_delta);
 }
 
 its done through that sched_rt_avg_update() (should probably rename
 that), it computes a floating average of time not spend on fair tasks.
 
It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
This piece of code is simply compiled out if this option is disabled.
   
   We can pull this bit out and make the common bit also available for
   paravirt.
  
  scale_rt_power() seems to do the right thing, but all the path leading
  to it seem to work on rq-clock, rather than rq-clock_task.
 
 Not quite, see how rq-clock_task is irq_delta less than the increment
 to rq-clock? You want it to be your steal-time delta less too.
yes, but once this delta is subtracted from rq-clock_task, this value is not
used to dictate power, unless I am mistaken.

power is adjusted according to scale_rt_power(), which does it using the
values of rq-rt_avg, rq-age_stamp, and rq-clock.

So whatever I store into rq-clock_task, but not rq-clock (which
correct me if I'm wrong, is expected to be walltime), will not be used
to adjust cpu power, which is what I'm trying to achieve.

  Although I do can experiment with that as well, could you please
  elaborate on what are your reasons to prefer this over than variations
  of the method I proposed?
 
 Because I want rq-clock_task to not include steal-time.
Sure, fair deal. But at this point, those demands seem orthogonal to me.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 17:17 +0200, Avi Kivity wrote:
 On 01/26/2011 02:20 PM, Glauber Costa wrote:
  On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
On 01/24/2011 08:06 PM, Glauber Costa wrote:
  As a proof of concept to KVM - Kernel Virtual Memory, this patch
  implements wallclock grabbing on top of it. At first, it may seem
  as a waste of work to just redo it, since it is working well. But 
   over the
  time, other MSRs were added - think ASYNC_PF - and more will probably 
   come.
  After this patch, we won't need to ever add another virtual MSR to 
   KVM.

  
So instead of adding MSRs, we're adding area identifiers.  What did we 
   gain?
 
  * No risk of namespace clashes of any kind,
  * less need for userspace coordination for feature enablement,
 
 That's a bug, not a feature.

I don't see why.
I's about feature enablement, not feature discovery. 

  * more robust mechanism that can do discovery even on early boot,
 
 cpuid/wrmsr should be robust enough.
 
  * more informative result values can be passed on to guest kernel,
 
 True.
 
  * more flexibility, since we go back to userspace if we can't handle
  some request. Also, some requests are better handled by userspace
  anyway. But again, maybe this is a separate issue here...
 
 Yes.
 
  * size information goes together with base, allowing for extending
  structures (well, maybe I should add versioning explicitly?)
 
 
 We could do that as well with wrmsr, by having the size as the first 
 field of the structure.  Usually the size isn't really interesting, 
 though, since you need to discover/enable the new features independently.

Which structure? For msrs, we're usually going for just an u64, but of
course we could change that when needed.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-26 Thread Peter Zijlstra
On Wed, 2011-01-26 at 13:43 -0200, Glauber Costa wrote:

 yes, but once this delta is subtracted from rq-clock_task, this value is not
 used to dictate power, unless I am mistaken.
 
 power is adjusted according to scale_rt_power(), which does it using the
 values of rq-rt_avg, rq-age_stamp, and rq-clock.
 
 So whatever I store into rq-clock_task, but not rq-clock (which
 correct me if I'm wrong, is expected to be walltime), will not be used
 to adjust cpu power, which is what I'm trying to achieve.

No, see the below, it uses a per-cpu virt_steal_time() clock which is
expected to return steal-time in ns.

All time not accounted to -clock_task is accumulated in lost, and
passed into sched_rt_avg_update() and thus affects the cpu_power.

If it finds that 50% of the (recent) time is steal time, its cpu_power
will be 50%.

---
 kernel/sched.c  |   44 
 kernel/sched_features.h |2 +-
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..c71384c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -523,6 +523,9 @@ struct rq {
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
 #endif
+#ifdef CONFIG_SCHED_PARAVIRT
+   u64 prev_steal_time;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1888,11 +1891,15 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-   s64 irq_delta;
+   s64 lost_delta __maybe_unused;
+   s64 lost = 0;
 
-   irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+   lost_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
 
/*
 * Since irq_time is only updated on {soft,}irq_exit, we might run into
@@ -1909,26 +1916,31 @@ static void update_rq_clock_task(struct rq *rq, s64 
delta)
 * the current rq-clock timestamp, except that would require using
 * atomic ops.
 */
-   if (irq_delta  delta)
-   irq_delta = delta;
+   if (lost_delta  delta)
+   lost_delta = delta;
 
-   rq-prev_irq_time += irq_delta;
-   delta -= irq_delta;
-   rq-clock_task += delta;
+   rq-prev_irq_time += lost_delta;
+   lost += lost_delta;
+#endif
+#ifdef CONFIG_SCHED_PARAVIRT
+   lost_delta = virt_steal_time(cpu_of(rq)) - rq-prev_steal_time;
+   
+   /*
+* unlikely, unless steal_time accounting is iffy
+*/
+   if (lost + lost_delta  delta)
+   lost_delta = delta - lost;
 
-   if (irq_delta  sched_feat(NONIRQ_POWER))
-   sched_rt_avg_update(rq, irq_delta);
-}
+   rq-prev_steal_time += lost_delta;
+   lost += lost_delta
+#endif
 
-#else /* CONFIG_IRQ_TIME_ACCOUNTING */
+   rq-clock_task += delta - lost;
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-   rq-clock_task += delta;
+   if (lost  sched_feat(NONTASK_POWER))
+   sched_rt_avg_update(rq, lost);
 }
 
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
-
 #include sched_idletask.c
 #include sched_fair.c
 #include sched_rt.c
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 68e69ac..b334a2d 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -63,4 +63,4 @@ SCHED_FEAT(OWNER_SPIN, 1)
 /*
  * Decrement CPU power based on irq activity
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-26 Thread Peter Zijlstra
On Wed, 2011-01-26 at 17:46 +0100, Peter Zijlstra wrote:
 it uses a per-cpu virt_steal_time() clock which is
 expected to return steal-time in ns. 

This clock should return u64 and wrap on u64 and be provided when
CONFIG_SCHED_PARAVIRT.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM on old kernels pre-2.6.28

2011-01-26 Thread Asdo
Some time ago in this list it was mentioned that old kernels pre-2.6.28
don't work well with KVM.
(in particular we have a machine with 2.6.24)

Unfortunately the type of problem was not mentioned in the posts I could
find. Is that a performance problem, a stability problem, or a data
corruption problem... ?

And I would also like to know if the problem can be worked around by
installing a new kvm-kmod like 2.6.36.2  or we really need to upgrade
the whole kernel to get rid of it.

(I was thinking at installing a new qemu-kvm like the 0.12.5 on that)

Thank you in advance
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)

2011-01-26 Thread Christoph Lameter

Reviewed-by: Christoph Lameter c...@linux.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-26 Thread Glauber Costa
On Wed, 2011-01-26 at 17:46 +0100, Peter Zijlstra wrote:
 On Wed, 2011-01-26 at 13:43 -0200, Glauber Costa wrote:
 
  yes, but once this delta is subtracted from rq-clock_task, this value is 
  not
  used to dictate power, unless I am mistaken.
  
  power is adjusted according to scale_rt_power(), which does it using the
  values of rq-rt_avg, rq-age_stamp, and rq-clock.
  
  So whatever I store into rq-clock_task, but not rq-clock (which
  correct me if I'm wrong, is expected to be walltime), will not be used
  to adjust cpu power, which is what I'm trying to achieve.
 
 No, see the below, it uses a per-cpu virt_steal_time() clock which is
 expected to return steal-time in ns.
 
 All time not accounted to -clock_task is accumulated in lost, and
 passed into sched_rt_avg_update() and thus affects the cpu_power.
 
 If it finds that 50% of the (recent) time is steal time, its cpu_power
 will be 50%.
ok, now I get your proposal. It does make sense.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v7 PATCH 6/7] kvm: keep track of which task is running a KVM vcpu

2011-01-26 Thread Rik van Riel
Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single run of the vcpu.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..9d56ed5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
 #endif
int vcpu_id;
struct mutex mutex;
+   struct pid *pid;
int   cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..0fa9a48 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -117,6 +117,14 @@ void vcpu_load(struct kvm_vcpu *vcpu)
int cpu;
 
mutex_lock(vcpu-mutex);
+   if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
+   /* The thread running this VCPU changed. */
+   struct pid *oldpid = vcpu-pid;
+   struct pid *newpid = get_task_pid(current, PIDTYPE_PID);
+   rcu_assign_pointer(vcpu-pid, newpid);
+   synchronize_rcu();
+   put_pid(oldpid);
+   }
cpu = get_cpu();
preempt_notifier_register(vcpu-preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -185,6 +193,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu-cpu = -1;
vcpu-kvm = kvm;
vcpu-vcpu_id = id;
+   vcpu-pid = NULL;
init_waitqueue_head(vcpu-wq);
 
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -208,6 +217,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init);
 
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
+   put_pid(vcpu-pid);
kvm_arch_vcpu_uninit(vcpu);
free_page((unsigned long)vcpu-run);
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v7 PATCH 3/7] sched: use a buddy to implement yield_task_fair

2011-01-26 Thread Rik van Riel
Use the buddy mechanism to implement yield_task_fair.  This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next.  We need next to be able
to override yield, because it is possible for the next and
yield task to be different processen in the same sub-tree
of the CFS tree.  When they are, we need to go into that
sub-tree regardless of the yield hint, and pick the correct
entity once we get to the right level.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..7ff53e2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -327,7 +327,7 @@ struct cfs_rq {
 * 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr, *next, *last;
+   struct sched_entity *curr, *next, *last, *skip;
 
unsigned int nr_spread_over;
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ad946fd..a56d410 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -384,6 +384,22 @@ static struct sched_entity *__pick_next_entity(struct 
cfs_rq *cfs_rq)
return rb_entry(left, struct sched_entity, run_node);
 }
 
+static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
+{
+   struct rb_node *left = cfs_rq-rb_leftmost;
+   struct rb_node *second;
+
+   if (!left)
+   return NULL;
+
+   second = rb_next(left);
+
+   if (!second)
+   second = left;
+
+   return rb_entry(second, struct sched_entity, run_node);
+}
+
 static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
struct rb_node *last = rb_last(cfs_rq-tasks_timeline);
@@ -806,6 +822,17 @@ static void __clear_buddies_next(struct sched_entity *se)
}
 }
 
+static void __clear_buddies_skip(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-skip == se)
+   cfs_rq-skip = NULL;
+   else
+   break;
+   }
+}
+
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
if (cfs_rq-last == se)
@@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
if (cfs_rq-next == se)
__clear_buddies_next(se);
+
+   if (cfs_rq-skip == se)
+   __clear_buddies_skip(se);
 }
 
 static void
@@ -926,13 +956,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static int
 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 
+/*
+ * Pick the next process, keeping these things in mind, in this order:
+ * 1) keep things fair between processes/task groups
+ * 2) pick the next process, since someone really wants that to run
+ * 3) pick the last process, for cache locality
+ * 4) do not run the skip process, if something else is available
+ */
 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 {
struct sched_entity *se = __pick_next_entity(cfs_rq);
struct sched_entity *left = se;
 
-   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
-   se = cfs_rq-next;
+   /*
+* Avoid running the skip buddy, if running something else can
+* be done without getting too unfair.
+*/
+   if (cfs_rq-skip == se) {
+   struct sched_entity *second = __pick_second_entity(cfs_rq);
+   if (wakeup_preempt_entity(second, left)  1)
+   se = second;
+   }
 
/*
 * Prefer last buddy, try to return the CPU to a preempted task.
@@ -940,6 +984,12 @@ static struct sched_entity *pick_next_entity(struct cfs_rq 
*cfs_rq)
if (cfs_rq-last  wakeup_preempt_entity(cfs_rq-last, left)  1)
se = cfs_rq-last;
 
+   /*
+* Someone really wants this to run. If it's not unfair, run it.
+*/
+   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
+   se = cfs_rq-next;
+
clear_buddies(cfs_rq, se);
 
return se;
@@ -1096,52 +1146,6 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
hrtick_update(rq);
 }
 
-/*
- * sched_yield() support is very simple - we dequeue and enqueue.
- *
- * If compat_yield is turned on then we requeue to the end of the tree.
- */
-static void yield_task_fair(struct rq *rq)
-{
-   struct task_struct *curr = rq-curr;
-   struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-   struct sched_entity *rightmost, *se = curr-se;
-
-   /*
-* Are we the only task in the tree?
-*/
-   if (unlikely(rq-nr_running == 1))
-   return;
-
-   

[RFC -v7 PATCH 7/7] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2011-01-26 Thread Rik van Riel
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to get
another VCPU in the same KVM guest to run sooner.

This seems to give a 10-15% speedup in certain workloads.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d56ed5..fab2250 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -187,6 +187,7 @@ struct kvm {
 #endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+   int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 86c4905..8b761ba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1292,18 +1292,55 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-   ktime_t expires;
-   DEFINE_WAIT(wait);
-
-   prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE);
-
-   /* Sleep for 100 us, and hope lock-holder got scheduled */
-   expires = ktime_add_ns(ktime_get(), 10UL);
-   schedule_hrtimeout(expires, HRTIMER_MODE_ABS);
+   struct kvm *kvm = me-kvm;
+   struct kvm_vcpu *vcpu;
+   int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
+   int yielded = 0;
+   int pass;
+   int i;
 
-   finish_wait(vcpu-wq, wait);
+   /*
+* We boost the priority of a VCPU that is runnable but not
+* currently running, because it got preempted by something
+* else and called schedule in __vcpu_run.  Hopefully that
+* VCPU is holding the lock that we need and will release it.
+* We approximate round-robin by starting at the last boosted VCPU.
+*/
+   for (pass = 0; pass  2  !yielded; pass++) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = NULL;
+   struct pid *pid;
+   if (!pass  i  last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (pass  i  last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (waitqueue_active(vcpu-wq))
+   continue;
+   rcu_read_lock();
+   pid = rcu_dereference(vcpu-pid);
+   if (pid)
+   task = get_pid_task(vcpu-pid, PIDTYPE_PID);
+   rcu_read_unlock();
+   if (!task)
+   continue;
+   if (task-flags  PF_VCPU) {
+   put_task_struct(task);
+   continue;
+   }
+   if (yield_to(task, 1)) {
+   put_task_struct(task);
+   kvm-last_boosted_vcpu = i;
+   yielded = 1;
+   break;
+   }
+   put_task_struct(task);
+   }
+   }
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v7 PATCH 0/7] directed yield for Pause Loop Exiting

2011-01-26 Thread Rik van Riel
When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v7:
- move the vcpu to pid mapping to inside the vcpu-mutex
- rename -yield to -skip
- merge patch 5 into patch 4
v6:
- implement yield_task_fair in a way that works with task groups,
  this allows me to actually get a performance improvement!
- fix another race Avi pointed out, the code should be good now
v5:
- fix the race condition Avi pointed out, by tracking vcpu-pid
- also allows us to yield to vcpu tasks that got preempted while in qemu
  userspace
v4:
- change to newer version of Mike Galbraith's yield_to implementation
- chainsaw out some code from Mike that looked like a great idea, but
  turned out to give weird interactions in practice
v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
  situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)


Benchmark results:

Two 4-CPU KVM guests are pinned to the same 4 physical CPUs.

One guest runs the AMQP performance test, the other guest runs
0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5
and 4.

The AMQP perftest is run 30 times, with message payloads of 8 and 16 bytes.

size8   no overcommit   1.5x overcommit 2x overcommit

no PLE  223801  135137  104951
PLE 224135  141105  118744

size16  no overcommit   1.5x overcommit 2x overcommit

no PLE  222424  126175  105299
PLE 222534  138082  132945

Note: this is with the KVM guests NOT running inside cgroups.  There
seems to be a CPU load balancing issue with cgroup fair group scheduling,
which often results in one guest getting only 80% CPU time and the other
guest 320%.  That will have to be fixed to get meaningful results with
cgroups.

CPU time division between the AMQP guest and the infinite loop guest
were not exactly fair, but the guests got close to the same amount
of CPU time in each test run.

There is a substantial amount of randomness in CPU time division between
guests, but the performance improvement is consistent between multiple
runs.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v7 PATCH 1/7] sched: check the right -nr_running in yield_task_fair

2011-01-26 Thread Rik van Riel
With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
Yielding to a task from another cfs_rq may be worthwhile, since
a process calling yield typically cannot use the CPU right now.

Therefor, we want to check the per-cpu nr_running, not the
cgroup local one.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched_fair.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..7b338ac 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1304,7 +1304,7 @@ static void yield_task_fair(struct rq *rq)
/*
 * Are we the only task in the tree?
 */
-   if (unlikely(cfs_rq-nr_running == 1))
+   if (unlikely(rq-nr_running == 1))
return;
 
clear_buddies(cfs_rq, se);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v7 PATCH 2/7] sched: limit the scope of clear_buddies

2011-01-26 Thread Rik van Riel
The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues.  In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

 (nl)
/\
   G(nl)  G
   / \ \
 T(l) T(n)  T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared.  This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be.  As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched_fair.c |   30 +++---
 1 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4ee445..0321473 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
__enqueue_entity(cfs_rq, se);
 }
 
-static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void __clear_buddies_last(struct sched_entity *se)
 {
-   if (!se || cfs_rq-last == se)
-   cfs_rq-last = NULL;
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-last == se)
+   cfs_rq-last = NULL;
+   else
+   break;
+   }
+}
 
-   if (!se || cfs_rq-next == se)
-   cfs_rq-next = NULL;
+static void __clear_buddies_next(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-next == se)
+   cfs_rq-next = NULL;
+   else
+   break;
+   }
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   for_each_sched_entity(se)
-   __clear_buddies(cfs_rq_of(se), se);
+   if (cfs_rq-last == se)
+   __clear_buddies_last(se);
+
+   if (cfs_rq-next == se)
+   __clear_buddies_next(se);
 }
 
 static void
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v7 PATCH 4/7] Add yield_to(task, preempt) functionality.

2011-01-26 Thread Rik van Riel
From: Mike Galbraith efa...@gmx.de

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Mike Galbraith efa...@gmx.de

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..6c43fc4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,6 +1047,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+   bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool 
preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
@@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p) do { } while (0)
 #endif
 
+extern bool yield_to(struct task_struct *p, bool preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 7ff53e2..5e70156 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5270,6 +5270,56 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ *
+ * Returns true if we indeed boosted the target task.
+ */
+bool __sched yield_to(struct task_struct *p, bool preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   bool yielded = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yielded)
+   yield();
+
+   return yielded;
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a56d410..be729d7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1800,6 +1800,23 @@ static void yield_task_fair(struct rq *rq)
set_skip_buddy(se);
 }
 
+static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool 
preempt)
+{
+   struct sched_entity *se = p-se;
+
+   if (!se-on_rq)
+   return false;
+
+   /* Tell the scheduler that we'd really like pse to run next. */
+   set_next_buddy(se);
+
+   /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */
+   if (preempt)
+   resched_task(rq-curr);
+
+   return true;
+}
+
 #ifdef CONFIG_SMP
 /**
  * Fair scheduling class load-balancing methods:
@@ -3993,6 +4010,7 @@ static const struct sched_class fair_sched_class = {
.enqueue_task   = enqueue_task_fair,
.dequeue_task   = dequeue_task_fair,
.yield_task = yield_task_fair,
+   .yield_to_task  = yield_to_task_fair,
 
.check_preempt_curr = check_preempt_wakeup,
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-26 Thread Minchan Kim
Hi Balbir,

On Tue, Jan 25, 2011 at 2:10 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote:
 Changelog v4
 1. Add max_unmapped_ratio and use that as the upper limit
 to check when to shrink the unmapped page cache (Christoph
 Lameter)

 Changelog v2
 1. Use a config option to enable the code (Andrew Morton)
 2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
 4. Use better names (balanced is not a good naming convention)

 Provide control using zone_reclaim() and a boot parameter. The
 code reuses functionality from zone_reclaim() to isolate unmapped
 pages and reclaim them as a priority, ahead of other mapped pages.
 A new sysctl for max_unmapped_ratio is provided and set to 16,
 indicating 16% of the total zone pages are unmapped, we start
 shrinking unmapped page cache.

 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 ---
  Documentation/kernel-parameters.txt |    8 +++
  include/linux/mmzone.h              |    5 ++
  include/linux/swap.h                |   23 -
  init/Kconfig                        |   12 +
  kernel/sysctl.c                     |   11 
  mm/page_alloc.c                     |   25 ++
  mm/vmscan.c                         |   87 
 +++
  7 files changed, 166 insertions(+), 5 deletions(-)

 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index fee5f57..65a4ee6 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -2500,6 +2500,14 @@ and is between 256 and 4096 characters. It is defined 
 in the file
                        [X86]
                        Set unknown_nmi_panic=1 early on boot.

 +       unmapped_page_control
 +                       [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
 +                       is enabled. It controls the amount of unmapped memory
 +                       that is present in the system. This boot option plus
 +                       vm.min_unmapped_ratio (sysctl) provide granular 
 control
 +                       over how much unmapped page cache can exist in the 
 system
 +                       before kswapd starts reclaiming unmapped page cache 
 pages.
 +
        usbcore.autosuspend=
                        [USB] The autosuspend time delay (in seconds) used
                        for newly-detected USB devices (default 2).  This
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index 2485acc..18f0f09 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -306,7 +306,10 @@ struct zone {
        /*
         * zone reclaim becomes active if more unmapped pages exist.
         */
 +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
        unsigned long           min_unmapped_pages;
 +       unsigned long           max_unmapped_pages;
 +#endif
  #ifdef CONFIG_NUMA
        int node;
        unsigned long           min_slab_pages;
 @@ -773,6 +776,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
 ctl_table *, int,
                                        void __user *, size_t *, loff_t *);
  int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
                        void __user *, size_t *, loff_t *);
 +int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 +                       void __user *, size_t *, loff_t *);
  int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
                        void __user *, size_t *, loff_t *);

 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 7b75626..ae62a03 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -255,19 +255,34 @@ extern int vm_swappiness;
  extern int remove_mapping(struct address_space *mapping, struct page *page);
  extern long vm_total_pages;

 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
  extern int sysctl_min_unmapped_ratio;
 +extern int sysctl_max_unmapped_ratio;
 +
  extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 -#ifdef CONFIG_NUMA
 -extern int zone_reclaim_mode;
 -extern int sysctl_min_slab_ratio;
  #else
 -#define zone_reclaim_mode 0
  static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int 
 order)
  {
        return 0;
  }
  #endif

 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +extern bool should_reclaim_unmapped_pages(struct zone *zone);
 +#else
 +static inline bool should_reclaim_unmapped_pages(struct zone *zone)
 +{
 +       return false;
 +}
 +#endif
 +
 +#ifdef CONFIG_NUMA
 +extern int zone_reclaim_mode;
 +extern int sysctl_min_slab_ratio;
 +#else
 +#define zone_reclaim_mode 0
 +#endif
 +
  extern int page_evictable(struct page *page, struct vm_area_struct *vma);
  extern void scan_mapping_unevictable_pages(struct address_space *);

 diff --git a/init/Kconfig b/init/Kconfig
 index 4f6cdbf..2dfbc09 100644
 --- a/init/Kconfig
 +++ 

Re: [PATCH v3 1/3] KVM: fix rcu usage warning in kvm_arch_vcpu_ioctl_set_sregs()

2011-01-26 Thread Xiao Guangrong
On 01/20/2011 11:27 PM, Marcelo Tosatti wrote:

 Sure :-), there is the simply test result of kernbench:

 Before patch:

 real5m6.493s 

 user3m57.847s

 sys 9m7.115s 

 real5m1.750s 

 user4m0.109s 

 sys 9m10.192s   


 After patch:
 real5m0.140s 

 user3m57.956s

 sys 8m58.339s  

 real4m56.314s

 user4m0.303s 

 sys 8m55.774s  
 
 Nice. One disadvantageous side effect for the kvm_vcpu_kick path 
 is that it can race with make_all_cpus_request, which is possibly
 doing unrelated, slower work (IPI'ing other vcpus, waiting for
 response).
 
 Looks ok, but lets wait for more careful reviews before applying.
 

Hi Avi, Marcelo,

I shall on vacation from 1.25 to 2.9, if you have comments on it i'll
replay after coming back. :-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-26 Thread Huang Ying
Hi, Andrew,

On Thu, 2011-01-20 at 23:50 +0800, Marcelo Tosatti wrote:
 On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote:
  Hi, Andrew,
  
  On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote:
   On 01/14/2011 03:37 AM, Huang Ying wrote:
On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
  On 01/13/2011 10:42 AM, Huang Ying wrote:
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested 
 callers
can distinguish HWPOISON page from general FAULT page, while other
callers will still get -EFAULT for pages, so the user space 
 interface
need not to be changed.
  
get_user_pages_hwpoison is added as a variant of get_user_pages 
 that
can return -EHWPOISON for HWPOISON page.
  
This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO 
 will be
tried for general FAULT page.
  
The idea comes from Andrew Morton.
  
Signed-off-by: Huang Yingying.hu...@intel.com
Cc: Andrew Mortona...@linux-foundation.org
  
+#ifdef CONFIG_MEMORY_FAILURE
+int get_user_pages_hwpoison(struct task_struct *tsk, struct 
 mm_struct *mm,
+ unsigned long start, int nr_pages, int 
 write,
+ int force, struct page **pages,
+ struct vm_area_struct **vmas);
+#else

  Since we'd also like to add get_user_pages_noio(), perhaps adding a
  flags field to get_user_pages() is better.
   
Or export __get_user_pages()?
   
   That's better, yes.
  
  Do you think it is a good idea to export __get_user_pages() instead of
  adding get_user_pages_noio() and get_user_pages_hwpoison()?
 
 Better Andrew and/or Linus should decide it.

We really need your comments about this.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html