from:"Glauber Costa"

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-12-08 Thread Glauber Costa

On Tue, Dec 8, 2020 at 8:11 AM Stefan Hajnoczi  wrote:
>
> On Thu, Oct 22, 2020 at 05:29:16PM +0100, Fam Zheng wrote:
> > On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote:
> > > On 2020/10/19 21:25, Paolo Bonzini wrote:
> > > > On 19/10/20 14:40, Zhenyu Ye wrote:
> > > > > The kernel backtrace for io_submit in GUEST is:
> > > > >
> > > > > guest# ./offcputime -K -p `pgrep -nx fio`
> > > > > b'finish_task_switch'
> > > > > b'__schedule'
> > > > > b'schedule'
> > > > > b'io_schedule'
> > > > > b'blk_mq_get_tag'
> > > > > b'blk_mq_get_request'
> > > > > b'blk_mq_make_request'
> > > > > b'generic_make_request'
> > > > > b'submit_bio'
> > > > > b'blkdev_direct_IO'
> > > > > b'generic_file_read_iter'
> > > > > b'aio_read'
> > > > > b'io_submit_one'
> > > > > b'__x64_sys_io_submit'
> > > > > b'do_syscall_64'
> > > > > b'entry_SYSCALL_64_after_hwframe'
> > > > > -fio (1464)
> > > > > 40031912
> > > > >
> > > > > And Linux io_uring can avoid the latency problem.
> >
> > Thanks for the info. What this tells us is basically the inflight
> > requests are high. It's sad that the linux-aio is in practice
> > implemented as a blocking API.

it is.

> >
> > Host side backtrace will be of more help. Can you get that too?
>
> I guess Linux AIO didn't set the BLK_MQ_REQ_NOWAIT flag so the task went
> to sleep when it ran out of blk-mq tags. The easiest solution is to move
> to io_uring. Linux AIO is broken - it's not AIO :).

Agree!
>
> If we know that no other process is writing to the host block device
> then maybe we can determine the blk-mq tags limit (the queue depth) and
> avoid sending more requests. That way QEMU doesn't block, but I don't
> think this approach works when other processes are submitting I/O to the
> same host block device :(.
>
> Fam's original suggestion of invoking io_submit(2) from a worker thread
> is an option, but I'm afraid it will slow down the uncontended case.
>
> I'm CCing Glauber in case he battled this in the past in ScyllaDB.

We have, and a lot. I don't recall seeing this particular lock, but
XFS would block us all the time
if it had to update metadata to submit the operation, lock inodes, etc.

The work we did at the time was in fixing those things in the kernel
as much as we could.
But the API is just like that...

>
> Stefan

Re: [Qemu-devel] [PATCH v3] Add an isa device for SGA

2011-06-07 Thread Glauber Costa


On 06/07/2011 04:17 PM, Anthony Liguori wrote:

On 05/16/2011 01:45 PM, Glauber Costa wrote:

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

[v2: suggestions on qdev by Markus ]
[v3: cleanups and documentation, per list suggestions ]

Signed-off-by: Glauber Costaglom...@redhat.com


Applied. But I'd like to figure out what to do about sgabios.bin. I
think we should ship a copy.


Agree.


Regards,

Anthony Liguori


---
Makefile.target | 2 +-
hw/pc.c | 9 
hw/sga.c | 56 +++
3 files changed, 66 insertions(+), 1 deletions(-)
create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
# Hardware support
obj-i386-y += vga.o
obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
obj-i386-y += vmport.o
obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..5a8e00a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,15 @@ void pc_vga_init(PCIBus *pci_bus)
isa_vga_init();
}
}
+
+ /*
+ * sga does not suppress normal vga output. So a machine can have both a
+ * vga card and sga manually enabled. Output will be seen on both.
+ * For nographic case, sga is enabled at all times
+ */
+ if (display_type == DT_NOGRAPHIC) {
+ isa_create_simple(sga);
+ }
}

static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..7ef750a
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,56 @@
+/*
+ * QEMU dummy ISA device for loading sgabios option rom.
+ *
+ * Copyright (c) 2011 Glauber Costa, Red Hat Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a copy
+ * of this software and associated documentation files (the
Software), to deal
+ * in the Software without restriction, including without limitation
the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * sgabios code originally available at code.google.com/p/sgabios
+ *
+ */
+#include pci.h
+#include pc.h
+#include loader.h
+#include sysemu.h
+
+#define SGABIOS_FILENAME sgabios.bin
+
+typedef struct ISAGAState {
+ ISADevice dev;
+} ISASGAState;
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+ rom_add_vga(SGABIOS_FILENAME);
+ return 0;
+}
+
+static ISADeviceInfo sga_info = {
+ .qdev.name = sga,
+ .qdev.desc = Serial Graphics Adapter,
+ .qdev.size = sizeof(ISASGAState),
+ .init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+ isa_qdev_register(sga_info);
+}
+
+device_init(sga_register);

[Qemu-devel] [PATCH v3] Add an isa device for SGA

2011-05-16 Thread Glauber Costa

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

[v2: suggestions on qdev by Markus ]
[v3: cleanups and documentation, per list suggestions ]

Signed-off-by: Glauber Costa glom...@redhat.com
---
 Makefile.target |2 +-
 hw/pc.c |9 
 hw/sga.c|   56 +++
 3 files changed, 66 insertions(+), 1 deletions(-)
 create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..5a8e00a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,15 @@ void pc_vga_init(PCIBus *pci_bus)
 isa_vga_init();
 }
 }
+
+/*
+ * sga does not suppress normal vga output. So a machine can have both a
+ * vga card and sga manually enabled. Output will be seen on both.
+ * For nographic case, sga is enabled at all times
+ */
+if (display_type == DT_NOGRAPHIC) {
+isa_create_simple(sga);
+}
 }
 
 static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..7ef750a
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,56 @@
+/*
+ * QEMU dummy ISA device for loading sgabios option rom.
+ *
+ * Copyright (c) 2011 Glauber Costa, Red Hat Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * sgabios code originally available at code.google.com/p/sgabios
+ *
+ */
+#include pci.h
+#include pc.h
+#include loader.h
+#include sysemu.h
+
+#define SGABIOS_FILENAME sgabios.bin
+
+typedef struct ISAGAState {
+ISADevice dev;
+} ISASGAState;
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+rom_add_vga(SGABIOS_FILENAME);
+return 0;
+}
+
+static ISADeviceInfo sga_info = {
+.qdev.name= sga,
+.qdev.desc= Serial Graphics Adapter,
+.qdev.size= sizeof(ISASGAState),
+.init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+  isa_qdev_register(sga_info);
+}
+
+device_init(sga_register);
-- 
1.7.4.2

Re: [Qemu-devel] [PATCH v2] Add an isa device for SGA

2011-05-13 Thread Glauber Costa

On Fri, May 13, 2011 at 10:43:33AM +0200, Markus Armbruster wrote:
 Glauber Costa glom...@redhat.com writes:
 
  This patch adds a dummy legacy ISA device whose responsibility is to
  deploy sgabios, an option rom for a serial graphics adapter.
  The proposal is that this device is always-on when -nographics,
  but can otherwise be enable in any setup when -device sga is used.
 
  [v2: suggestions on qdev by Markus ]
 
 One more question: should -device sga suppress the default VGA?  The
 other three VGA devices do; check out default_list[] in vl.c.
No, it should not. This rom is specially designed not to:
It will chain all int10h requests to whatever it was there before.

 
 [...]
  diff --git a/hw/pc.h b/hw/pc.h
  index 1291e2d..a00e054 100644
  --- a/hw/pc.h
  +++ b/hw/pc.h
  @@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base,
   void pci_cirrus_vga_init(PCIBus *bus);
   void isa_cirrus_vga_init(void);
   
  +/* serial graphics */
  +void isa_sga_init(void);
  +
 
 No longer exists, please drop.
You're right, my bad.

Anthony, is it needed to send a v3 to take this away, or will you remove 
manually ?

[Qemu-devel] [PATCH] Avoid segmentation fault for qdev device not found

2011-05-12 Thread Glauber Costa

qdev_try_create will cope well with a NULL bus, since it will assume
the main system bus by default. qdev_create, however, wants to print
a message, in which it instantiates the bus name. That simple and at
first inoffensive message will generate a segmentation found if the
reason for failure is a NULL bus.

I propose we avoid that - thus generating the normal hw_error by
always passing a valid bus to qdev_try_create - if none, be it the
main system bus.

The code for testing a NULL bus is not remove from qdev_try_create
because it is a externally visible function, and we want it to continue
working fine.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 hw/qdev.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/hw/qdev.c b/hw/qdev.c
index 1aa1ea0..21ef075 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -108,6 +108,10 @@ DeviceState *qdev_create(BusState *bus, const char *name)
 {
 DeviceState *dev;
 
+if (!bus) {
+bus = sysbus_get_default();
+}
+
 dev = qdev_try_create(bus, name);
 if (!dev) {
 hw_error(Unknown device '%s' for bus '%s'\n, name, bus-info-name);
-- 
1.7.4.2

[Qemu-devel] [PATCH v2] Add an isa device for SGA

2011-05-12 Thread Glauber Costa

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

[v2: suggestions on qdev by Markus ]

Signed-off-by: Glauber Costa glom...@redhat.com
---
 Makefile.target |2 +-
 hw/pc.c |4 
 hw/pc.h |3 +++
 hw/sga.c|   31 +++
 4 files changed, 39 insertions(+), 1 deletions(-)
 create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..78bf7de 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,10 @@ void pc_vga_init(PCIBus *pci_bus)
 isa_vga_init();
 }
 }
+
+if (display_type == DT_NOGRAPHIC) {
+isa_create_simple(sga);
+}
 }
 
 static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/pc.h b/hw/pc.h
index 1291e2d..a00e054 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base,
 void pci_cirrus_vga_init(PCIBus *bus);
 void isa_cirrus_vga_init(void);
 
+/* serial graphics */
+void isa_sga_init(void);
+
 /* ne2000.c */
 static inline bool isa_ne2000_init(int base, int irq, NICInfo *nd)
 {
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..08cef56
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,31 @@
+#include pci.h
+#include pc.h
+#include loader.h
+#include sysemu.h
+
+#define SGABIOS_FILENAME sgabios.bin
+
+typedef struct ISAGAState {
+ISADevice dev;
+} ISASGAState;
+
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+rom_add_vga(SGABIOS_FILENAME);
+return 0;
+}
+
+static ISADeviceInfo sga_info = {
+.qdev.name= sga,
+.qdev.desc= Serial Graphics Adapter,
+.qdev.size= sizeof(ISASGAState),
+.init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+  isa_qdev_register(sga_info);
+}
+
+device_init(sga_register);
-- 
1.7.4.2

[Qemu-devel] [PATCH] Add an isa device for SGA

2011-05-11 Thread Glauber Costa

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 Makefile.target |2 +-
 hw/pc.c |2 ++
 hw/pc.h |3 +++
 hw/sga.c|   49 +
 4 files changed, 55 insertions(+), 1 deletions(-)
 create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..56c3887 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,8 @@ void pc_vga_init(PCIBus *pci_bus)
 isa_vga_init();
 }
 }
+
+isa_sga_init();
 }
 
 static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/pc.h b/hw/pc.h
index 1291e2d..a00e054 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base,
 void pci_cirrus_vga_init(PCIBus *bus);
 void isa_cirrus_vga_init(void);
 
+/* serial graphics */
+void isa_sga_init(void);
+
 /* ne2000.c */
 static inline bool isa_ne2000_init(int base, int irq, NICInfo *nd)
 {
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..411191b
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,49 @@
+#include pci.h
+#include pc.h
+#include loader.h
+#include sysemu.h
+
+#define SGABIOS_FILENAME sgabios.bin
+
+typedef struct ISAGAState {
+ISADevice dev;
+} ISASGAState;
+
+/* We can have both -device, and the initfn called, so better
+ * avoid to have the rom loaded twice */
+static void deploy_rom(void)
+{
+static int rom_deployed = 0;
+
+if (!rom_deployed++) {
+rom_add_vga(SGABIOS_FILENAME);
+}
+}
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+deploy_rom();
+
+return 0;
+}
+
+void isa_sga_init(void)
+{
+if (display_type == DT_NOGRAPHIC) {
+deploy_rom();
+}
+}
+
+static ISADeviceInfo sga_info = {
+.qdev.name= sga,
+.qdev.desc= Serial Graphics Adapter,
+.qdev.size= sizeof(ISASGAState),
+.init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+  isa_qdev_register(sga_info);
+}
+
+device_init(sga_register);
-- 
1.7.4.2

Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-04 Thread Glauber Costa

On Wed, 2011-05-04 at 06:09 -0300, Marcelo Tosatti wrote:
 On Wed, May 04, 2011 at 04:06:59AM -0400, Ulrich Obergfell wrote:
  
  Hi Marcelo,
   
   Whats prev_period for, since in practice the period will not change
   between interrupts (OS programs comparator once, or perhaps twice
   during bootup) ?
  
  'prev_period' is needed if a guest o/s changes the comparator period
  'on the fly' (without stopping and restarting the timer).
  
  
   guest o/s changes period
 |
ti(n-1)  |ti(n)  ti(n+1)
  |  v  |  |
  +-+--+
  
   --- prev_period --- -- period --
  
  
  The idea is that each timer interrupt represents a certain quantum
  of time (the comparator period). If a guest o/s changes the period
  between timer interrupt 'n-1' and timer interrupt 'n', I think the
  new value should not take effect before timer interrupt 'n'. Timer
  interrupt 'n' still represents the old/previous quantum, and timer
  interrupt 'n+1' represents the new quantum.
  
  Hence, the patch decrements 'ticks_not_accounted' by 'prev_period'
  and sets 'prev_period' to 'period' when an interrupt was delivered
  to the guest o/s.
  
  +irq_delivered = update_irq(t, 1);
  +if (irq_delivered) {
  +t-ticks_not_accounted -= t-prev_period;
  +t-prev_period = t-period;
  +} else {
  
  Most of the time 'prev_period' is equal to 'period'. It should only
  be different in the scenario shown above.
 
 OK, makes sense. You should probably reset ticks_not_accounted to zero
 on HPET initialization (for example, to avoid miscalibration when
 kexec'ing a new kernel).

Everybody resetting the machine in anyway is expected to force devices
to be reinitialized, right ?
I may be wrong, but I was under the impression that kexec would do this
as well. In this case, the reset function should be enough.

Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-04 Thread Glauber Costa

On Wed, 2011-05-04 at 16:46 +0300, Gleb Natapov wrote:
 On Wed, May 04, 2011 at 10:36:12AM -0300, Glauber Costa wrote:
  On Wed, 2011-05-04 at 06:09 -0300, Marcelo Tosatti wrote:
   On Wed, May 04, 2011 at 04:06:59AM -0400, Ulrich Obergfell wrote:

Hi Marcelo,
 
 Whats prev_period for, since in practice the period will not change
 between interrupts (OS programs comparator once, or perhaps twice
 during bootup) ?

'prev_period' is needed if a guest o/s changes the comparator period
'on the fly' (without stopping and restarting the timer).


 guest o/s changes period
   |
  ti(n-1)  |ti(n)  ti(n+1)
|  v  |  |
+-+--+

 --- prev_period --- -- period --


The idea is that each timer interrupt represents a certain quantum
of time (the comparator period). If a guest o/s changes the period
between timer interrupt 'n-1' and timer interrupt 'n', I think the
new value should not take effect before timer interrupt 'n'. Timer
interrupt 'n' still represents the old/previous quantum, and timer
interrupt 'n+1' represents the new quantum.

Hence, the patch decrements 'ticks_not_accounted' by 'prev_period'
and sets 'prev_period' to 'period' when an interrupt was delivered
to the guest o/s.

+irq_delivered = update_irq(t, 1);
+if (irq_delivered) {
+t-ticks_not_accounted -= t-prev_period;
+t-prev_period = t-period;
+} else {

Most of the time 'prev_period' is equal to 'period'. It should only
be different in the scenario shown above.
   
   OK, makes sense. You should probably reset ticks_not_accounted to zero
   on HPET initialization (for example, to avoid miscalibration when
   kexec'ing a new kernel).
  
  Everybody resetting the machine in anyway is expected to force devices
  to be reinitialized, right ?
  I may be wrong, but I was under the impression that kexec would do this
  as well. In this case, the reset function should be enough.
  
 kexec does not reset a machine. That's the whole point of kexec in
 fact.
Sure thing, but doesn't it force the initialization routine of the devices 
themselves, without
going through the bios ?

 --
   Gleb.

Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-03 Thread Glauber Costa

On Tue, 2011-05-03 at 16:03 -0300, Marcelo Tosatti wrote:
 On Thu, Apr 28, 2011 at 04:25:00PM +0200, Ulrich Obergfell wrote:
  Loss of periodic timer interrupts caused by delayed callbacks and by
  interrupt coalescing is compensated by gradually injecting additional
  interrupts during subsequent timer intervals, starting at a rate of
  one additional interrupt per interval. The injection of additional
  interrupts is based on a backlog of unaccounted HPET clock periods
  (new HPETTimer field 'ticks_not_accounted'). The backlog increases
  due to delayed callbacks and coalesced interrupts, and it decreases
  if an interrupt was injected successfully. If the backlog increases
  while compensation is still in progress, the rate at which additional
  interrupts are injected is increased too. A limit is imposed on the
  backlog and on the rate.
  
  Injecting additional timer interrupts to compensate lost interrupts
  can alleviate long term time drift. However, on a short time scale,
  this method can have the side effect of making virtual machine time
  intermittently pass slower and faster than real time (depending on
  the guest's time keeping algorithm). Compensation is disabled by
  default and can be enabled for guests where this behaviour may be
  acceptable.
  
  Signed-off-by: Ulrich Obergfell uober...@redhat.com
  ---
   hw/hpet.c |   63 
  +++-
   1 files changed, 61 insertions(+), 2 deletions(-)
  
  diff --git a/hw/hpet.c b/hw/hpet.c
  index 35466ae..92d5f58 100644
  --- a/hw/hpet.c
  +++ b/hw/hpet.c
  @@ -32,6 +32,7 @@
   #include sysbus.h
   #include mc146818rtc.h
   #include sysemu.h
  +#include assert.h
   
   //#define HPET_DEBUG
   #ifdef HPET_DEBUG
  @@ -42,6 +43,9 @@
   
   #define HPET_MSI_SUPPORT0
   
  +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
  +#define MAX_IRQ_RATE(uint32_t)10
  +
   struct HPETState;
   typedef struct HPETTimer {  /* timers */
   uint8_t tn; /*timer number*/
  @@ -326,28 +330,63 @@ static const VMStateDescription vmstate_hpet = {
   }
   };
   
  +static bool hpet_timer_has_tick_backlog(HPETTimer *t)
  +{
  +uint64_t backlog = t-ticks_not_accounted - (t-period + 
  t-prev_period);
  +return (backlog = t-period);
  +}
  +
   /*
* timer expiration callback
*/
   static void hpet_timer(void *opaque)
   {
   HPETTimer *t = opaque;
  +HPETState *s = t-state;
   uint64_t diff;
  -
  +int irq_delivered = 0;
  +uint32_t irq_count = 0;
   uint64_t period = t-period;
   uint64_t cur_tick = hpet_get_ticks(t-state);
   
  +if (s-driftfix  !t-ticks_not_accounted) {
  +t-ticks_not_accounted = t-prev_period = t-period;
  +}
   if (timer_is_periodic(t)  period != 0) {
   if (t-config  HPET_TN_32BIT) {
   while (hpet_time_after(cur_tick, t-cmp)) {
   t-cmp = (uint32_t)(t-cmp + t-period);
  +t-ticks_not_accounted += t-period;
  +irq_count++;
   }
   } else {
   while (hpet_time_after64(cur_tick, t-cmp)) {
   t-cmp += period;
  +t-ticks_not_accounted += period;
  +irq_count++;
   }
   }
   diff = hpet_calculate_diff(t, cur_tick);
  +if (s-driftfix) {
  +if (t-ticks_not_accounted  MAX_TICKS_NOT_ACCOUNTED) {
  +t-ticks_not_accounted = t-period + t-prev_period;
  +}
  +if (hpet_timer_has_tick_backlog(t)) {
  +if (t-irq_rate == 1 || irq_count  1) {
  +t-irq_rate++;
  +t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE);
  +}
  +if (t-divisor == 0) {
  +assert(irq_count);
  +}
  +if (irq_count) {
  +t-divisor = t-irq_rate;
  +}
  +diff /= t-divisor--;
  +} else {
  +t-irq_rate = 1;
  +}
  +}
   qemu_mod_timer(t-qemu_timer,
  qemu_get_clock_ns(vm_clock) + 
  (int64_t)ticks_to_ns(diff));
   } else if (t-config  HPET_TN_32BIT  !timer_is_periodic(t)) {
  @@ -358,7 +397,22 @@ static void hpet_timer(void *opaque)
   t-wrap_flag = 0;
   }
   }
  -update_irq(t, 1);
  +if (s-driftfix  timer_is_periodic(t)  period != 0) {
  +if (t-ticks_not_accounted = t-period + t-prev_period) {
  +irq_delivered = update_irq(t, 1);
  +if (irq_delivered) {
  +t-ticks_not_accounted -= t-prev_period;
  +t-prev_period = t-period;
  +} else {
  +if (irq_count) {
  +t-irq_rate++;
  +t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE);
  +}
  +}
  +}

Re: [Qemu-devel] [PATCH v2 2a/6] x86: Allow multiple cpu feature matches of lookup_feature

2011-04-27 Thread Glauber Costa

On Tue, 2011-04-19 at 13:06 +0200, Jan Kiszka wrote:
 kvmclock is represented by two feature bits. Therefore, lookup_feature
 needs to continue its search even after the first match. Enhance it
 accordingly and switch to a bool return type at this chance.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  target-i386/cpuid.c |   14 --
  1 files changed, 8 insertions(+), 6 deletions(-)
 
 Glauber, could you check/ack this? Marcelo, please respin the series
 afterward. I'd like to see all bits upstream and merged back into
 qemu-kvm to proceed with switching the latter to upstream's kvm
 infrastructure.

Yes, this patch is okay.

Actually, I did sent out something like this, maybe Marcelo applied only
part of the series?

Anyway, Jan's version is handy, please apply it.

 diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
 index 814d13e..0ac592f 100644
 --- a/target-i386/cpuid.c
 +++ b/target-i386/cpuid.c
 @@ -182,20 +182,22 @@ static int altcmp(const char *s, const char *e, const 
 char *altstr)
  }
  
  /* search featureset for flag *[s..e), if found set corresponding bit in
 - * *pval and return success, otherwise return zero
 + * *pval and return true, otherwise return false
   */
 -static int lookup_feature(uint32_t *pval, const char *s, const char *e,
 -const char **featureset)
 +static bool lookup_feature(uint32_t *pval, const char *s, const char *e,
 +   const char **featureset)
  {
  uint32_t mask;
  const char **ppc;
 +bool found = false;
  
 -for (mask = 1, ppc = featureset; mask; mask = 1, ++ppc)
 +for (mask = 1, ppc = featureset; mask; mask = 1, ++ppc) {
  if (*ppc  !altcmp(s, e, *ppc)) {
  *pval |= mask;
 -break;
 +found = true;
  }
 -return (mask ? 1 : 0);
 +}
 +return found;
  }
  
  static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features,

[Qemu-devel] [PATCH 0/2] allow fw_dir to be specified in the option-rom switch

2011-04-12 Thread Glauber Costa

Some roms should not live in genroms/, the default place for all roms passed
through -option-rom switch.
Rather, they'd like to be placed in vgaroms. This patch allows it to happen.

Glauber Costa (2):
  document bootindex option
  add fw_dir option to option-rom switch

 hw/pc.c |7 ++-
 qemu-config.c   |3 +++
 qemu-options.hx |   14 --
 sysemu.h|1 +
 vl.c|1 +
 5 files changed, 23 insertions(+), 3 deletions(-)

-- 
1.7.4

[Qemu-devel] [PATCH 1/2] document bootindex option

2011-04-12 Thread Glauber Costa

bootindex option was added to -option-rom switch, but never documented.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 qemu-options.hx |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index 18f54d2..96927cc 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2079,13 +2079,20 @@ to cope with initialization race conditions.
 ETEXI
 
 DEF(option-rom, HAS_ARG, QEMU_OPTION_option_rom, \
--option-rom rom load a file, rom, into the option ROM space\n,
+-option-rom rom[,bootindex=idx] load a file, rom, into the option ROM 
space\n,
 QEMU_ARCH_ALL)
 STEXI
-@item -option-rom @var{file}
+@item -option-rom @var{file}[,bootindex=@var{bootindex}]
 @findex -option-rom
 Load the contents of @var{file} as an option ROM.
 This option is useful to load things like EtherBoot.
+
+@table @option
+@item bootindex=@var{bootindex}
+Defines which boot index the option rom will be given in boot entry vectors,
+allowing fine-grained selection of devices boot order.
+@end table
+
 ETEXI
 
 DEF(clock, HAS_ARG, QEMU_OPTION_clock, \
-- 
1.7.4

[Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

The option-rom puts all roms passed by this switch in the genroms directory,
through the fw_dir option. But as it turns out, not all roms should be placed 
there.

VGA roms are a of a different kind. They live in a different segment, and are 
scanned
first. This patch allows qemu to use external vgaroms such as sgabios, a rom 
for a serial
graphics adapter.

Without this patch, we lose all the initial BIOS output: until the option rom 
is initialized,
the only available vga rom is the default cirrus/vesa one.

We're also vulnerable to option rom enumeration order: if a vga oprom is 
initialized
before, say, gpxe, we are able to see its output in the adapter. If it is 
initialized
after, we miss it.

So I believe the proper solution is to allow users to specify that a rom 
belongs in vgaroms
directory instead.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Gleb Natapov g...@redhat.com
---
 hw/pc.c |7 ++-
 qemu-config.c   |3 +++
 qemu-options.hx |7 +--
 sysemu.h|1 +
 vl.c|1 +
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..736efbb 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1055,7 +1055,12 @@ void pc_memory_init(ram_addr_t ram_size,
 }
 
 for (i = 0; i  nb_option_roms; i++) {
-rom_add_option(option_rom[i].name, option_rom[i].bootindex);
+/* do it here, instead of in vl.c, to avoid cluttering that file with 
x86 material */
+if (!option_rom[i].fw_dir) {
+option_rom[i].fw_dir = genroms;
+}
+rom_add_file(option_rom[i].name, option_rom[i].fw_dir, 0,
+ option_rom[i].bootindex);
 }
 }
 
diff --git a/qemu-config.c b/qemu-config.c
index 6d9c238..97b8515 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -450,6 +450,9 @@ QemuOptsList qemu_option_rom_opts = {
 }, {
 .name = romfile,
 .type = QEMU_OPT_STRING,
+}, {
+.name = fw_dir,
+.type = QEMU_OPT_STRING,
 },
 { /* end if list */ }
 },
diff --git a/qemu-options.hx b/qemu-options.hx
index 96927cc..d9eec62 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2079,10 +2079,10 @@ to cope with initialization race conditions.
 ETEXI
 
 DEF(option-rom, HAS_ARG, QEMU_OPTION_option_rom, \
--option-rom rom[,bootindex=idx] load a file, rom, into the option ROM 
space\n,
+-option-rom rom[,bootindex=idx][,fw_dir=dir] load a file, rom, into the 
option ROM space\n,
 QEMU_ARCH_ALL)
 STEXI
-@item -option-rom @var{file}[,bootindex=@var{bootindex}]
+@item -option-rom @var{file}[,bootindex=@var{bootindex}][,fw_dir=@var{dir}]
 @findex -option-rom
 Load the contents of @var{file} as an option ROM.
 This option is useful to load things like EtherBoot.
@@ -2091,6 +2091,9 @@ This option is useful to load things like EtherBoot.
 @item bootindex=@var{bootindex}
 Defines which boot index the option rom will be given in boot entry vectors,
 allowing fine-grained selection of devices boot order.
+@item fw_dir=@var{dir} (x86 only)
+Specify under which firmware directory entry this rom should live. Current
+allowed values are vgaroms and genroms (default).
 @end table
 
 ETEXI
diff --git a/sysemu.h b/sysemu.h
index 3f7e17e..2f8be32 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -160,6 +160,7 @@ extern uint64_t node_cpumask[MAX_NODES];
 typedef struct QEMUOptionRom {
 const char *name;
 int32_t bootindex;
+const char *fw_dir;
 } QEMUOptionRom;
 extern QEMUOptionRom option_rom[MAX_OPTION_ROMS];
 extern int nb_option_roms;
diff --git a/vl.c b/vl.c
index 8bcf2ae..e1d7868 100644
--- a/vl.c
+++ b/vl.c
@@ -2675,6 +2675,7 @@ int main(int argc, char **argv, char **envp)
 fprintf(stderr, Option ROM file is not specified\n);
 exit(1);
 }
+option_rom[nb_option_roms].fw_dir = qemu_opt_get(opts, 
fw_dir);
nb_option_roms++;
break;
 case QEMU_OPTION_semihosting:
-- 
1.7.4

Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

On Tue, 2011-04-12 at 12:40 -0500, Anthony Liguori wrote:
 On 04/12/2011 12:23 PM, Glauber Costa wrote:
  The option-rom puts all roms passed by this switch in the genroms directory,
  through the fw_dir option. But as it turns out, not all roms should be 
  placed there.
 
  VGA roms are a of a different kind. They live in a different segment, and 
  are scanned
  first. This patch allows qemu to use external vgaroms such as sgabios, a 
  rom for a serial
  graphics adapter.
 
 Wouldn't it make more sense to have a new PCI device that had sgabios 
 that could play the role of the VGA device.
 
 Then you could say -vga sga
 
 Regards,

This would be good, if we included sga along the roms we ship. We're
still, however, left with the problem that in the future, people may
want to use their own roms. And the fact that not all of them should
live in genroms persists.

 Anthony Liguori
 
  Without this patch, we lose all the initial BIOS output: until the option 
  rom is initialized,
  the only available vga rom is the default cirrus/vesa one.
 
  We're also vulnerable to option rom enumeration order: if a vga oprom is 
  initialized
  before, say, gpxe, we are able to see its output in the adapter. If it is 
  initialized
  after, we miss it.
 
  So I believe the proper solution is to allow users to specify that a rom 
  belongs in vgaroms
  directory instead.
 
  Signed-off-by: Glauber Costaglom...@redhat.com
  CC: Gleb Natapovg...@redhat.com
  ---
hw/pc.c |7 ++-
qemu-config.c   |3 +++
qemu-options.hx |7 +--
sysemu.h|1 +
vl.c|1 +
5 files changed, 16 insertions(+), 3 deletions(-)
 
  diff --git a/hw/pc.c b/hw/pc.c
  index 8d351ba..736efbb 100644
  --- a/hw/pc.c
  +++ b/hw/pc.c
  @@ -1055,7 +1055,12 @@ void pc_memory_init(ram_addr_t ram_size,
}
 
for (i = 0; i  nb_option_roms; i++) {
  -rom_add_option(option_rom[i].name, option_rom[i].bootindex);
  +/* do it here, instead of in vl.c, to avoid cluttering that file 
  with x86 material */
  +if (!option_rom[i].fw_dir) {
  +option_rom[i].fw_dir = genroms;
  +}
  +rom_add_file(option_rom[i].name, option_rom[i].fw_dir, 0,
  + option_rom[i].bootindex);
}
}
 
  diff --git a/qemu-config.c b/qemu-config.c
  index 6d9c238..97b8515 100644
  --- a/qemu-config.c
  +++ b/qemu-config.c
  @@ -450,6 +450,9 @@ QemuOptsList qemu_option_rom_opts = {
}, {
.name = romfile,
.type = QEMU_OPT_STRING,
  +}, {
  +.name = fw_dir,
  +.type = QEMU_OPT_STRING,
},
{ /* end if list */ }
},
  diff --git a/qemu-options.hx b/qemu-options.hx
  index 96927cc..d9eec62 100644
  --- a/qemu-options.hx
  +++ b/qemu-options.hx
  @@ -2079,10 +2079,10 @@ to cope with initialization race conditions.
ETEXI
 
DEF(option-rom, HAS_ARG, QEMU_OPTION_option_rom, \
  --option-rom rom[,bootindex=idx] load a file, rom, into the option ROM 
  space\n,
  +-option-rom rom[,bootindex=idx][,fw_dir=dir] load a file, rom, into 
  the option ROM space\n,
QEMU_ARCH_ALL)
STEXI
  -@item -option-rom @var{file}[,bootindex=@var{bootindex}]
  +@item -option-rom @var{file}[,bootindex=@var{bootindex}][,fw_dir=@var{dir}]
@findex -option-rom
Load the contents of @var{file} as an option ROM.
This option is useful to load things like EtherBoot.
  @@ -2091,6 +2091,9 @@ This option is useful to load things like EtherBoot.
@item bootindex=@var{bootindex}
Defines which boot index the option rom will be given in boot entry 
  vectors,
allowing fine-grained selection of devices boot order.
  +@item fw_dir=@var{dir} (x86 only)
  +Specify under which firmware directory entry this rom should live. Current
  +allowed values are vgaroms and genroms (default).
@end table
 
ETEXI
  diff --git a/sysemu.h b/sysemu.h
  index 3f7e17e..2f8be32 100644
  --- a/sysemu.h
  +++ b/sysemu.h
  @@ -160,6 +160,7 @@ extern uint64_t node_cpumask[MAX_NODES];
typedef struct QEMUOptionRom {
const char *name;
int32_t bootindex;
  +const char *fw_dir;
} QEMUOptionRom;
extern QEMUOptionRom option_rom[MAX_OPTION_ROMS];
extern int nb_option_roms;
  diff --git a/vl.c b/vl.c
  index 8bcf2ae..e1d7868 100644
  --- a/vl.c
  +++ b/vl.c
  @@ -2675,6 +2675,7 @@ int main(int argc, char **argv, char **envp)
fprintf(stderr, Option ROM file is not specified\n);
exit(1);
}
  +option_rom[nb_option_roms].fw_dir = qemu_opt_get(opts, 
  fw_dir);
  nb_option_roms++;
  break;
case QEMU_OPTION_semihosting:

Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

On Tue, 2011-04-12 at 13:31 -0500, Anthony Liguori wrote:
 On 04/12/2011 01:13 PM, Glauber Costa wrote:
  On Tue, 2011-04-12 at 12:40 -0500, Anthony Liguori wrote:
  On 04/12/2011 12:23 PM, Glauber Costa wrote:
  The option-rom puts all roms passed by this switch in the genroms 
  directory,
  through the fw_dir option. But as it turns out, not all roms should be 
  placed there.
 
  VGA roms are a of a different kind. They live in a different segment, and 
  are scanned
  first. This patch allows qemu to use external vgaroms such as sgabios, a 
  rom for a serial
  graphics adapter.
  Wouldn't it make more sense to have a new PCI device that had sgabios
  that could play the role of the VGA device.
 
  Then you could say -vga sga
 
  Regards,
  This would be good, if we included sga along the roms we ship. We're
  still, however, left with the problem that in the future, people may
  want to use their own roms.
 
 Fortunately, I don't think that there are myriads of people writing 
 16-bit option roms so I don't think this is a pressing problem :-)
For sure, but if we had this discussion a while ago, 
sgabios wouldn't exist back then, and now it does =p

And the fact that not all of them should
  live in genroms persists.
 
 Actually genroms should disappear.  We should make -option-rom work by 
 using a dummy PCI device or something like that.

In a side track, I think sgabios is a pretty useful addition to qemu.
Would you consider including it together with the other roms ?
This would definitely increase the out-of-the box experience.

I think we don't even need to have a -vga sga switch then. Since sgabios
does not prevent output from going to the normal vga, we could always
install it, or install when there is a -serial parameter present.

Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

On Tue, 2011-04-12 at 14:19 -0500, Anthony Liguori wrote:
 On 04/12/2011 01:47 PM, Glauber Costa wrote:
  For sure, but if we had this discussion a while ago,
  sgabios wouldn't exist back then, and now it does =p
 
 Actually, it's been around for ages :-)
 
 And the fact that not all of them should
  live in genroms persists.
  Actually genroms should disappear.  We should make -option-rom work by
  using a dummy PCI device or something like that.
  In a side track, I think sgabios is a pretty useful addition to qemu.
  Would you consider including it together with the other roms ?
 
 Definitely.
 
  This would definitely increase the out-of-the box experience.
 
  I think we don't even need to have a -vga sga switch then. Since sgabios
  does not prevent output from going to the normal vga, we could always
  install it, or install when there is a -serial parameter present.
 
 Is that true?  It needs to load during video bios init, no?  So doesn't 
 that preclude having a normal video bios installed?

No.

sgabios chains all int10h calls.

Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-11 Thread Glauber Costa

On Mon, 2011-04-11 at 08:10 -0500, Anthony Liguori wrote:
 On 04/11/2011 04:08 AM, Avi Kivity wrote:
  On 04/11/2011 12:06 PM, Ulrich Obergfell wrote:
vmstate_hpet_timer = {
  VMSTATE_UINT64(fsb, HPETTimer),
  VMSTATE_UINT64(period, HPETTimer),
  VMSTATE_UINT8(wrap_flag, HPETTimer),
+ VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
+ VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
+ VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
+ VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
+ VMSTATE_UINT32_V(divisor, HPETTimer, 3),
  
We ought to be able to use a subsection keyed off of whether any 
  ticks
are currently accumulated, no?
 
 
  Anthony,
 
  I'm not sure if I understand your question correctly. Are you suggesting
  to migrate the driftfix-related state conditionally / only if there are
  any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ?
 
  The size of the driftfix-related state is 28 bytes per timer and we have
  32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a
  maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes.
  Hence, unconditional migration of the driftfix-related state should not
  cause significant additional overhead.
 
 
  It's not about overhead.
 
  Maybe I missed something. Could you please explain which benefit you see
  in using a subsection ?
 
  In the common case of there being no drift, you can migrate from a 
  qemu that supports driftfix to a qemu that doesn't.
 
 
 Right, subsections are a trick.  The idea is that when you introduce new 
 state for a device model that is not always going to be set, when you do 
 the migration, you detect whether the state is set or not and if it's 
 not set, instead of sending empty versions of that state (i.e. 
 missed_ticks=0) you just don't send the new state at all.
 
 This means that you can migrate to an older version of QEMU provided the 
 migration would work correctly.

Using subsections and testing for hpet option being disabled vs enabled,
is fine. But checking for the existence of drift, like you suggested (or
at least how I understood you), is very tricky. It is expected to change
many times during guest lifetime, and would make our migration
predictability something Heisenberg would be proud of.

Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-11 Thread Glauber Costa

On Mon, 2011-04-11 at 08:47 -0500, Anthony Liguori wrote:
 On 04/11/2011 08:39 AM, Glauber Costa wrote:
  On Mon, 2011-04-11 at 08:10 -0500, Anthony Liguori wrote:
  On 04/11/2011 04:08 AM, Avi Kivity wrote:
  On 04/11/2011 12:06 PM, Ulrich Obergfell wrote:
vmstate_hpet_timer = {
  VMSTATE_UINT64(fsb, HPETTimer),
  VMSTATE_UINT64(period, HPETTimer),
  VMSTATE_UINT8(wrap_flag, HPETTimer),
+ VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
+ VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
+ VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
+ VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
+ VMSTATE_UINT32_V(divisor, HPETTimer, 3),
We ought to be able to use a subsection keyed off of whether any
  ticks
are currently accumulated, no?
 
  Anthony,
 
  I'm not sure if I understand your question correctly. Are you suggesting
  to migrate the driftfix-related state conditionally / only if there are
  any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ?
 
  The size of the driftfix-related state is 28 bytes per timer and we have
  32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a
  maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes.
  Hence, unconditional migration of the driftfix-related state should not
  cause significant additional overhead.
 
  It's not about overhead.
 
  Maybe I missed something. Could you please explain which benefit you see
  in using a subsection ?
  In the common case of there being no drift, you can migrate from a
  qemu that supports driftfix to a qemu that doesn't.
 
  Right, subsections are a trick.  The idea is that when you introduce new
  state for a device model that is not always going to be set, when you do
  the migration, you detect whether the state is set or not and if it's
  not set, instead of sending empty versions of that state (i.e.
  missed_ticks=0) you just don't send the new state at all.
 
  This means that you can migrate to an older version of QEMU provided the
  migration would work correctly.
  Using subsections and testing for hpet option being disabled vs enabled,
  is fine. But checking for the existence of drift, like you suggested (or
  at least how I understood you), is very tricky. It is expected to change
  many times during guest lifetime, and would make our migration
  predictability something Heisenberg would be proud of.
 
 Is this true?  I would expect it to be very tied to workloads.  For idle 
 workloads, you should never have accumulated missed ticks whereas with 
 heavy workloads, you always will have accumulated ticks.
 
 Is that not correct?
Yes, it is , but we lose a lot of reliability by tying migration to the 
workload. Given that
we still have to start qemu the same way both sides, we end up with a
situation in which at time t, migration is possible, and at time t+1
migration is not.

I'd rather have subsections enabled at all times when the option to
allow driftfix is enabled.

[Qemu-devel] Re: [PATCH v2 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-04-08 Thread Glauber Costa

On Fri, 2011-04-08 at 17:54 +0200, Jan Kiszka wrote:
  +}
 
 Did I miss some change in the plan? I thought we were heading for a
 generic, reusable driftfix tool box (or periodic timer service)? Or is
 this intentionally an intermediate step? 

Which is a medium to long way in the future. There is benefit of having
this early, since it can deliver real functionality - a reliable hpet,
and converting to whatever the api may look like in the future.

[Qemu-devel] Re: [PATCH 3/3] don't create kvmclock when one of the flags are present.

2011-03-18 Thread Glauber Costa

On Fri, 2011-03-18 at 11:24 +0100, Jan Kiszka wrote:
 On 2011-03-17 23:42, Glauber Costa wrote:
  kvmclock presence can be signalled by two different flags. So for
  device creation, we have to test for both.

 Patch is OK, but the subject's logic is inverted.
Indeed, should have said something like to test for either of them

Dear maintainers, is it okay to commit with a minor edit to the
changelogs?

 Jan
 
  
  Signed-off-by: Glauber Costa glom...@redhat.com
  ---
   hw/kvmclock.c |6 +-
   1 files changed, 5 insertions(+), 1 deletions(-)
  
  diff --git a/hw/kvmclock.c b/hw/kvmclock.c
  index b6ceddf..004c4ad 100644
  --- a/hw/kvmclock.c
  +++ b/hw/kvmclock.c
  @@ -103,7 +103,11 @@ static SysBusDeviceInfo kvmclock_info = {
   void kvmclock_create(void)
   {
   if (kvm_enabled() 
  -first_cpu-cpuid_kvm_features  (1ULL  KVM_FEATURE_CLOCKSOURCE)) 
  {
  +first_cpu-cpuid_kvm_features  ((1ULL  KVM_FEATURE_CLOCKSOURCE)
  +#ifdef KVM_FEATURE_CLOCKSOURCE2
  +|| (1ULL  KVM_FEATURE_CLOCKSOURCE2)
  +#endif
  +)) {
   sysbus_create_simple(kvmclock, -1, NULL);
   }
   }

[Qemu-devel] [PATCH 1/3] use kernel-provided para_features instead of statically coming up with new capabilities

2011-03-17 Thread Glauber Costa

According to Avi's comments over my last submission, I decided to take a
different, and more correct direction - we hope.

This patch is now using the features provided by KVM_GET_SUPPORTED_CPUID 
directly to
mask out features from guest-visible cpuid.

The old get_para_features() mechanism is kept for older kernels that do not 
implement it.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 target-i386/kvm.c |   76 +++-
 1 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index da757fa..dc1e547 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -95,6 +95,35 @@ static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max)
 return cpuid;
 }
 
+#ifdef CONFIG_KVM_PARA
+struct kvm_para_features {
+int cap;
+int feature;
+} para_features[] = {
+{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
+{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
+{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
+#ifdef KVM_CAP_ASYNC_PF
+{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
+#endif
+{ -1, -1 }
+};
+
+static int get_para_features(CPUState *env)
+{
+int i, features = 0;
+
+for (i = 0; i  ARRAY_SIZE(para_features) - 1; i++) {
+if (kvm_check_extension(env-kvm_state, para_features[i].cap)) {
+features |= (1  para_features[i].feature);
+}
+}
+
+return features;
+}
+#endif
+
+
 uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function,
   uint32_t index, int reg)
 {
@@ -102,6 +131,7 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 int i, max;
 uint32_t ret = 0;
 uint32_t cpuid_1_edx;
+int has_kvm_features = 0;
 
 max = 1;
 while ((cpuid = try_get_cpuid(env-kvm_state, max)) == NULL) {
@@ -111,6 +141,9 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 for (i = 0; i  cpuid-nent; ++i) {
 if (cpuid-entries[i].function == function 
 cpuid-entries[i].index == index) {
+if (cpuid-entries[i].function == KVM_CPUID_FEATURES) {
+has_kvm_features = 1;
+}
 switch (reg) {
 case R_EAX:
 ret = cpuid-entries[i].eax;
@@ -141,41 +174,16 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 }
 }
 
+/* fallback for older kernels */
+if (!has_kvm_features  (function == KVM_CPUID_FEATURES)) {
+ret = get_para_features(env);
+}
+
 qemu_free(cpuid);
 
 return ret;
 }
 
-#ifdef CONFIG_KVM_PARA
-struct kvm_para_features {
-int cap;
-int feature;
-} para_features[] = {
-{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
-{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
-{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
-#ifdef KVM_CAP_ASYNC_PF
-{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
-#endif
-{ -1, -1 }
-};
-
-static int get_para_features(CPUState *env)
-{
-int i, features = 0;
-
-for (i = 0; i  ARRAY_SIZE(para_features) - 1; i++) {
-if (kvm_check_extension(env-kvm_state, para_features[i].cap)) {
-features |= (1  para_features[i].feature);
-}
-}
-#ifdef KVM_CAP_ASYNC_PF
-has_msr_async_pf_en = features  (1  KVM_FEATURE_ASYNC_PF);
-#endif
-return features;
-}
-#endif
-
 #ifdef KVM_CAP_MCE
 static int kvm_get_mce_cap_supported(KVMState *s, uint64_t *mce_cap,
  int *max_banks)
@@ -363,7 +371,13 @@ int kvm_arch_init_vcpu(CPUState *env)
 c = cpuid_data.entries[cpuid_i++];
 memset(c, 0, sizeof(*c));
 c-function = KVM_CPUID_FEATURES;
-c-eax = env-cpuid_kvm_features  get_para_features(env);
+c-eax = env-cpuid_kvm_features  kvm_arch_get_supported_cpuid(env,
+KVM_CPUID_FEATURES, 0, R_EAX);
+
+#ifdef KVM_CAP_ASYNC_PF
+has_msr_async_pf_en = c-eax  (1  KVM_FEATURE_ASYNC_PF);
+#endif
+
 #endif
 
 cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused);
-- 
1.7.2.3

[Qemu-devel] [PATCH 0/3] enable newer msr set for kvm

2011-03-17 Thread Glauber Costa

This patch is a follow up to an earlier one that aims to enable
kvmclock newer msr set. This time I'm doing it through a more sane
mechanism of consulting the kernel about the supported msr set.

Glauber Costa (3):
  use kernel-provided para_features instead of statically coming up
with new capabilities
  add kvmclock to its second bit
  don't create kvmclock when one of the flags are present.

 hw/kvmclock.c   |6 +++-
 target-i386/cpuid.c |3 +-
 target-i386/kvm.c   |   76 ++-
 3 files changed, 51 insertions(+), 34 deletions(-)

-- 
1.7.2.3

[Qemu-devel] [PATCH 3/3] don't create kvmclock when one of the flags are present.

2011-03-17 Thread Glauber Costa

kvmclock presence can be signalled by two different flags. So for
device creation, we have to test for both.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 hw/kvmclock.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/hw/kvmclock.c b/hw/kvmclock.c
index b6ceddf..004c4ad 100644
--- a/hw/kvmclock.c
+++ b/hw/kvmclock.c
@@ -103,7 +103,11 @@ static SysBusDeviceInfo kvmclock_info = {
 void kvmclock_create(void)
 {
 if (kvm_enabled() 
-first_cpu-cpuid_kvm_features  (1ULL  KVM_FEATURE_CLOCKSOURCE)) {
+first_cpu-cpuid_kvm_features  ((1ULL  KVM_FEATURE_CLOCKSOURCE)
+#ifdef KVM_FEATURE_CLOCKSOURCE2
+|| (1ULL  KVM_FEATURE_CLOCKSOURCE2)
+#endif
+)) {
 sysbus_create_simple(kvmclock, -1, NULL);
 }
 }
-- 
1.7.2.3

[Qemu-devel] [PATCH 2/3] add kvmclock to its second bit

2011-03-17 Thread Glauber Costa

We have two bits that can represent kvmclock in cpuid.
They signal the guest which msr set to use. When we tweak flags
involving this value - specially when we use -, we have to act on both.

Besides adding it to the kvm features list, we also have to break the
assumption represented by the break in lookup_feature.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 target-i386/cpuid.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index d28de20..48f9bbd 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -73,7 +73,7 @@ static const char *ext3_feature_name[] = {
 };
 
 static const char *kvm_feature_name[] = {
-kvmclock, kvm_nopiodelay, kvm_mmu, NULL, kvm_asyncpf, NULL, NULL, 
NULL,
+kvmclock, kvm_nopiodelay, kvm_mmu, kvmclock, kvm_asyncpf, NULL, 
NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
@@ -193,7 +193,6 @@ static int lookup_feature(uint32_t *pval, const char *s, 
const char *e,
 for (mask = 1, ppc = featureset; mask; mask = 1, ++ppc)
 if (*ppc  !altcmp(s, e, *ppc)) {
 *pval |= mask;
-break;
 }
 return (mask ? 1 : 0);
 }
-- 
1.7.2.3

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote:
 On 2011-02-07 13:27, Glauber Costa wrote:
  On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  CC: Glauber Costa glom...@redhat.com
  ---
   Makefile.target |4 +-
   hw/kvmclock.c   |  125 
  +++
   hw/kvmclock.h   |   14 ++
   hw/pc_piix.c|   31 +++---
   4 files changed, 165 insertions(+), 9 deletions(-)
   create mode 100644 hw/kvmclock.c
   create mode 100644 hw/kvmclock.h
 
  diff --git a/Makefile.target b/Makefile.target
  index b0ba95f..30232fa 100644
  --- a/Makefile.target
  +++ b/Makefile.target
  @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
   LIBS+=-lm
   endif
   
  -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
  +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
   
   config-target.h: config-target.h-timestamp
   config-target.h-timestamp: config-target.mak
  @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
   obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
   obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
   obj-i386-y += debugcon.o multiboot.o
  -obj-i386-y += pc_piix.o
  +obj-i386-y += pc_piix.o kvmclock.o
   obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
   
   # shared objects
  diff --git a/hw/kvmclock.c b/hw/kvmclock.c
  new file mode 100644
  index 000..b6ceddf
  --- /dev/null
  +++ b/hw/kvmclock.c
  @@ -0,0 +1,125 @@
  +/*
  + * QEMU KVM support, paravirtual clock device
  + *
  + * Copyright (C) 2011 Siemens AG
  + *
  + * Authors:
  + *  Jan Kiszkajan.kis...@siemens.com
  + *
  + * This work is licensed under the terms of the GNU GPL version 2.
  + * See the COPYING file in the top-level directory.
  + *
  + */
  +
  +#include qemu-common.h
  +#include sysemu.h
  +#include sysbus.h
  +#include kvm.h
  +#include kvmclock.h
  +
  +#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ADJUST_CLOCK)
  +
  +#include linux/kvm.h
  +#include linux/kvm_para.h
  +
  +typedef struct KVMClockState {
  +SysBusDevice busdev;
  +uint64_t clock;
  +bool clock_valid;
  +} KVMClockState;
  +
  +static void kvmclock_pre_save(void *opaque)
  +{
  +KVMClockState *s = opaque;
  +struct kvm_clock_data data;
  +int ret;
  +
  +if (s-clock_valid) {
  +return;
  +}
  +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data);
  +if (ret  0) {
  +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret));
  +data.clock = 0;
  +}
  +s-clock = data.clock;
  +/*
  + * If the VM is stopped, declare the clock state valid to avoid 
  re-reading
  + * it on next vmsave (which would return a different value). Will be 
  reset
  + * when the VM is continued.
  + */
  +s-clock_valid = !vm_running;
  +}
  +
  +static int kvmclock_post_load(void *opaque, int version_id)
  +{
  +KVMClockState *s = opaque;
  +struct kvm_clock_data data;
  +
  +data.clock = s-clock;
  +data.flags = 0;
  +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data);
  +}
  +
  +static void kvmclock_vm_state_change(void *opaque, int running, int 
  reason)
  +{
  +KVMClockState *s = opaque;
  +
  +if (running) {
  +s-clock_valid = false;
  +}
  +}
  +
  +static int kvmclock_init(SysBusDevice *dev)
  +{
  +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev);
  +
  +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
  +return 0;
  +}
  +
  +static const VMStateDescription kvmclock_vmsd = {
  +.name = kvmclock,
  +.version_id = 1,
  +.minimum_version_id = 1,
  +.minimum_version_id_old = 1,
  +.pre_save = kvmclock_pre_save,
  +.post_load = kvmclock_post_load,
  +.fields = (VMStateField[]) {
  +VMSTATE_UINT64(clock, KVMClockState),
  +VMSTATE_END_OF_LIST()
  +}
  +};
  +
  +static SysBusDeviceInfo kvmclock_info = {
  +.qdev.name = kvmclock,
  +.qdev.size = sizeof(KVMClockState),
  +.qdev.vmsd = kvmclock_vmsd,
  +.qdev.no_user = 1,
  +.init = kvmclock_init,
  +};
  +
  +/* Note: Must be called after VCPU initialization. */
  +void kvmclock_create(void)
  +{
  +if (kvm_enabled() 
  +first_cpu-cpuid_kvm_features  (1ULL  
  KVM_FEATURE_CLOCKSOURCE)) {
  +sysbus_create_simple(kvmclock, -1, NULL);
  +}
  +}
  +
  +static void kvmclock_register_device(void)
  +{
  +if (kvm_enabled()) {
  +sysbus_register_withprop(kvmclock_info);
  +}
  +}
  +
  +device_init(kvmclock_register_device);
  +
  +#else /* !(CONFIG_KVM_PARA  KVM_CAP_ADJUST_CLOCK) */
  +
  +void kvmclock_create(void)
  +{
  +}
  +#endif

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 15:03 +0100, Jan Kiszka wrote:
 On 2011-02-07 14:40, Glauber Costa wrote:
  On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote:
  On 2011-02-07 13:27, Glauber Costa wrote:
  On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  CC: Glauber Costa glom...@redhat.com
  ---
   Makefile.target |4 +-
   hw/kvmclock.c   |  125 
  +++
   hw/kvmclock.h   |   14 ++
   hw/pc_piix.c|   31 +++---
   4 files changed, 165 insertions(+), 9 deletions(-)
   create mode 100644 hw/kvmclock.c
   create mode 100644 hw/kvmclock.h
 
  diff --git a/Makefile.target b/Makefile.target
  index b0ba95f..30232fa 100644
  --- a/Makefile.target
  +++ b/Makefile.target
  @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
   LIBS+=-lm
   endif
   
  -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
  +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: 
  QEMU_CFLAGS+=$(KVM_CFLAGS)
   
   config-target.h: config-target.h-timestamp
   config-target.h-timestamp: config-target.mak
  @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
   obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
   obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
   obj-i386-y += debugcon.o multiboot.o
  -obj-i386-y += pc_piix.o
  +obj-i386-y += pc_piix.o kvmclock.o
   obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
   
   # shared objects
  diff --git a/hw/kvmclock.c b/hw/kvmclock.c
  new file mode 100644
  index 000..b6ceddf
  --- /dev/null
  +++ b/hw/kvmclock.c
  @@ -0,0 +1,125 @@
  +/*
  + * QEMU KVM support, paravirtual clock device
  + *
  + * Copyright (C) 2011 Siemens AG
  + *
  + * Authors:
  + *  Jan Kiszkajan.kis...@siemens.com
  + *
  + * This work is licensed under the terms of the GNU GPL version 2.
  + * See the COPYING file in the top-level directory.
  + *
  + */
  +
  +#include qemu-common.h
  +#include sysemu.h
  +#include sysbus.h
  +#include kvm.h
  +#include kvmclock.h
  +
  +#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ADJUST_CLOCK)
  +
  +#include linux/kvm.h
  +#include linux/kvm_para.h
  +
  +typedef struct KVMClockState {
  +SysBusDevice busdev;
  +uint64_t clock;
  +bool clock_valid;
  +} KVMClockState;
  +
  +static void kvmclock_pre_save(void *opaque)
  +{
  +KVMClockState *s = opaque;
  +struct kvm_clock_data data;
  +int ret;
  +
  +if (s-clock_valid) {
  +return;
  +}
  +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data);
  +if (ret  0) {
  +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret));
  +data.clock = 0;
  +}
  +s-clock = data.clock;
  +/*
  + * If the VM is stopped, declare the clock state valid to avoid 
  re-reading
  + * it on next vmsave (which would return a different value). Will 
  be reset
  + * when the VM is continued.
  + */
  +s-clock_valid = !vm_running;
  +}
  +
  +static int kvmclock_post_load(void *opaque, int version_id)
  +{
  +KVMClockState *s = opaque;
  +struct kvm_clock_data data;
  +
  +data.clock = s-clock;
  +data.flags = 0;
  +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data);
  +}
  +
  +static void kvmclock_vm_state_change(void *opaque, int running, int 
  reason)
  +{
  +KVMClockState *s = opaque;
  +
  +if (running) {
  +s-clock_valid = false;
  +}
  +}
  +
  +static int kvmclock_init(SysBusDevice *dev)
  +{
  +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev);
  +
  +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
  +return 0;
  +}
  +
  +static const VMStateDescription kvmclock_vmsd = {
  +.name = kvmclock,
  +.version_id = 1,
  +.minimum_version_id = 1,
  +.minimum_version_id_old = 1,
  +.pre_save = kvmclock_pre_save,
  +.post_load = kvmclock_post_load,
  +.fields = (VMStateField[]) {
  +VMSTATE_UINT64(clock, KVMClockState),
  +VMSTATE_END_OF_LIST()
  +}
  +};
  +
  +static SysBusDeviceInfo kvmclock_info = {
  +.qdev.name = kvmclock,
  +.qdev.size = sizeof(KVMClockState),
  +.qdev.vmsd = kvmclock_vmsd,
  +.qdev.no_user = 1,
  +.init = kvmclock_init,
  +};
  +
  +/* Note: Must be called after VCPU initialization. */
  +void kvmclock_create(void)
  +{
  +if (kvm_enabled() 
  +first_cpu-cpuid_kvm_features  (1ULL  
  KVM_FEATURE_CLOCKSOURCE)) {
  +sysbus_create_simple(kvmclock, -1, NULL);
  +}
  +}
  +
  +static void kvmclock_register_device(void)
  +{
  +if (kvm_enabled()) {
  +sysbus_register_withprop(kvmclock_info);
  +}
  +}
  +
  +device_init(kvmclock_register_device);
  +
  +#else

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 19:12 +0100, Jan Kiszka wrote:
 On 2011-02-07 19:04, Glauber Costa wrote:
  On Mon, 2011-02-07 at 15:03 +0100, Jan Kiszka wrote:
  On 2011-02-07 14:40, Glauber Costa wrote:
  On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote:
  On 2011-02-07 13:27, Glauber Costa wrote:
  On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and 
  restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  CC: Glauber Costa glom...@redhat.com
  ---
   Makefile.target |4 +-
   hw/kvmclock.c   |  125 
  +++
   hw/kvmclock.h   |   14 ++
   hw/pc_piix.c|   31 +++---
   4 files changed, 165 insertions(+), 9 deletions(-)
   create mode 100644 hw/kvmclock.c
   create mode 100644 hw/kvmclock.h
 
  diff --git a/Makefile.target b/Makefile.target
  index b0ba95f..30232fa 100644
  --- a/Makefile.target
  +++ b/Makefile.target
  @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
   LIBS+=-lm
   endif
   
  -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
  +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: 
  QEMU_CFLAGS+=$(KVM_CFLAGS)
   
   config-target.h: config-target.h-timestamp
   config-target.h-timestamp: config-target.mak
  @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o 
  piix_pci.o
   obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
   obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
   obj-i386-y += debugcon.o multiboot.o
  -obj-i386-y += pc_piix.o
  +obj-i386-y += pc_piix.o kvmclock.o
   obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
   
   # shared objects
  diff --git a/hw/kvmclock.c b/hw/kvmclock.c
  new file mode 100644
  index 000..b6ceddf
  --- /dev/null
  +++ b/hw/kvmclock.c
  @@ -0,0 +1,125 @@
  +/*
  + * QEMU KVM support, paravirtual clock device
  + *
  + * Copyright (C) 2011 Siemens AG
  + *
  + * Authors:
  + *  Jan Kiszkajan.kis...@siemens.com
  + *
  + * This work is licensed under the terms of the GNU GPL version 2.
  + * See the COPYING file in the top-level directory.
  + *
  + */
  +
  +#include qemu-common.h
  +#include sysemu.h
  +#include sysbus.h
  +#include kvm.h
  +#include kvmclock.h
  +
  +#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ADJUST_CLOCK)
  +
  +#include linux/kvm.h
  +#include linux/kvm_para.h
  +
  +typedef struct KVMClockState {
  +SysBusDevice busdev;
  +uint64_t clock;
  +bool clock_valid;
  +} KVMClockState;
  +
  +static void kvmclock_pre_save(void *opaque)
  +{
  +KVMClockState *s = opaque;
  +struct kvm_clock_data data;
  +int ret;
  +
  +if (s-clock_valid) {
  +return;
  +}
  +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data);
  +if (ret  0) {
  +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret));
  +data.clock = 0;
  +}
  +s-clock = data.clock;
  +/*
  + * If the VM is stopped, declare the clock state valid to avoid 
  re-reading
  + * it on next vmsave (which would return a different value). Will 
  be reset
  + * when the VM is continued.
  + */
  +s-clock_valid = !vm_running;
  +}
  +
  +static int kvmclock_post_load(void *opaque, int version_id)
  +{
  +KVMClockState *s = opaque;
  +struct kvm_clock_data data;
  +
  +data.clock = s-clock;
  +data.flags = 0;
  +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data);
  +}
  +
  +static void kvmclock_vm_state_change(void *opaque, int running, int 
  reason)
  +{
  +KVMClockState *s = opaque;
  +
  +if (running) {
  +s-clock_valid = false;
  +}
  +}
  +
  +static int kvmclock_init(SysBusDevice *dev)
  +{
  +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev);
  +
  +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
  +return 0;
  +}
  +
  +static const VMStateDescription kvmclock_vmsd = {
  +.name = kvmclock,
  +.version_id = 1,
  +.minimum_version_id = 1,
  +.minimum_version_id_old = 1,
  +.pre_save = kvmclock_pre_save,
  +.post_load = kvmclock_post_load,
  +.fields = (VMStateField[]) {
  +VMSTATE_UINT64(clock, KVMClockState),
  +VMSTATE_END_OF_LIST()
  +}
  +};
  +
  +static SysBusDeviceInfo kvmclock_info = {
  +.qdev.name = kvmclock,
  +.qdev.size = sizeof(KVMClockState),
  +.qdev.vmsd = kvmclock_vmsd,
  +.qdev.no_user = 1,
  +.init = kvmclock_init,
  +};
  +
  +/* Note: Must be called after VCPU initialization. */
  +void kvmclock_create(void)
  +{
  +if (kvm_enabled() 
  +first_cpu-cpuid_kvm_features  (1ULL  
  KVM_FEATURE_CLOCKSOURCE)) {
  +sysbus_create_simple(kvmclock, -1, NULL);
  +}
  +}
  +
  +static void kvmclock_register_device(void)
  +{
  +if (kvm_enabled

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
 If kvmclock is used, which implies the kernel supports it, register a
 kvmclock device with the sysbus. Its main purpose is to save and restore
 the kernel state on migration, but this will also allow to visualize it
 one day.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Glauber Costa glom...@redhat.com
 ---
  Makefile.target |4 +-
  hw/kvmclock.c   |  125 
 +++
  hw/kvmclock.h   |   14 ++
  hw/pc_piix.c|   31 +++---
  4 files changed, 165 insertions(+), 9 deletions(-)
  create mode 100644 hw/kvmclock.c
  create mode 100644 hw/kvmclock.h
 
 diff --git a/Makefile.target b/Makefile.target
 index b0ba95f..30232fa 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
  LIBS+=-lm
  endif
  
 -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
 +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
  
  config-target.h: config-target.h-timestamp
  config-target.h-timestamp: config-target.mak
 @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
  obj-i386-y += debugcon.o multiboot.o
 -obj-i386-y += pc_piix.o
 +obj-i386-y += pc_piix.o kvmclock.o
  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
  
  # shared objects
 diff --git a/hw/kvmclock.c b/hw/kvmclock.c
 new file mode 100644
 index 000..b6ceddf
 --- /dev/null
 +++ b/hw/kvmclock.c
 @@ -0,0 +1,125 @@
 +/*
 + * QEMU KVM support, paravirtual clock device
 + *
 + * Copyright (C) 2011 Siemens AG
 + *
 + * Authors:
 + *  Jan Kiszkajan.kis...@siemens.com
 + *
 + * This work is licensed under the terms of the GNU GPL version 2.
 + * See the COPYING file in the top-level directory.
 + *
 + */
 +
 +#include qemu-common.h
 +#include sysemu.h
 +#include sysbus.h
 +#include kvm.h
 +#include kvmclock.h
 +
 +#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ADJUST_CLOCK)
 +
 +#include linux/kvm.h
 +#include linux/kvm_para.h
 +
 +typedef struct KVMClockState {
 +SysBusDevice busdev;
 +uint64_t clock;
 +bool clock_valid;
 +} KVMClockState;
 +
 +static void kvmclock_pre_save(void *opaque)
 +{
 +KVMClockState *s = opaque;
 +struct kvm_clock_data data;
 +int ret;
 +
 +if (s-clock_valid) {
 +return;
 +}
 +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data);
 +if (ret  0) {
 +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret));
 +data.clock = 0;
 +}
 +s-clock = data.clock;
 +/*
 + * If the VM is stopped, declare the clock state valid to avoid 
 re-reading
 + * it on next vmsave (which would return a different value). Will be 
 reset
 + * when the VM is continued.
 + */
 +s-clock_valid = !vm_running;
 +}
 +
 +static int kvmclock_post_load(void *opaque, int version_id)
 +{
 +KVMClockState *s = opaque;
 +struct kvm_clock_data data;
 +
 +data.clock = s-clock;
 +data.flags = 0;
 +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data);
 +}
 +
 +static void kvmclock_vm_state_change(void *opaque, int running, int reason)
 +{
 +KVMClockState *s = opaque;
 +
 +if (running) {
 +s-clock_valid = false;
 +}
 +}
 +
 +static int kvmclock_init(SysBusDevice *dev)
 +{
 +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev);
 +
 +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
 +return 0;
 +}
 +
 +static const VMStateDescription kvmclock_vmsd = {
 +.name = kvmclock,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.pre_save = kvmclock_pre_save,
 +.post_load = kvmclock_post_load,
 +.fields = (VMStateField[]) {
 +VMSTATE_UINT64(clock, KVMClockState),
 +VMSTATE_END_OF_LIST()
 +}
 +};
 +
 +static SysBusDeviceInfo kvmclock_info = {
 +.qdev.name = kvmclock,
 +.qdev.size = sizeof(KVMClockState),
 +.qdev.vmsd = kvmclock_vmsd,
 +.qdev.no_user = 1,
 +.init = kvmclock_init,
 +};
 +
 +/* Note: Must be called after VCPU initialization. */
 +void kvmclock_create(void)
 +{
 +if (kvm_enabled() 
 +first_cpu-cpuid_kvm_features  (1ULL  KVM_FEATURE_CLOCKSOURCE)) {
 +sysbus_create_simple(kvmclock, -1, NULL);
 +}
 +}
 +
 +static void kvmclock_register_device(void)
 +{
 +if (kvm_enabled()) {
 +sysbus_register_withprop(kvmclock_info);
 +}
 +}
 +
 +device_init(kvmclock_register_device);
 +
 +#else /* !(CONFIG_KVM_PARA  KVM_CAP_ADJUST_CLOCK) */
 +
 +void kvmclock_create(void)
 +{
 +}
 +#endif /* !(CONFIG_KVM_PARA  KVM_CAP_ADJUST_CLOCK) */
 diff --git a/hw/kvmclock.h b/hw/kvmclock.h
 new file mode 100644
 index 000..7a83cbe
 --- /dev/null
 +++ b/hw/kvmclock.h
 @@ -0,0 +1,14 @@
 +/*
 + * QEMU KVM support, paravirtual clock device
 + *
 + * Copyright (C) 2011

[Qemu-devel] [PATCH v3] make tsc stable over migration and machine start

2011-02-03 Thread Glauber Costa

If the machine is stopped, we should not record two different tsc values
upon a save operation. The same problem happens with kvmclock.

But kvmclock is taking a different diretion, being now seen as a separate
device. Since this is unlikely to happen with the tsc, I am taking the
approach here of simply registering a handler for state change, and
using a per-CPUState variable that prevents double updates for the TSC.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Jan Kiszka jan.kis...@web.de

---
v2: updated tsc validation logic, as asked by Jan
v3: regenerated against uq/master
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   18 +-
 2 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index af701a4..5f1df8b 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -734,6 +734,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
 uint32_t cpuid_svm_features;
+bool tsc_valid;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 05010bb..ed9557e 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -301,6 +301,15 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 #endif
 }
 
+static void cpu_update_state(void *opaque, int running, int reason)
+{
+CPUState *env = opaque;
+
+if (running) {
+env-tsc_valid = false;
+}
+}
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 struct {
@@ -434,6 +443,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 }
 #endif
 
+qemu_add_vm_change_state_handler(cpu_update_state, env);
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
 }
 
@@ -1061,7 +1072,12 @@ static int kvm_get_msrs(CPUState *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
-msrs[n++].index = MSR_IA32_TSC;
+
+if (!env-tsc_valid) {
+msrs[n++].index = MSR_IA32_TSC;
+env-tsc_valid = !vm_running;
+}
+
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;
-- 
1.7.2.3

[Qemu-devel] [PATCH v2] make tsc stable over migration and machine start

2011-02-02 Thread Glauber Costa

If the machine is stopped, we should not record two different tsc values
upon a save operation. The same problem happens with kvmclock.

But kvmclock is taking a different diretion, being now seen as a separate
device. Since this is unlikely to happen with the tsc, I am taking the
approach here of simply registering a handler for state change, and
using a per-CPUState variable that prevents double updates for the TSC.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Jan Kiszka jan.kis...@web.de

---
v2: updated tsc validation logic, as asked by Jan
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   18 +-
 2 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 6d619e8..6bb2e87 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -732,6 +732,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
 uint32_t cpuid_svm_features;
+bool tsc_valid;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ecb8405..9cc198a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -302,6 +302,15 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 
 static int _kvm_arch_init_vcpu(CPUState *env);
 
+static void cpu_update_state(void *opaque, int running, int reason)
+{
+CPUState *env = opaque;
+
+if (running) {
+env-tsc_valid = false;
+}
+}
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 int r;
@@ -444,6 +453,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 }
 #endif
 
+qemu_add_vm_change_state_handler(cpu_update_state, env);
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
 }
 
@@ -1093,7 +1104,12 @@ static int kvm_get_msrs(CPUState *env)
msrs[n++].index = MSR_STAR;
 if (kvm_has_msr_hsave_pa(env))
 msrs[n++].index = MSR_VM_HSAVE_PA;
-msrs[n++].index = MSR_IA32_TSC;
+
+if (!env-tsc_valid) {
+msrs[n++].index = MSR_IA32_TSC;
+env-tsc_valid = !vm_running;
+}
+
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;
-- 
1.7.2.3

[Qemu-devel] Re: [PATCH] make tsc stable over migration and machine start

2011-02-02 Thread Glauber Costa

On Tue, 2011-02-01 at 21:26 +0100, Jan Kiszka wrote:
 On 2011-02-01 20:17, Glauber Costa wrote:
  If the machine is stopped, we should not record two different tsc values
  upon a save operation. The same problem happens with kvmclock.
  
  But kvmclock is taking a different diretion, being now seen as a separate
  device. Since this is unlikely to happen with the tsc, I am taking the
  approach here of simply registering a handler for state change, and
  using a per-CPUState variable that prevents double updates for the TSC.
  
  Signed-off-by: Glauber Costa glom...@redhat.com
  ---
   target-i386/cpu.h |1 +
   target-i386/kvm.c |   19 ++-
   2 files changed, 19 insertions(+), 1 deletions(-)
  
  diff --git a/target-i386/cpu.h b/target-i386/cpu.h
  index 6d619e8..7f1c4f8 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -732,6 +732,7 @@ typedef struct CPUX86State {
   uint32_t sipi_vector;
   uint32_t cpuid_kvm_features;
   uint32_t cpuid_svm_features;
  +uint8_t  update_tsc;
 
 bool please.
 
   
   /* in order to simplify APIC support, we leave this pointer to the
  user */
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c
  index ecb8405..c3925be 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -302,6 +302,16 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, 
  uint64_t status,
   
   static int _kvm_arch_init_vcpu(CPUState *env);
   
  +static void cpu_update_state(void *opaque, int running, int reason)
  +{
  +CPUState *env = opaque;
  +
  +if (!running) {
  +env-update_tsc = 1;
  +}
  +}
  +
  +
 
 Additional blank line.
 
   int kvm_arch_init_vcpu(CPUState *env)
   {
   int r;
  @@ -444,6 +454,8 @@ int kvm_arch_init_vcpu(CPUState *env)
   }
   #endif
   
  +qemu_add_vm_change_state_handler(cpu_update_state, env);
  +
   return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
   }
   
  @@ -1093,7 +1105,12 @@ static int kvm_get_msrs(CPUState *env)
  msrs[n++].index = MSR_STAR;
   if (kvm_has_msr_hsave_pa(env))
   msrs[n++].index = MSR_VM_HSAVE_PA;
  -msrs[n++].index = MSR_IA32_TSC;
  +
  +if (env-update_tsc) {
  +msrs[n++].index = MSR_IA32_TSC;
  +env-update_tsc = 0;
  +}
  +
   #ifdef TARGET_X86_64
   if (lm_capable_kernel) {
   msrs[n++].index = MSR_CSTAR;
 
 Not quite the logic I'm using for kvmclock:

Ok. I have all the interest in keeping the same logic.
I will respin.

 cpu_update_state()
   if running
   tsc_valid = false;
 
 kvm_get_msrs()
   ...
   if (!tsc_valid)
   read_tsc
   tsc_valid = !vm_running;
 
 That ensure we always read the tsc while the VM is running, and not only
 after it was stopped (might otherwise be surprising when once
 visualizing the MSRs).
 
 Jan

[Qemu-devel] [PATCH] make tsc stable over migration and machine start

2011-02-01 Thread Glauber Costa

If the machine is stopped, we should not record two different tsc values
upon a save operation. The same problem happens with kvmclock.

But kvmclock is taking a different diretion, being now seen as a separate
device. Since this is unlikely to happen with the tsc, I am taking the
approach here of simply registering a handler for state change, and
using a per-CPUState variable that prevents double updates for the TSC.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   19 ++-
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 6d619e8..7f1c4f8 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -732,6 +732,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
 uint32_t cpuid_svm_features;
+uint8_t  update_tsc;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ecb8405..c3925be 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -302,6 +302,16 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 
 static int _kvm_arch_init_vcpu(CPUState *env);
 
+static void cpu_update_state(void *opaque, int running, int reason)
+{
+CPUState *env = opaque;
+
+if (!running) {
+env-update_tsc = 1;
+}
+}
+
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 int r;
@@ -444,6 +454,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 }
 #endif
 
+qemu_add_vm_change_state_handler(cpu_update_state, env);
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
 }
 
@@ -1093,7 +1105,12 @@ static int kvm_get_msrs(CPUState *env)
msrs[n++].index = MSR_STAR;
 if (kvm_has_msr_hsave_pa(env))
 msrs[n++].index = MSR_VM_HSAVE_PA;
-msrs[n++].index = MSR_IA32_TSC;
+
+if (env-update_tsc) {
+msrs[n++].index = MSR_IA32_TSC;
+env-update_tsc = 0;
+}
+
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;
-- 
1.7.2.3

[Qemu-devel] Re: [PATCH v3 11/21] kvm: x86: Reset paravirtual MSRs

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Make sure to write the cleared MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 and MSR_KVM_ASYNC_PF_EN to the kernel state so that a freshly booted
 guest cannot be disturbed by old values.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Glauber Costa glom...@redhat.com

looks good.

  target-i386/kvm.c |7 +++
  1 files changed, 7 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index d8f26bf..8267655 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -845,6 +845,13 @@ static int kvm_put_msrs(CPUState *env, int level)
  if (smp_cpus == 1 || env-tsc != 0) {
  kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
  }
 +}
 +/*
 + * The following paravirtual MSRs have side effects on the guest or are
 + * too heavy for normal writeback. Limit them to reset or full state
 + * updates.
 + */
 +if (level = KVM_PUT_RESET_STATE) {
  kvm_msr_entry_set(msrs[n++], MSR_KVM_SYSTEM_TIME,
env-system_time_msr);
  kvm_msr_entry_set(msrs[n++], MSR_KVM_WALL_CLOCK, 
 env-wall_clock_msr);

[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If kvmclock is used, which implies the kernel supports it, register a
 kvmclock device with the sysbus. Its main purpose is to save and restore
 the kernel state on migration, but this will also allow to visualize it
 one day.

I would prefer having no pre-save, and doing the ioctl in the state
change handler. But I won't nitpick about that, if the maintainers think
this is okay, all the rest of the patch looks fine as well.

[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 12:34 +0100, Jan Kiszka wrote:
 Am 04.01.2011 12:08, Glauber Costa wrote:
  On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
  
  I would prefer having no pre-save, and doing the ioctl in the state
  change handler. But I won't nitpick about that, if the maintainers think
  this is okay, all the rest of the patch looks fine as well.
 
 I did this for a reason: to be able to obtain the current clock state
 even while the vm is running. It's cleaner IMHO.

but if we're on pre-save, we are certain that the vm is stopped at this
moment, no ?

[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 12:45 +0100, Jan Kiszka wrote:
 Am 04.01.2011 12:43, Glauber Costa wrote:
  On Tue, 2011-01-04 at 12:34 +0100, Jan Kiszka wrote:
  Am 04.01.2011 12:08, Glauber Costa wrote:
  On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
 
  I would prefer having no pre-save, and doing the ioctl in the state
  change handler. But I won't nitpick about that, if the maintainers think
  this is okay, all the rest of the patch looks fine as well.
 
  I did this for a reason: to be able to obtain the current clock state
  even while the vm is running. It's cleaner IMHO.
  
  but if we're on pre-save, we are certain that the vm is stopped at this
  moment, no ?
 
 Maybe at the moment, but not for device state dumping or maybe (dunno)
 for Kemari's continuous sync'ing. I simply want to avoid this implicit
 dependency.

that's fine.

[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 18:04 +0200, Avi Kivity wrote:
 On 01/03/2011 10:33 AM, Jan Kiszka wrote:
  From: Jan Kiszkajan.kis...@siemens.com
 
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
 
 
 kvmclock is a per-cpu affair.
 
 
  @@ -534,6 +599,10 @@ int kvm_arch_init(int smp_cpus)
int ret;
struct utsname utsname;
 
  +#ifdef KVM_CAP_ADJUST_CLOCK
  +sysbus_register_withprop(kvmclock_info);
  +#endif
  +
 
 So this doesn't look right.  I think we're fine with just migrating the 
 MSRs, like we migrate anything else that has to do with the cpu.

The ioctl jan is handling here is a global one, that adjusts the base
offset for the clock over migration. It is okay.

[Qemu-devel] Re: [PATCH v2 11/17] kvm: x86: Reset paravirtual MSRs

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Make sure to clear MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and
 MSR_KVM_ASYNC_PF_EN so that a freshly booted guest cannot be disturbed
 by old values.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Glauber Costa glom...@redhat.com
 ---
  target-i386/kvm.c |   10 ++
  1 files changed, 10 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index d8f26bf..664a4a0 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -453,6 +453,9 @@ void kvm_arch_reset_vcpu(CPUState *env)
  env-nmi_injected = 0;
  env-nmi_pending = 0;
  env-xcr0 = 1;
 +env-system_time_msr = 0;
 +env-wall_clock_msr = 0;
 +env-async_pf_en_msr = 0;

Have you seen this happening? I'd expect CPUState to be zeroed out over
init. And if it is not, I guess we should...

[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If kvmclock is used, which implies the kernel supports it, register a
 kvmclock device with the sysbus. Its main purpose is to save and restore
 the kernel state on migration, but this will also allow to visualize it
 one day.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Glauber Costa glom...@redhat.com

Hi Jan.

I've just recently posted a patch (not sure what was made from it), that
fixes a bug that you reintroduce here.

The bug is: if we call KVM_GET_CLOCK ioctl in pre_save, this means that
this value will change every time we issue savevm, even if the machine
is not run in between.

Ideally, we'd like to have two consecutive savevms returning the exact
same thing in that situation.

I like the general approach of moving it to sysbus, but please move the
ioctl to change state notifiers.

Cheers!

[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 17:30 +0100, Jan Kiszka wrote:
 Am 03.01.2011 17:04, Avi Kivity wrote:
  On 01/03/2011 10:33 AM, Jan Kiszka wrote:
  From: Jan Kiszkajan.kis...@siemens.com
 
  If kvmclock is used, which implies the kernel supports it, register a
  kvmclock device with the sysbus. Its main purpose is to save and restore
  the kernel state on migration, but this will also allow to visualize it
  one day.
 
  
  kvmclock is a per-cpu affair.
 
 Nope, it's state (the one save/restored here) is per VM.
 
  
 
  @@ -534,6 +599,10 @@ int kvm_arch_init(int smp_cpus)
int ret;
struct utsname utsname;
 
  +#ifdef KVM_CAP_ADJUST_CLOCK
  +sysbus_register_withprop(kvmclock_info);
  +#endif
  +
  
  So this doesn't look right.  I think we're fine with just migrating the
  MSRs, like we migrate anything else that has to do with the cpu.
  
 
 The kvmclock state is not contained in any MSR. It's an independent
 machine state that can be indirectly obtained via MSR access. Therefore,
 qemu-kvm currently registers only one vmstate entry per machine, and
 this patch just turns this into a clean device - because that's what
 kvmclock is in the end, something like an HPET.

Jan is right, the per-cpu information only cares about which piece of
memory will be written to.

[Qemu-devel] Re: [PATCH v2 11/17] kvm: x86: Reset paravirtual MSRs

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 17:46 +0100, Jan Kiszka wrote:
 Am 03.01.2011 17:40, Glauber Costa wrote:
  On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  Make sure to clear MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and
  MSR_KVM_ASYNC_PF_EN so that a freshly booted guest cannot be disturbed
  by old values.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  CC: Glauber Costa glom...@redhat.com
  ---
   target-i386/kvm.c |   10 ++
   1 files changed, 10 insertions(+), 0 deletions(-)
 
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c
  index d8f26bf..664a4a0 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -453,6 +453,9 @@ void kvm_arch_reset_vcpu(CPUState *env)
   env-nmi_injected = 0;
   env-nmi_pending = 0;
   env-xcr0 = 1;
  +env-system_time_msr = 0;
  +env-wall_clock_msr = 0;
  +env-async_pf_en_msr = 0;
  
  Have you seen this happening? I'd expect CPUState to be zeroed out over
  init. And if it is not, I guess we should...
 
 Ah, true, those three are part of the section that is zeroed. Will drop
 that hunk on repost.
 
 Guess we should rather move some other variables in that region too and
 avoid clearing them manually like above...
 
 Jan
 
Agreed.

[Qemu-devel] Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-13 Thread Glauber Costa

On Wed, 2010-12-08 at 17:31 -0200, Marcelo Tosatti wrote:
 On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote:
  On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote:
   On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote:
Usually nobody usually thinks about that scenario (me included and 
specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This 
patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little 
bit
later, since we won't know the final kvmclock state before cpuid gets 
parsed.
   
   What is the problem of registering the section? Restoring the value if
   the host does not support it returns an error?
   
   Can't you ignore the error if kvmclock is not reported in cpuid, in the
   restore handler?
  
  We can change the restore handler, but not the restore handler of
  binaries that are already out there. The motivation here is precisely to
  address migration to hosts without kvmclock, so it's better to have
  a way to disable, than to count on the fact that the other side will be
  able to ignore it.
 
 OK. Can't you register conditionally on kvmclock cpuid bit at the end of
 kvm_arch_init_vcpu, in target-i386/kvm.c?

Haven't looked at it, but will today. Actually, tsc has (obviously) the
same problem and I plan to respin the patch today including a fix for it
as well.

Thanks!

[Qemu-devel] Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-07 Thread Glauber Costa

On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote:
 On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote:
  Usually nobody usually thinks about that scenario (me included and 
  specially),
  but kvmclock can be actually disabled in the host.
  
  It happens in two scenarios:
   1. host too old.
   2. we passed -kvmclock to our -cpu parameter.
  
  In both cases, we should not register kvmclock savevm section. This patch
  achives that by registering this section only if kvmclock is actually
  currently enabled in cpuid.
  
  The only caveat is that we have to register the savevm section a little bit
  later, since we won't know the final kvmclock state before cpuid gets 
  parsed.
 
 What is the problem of registering the section? Restoring the value if
 the host does not support it returns an error?
 
 Can't you ignore the error if kvmclock is not reported in cpuid, in the
 restore handler?

We can change the restore handler, but not the restore handler of
binaries that are already out there. The motivation here is precisely to
address migration to hosts without kvmclock, so it's better to have
a way to disable, than to count on the fact that the other side will be
able to ignore it.

[Qemu-devel] [PATCH v2 2/2] make kvmclock value idempotent for stopped machine

2010-12-06 Thread Glauber Costa

Although we never made such commitment clear (well, to the best
of my knowledge), some people expect that two savevm issued in sequence
in a stopped machine will yield the same results. This is not a crazy
requirement, since we don't expect a stopped machine to be updating its state,
for any device.

With kvmclock, this is not the case, since the .pre_save hook will issue an
ioctl to the host to acquire a timestamp, which is always changing.

This patch moves the value acquisition to vm state change handlers, conditional
on not being run. This could mean mean our get clock ioctl is issued more times,
but this should be fine since vm_stop is not a hot path.

When we do migrate, we'll transfer that value along.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Paolo Bonzini pbonz...@redhat.com

---
 qemu-kvm-x86.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 14a52ce..0e357ac 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -500,11 +500,11 @@ static int kvm_enable_tpr_access_reporting(CPUState *env)
 #ifdef KVM_CAP_ADJUST_CLOCK
 static struct kvm_clock_data kvmclock_data;
 
-static void kvmclock_pre_save(void *opaque)
+static void kvmclock_update_clock(void *opaque, int running, int reason)
 {
 struct kvm_clock_data *cl = opaque;
 
-if (!kvmclock_enabled)
+if ((!kvmclock_enabled) || running)
 return;
 
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
@@ -522,7 +522,6 @@ static const VMStateDescription vmstate_kvmclock= {
 .version_id = 1,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
-.pre_save = kvmclock_pre_save,
 .post_load = kvmclock_post_load,
 .fields  = (VMStateField []) {
 VMSTATE_U64(clock, struct kvm_clock_data),
@@ -537,6 +536,7 @@ void kvmclock_register_savevm(void)
 #ifdef KVM_CAP_ADJUST_CLOCK
 if (kvmclock_enabled  kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
 vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
+qemu_add_vm_change_state_handler(kvmclock_update_clock, 
kvmclock_data);
 }
 #endif
 }
-- 
1.7.2.3

[Qemu-devel] [PATCH v2 0/2] savevm odness related to kvmclock

2010-12-06 Thread Glauber Costa

Some users told me that savevm path is behaving oddly wrt kvmclock.
The first oddness is that a guarantee we never made (AFAIK) is being broken:
two consecutive savevm operations, with the machine stopped in between
produces different results, due to the call to KVM_GET_CLOCK ioctl.
I believe the assumption that if the vm does not run, its saveable
state won't change is fairly reasonable. Maybe we should formally
guarantee that?

Also, this patch deals with the fact that this happens even if
kvmclock is disabled in cpuid: its savevm section is registered
nevertheless. Here, I try to register it only if it's enabled at
machine start.

v2: improvements suggested by Paolo, and patch reordering.

Glauber Costa (2):
  Do not register kvmclock savevm section if kvmclock is disabled.
  make kvmclock value idempotent for stopped machine

 cpus.c|3 +++
 qemu-kvm-x86.c|   23 +++
 qemu-kvm.h|3 +++
 target-i386/kvm.c |7 +++
 4 files changed, 28 insertions(+), 8 deletions(-)

-- 
1.7.2.3

[Qemu-devel] [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-06 Thread Glauber Costa

Usually nobody usually thinks about that scenario (me included and specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little bit
later, since we won't know the final kvmclock state before cpuid gets parsed.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpus.c|3 +++
 qemu-kvm-x86.c|   19 +--
 qemu-kvm.h|3 +++
 target-i386/kvm.c |7 +++
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/cpus.c b/cpus.c
index a55c330..a24098e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void)
 for (cpu = first_cpu; cpu; cpu = cpu-next_cpu) {
 cpu_synchronize_post_init(cpu);
 }
+if (kvm_enabled()) {
+kvmclock_register_savevm();
+}
 }
 
 int cpu_is_stopped(CPUState *env)
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 20b7d6d..14a52ce 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -504,6 +504,9 @@ static void kvmclock_pre_save(void *opaque)
 {
 struct kvm_clock_data *cl = opaque;
 
+if (!kvmclock_enabled)
+return;
+
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
 }
 
@@ -528,6 +531,16 @@ static const VMStateDescription vmstate_kvmclock= {
 };
 #endif
 
+/* This has to happen after vcpu setup*/
+void kvmclock_register_savevm(void)
+{
+#ifdef KVM_CAP_ADJUST_CLOCK
+if (kvmclock_enabled  kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
+vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
+}
+#endif
+}
+
 int kvm_arch_qemu_create_context(void)
 {
 int r;
@@ -545,12 +558,6 @@ int kvm_arch_qemu_create_context(void)
 return -1;
 }
 
-#ifdef KVM_CAP_ADJUST_CLOCK
-if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) {
-vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
-}
-#endif
-
 r = kvm_set_boot_cpu_id(0);
 if (r  0  r != -ENOSYS) {
 return r;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 0f3fb50..0a104ef 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -752,6 +752,9 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #define qemu_kvm_cpu_stop(env) do {} while(0)
 #endif
 
+extern int kvmclock_enabled;
+void kvmclock_register_savevm(void);
+
 #ifdef CONFIG_KVM
 
 typedef struct KVMSlot {
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 95e5d02..5443765 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 }
 
 static int _kvm_arch_init_vcpu(CPUState *env);
+int kvmclock_enabled = 1;
 
 int kvm_arch_init_vcpu(CPUState *env)
 {
@@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env)
 memset(c, 0, sizeof(*c));
 c-function = KVM_CPUID_FEATURES;
 c-eax = env-cpuid_kvm_features  get_para_features(env);
+
+if (!(c-eax  (1  KVM_FEATURE_CLOCKSOURCE))) {
+/* In theory cpuid is per-cpu, and this is a global variable,
+ * but we don't expect kvmclock enabled in some cpus only */
+kvmclock_enabled = 0;
+}
 #endif
 
 cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused);
-- 
1.7.2.3

[Qemu-devel] [PATCH 0/2] Fix savevm odness related to kvmclock

2010-12-03 Thread Glauber Costa

Some users told me that savevm path is behaving oddly wrt kvmclock.
The first oddness is that a guarantee we never made (AFAIK) is being broken:
two consecutive savevm operations, with the machine stopped in between
produces different results, due to the call to KVM_GET_CLOCK ioctl.
I believe the assumption that if the vm does not run, its saveable
state won't change is fairly reasonable. Maybe we should formally
guarantee that?

The second patch deals with the fact that this happens even if
kvmclock is disabled in cpuid: its savevm section is registered
nevertheless. Here, I try to register it only if it's enabled at
machine start.

Thanks

Glauber Costa (2):
  make kvmclock value idempotent for stopped machine
  Do not register kvmclock savevm section if kvmclock is disabled.

 cpus.c|7 +++
 qemu-kvm-x86.c|   25 +++--
 qemu-kvm.h|4 
 target-i386/kvm.c |7 +++
 4 files changed, 33 insertions(+), 10 deletions(-)

-- 
1.7.2.3

[Qemu-devel] [PATCH 2/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-03 Thread Glauber Costa

Usually nobody usually thinks about that scenario (me included and specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little bit
later, since we won't know the final kvmclock state before cpuid gets parsed.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpus.c|3 +++
 qemu-kvm-x86.c|   20 ++--
 qemu-kvm.h|2 ++
 target-i386/kvm.c |7 +++
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/cpus.c b/cpus.c
index 879a03a..eef716c 100644
--- a/cpus.c
+++ b/cpus.c
@@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void)
 for (cpu = first_cpu; cpu; cpu = cpu-next_cpu) {
 cpu_synchronize_post_init(cpu);
 }
+if (kvm_enabled()) {
+kvmclock_register_savevm();
+}
 }
 
 int cpu_is_stopped(CPUState *env)
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index d099d3d..668c8cf 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -502,6 +502,9 @@ static struct kvm_clock_data kvmclock_data;
 
 void kvmclock_update_clock(void)
 {
+if (!kvmclock_enabled)
+return;
+
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, kvmclock_data);
 }
 
@@ -525,6 +528,17 @@ static const VMStateDescription vmstate_kvmclock= {
 };
 #endif
 
+/* This has to happen after vcpu setup*/
+void kvmclock_register_savevm(void)
+{
+#ifdef KVM_CAP_ADJUST_CLOCK
+if (kvmclock_enabled  kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
+printf(registering kvmclock savevm section\n);
+vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
+}
+#endif
+}
+
 int kvm_arch_qemu_create_context(void)
 {
 int r;
@@ -542,12 +556,6 @@ int kvm_arch_qemu_create_context(void)
 return -1;
 }
 
-#ifdef KVM_CAP_ADJUST_CLOCK
-if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) {
-vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data);
-}
-#endif
-
 r = kvm_set_boot_cpu_id(0);
 if (r  0  r != -ENOSYS) {
 return r;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index b0b7ab3..f51a2d6 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -753,6 +753,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #endif
 
 void kvmclock_update_clock(void);
+extern int kvmclock_enabled;
+void kvmclock_register_savevm(void);
 
 #ifdef CONFIG_KVM
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 95e5d02..5443765 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 }
 
 static int _kvm_arch_init_vcpu(CPUState *env);
+int kvmclock_enabled = 1;
 
 int kvm_arch_init_vcpu(CPUState *env)
 {
@@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env)
 memset(c, 0, sizeof(*c));
 c-function = KVM_CPUID_FEATURES;
 c-eax = env-cpuid_kvm_features  get_para_features(env);
+
+if (!(c-eax  (1  KVM_FEATURE_CLOCKSOURCE))) {
+/* In theory cpuid is per-cpu, and this is a global variable,
+ * but we don't expect kvmclock enabled in some cpus only */
+kvmclock_enabled = 0;
+}
 #endif
 
 cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused);
-- 
1.7.2.3

[Qemu-devel] [PATCH 1/2] make kvmclock value idempotent for stopped machine

2010-12-03 Thread Glauber Costa

Although we never made such commitment clear (well, to the best
of my knowledge), some people expect that two savevm issued in sequence
in a stopped machine will yield the same results. This is not a crazy
requirement, since we don't expect a stopped machine to be updating its state,
for any device.

With kvmclock, this is not the case, since the .pre_save hook will issue an
ioctl to the host to acquire a timestamp, which is always changing.

This patch moves the value acquisition to vm_stop. This should mean our
get clock ioctl is issued more times, but this should be fine since vm_stop
is not a hot path.

When we do migrate, we'll transfer that value along.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpus.c |4 
 qemu-kvm-x86.c |7 ++-
 qemu-kvm.h |2 ++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/cpus.c b/cpus.c
index a55c330..879a03a 100644
--- a/cpus.c
+++ b/cpus.c
@@ -112,6 +112,10 @@ static void do_vm_stop(int reason)
 pause_all_vcpus();
 vm_state_notify(0, reason);
 monitor_protocol_event(QEVENT_STOP, NULL);
+if (kvm_enabled()) {
+kvmclock_update_clock();
+}
+
 }
 }
 
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 20b7d6d..d099d3d 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -500,11 +500,9 @@ static int kvm_enable_tpr_access_reporting(CPUState *env)
 #ifdef KVM_CAP_ADJUST_CLOCK
 static struct kvm_clock_data kvmclock_data;
 
-static void kvmclock_pre_save(void *opaque)
+void kvmclock_update_clock(void)
 {
-struct kvm_clock_data *cl = opaque;
-
-kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
+kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, kvmclock_data);
 }
 
 static int kvmclock_post_load(void *opaque, int version_id)
@@ -519,7 +517,6 @@ static const VMStateDescription vmstate_kvmclock= {
 .version_id = 1,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
-.pre_save = kvmclock_pre_save,
 .post_load = kvmclock_post_load,
 .fields  = (VMStateField []) {
 VMSTATE_U64(clock, struct kvm_clock_data),
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 0f3fb50..b0b7ab3 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -752,6 +752,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #define qemu_kvm_cpu_stop(env) do {} while(0)
 #endif
 
+void kvmclock_update_clock(void);
+
 #ifdef CONFIG_KVM
 
 typedef struct KVMSlot {
-- 
1.7.2.3

[Qemu-devel] Re: [PATCH 0/5] RFC: distinguish warm reset from cold reset.

2010-08-30 Thread Glauber Costa

On Mon, Aug 30, 2010 at 05:35:20PM +0900, Isaku Yamahata wrote:
 On Mon, Aug 30, 2010 at 10:59:19AM +0300, Avi Kivity wrote:
   On 08/30/2010 10:49 AM, Isaku Yamahata wrote:
  This patch set distinguish warm reset from cold reset by
  introducing warm reset callback handler.
  The first 4 patches are trivial clean up patches. The last patch of 5/5
  is RFC patch.
 
  The following thread arose cold reset vs warm reset issues.
  http://lists.nongnu.org/archive/html/qemu-devel/2010-08/msg00186.html
  The summary is
  - warm reset is wanted in qemu
 - Pressing the reset button is a warm reset on real machines
 - Sparc64 CPU uses different reset vector for warm and cold reset,
   so system_reset acts like a reset button
 - Bus reset can be implemented utilizing qdev frame work instead of
   implemeting it each bus layer independently.
  - The modification should be incremental.
 Anthony would like to see that as an incremental addition to what we 
  have
 today (like introducing a propagating warm reset callback) and thinking
 through what the actual behavior should and shouldn't be.
 
 
  If the direction is okay, The next step would be a patch(set) for qdev 
  which
  would introduce qdev_cold_reset(), qdev_warm_reset(),
  DeviceInfo::cold_reset and DeviceInfo::warm_reset
  and would obsolete qdev_reset() and DeviceInfo::reset.
 
 
  What would be the difference between warm and cold reset?  Former called  
  on any reset, while the latter called on power up only?
 
 What I have in mind at the moment is,
 warm reset callback is called on warm reset, not called on power up.
 cold reset callback is called only on power up (and power cycle).
 
 Hmm, should warm reset handler be called on power up?
 Each cold reset callbacks can call corresponding warm reset handler
 if necessary. So it would be redundant to call qdev_warm_reset() on power up.
 Or would it be more robust to call warm reset in addition to cold reset
 on power on?
I guess a good way to do that would be making reset just mean warm reset,
and keep a single list. Those would be called on every kind of reset.
Alternatively, cards that want to have a different power-on code could
then be added to an amendment list, interfaced by a cold-reset API.

[Qemu-devel] Re: [PATCH] e1000: Fix hotplug

2010-08-03 Thread Glauber Costa

On Mon, Aug 02, 2010 at 03:15:17PM -0600, Alex Williamson wrote:
 When we hotplug the device,
 we don't go through a reset cycle, which means a hot added e1000 is
 useless until the VM reboots.  

I do guess, however, that this is true for any device, right?

Wouldn't it be better to just call the newly added reset function at
hotplug? One way to do that, would be to store a value indicated qemu
has already started. If you add a reset handler after that, the function
is called before being placed on the list.

[Qemu-devel] Re: [PATCH] qdev: Reset hotplugged devices

2010-08-03 Thread Glauber Costa

On Tue, Aug 03, 2010 at 10:19:47AM -0600, Alex Williamson wrote:
 Several devices rely on their reset() function being called to
 initialize device state, e1000 and rtl8139 in particular.  When
 the device is hot added, the reset doesn't occur, often leaving
 the device in an unusable state.  Adding a call to reset() after
 init() for hotplugged devices puts the device in the expected
 state for the guest.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
I like this one better.

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-16 Thread Glauber Costa

On Mon, Jun 14, 2010 at 02:58:47PM -0500, Anthony Liguori wrote:
 On 06/14/2010 02:42 PM, Glauber Costa wrote:
 On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote:
 On 06/14/2010 02:27 PM, Glauber Costa wrote:
 This patch fixes a bug that happens with kvm, irqchip-in-kernel,
 while adding a netdev. Despite the situations of reproduction being
 specific to kvm, I believe this fix is pretty generic, and fits here.
 Specially if we ever want to have our own irqchip in kernel too.
 
 The problem happens after the fork system call, and although it is not
 100 % reproduceable, happens pretty often. After fork, the memory where
 the apic is mapped is present in both processes. It ends up confusing
 the vcpus somewhere in the irq-   ack path, and qemu hangs, with no
 irqs being delivered at all from that point on.
 
 Making sure the vcpus are stopped before forking makes the problem go
 away. Besides, this is a pretty unfrequent operation, which already hangs
 the io-thread for a while. So it should not hurt performance.
 
 Signed-off-by: Glauber Costaglom...@redhat.com
 This doesn't make very much sense to me but smells like a kernel bug to me.
 My interpretation is that by doing that, we make sure no in-flight
 requests are happening. Actually, a sleep(x), with x sufficiently big
 is enough to make this problem go away, but that is too hacky.
 
 vm_stop() is probably just acting a glorified sleep() since it has
 to wait for each thread to stop.
No, it is not. It also makes sure no vcpus are running, and thus, not generating
new requests (or waiting for any replies, for that matter).

(Note: I am not advocating the inclusion of this, just trying to build my own
awareness)

[Qemu-devel] [PATCH] stop cpus before forking.

2010-06-14 Thread Glauber Costa

This patch fixes a bug that happens with kvm, irqchip-in-kernel,
while adding a netdev. Despite the situations of reproduction being
specific to kvm, I believe this fix is pretty generic, and fits here.
Specially if we ever want to have our own irqchip in kernel too.

The problem happens after the fork system call, and although it is not
100 % reproduceable, happens pretty often. After fork, the memory where
the apic is mapped is present in both processes. It ends up confusing
the vcpus somewhere in the irq - ack path, and qemu hangs, with no
irqs being delivered at all from that point on.

Making sure the vcpus are stopped before forking makes the problem go
away. Besides, this is a pretty unfrequent operation, which already hangs
the io-thread for a while. So it should not hurt performance.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 net/tap.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/tap.c b/net/tap.c
index 0147dab..f34dd9c 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -330,6 +330,9 @@ static int launch_script(const char *setup_script, const 
char *ifname, int fd)
 sigaddset(mask, SIGCHLD);
 sigprocmask(SIG_BLOCK, mask, oldmask);
 
+/* make sure no cpus are running, so the apic does not
+ * get confused */
+vm_stop(0);
 /* try to launch network script */
 pid = fork();
 if (pid == 0) {
@@ -350,6 +353,7 @@ static int launch_script(const char *setup_script, const 
char *ifname, int fd)
 execv(setup_script, args);
 _exit(1);
 } else if (pid  0) {
+vm_start();
 while (waitpid(pid, status, 0) != pid) {
 /* loop */
 }
-- 
1.7.0.1

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-14 Thread Glauber Costa

On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote:
 On 06/14/2010 02:27 PM, Glauber Costa wrote:
 This patch fixes a bug that happens with kvm, irqchip-in-kernel,
 while adding a netdev. Despite the situations of reproduction being
 specific to kvm, I believe this fix is pretty generic, and fits here.
 Specially if we ever want to have our own irqchip in kernel too.
 
 The problem happens after the fork system call, and although it is not
 100 % reproduceable, happens pretty often. After fork, the memory where
 the apic is mapped is present in both processes. It ends up confusing
 the vcpus somewhere in the irq-  ack path, and qemu hangs, with no
 irqs being delivered at all from that point on.
 
 Making sure the vcpus are stopped before forking makes the problem go
 away. Besides, this is a pretty unfrequent operation, which already hangs
 the io-thread for a while. So it should not hurt performance.
 
 Signed-off-by: Glauber Costaglom...@redhat.com
 
 This doesn't make very much sense to me but smells like a kernel bug to me.
My interpretation is that by doing that, we make sure no in-flight
requests are happening. Actually, a sleep(x), with x sufficiently big
is enough to make this problem go away, but that is too hacky.

I do agree that this is most likely a kernel bug. But as with any other
kernel bugs, I believe this is a easy workaround to have things working
even in older kernels until we fix it.

 
 Even if it isn't, I can't rationalize why stopping the vm like this
 is enough to fix such a problem.  Is the problem that the KVM VCPU
 threads get duplicated while potentially running or something like
 that?

I doubt fork is duplicating the vcpu threads. More than that, this
bug does not happen with userspace irqchip.
So I believe that either irq request or the ack itself is reaching the
wrong process, forever stalling the apic.

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-14 Thread Glauber Costa

On Mon, Jun 14, 2010 at 02:58:47PM -0500, Anthony Liguori wrote:
 On 06/14/2010 02:42 PM, Glauber Costa wrote:
 On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote:
 On 06/14/2010 02:27 PM, Glauber Costa wrote:
 This patch fixes a bug that happens with kvm, irqchip-in-kernel,
 while adding a netdev. Despite the situations of reproduction being
 specific to kvm, I believe this fix is pretty generic, and fits here.
 Specially if we ever want to have our own irqchip in kernel too.
 
 The problem happens after the fork system call, and although it is not
 100 % reproduceable, happens pretty often. After fork, the memory where
 the apic is mapped is present in both processes. It ends up confusing
 the vcpus somewhere in the irq-   ack path, and qemu hangs, with no
 irqs being delivered at all from that point on.
 
 Making sure the vcpus are stopped before forking makes the problem go
 away. Besides, this is a pretty unfrequent operation, which already hangs
 the io-thread for a while. So it should not hurt performance.
 
 Signed-off-by: Glauber Costaglom...@redhat.com
 This doesn't make very much sense to me but smells like a kernel bug to me.
 My interpretation is that by doing that, we make sure no in-flight
 requests are happening. Actually, a sleep(x), with x sufficiently big
 is enough to make this problem go away, but that is too hacky.
 
 vm_stop() is probably just acting a glorified sleep() since it has
 to wait for each thread to stop.
 
 I do agree that this is most likely a kernel bug. But as with any other
 kernel bugs, I believe this is a easy workaround to have things working
 even in older kernels until we fix it.
 
 If we don't know what the bug is, then we do not know whether this
 is a work around.  Rather, this change happens to make the bug more
 difficult to reproduce with your test case.
 
 Even if it isn't, I can't rationalize why stopping the vm like this
 is enough to fix such a problem.  Is the problem that the KVM VCPU
 threads get duplicated while potentially running or something like
 that?
 I doubt fork is duplicating the vcpu threads. More than that, this
 bug does not happen with userspace irqchip.
 So I believe that either irq request or the ack itself is reaching the
 wrong process, forever stalling the apic.
 
 That sounds more like a signal delivery issue.  It's not obvious to
 me that we're doing the wrong thing with signal mask though.
 
 If it's a signal mask related issue, then vm_stop isn't a proper fix
 as there would be still be a race.
I do can investigate it further, but I doubt it is signal-delivery related.
I spent the first days believing it was, but now, I believe it is much
more likely to be apic-related. We don't need to wait for the child to exit for
this bug to happen, so SIGCHLD is never raised. And with in-kernel irqchip,
we don't deliver signals during normal vcpu execution.

[Qemu-devel] [PATCH] introduce -machine switch

2010-06-04 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes.
Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

machine-related options like kernel, initrd, etc, are now
accepted under this switch.

Note: This is against anthony's staging.

---
 hw/boards.h |4 +++
 hw/pc_piix.c|3 ++
 qemu-options.hx |   14 ++
 vl.c|   72 +++
 4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index 18b6b8f..bac8583 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -35,6 +35,10 @@ extern QEMUMachine *current_machine;
 
 #define COMMON_MACHINE_OPTS()  \
 {   \
+.name = machine,  \
+.type = QEMU_OPT_STRING,\
+},  \
+{   \
 .name = ram_size, \
 .type = QEMU_OPT_NUMBER,\
 },  \
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index f01194c..3ddb695 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -67,6 +67,9 @@ static void pc_init1(QemuOpts *opts, int pci_enabled)
 
 vmport_init();
 
+if (!kernel_cmdline)
+kernel_cmdline = ;
+
 /* allocate ram and load rom/bios */
 pc_memory_init(ram_size, kernel_filename, kernel_cmdline, initrd_filename,
below_4g_mem_size, above_4g_mem_size);
diff --git a/qemu-options.hx b/qemu-options.hx
index a6928b7..76ca866 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -35,6 +35,20 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF(machine, HAS_ARG, QEMU_OPTION_machine,
+-machine [machine=m][,ram_size=ram][,boot_device=dev]\n
+ [,kernel=vmlinux][,cmdline=kernel_cmdline][,initrd=initrd]\n
+ [,cpu=cpu_type]\n
+pc-specific options: [,acpi=on|off]\n
+kvm-x86 specific options: [,apic_in_kernel=on|off]\n
+select emulated machine (-machine ? for list)\n,
+QEMU_ARCH_ALL)
+STEXI
+...@item -machine @var{machine}[,@var{option}]
+...@findex -machine
+Select the emulated @var{machine} (@code{-machine ?} for list)
+ETEXI
+
 DEF(cpu, HAS_ARG, QEMU_OPTION_cpu,
 -cpu cpuselect CPU (-cpu ? for list)\n, QEMU_ARCH_ALL)
 STEXI
diff --git a/vl.c b/vl.c
index 96b8d35..177ffe2 100644
--- a/vl.c
+++ b/vl.c
@@ -1605,6 +1605,16 @@ static QEMUMachine *find_machine(const char *name)
 if (m-alias  !strcmp(m-alias, name))
 return m;
 }
+
+printf(Supported machines are:\n);
+for(m = first_machine; m != NULL; m = m-next) {
+   if (m-alias)
+   printf(%-10s %s (alias of %s)\n,
+  m-alias, m-desc, m-name);
+   printf(%-10s %s%s\n,
+  m-name, m-desc,
+  m-is_default ?  (default) : );
+}
 return NULL;
 }
 
@@ -2567,7 +2577,7 @@ int main(int argc, char **argv, char **envp)
 DisplayState *ds;
 DisplayChangeListener *dcl;
 int cyls, heads, secs, translation;
-QemuOpts *hda_opts = NULL, *opts;
+QemuOpts *hda_opts = NULL, *machine_opts = NULL, *opts = NULL;
 int optind;
 const char *optarg;
 const char *loadvm = NULL;
@@ -2697,21 +2707,29 @@ int main(int argc, char **argv, char **envp)
 exit(1);
 }
 switch(popt-index) {
+case QEMU_OPTION_machine: {
+   const char *mach;
+
+machine_opts = qemu_opts_parse(qemu_machine_opts, optarg, 0);
+if (!machine_opts) {
+fprintf(stderr, parse error: %s\n, optarg);
+exit(1);
+}
+mach = qemu_opt_get(machine_opts, machine);
+
+   if (!mach)
+   break;
+
+machine = find_machine(mach);
+
+if (!machine)
+exit(*mach != '?');
+   break;
+   }
 case QEMU_OPTION_M:
 machine = find_machine(optarg);
-if (!machine) {
-QEMUMachine *m;
-printf(Supported machines are:\n);
-for(m = first_machine; m != NULL; m = m-next) {
-if (m-alias)
-printf(%-10s %s (alias of %s)\n,
-   m-alias, m-desc, m-name);
-printf(%-10s %s%s\n,
-   m-name, m-desc,
-   m-is_default ?  (default) : );
-}
-exit(*optarg != '?');
-}
+if (!machine) 
+   exit(*optarg != '?');
 break;
 case QEMU_OPTION_cpu:
 /* hw initialization will

[Qemu-devel] Re: [PATCH v2 2/2] basic machine opts framework

2010-06-02 Thread Glauber Costa

On Wed, Jun 02, 2010 at 09:15:10AM +0200, Jan Kiszka wrote:
   
  +QemuOptsList qemu_machine_opts = {
  +.name = M,
  +.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
  +.desc = {
  +{
  +.name = mach,
  +.type = QEMU_OPT_STRING,
  +},{
  +.name = irqchip,
  +.type = QEMU_OPT_STRING,
  +},
 
 Can't we make the concrete machine define what options it needs? Pushing
 this into the generic code may soon end up in a bunch of very special
 switches that are unused on most machines or even have no meaning for them.
 
 Also, I would suggest to introduce the generic part first, and then add
 first users like the x86 irqchip.
Yeah, in general, I do agree with you.

Me and anthony talked about it for a while some time ago, and more or less
concluded that it could be possible to avoid that, putting a little think
which options to add.

the irqchip option, if you note, is not x86-specific, in any case.
Any machine has an irqchip. The first idea was to use something like
apic=in_kernel|userspace which would be, that, very x86-centric.

So, since letting machines define their own options adds complexity,
my take would be to add those common options, and add infrastructure
for machine-specific options when we see something that makes it
unavoidable.

What do you think?

[Qemu-devel] [PATCH v2 1/2] early set current_machine

2010-06-01 Thread Glauber Costa

this way, the machine_init function itself can know which machine is current
in use, not only the late init code.
Signed-off-by: Glauber Costa glom...@redhat.com
---
 vl.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index 96838f8..7a8b20b 100644
--- a/vl.c
+++ b/vl.c
@@ -5824,6 +5824,9 @@ int main(int argc, char **argv, char **envp)
 if (machine-compat_props) {
 qdev_prop_register_compat(machine-compat_props);
 }
+
+current_machine = machine;
+
 machine-init(ram_size, boot_devices,
   kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
 
@@ -5841,8 +5844,6 @@ int main(int argc, char **argv, char **envp)
 }
 }
 
-current_machine = machine;
-
 /* init USB devices */
 if (usb_enabled) {
 if (foreach_device_config(DEV_USB, usb_parse)  0)
-- 
1.6.2.2

[Qemu-devel] [PATCH v2 2/2] basic machine opts framework

2010-06-01 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes (always relying on safe
defaults). Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

A machine with in-kernel-irqchip could be specified as:
-machine irqchip=apic-kvm
And one without it:
-machine irqchip=apic

To demonstrate how it'd work, this patch introduces a choice between
pic and apic, pic being the old-style isa thing.
---
 hw/boards.h |   10 ++
 hw/pc.c |   45 +++--
 qemu-config.c   |   16 
 qemu-config.h   |1 +
 qemu-options.hx |9 +
 vl.c|   54 ++
 6 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index d889341..187794e 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -12,6 +12,15 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size,
  const char *initrd_filename,
  const char *cpu_model);
 
+typedef void (QEMUIrqchipFunc)(void *opaque);
+
+typedef struct QEMUIrqchip {
+const char *name;
+QEMUIrqchipFunc *init; 
+int used;
+int is_default;
+} QEMUIrqchip;
+
 typedef struct QEMUMachine {
 const char *name;
 const char *alias;
@@ -21,6 +30,7 @@ typedef struct QEMUMachine {
 int max_cpus;
 int is_default;
 CompatProperty *compat_props;
+QEMUIrqchip *irqchip;
 struct QEMUMachine *next;
 } QEMUMachine;
 
diff --git a/hw/pc.c b/hw/pc.c
index 408d6d6..b3de30a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1007,21 +1007,43 @@ int cpu_is_bsp(CPUState *env)
 return env-cpuid_apic_id == 0;
 }
 
+static void qemu_apic_init(void *opaque)
+{
+CPUState *env = opaque;
+if (!(env-cpuid_features  CPUID_APIC)) {
+fprintf(stderr, CPU lacks APIC cpuid flag\n);
+exit(1);
+}
+env-cpuid_apic_id = env-cpu_index;
+/* APIC reset callback resets cpu */
+apic_init(env);
+}
+
+static void qemu_pic_init(void *opaque)
+{
+CPUState *env = opaque;
+
+if (smp_cpus  1) {
+fprintf(stderr, PIC can't support smp systems\n);
+exit(1);
+}
+qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+}
+
 static CPUState *pc_new_cpu(const char *cpu_model)
 {
 CPUState *env;
+QEMUIrqchip *ic;
 
 env = cpu_init(cpu_model);
 if (!env) {
 fprintf(stderr, Unable to find x86 CPU definition\n);
 exit(1);
 }
-if ((env-cpuid_features  CPUID_APIC) || smp_cpus  1) {
-env-cpuid_apic_id = env-cpu_index;
-/* APIC reset callback resets cpu */
-apic_init(env);
-} else {
-qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+
+for (ic = current_machine-irqchip; ic-name != NULL; ic++) {
+if (ic-used)
+ic-init(env);
 }
 return env;
 }
@@ -1370,6 +1392,17 @@ static QEMUMachine pc_machine = {
 .desc = Standard PC,
 .init = pc_init_pci,
 .max_cpus = 255,
+.irqchip = (QEMUIrqchip[]){
+{
+.name = apic,
+.init = qemu_apic_init,
+.is_default = 1,
+},{
+.name = pic,
+.init = qemu_pic_init,
+},
+{ /* end of list */ },
+},
 .is_default = 1,
 };
 
diff --git a/qemu-config.c b/qemu-config.c
index cae92f7..e83b301 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -196,6 +196,21 @@ QemuOptsList qemu_rtc_opts = {
 },
 };
 
+QemuOptsList qemu_machine_opts = {
+.name = M,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
+.desc = {
+{
+.name = mach,
+.type = QEMU_OPT_STRING,
+},{
+.name = irqchip,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
 static QemuOptsList *lists[] = {
 qemu_drive_opts,
 qemu_chardev_opts,
@@ -203,6 +218,7 @@ static QemuOptsList *lists[] = {
 qemu_netdev_opts,
 qemu_net_opts,
 qemu_rtc_opts,
+qemu_machine_opts,
 NULL,
 };
 
diff --git a/qemu-config.h b/qemu-config.h
index 3cc8864..9b02ee0 100644
--- a/qemu-config.h
+++ b/qemu-config.h
@@ -7,6 +7,7 @@ extern QemuOptsList qemu_device_opts;
 extern QemuOptsList qemu_netdev_opts;
 extern QemuOptsList qemu_net_opts;
 extern QemuOptsList qemu_rtc_opts;
+extern QemuOptsList qemu_machine_opts;
 
 int qemu_set_option(const char *str);
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 20aa242..4dfc55a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -31,6 +31,15 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF(machine, HAS_ARG, QEMU_OPTION_machine,
+-machine mach=m[,irqchip=chip]\n
+select emulated machine (-machine ? for list)\n)
+STEXI
+...@item -machine @var{machine}[,@var{option}]
+...@findex -machine
+Select the emulated

[Qemu-devel] [PATCH v2 0/2] basic machine opts framework

2010-06-01 Thread Glauber Costa

Hello,

This is a resent (rebased) of an old patch set of mine, I sent
some time ago. With that, we should have all the needed infrastructure
to select the in-kernel irqchip for KVM.


Glauber Costa (2):
  early set current_machine
  basic machine opts framework

 hw/boards.h |   10 +
 hw/pc.c |   45 -
 qemu-config.c   |   16 ++
 qemu-config.h   |1 +
 qemu-options.hx |9 
 vl.c|   59 +-
 6 files changed, 132 insertions(+), 8 deletions(-)

[Qemu-devel] [PATCH 2/2] machine opts framework

2010-03-24 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes (always relying on safe
defaults). Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

A machine with in-kernel-irqchip could be specified as:
-machine irqchip=apic-kvm
And one without it:
-machine irqchip=apic

To demonstrate how it'd work, this patch introduces a choice between
pic and apic, pic being the old-style isa thing. 

---
 hw/boards.h |   10 ++
 hw/pc.c |   45 +++--
 qemu-config.c   |   16 
 qemu-config.h   |1 +
 qemu-options.hx |9 +
 vl.c|   54 ++
 6 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index 6f0f0d7..831728c 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -12,6 +12,15 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size,
  const char *initrd_filename,
  const char *cpu_model);
 
+typedef void (QEMUIrqchipFunc)(void *opaque);
+
+typedef struct QEMUIrqchip {
+const char *name;
+QEMUIrqchipFunc *init; 
+int used;
+int is_default;
+} QEMUIrqchip;
+
 typedef struct QEMUMachine {
 const char *name;
 const char *alias;
@@ -28,6 +37,7 @@ typedef struct QEMUMachine {
 no_sdcard:1;
 int is_default;
 GlobalProperty *compat_props;
+QEMUIrqchip *irqchip;
 struct QEMUMachine *next;
 } QEMUMachine;
 
diff --git a/hw/pc.c b/hw/pc.c
index ba14df0..43ec022 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -752,21 +752,43 @@ int cpu_is_bsp(CPUState *env)
 return env-cpu_index == 0;
 }
 
+static void qemu_apic_init(void *opaque)
+{
+CPUState *env = opaque;
+if (!(env-cpuid_features  CPUID_APIC)) {
+fprintf(stderr, CPU lacks APIC cpuid flag\n);
+exit(1);
+}
+env-cpuid_apic_id = env-cpu_index;
+/* APIC reset callback resets cpu */
+apic_init(env);
+}
+
+static void qemu_pic_init(void *opaque)
+{
+CPUState *env = opaque;
+
+if (smp_cpus  1) {
+fprintf(stderr, PIC can't support smp systems\n);
+exit(1);
+}
+qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+}
+
 static CPUState *pc_new_cpu(const char *cpu_model)
 {
 CPUState *env;
+QEMUIrqchip *ic;
 
 env = cpu_init(cpu_model);
 if (!env) {
 fprintf(stderr, Unable to find x86 CPU definition\n);
 exit(1);
 }
-if ((env-cpuid_features  CPUID_APIC) || smp_cpus  1) {
-env-cpuid_apic_id = env-cpu_index;
-/* APIC reset callback resets cpu */
-apic_init(env);
-} else {
-qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+
+for (ic = current_machine-irqchip; ic-name != NULL; ic++) {
+if (ic-used)
+ic-init(env);
 }
 return env;
 }
@@ -1074,6 +1096,17 @@ static QEMUMachine pc_machine = {
 .desc = Standard PC,
 .init = pc_init_pci,
 .max_cpus = 255,
+.irqchip = (QEMUIrqchip[]){
+{
+.name = apic,
+.init = qemu_apic_init,
+.is_default = 1,
+},{
+.name = pic,
+.init = qemu_pic_init,
+},
+{ /* end of list */ },
+},
 .is_default = 1,
 };
 
diff --git a/qemu-config.c b/qemu-config.c
index 150157c..2b985a9 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -296,6 +296,21 @@ QemuOptsList qemu_cpudef_opts = {
 },
 };
 
+QemuOptsList qemu_machine_opts = {
+.name = M,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
+.desc = {
+{
+.name = mach,
+.type = QEMU_OPT_STRING,
+},{
+.name = irqchip,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
 static QemuOptsList *lists[] = {
 qemu_drive_opts,
 qemu_chardev_opts,
@@ -306,6 +321,7 @@ static QemuOptsList *lists[] = {
 qemu_global_opts,
 qemu_mon_opts,
 qemu_cpudef_opts,
+qemu_machine_opts,
 NULL,
 };
 
diff --git a/qemu-config.h b/qemu-config.h
index f217c58..ea302f0 100644
--- a/qemu-config.h
+++ b/qemu-config.h
@@ -10,6 +10,7 @@ extern QemuOptsList qemu_rtc_opts;
 extern QemuOptsList qemu_global_opts;
 extern QemuOptsList qemu_mon_opts;
 extern QemuOptsList qemu_cpudef_opts;
+extern QemuOptsList qemu_machine_opts;
 
 QemuOptsList *qemu_find_opts(const char *group);
 int qemu_set_option(const char *str);
diff --git a/qemu-options.hx b/qemu-options.hx
index 8450b45..585ecd2 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -34,6 +34,15 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF(machine, HAS_ARG, QEMU_OPTION_machine,
+-machine mach=m[,irqchip=chip]\n
+select emulated machine (-machine ? for list)\n)
+STEXI
+...@item -machine

[Qemu-devel] [PATCH 1/2] early set current_machine

2010-03-24 Thread Glauber Costa

this way, the machine_init function itself can know which machine is current
in use, not only the late init code.
---
 vl.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index d69250c..ceddeac 100644
--- a/vl.c
+++ b/vl.c
@@ -4841,6 +4841,9 @@ int main(int argc, char **argv, char **envp)
 }
 qemu_add_globals();
 
+
+current_machine = machine;
+
 machine-init(ram_size, boot_devices,
   kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
 
@@ -4859,8 +4862,6 @@ int main(int argc, char **argv, char **envp)
 }
 }
 
-current_machine = machine;
-
 /* init USB devices */
 if (usb_enabled) {
 if (foreach_device_config(DEV_USB, usb_parse)  0)
-- 
1.6.6

Re: [Qemu-devel] [PATCH 2/2] machine opts framework

2010-03-24 Thread Glauber Costa

On Wed, Mar 24, 2010 at 02:43:35PM -0500, Anthony Liguori wrote:
 On 03/24/2010 02:26 PM, Glauber Costa wrote:
 This patch adds initial support for the -machine option, that allows
 command line specification of machine attributes (always relying on safe
 defaults). Besides its value per-se, it is the saner way we found to
 allow for enabling/disabling of kvm's in-kernel irqchip.
 
 A machine with in-kernel-irqchip could be specified as:
  -machine irqchip=apic-kvm
 And one without it:
  -machine irqchip=apic
 
 To demonstrate how it'd work, this patch introduces a choice between
 pic and apic, pic being the old-style isa thing.
 
 I started from a different place.  See machine-qemuopts in my staging tree.
 
 I think we should combine efforts.
 
 Regards,
 

Absolutely. I see little overlap between what we did. Just a comment on yours:

-static void an5206_init(ram_addr_t ram_size,
- const char *boot_device,
- const char *kernel_filename, const char *kernel_cmdline,
- const char *initrd_filename, const char *cpu_model)
+static void an5206_init(QemuOpts *opts)
 {

Since we're changing init functions anyway, I believe we should also pass
a pointer to the machine structure. With that, we can avoing using the global
current_machine.

Re: [Qemu-devel] Re: qdev property bug?

2009-12-14 Thread Glauber Costa

On Mon, Dec 14, 2009 at 04:01:43PM +0200, Michael S. Tsirkin wrote:
 On Mon, Dec 14, 2009 at 02:35:31PM +0100, Alexander Graf wrote:
  Michael S. Tsirkin wrote:
   On Mon, Dec 14, 2009 at 12:55:28PM +0100, Alexander Graf wrote:
 
   Am 14.12.2009 um 11:59 schrieb Michael S. Tsirkin m...@redhat.com:
  
   
   On Mon, Dec 14, 2009 at 11:16:34AM +0100, Gerd Hoffmann wrote:
 
   On 12/14/09 10:44, Michael S. Tsirkin wrote:
   
   No, it did not even start booting the kernel. Just gave me blank  
   screen.
 
   [ testing ]
  
   Oh.  That is something completely different.  A bug in the rom  
   loader.
   It fails to fit both e1000 (default nic) and virtio-net boot roms  
   into
   the option rom area and bails out (before loading seabios).  vl.c
   doesn't check the return value and happily continues (without bios).
   Which doesn't work out very well ...
  
   With two identical nics the (single) rom fits and qemu boots.
  
   Hmm.  Of course vl.c must be fixed to check the return value.
   
   Yes.
  
 
   Not sure how to deal with the rom size issue.  The gPXE roms look  
   quit
   big compared to the older roms we had.
   
   Hmm, it's a regression then ...
 
   How does real hw handle this? I'm pretty sure most servers these days  
   use more option rom space than this. They usually have some onboard raid 
   bios, external storage, on-board nic, pci nic, ...
   
  
   Real hardware might do several things I know about
   - option rom is typically small.
   - option rom is not loaded always (BIOS option), or not for all cards.
   There are might be other tricks.
 
  
  There are probably other tricks. I was booting up a machine that had
  like 5 options roms going through their initialization that all weren't
  exactly small.
 
 This might boil down to better ROMs. E.g.
 maybe they request memory with PMM and then give it up
 after initialization?
That would be my guess.

However, gpxe rom are also used in real hardware. Maybe there
is some config trick we're missing?

Re: [Qemu-devel] Re: qdev property bug?

2009-12-14 Thread Glauber Costa

On Mon, Dec 14, 2009 at 04:11:43PM +0200, Michael S. Tsirkin wrote:
 On Mon, Dec 14, 2009 at 08:11:59AM -0600, Anthony Liguori wrote:
  Alexander Graf wrote:
  Michael S. Tsirkin wrote:

  On Mon, Dec 14, 2009 at 12:55:28PM +0100, Alexander Graf wrote:

  Am 14.12.2009 um 11:59 schrieb Michael S. Tsirkin m...@redhat.com:
 

  On Mon, Dec 14, 2009 at 11:16:34AM +0100, Gerd Hoffmann wrote:

  On 12/14/09 10:44, Michael S. Tsirkin wrote:

  No, it did not even start booting the kernel. Just gave me 
  blank  screen.

  [ testing ]
 
  Oh.  That is something completely different.  A bug in the rom  
  loader.
  It fails to fit both e1000 (default nic) and virtio-net boot 
  roms  into
  the option rom area and bails out (before loading seabios).  vl.c
  doesn't check the return value and happily continues (without bios).
  Which doesn't work out very well ...
 
  With two identical nics the (single) rom fits and qemu boots.
 
  Hmm.  Of course vl.c must be fixed to check the return value.

  Yes.
 

  Not sure how to deal with the rom size issue.  The gPXE roms 
  look  quit
  big compared to the older roms we had.

  Hmm, it's a regression then ...

  How does real hw handle this? I'm pretty sure most servers these 
  days  use more option rom space than this. They usually have some 
  onboard raid bios, external storage, on-board nic, pci nic, ...

  Real hardware might do several things I know about
  - option rom is typically small.
  - option rom is not loaded always (BIOS option), or not for all cards.
  There are might be other tricks.

 
  There are probably other tricks. I was booting up a machine that had
  like 5 options roms going through their initialization that all weren't
  exactly small.
 

  So there must be some way to just have more option rom space.   

  What do you mean?

 
  Well, what's keeping us from having 5 MB of option roms?

 
  For starters, option roms run in real mode when you only have 1MB of  
  addressable memory :-)
 
  Implementing anything else would just be a waste of time. It'd 
  break  again when ppl do device assignment.
 
  Alex

  We need some solution for 0.12 though IMO.
  This does not need to address device assignment,
  but it must be simple.

 
  Agreed. If there is a solution that gives us the chance to support an
  arbitrary number of option roms that wouldn't take forever to implement,
  I'd rather take that one though.

 
  For 0.12, we just need to fail gracefully (meaning stop loading option  
  roms when we run out of room).  It's not a regression compared to 0.11.
 
  Regards,
 
  Anthony Liguori
 
 
 Well I am pretty sure that I used virtio + e1000 with 0.11
 and apparently I can't now.
 So it does look like a regression to me ...

e1000 is the problem here, since it is by far the largest roms
of them all.

Re: [Qemu-devel] Re: qdev property bug?

2009-12-14 Thread Glauber Costa

On Mon, Dec 14, 2009 at 08:22:12AM -0600, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 Well I am pretty sure that I used virtio + e1000 with 0.11
 and apparently I can't now.
 So it does look like a regression to me ...
 
 That's what I said, we should make sure that we stop loading roms
 when we run out of room as opposed to trampling over the bios space.

Can't we first load the bios?

Then we're pretty sure oproms will never trample it.

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa


 Spice is a library, it is library for remote display, it handle by
 itself all the connection between the spice client to the host that
 run the guest, it include:
 sound, display, keyboard, usb, network tunneling (for printers) and so
 on...


 I want to add that qemu is not the sole user of spice, Spice will be
 used as a protocol to connect into physical windows/linux machines

 So how can we change the library just for qemu?

I don't fully understand spice yet, but what's the difficulty here?
libraries changes every single day to try to acomodate for the needs
of specific users, be it through generalizations, shims, or whatever.

This is just another day in the OSS world.


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

 I think we should allow freedom of choice to the users to decide what
 protcol they want to use, Spice and VNC are all diffrent and were born
 to meet diffrent goals.

 I would happy to answer more questions if anyone have

 I think the simple point is that, AFAICS, the spice folks are expecting
 the qemu team to integrate their big ugly tarball, instead of doing what
 everyone else does, which is forward port everything to current head
 and then provide a current set of patches against GIT head.

 Anything else is just a waste of time. The paths both projects are
 at are too far apart.


More important than forward porting, is respecting the design decisions
a huge community has agreed upon. Of course, when you become part
of that community, you can try to shift these designs towards your goals,
but trying to force them is just ridiculous.

That said, I do believe spice can play a essential role in making us go
forward, but the attitude towards it has to change quite a bit.

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

On Fri, Dec 11, 2009 at 5:22 PM, Izik Eidus iei...@redhat.com wrote:
 On Fri, 11 Dec 2009 13:06:47 -0600
 Anthony Liguori anth...@codemonkey.ws wrote:

 Izik Eidus wrote:
  I want to add that qemu is not the sole user of spice, Spice will be
  used as a protocol to connect into physical windows/linux
  machines
 
  So how can we change the library just for qemu?
 
 A library is not necessarily a problem.

 What would be a probably is if the library maintains guest visible
 state.  There are a lot of advantages to keeping qemu as the sole
 maintainer of guest visible state as it simplifies things like live
 migration.  More importantly, it allows us to do things like Avi's
 suggested security sandboxing using seccomp().  For that to work, we
 need to make sure that we can isolate any code that interacts
 directly with the guest.

 Spice guest visible state inside qemu is just its PCI QXL device.
 This part is qemu specificed.


But this part can work together with vnc with no problems, right?
If this is so, why don't we just start by merging it, while trying
to make the case for the rendering protocol in parallel ?

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] [PATCH 00/17] pci: switch a ton of drivers to symbolic names

2009-12-10 Thread Glauber Costa

On Thu, Dec 10, 2009 at 4:09 PM, Michael S. Tsirkin m...@redhat.com wrote:
 The recent e1000 bug made the important of using
 symbolic macros for pci config access clear for me.
 So I started going over drivers and converting
 to symbolic constants instead of hard-coded ones.
 I did a large part until I run out of steam.
 Maybe some brave soul will take up converting
 the rest of them, or maybe I will: note that
 when converting bridges one should be careful
 to use bridge macros where appropriate.

 Instead of testing a huge number of configurations,
 I compared binaries before and after conversion.
 Almost all of them generate exact same stripped binary
 before and after the change.
 The only object changed was eepro100, objdump showed
 that the change was because gcc for some reason
 decides to use a bit more stack for init function
 after comments are added there.

 This methodology was the reason that I added TODOs where I saw
 deviations from spec or other code ugliness, will have to be fixed
 separately.


IMHO, this is a huge enhancement.

I myself was found expending huge amounts of time trying do figure out the
meaning of some specific constants in the past.

+1

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Using qemu-thread for synchronization

2009-12-10 Thread Glauber Costa

On Thu, Dec 10, 2009 at 12:46 PM, Liran Schour lir...@il.ibm.com wrote:

 I want to be able to synchronize between the code that is running the live
 migration with the code that call fro the completion callback of async IO.
 For that I am using qemu-thread.c (i.e QemuCond). I see that I have
 problems while linking if I do not use --enable-io-thread.
 Can someone explain me this. And if I want to do the above what is the
 right way to do that?

qemu_thread is only available when you use --enable-io-thread.

But beware, it is not 100 % stable yet. I had a queue of patches
that makes it slightly better, tough...


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] [PATCH] Permit zero-sized qemu_malloc() friends

2009-12-07 Thread Glauber Costa

 So the only correct way would be to write:

 array = malloc(size);
 if (array == NULL  size != 0) {
   return -ENOMEM;
 }


Of course we can differentiate. A failed malloc will abort, a successful one
will not.

 If you were writing portable code.  But that's not what people write.  You
 can argue that qemu_malloc() can have any semantics we want and while that's
 true, it doesn't change my above statement.  I think the main argument for
 this behavior in qemu is that people are used to using this idiom with
 malloc() but it's a non-portable practice.

 If qemu_malloc() didn't carry the name malloc() then semantics with size=0
 would be a different discussion.  But so far, all qemu_* functions tend to
 behave almost exactly like their C counterparts.  Relying on the result of
 size=0 with malloc() is broken.

We can change qemu_malloc to qemu_alloc_memory(), or whatever.
But from the moment we do things like abort on failing, we are already
deviating from its C counterpart.

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin

2009-12-03 Thread Glauber Costa


 Does msix work with this patchset when in-kernel irqchip
 is enabled?

Haven't tested. But since I see that msix need some special code in qemu-kvm,
it probably won't. But I assume we can just add a patch ontop of this to add
that code and make it work, right?

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin

2009-12-03 Thread Glauber Costa

 Haven't tested. But since I see that msix need some special code in qemu-kvm,
 it probably won't. But I assume we can just add a patch ontop of this to add
 that code and make it work, right?

 Sure. However - are you making in-kernel irqchip the default?
 If so, I think the best way to do it would be:
 1. add in-kernel irqchip, off by default
 2. fix msix with in-kernel
 3. make in-kernel on by default

Right now there are no knobs to disable it, since last time I checked,
people were inclined to
solve that by adding a machine type that does not do irqchip in
kernel, if wanted.

However, since it is probably not going to reach 0.12 anyway, we can
come up with a patch
that fixes msix, and bundle in this series before I actually enable
the irqchip (which is one
of the last patches)

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:20 AM, Avi Kivity a...@redhat.com wrote:
 On 12/02/2009 03:48 PM, Glauber Costa wrote:

 This function is similar to qemu-kvm's on_vcpu mechanism.

 Is similar?  You're replacing on_vcpu().
Yeah, it began similar, now it is pretty much the same thing, but
using qemu-specific
data structures


 Totally synchronous,
 and guarantees that a given function will be executed at the specified
 vcpu.

 This patch also convert usage within the breakpoints system

 +void qemu_queue_work(CPUState *env, void (*func)(void *data), void
 *data);


 The name suggests that it is asynchronous.

 Why is this patch necessary?

to keep gdbstub working.


 --
 error compiling committee.c: too many arguments to function







-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 11/11] remove smp restriction from kvm

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:23 AM, Avi Kivity a...@redhat.com wrote:
 On 12/02/2009 03:48 PM, Glauber Costa wrote:

 We don't support smp without irqchip in kernel, so only abort in
 that situation



 What's the reason for this restriction?

It is temporary. But as far as my testing goes, we don't come even close to
working without in-kernel irqchip.


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-03 Thread Glauber Costa


 Keep the name then.  The new name is misleading.

ok.

 Totally synchronous,
 and guarantees that a given function will be executed at the specified
 vcpu.

 This patch also convert usage within the breakpoints system

 +void qemu_queue_work(CPUState *env, void (*func)(void *data), void
 *data);



 The name suggests that it is asynchronous.

 Why is this patch necessary?


 to keep gdbstub working.


 Because it fixes things.

 Please elaborate.


gdbstub is called from the i/o thread , and call vcpu ioctls. So it
has to use the on_vcpu
mechanism to guarantee its execution in the right thread.

What I meant is that currently, gdbstub is the only user of it, at
least in qemu.git

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 11/11] remove smp restriction from kvm

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:37 AM, Avi Kivity a...@redhat.com wrote:
 On 12/03/2009 02:36 PM, Glauber Costa wrote:

 On Thu, Dec 3, 2009 at 10:23 AM, Avi Kivitya...@redhat.com  wrote:


 On 12/02/2009 03:48 PM, Glauber Costa wrote:


 We don't support smp without irqchip in kernel, so only abort in
 that situation



 What's the reason for this restriction?


 It is temporary. But as far as my testing goes, we don't come even close
 to
 working without in-kernel irqchip.


 This works well in qemu-kvm.  What's the reason?  it may indicate a bug in
 up.

The reason is that we do not handle SIPI in qemu.git yet. Now that the basics
of smp are working, it should not be too difficult to get it working.

-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:45 AM, Avi Kivity a...@redhat.com wrote:
 On 12/03/2009 02:39 PM, Michael S. Tsirkin wrote:

 Right now there are no knobs to disable it, since last time I checked,
 people were inclined to
 solve that by adding a machine type that does not do irqchip in
 kernel, if wanted.


 Can't everything doable by machine type also doable from
 command line switch? Disabling irqchip used to be
 a valuable debugging tool.



 Even worse, if we throw everything into the the machine type, we end up with
 2^n machine types in order to control everything.

I am totally fine re-including the cmdline patch in the series.



-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 12:29 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 On 12/02/2009 02:48 PM, Glauber Costa wrote:

 +    if (env == qemu_get_current_env()) {

 Will always be false for TCG + iothread.  What's wrong with
 qemu_cpu_self(env)?  It appears to do the same, and it would also make the
 whole thread-local storage stuff redundant.

Indeed. I'll turn to qemu_cpu_self. thanks


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] [PATCH v2 05/11] tell kernel about all registers instead of just mp_state

2009-12-02 Thread Glauber Costa

 +{
 +    if (kvm_enabled()) {
 Is this ever called or intended to be called when kvm is disabled?

 +        kvm_cpu_flush_state(env);
 +    }

I don't think so. But this is here for consistency with its synchronize brother.


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] [PATCH v2 0/11]

2009-12-02 Thread Glauber Costa

On Wed, Dec 2, 2009 at 8:59 AM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Dec 01, 2009 at 10:51:26AM -0200, Glauber Costa wrote:
 This is a repost of the -smp series. Note that it depends on 
 irqchip-in-kernel,
 that is already in staging. Also, you'll have to enable the io-thread, for 
 the time
 being.

 From the last version, main change is that I am not calling queue_work 
 automatically
 from vcpu ioctls. queue_work is only used currently for the gdb stub.

 All other uses were by-passed by the new qemu_register_vcpu_reset(), since 
 most
 of it uses (all racy) came from the reset handlers.

 Looks good to me except one thing. I don't see how you are addressing
 the problem fixed by commit b8a7857071b477b28d3055e33ff0298fc91f329a
 in qemu-kvm. The problem is that mp_state can change in kernel between
 call to kvm_cpu_synchronize_state() and vcpu re-entry. In this case old
 mp_state will overwrite new one.

I plan to do it in a patch that already lives on my tree. It basically flushes
register state in early post_load


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.

Re: [Qemu-devel] [PATCH v2 04/11] qemu_flush_work for remote vcpu execution

2009-12-02 Thread Glauber Costa

On Wed, Dec 02, 2009 at 11:27:53AM -0200, Marcelo Tosatti wrote:
 On Tue, Dec 01, 2009 at 10:51:30AM -0200, Glauber Costa wrote:
  This function is similar to qemu-kvm's on_vcpu mechanism. Totally 
  synchronous,
  and guarantees that a given function will be executed at the specified vcpu.
  
  This patch also convert usage within the breakpoints system
  
  Signed-off-by: Glauber Costa glom...@redhat.com
  ---
 
  @@ -3436,8 +3441,7 @@ static int tcg_has_work(void);
   
   static pthread_key_t current_env;
   
  -CPUState *qemu_get_current_env(void);
  -CPUState *qemu_get_current_env(void)
  +static CPUState *qemu_get_current_env(void)
   {
   return pthread_getspecific(current_env);
   }
  @@ -3474,8 +3478,10 @@ static int qemu_init_main_loop(void)
   
   static void qemu_wait_io_event(CPUState *env)
   {
  -while (!tcg_has_work())
  +while (!tcg_has_work()) {
 
 This checks all cpus, while for KVM it should check only 
 the current cpu.
 
  +qemu_flush_work(env);
   qemu_cond_timedwait(env-halt_cond, qemu_global_mutex, 1000);
  +}
 
 KVM vcpu threads should block SIGUSR1, set the in-kernel signal mask 
 with KVM_SET_SIGNAL_MASK ioctl, and eat the signal in
 qemu_wait_io_event (qemu_flush_work should run after eating
 the signal). Similarly to qemu-kvm's kvm_main_loop_wait.
 
 Otherwise a vcpu thread can lose a signal (say handle SIGUSR1 when not
 holding qemu_global_mutex before kernel entry).
 
 I think this the source of the problems patch 8 attempts to fix.

I remember you had patches for both theses fixes.

Can you resend them, ontop of this series?

[Qemu-devel] [PATCH 0/9] in-kernel irqchip, new spin

2009-12-02 Thread Glauber Costa

Avi/Marcelo

Please include this in a branch in qemu-kvm, for future inclusion
in qemu.git

This is basically the same thing that was sitting in staging for weeks,
just with a build bug fix on non-x86 hosts.

Glauber Costa (9):
  introduce VMSTATE_U64
  Provide ioapic-kvm
  provide apic_set_irq_delivered
  provide i8259-kvm
  Don't call apic functions directly from kvm code
  export kvm_put_mp_state
  provide apic-kvm
  Initialize in-kernel irqchip
  Do GSI routing

 Makefile.target   |2 +
 hw/apic-kvm.c |  157 
 hw/apic.c |5 ++
 hw/hw.h   |   24 +++
 hw/i8259-kvm.c|  112 +++
 hw/ioapic-kvm.c   |   89 +
 hw/pc.c   |   21 +-
 hw/pc.h   |6 ++
 kvm-all.c |   47 +-
 kvm.h |   16 +
 savevm.c  |   23 +++
 target-i386/cpu.h |9 +++
 target-i386/kvm.c |  188 +++--
 target-ppc/kvm.c  |5 ++
 14 files changed, 693 insertions(+), 11 deletions(-)
 create mode 100644 hw/apic-kvm.c
 create mode 100644 hw/i8259-kvm.c
 create mode 100644 hw/ioapic-kvm.c

[Qemu-devel] [PATCH 1/9] introduce VMSTATE_U64

2009-12-02 Thread Glauber Costa

Slightly modified version of a patch already included in qemu-kvm:

This is a patch actually written by Juan, which, according to him,
he plans on posting to qemu.git. Problem is that linux defines
u64 in a way that is type-uncompatible with uint64_t.

I am including it here, because it is a dependency to my patch series
that follows.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 hw/hw.h  |   24 
 savevm.c |   23 +++
 2 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 7889aa3..e8a53f9 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -337,6 +337,11 @@ extern const VMStateInfo vmstate_info_uint16;
 extern const VMStateInfo vmstate_info_uint32;
 extern const VMStateInfo vmstate_info_uint64;
 
+#ifdef __linux__
+#include linux/types.h
+extern const VMStateInfo vmstate_info_u64;
+#endif
+
 extern const VMStateInfo vmstate_info_timer;
 extern const VMStateInfo vmstate_info_ptimer;
 extern const VMStateInfo vmstate_info_buffer;
@@ -393,6 +398,16 @@ extern const VMStateInfo vmstate_info_unused_buffer;
 .offset = vmstate_offset_array(_state, _field, _type, _num), \
 }
 
+#define VMSTATE_ARRAY_UNSAFE(_field, _state, _num, _version, _info, _type) {\
+.name   = (stringify(_field)),   \
+.version_id = (_version),\
+.num= (_num),\
+.info   = (_info),  \
+.size   = sizeof(_type), \
+.flags  = VMS_ARRAY, \
+.offset = offsetof(_state, _field)   \
+}
+
 #define VMSTATE_ARRAY_TEST(_field, _state, _num, _test, _info, _type) {\
 .name = (stringify(_field)),  \
 .field_exists = (_test),  \
@@ -596,6 +611,15 @@ extern const VMStateDescription vmstate_i2c_slave;
 #define VMSTATE_UINT64(_f, _s)\
 VMSTATE_UINT64_V(_f, _s, 0)
 
+/* This is needed because on linux __u64 is unsigned long long
+   and on glibc uint64_t is unsigned long on 64 bits */
+#ifdef __linux__
+#define VMSTATE_U64_V(_f, _s, _v) \
+VMSTATE_SINGLE(_f, _s, _v, vmstate_info_u64, __u64)
+#define VMSTATE_U64(_f, _s)   \
+VMSTATE_U64_V(_f, _s, 0)
+#endif
+
 #define VMSTATE_UINT8_EQUAL(_f, _s)   \
 VMSTATE_SINGLE(_f, _s, 0, vmstate_info_uint8_equal, uint8_t)
 
diff --git a/savevm.c b/savevm.c
index 18c2e54..7d68ec0 100644
--- a/savevm.c
+++ b/savevm.c
@@ -872,6 +872,29 @@ const VMStateInfo vmstate_info_uint64 = {
 .put  = put_uint64,
 };
 
+/* 64 bit linux kernel unsigned int */
+
+#ifdef __linux__
+static int get_u64(QEMUFile *f, void *pv, size_t size)
+{
+__u64 *v = pv;
+qemu_get_be64s(f, (uint64_t *)v);
+return 0;
+}
+
+static void put_u64(QEMUFile *f, void *pv, size_t size)
+{
+__u64 *v = pv;
+qemu_put_be64s(f, (uint64_t *)v);
+}
+
+const VMStateInfo vmstate_info_u64 = {
+.name = __u64,
+.get  = get_u64,
+.put  = put_u64,
+};
+#endif /* __linux__ */
+
 /* 8 bit int. See that the received value is the same than the one
in the field */
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 2/9] Provide ioapic-kvm

2009-12-02 Thread Glauber Costa

This patch provides the file ioapic-kvm.c, which implements a schim over
the kvm in-kernel IO APIC.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 Makefile.target |2 +
 hw/ioapic-kvm.c |   89 +++
 hw/pc.c |7 -
 hw/pc.h |2 +
 kvm-all.c   |   20 
 kvm.h   |4 ++
 6 files changed, 123 insertions(+), 1 deletions(-)
 create mode 100644 hw/ioapic-kvm.c

diff --git a/Makefile.target b/Makefile.target
index 2985658..844ba46 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,8 @@ obj-i386-y += usb-uhci.o vmmouse.o vmport.o vmware_vga.o 
hpet.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += ne2000-isa.o
 
+obj-i386-$(CONFIG_KVM) += ioapic-kvm.o
+
 # shared objects
 obj-ppc-y = ppc.o ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/macio.o
 obj-ppc-y += ide/cmd646.o
diff --git a/hw/ioapic-kvm.c b/hw/ioapic-kvm.c
new file mode 100644
index 000..78e0984
--- /dev/null
+++ b/hw/ioapic-kvm.c
@@ -0,0 +1,89 @@
+#include hw.h
+#include pc.h
+#include qemu-timer.h
+#include host-utils.h
+#include kvm.h
+
+#define IOAPIC_NUM_PINS0x18
+#define IOAPIC_DEFAULT_BASE_ADDRESS  0xfec0
+
+static void ioapic_reset(void *opaque)
+{
+struct kvm_ioapic_state *s = opaque;
+struct kvm_irqchip *chip;
+int i;
+
+chip = container_of(s, struct kvm_irqchip, chip.ioapic);
+
+chip-chip_id = KVM_IRQCHIP_IOAPIC;
+
+memset(s, 0, sizeof(*s));
+s-base_address = IOAPIC_DEFAULT_BASE_ADDRESS;
+for(i = 0; i  IOAPIC_NUM_PINS; i++)
+s-redirtbl[i].bits = 1  16; /* mask LVT */
+
+kvm_set_irqchip(chip);
+}
+
+static void ioapic_pre_save(void *opaque)
+{
+struct kvm_ioapic_state *s = opaque;
+struct kvm_irqchip *chip;
+
+chip = container_of(s, struct kvm_irqchip, chip.ioapic);
+
+kvm_get_irqchip(chip);
+}
+
+static int ioapic_post_load(void *opaque, int version_id)
+{
+struct kvm_ioapic_state *s = opaque;
+struct kvm_irqchip *chip;
+
+chip = container_of(s, struct kvm_irqchip, chip.ioapic);
+
+return kvm_set_irqchip(chip);
+}
+
+static const VMStateDescription vmstate_kvm_ioapic = {
+.name = ioapic-kvm,
+.version_id = 1,
+.minimum_version_id = 1,
+.post_load = ioapic_post_load,
+.pre_save = ioapic_pre_save,
+.fields  = (VMStateField []) {
+/* Linux does not define __u64 the same as uint64_t */
+VMSTATE_U64(base_address, struct kvm_ioapic_state),
+VMSTATE_UINT32(id, struct kvm_ioapic_state),
+VMSTATE_UINT32(ioregsel, struct kvm_ioapic_state),
+VMSTATE_UINT32(irr, struct kvm_ioapic_state),
+VMSTATE_ARRAY_UNSAFE(redirtbl, struct kvm_ioapic_state, 
IOAPIC_NUM_PINS, 0, vmstate_info_u64, __u64),
+VMSTATE_END_OF_LIST()
+}
+};
+
+
+static void kvm_ioapic_set_irq(void *opaque, int vector, int level)
+{
+/*
+int pic_ret;
+
+if (kvm_set_irq(vector, level, pic_ret)) {
+if (pic_ret != 0)
+apic_set_irq_delivered();
+return;
+}
+*/
+}
+
+qemu_irq *kvm_ioapic_init(void)
+{
+struct kvm_irqchip *s;
+
+s = qemu_mallocz(sizeof(*s));
+
+vmstate_register(0, vmstate_kvm_ioapic, s-chip.ioapic);
+qemu_register_reset(ioapic_reset, s-chip.ioapic);
+
+return qemu_allocate_irqs(kvm_ioapic_set_irq, s-chip.ioapic, 
IOAPIC_NUM_PINS);
+}
diff --git a/hw/pc.c b/hw/pc.c
index 97964b2..7bff913 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -44,6 +44,7 @@
 #include ide.h
 #include loader.h
 #include elf.h
+#include kvm.h
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -1144,7 +1145,11 @@ static void pc_init1(ram_addr_t ram_size,
 register_ioport_write(0x92, 1, 1, ioport92_write, NULL);
 
 if (pci_enabled) {
-isa_irq_state-ioapic = ioapic_init();
+if (kvm_enabled()  kvm_irqchip_in_kernel()) {
+isa_irq_state-ioapic = kvm_ioapic_init();
+} else {
+isa_irq_state-ioapic = ioapic_init();
+}
 }
 pit = pit_init(0x40, isa_reserve_irq(0));
 pcspk_init(pit);
diff --git a/hw/pc.h b/hw/pc.h
index 03ffc91..a3ad931 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -49,6 +49,8 @@ void ioapic_set_irq(void *opaque, int vector, int level);
 void apic_reset_irq_delivered(void);
 int apic_get_irq_delivered(void);
 
+qemu_irq *kvm_ioapic_init(void);
+
 /* i8254.c */
 
 #define PIT_FREQ 1193182
diff --git a/kvm-all.c b/kvm-all.c
index b605caa..35230c1 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -393,6 +393,26 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 return ret;
 }
 
+#ifdef KVM_CAP_IRQCHIP
+int kvm_set_irqchip(struct kvm_irqchip *chip)
+{
+if (!kvm_state-irqchip_in_kernel) {
+return 0;
+}
+
+return kvm_vm_ioctl(kvm_state, KVM_SET_IRQCHIP, chip);
+}
+
+int kvm_get_irqchip(struct kvm_irqchip *chip)
+{
+if (!kvm_state-irqchip_in_kernel) {
+return 0;
+}
+
+return

[Qemu-devel] [PATCH 3/9] provide apic_set_irq_delivered

2009-12-02 Thread Glauber Costa

i8259 chip will use it, so provide it, and export it through pc.h

Signed-off-by: Glauber Costa glom...@redhat.com
---
 hw/apic.c |5 +
 hw/pc.h   |1 +
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 87e7dc0..482bb1e 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -388,6 +388,11 @@ void apic_reset_irq_delivered(void)
 apic_irq_delivered = 0;
 }
 
+void apic_set_irq_delivered(void)
+{
+apic_irq_delivered = 1;
+}
+
 int apic_get_irq_delivered(void)
 {
 return apic_irq_delivered;
diff --git a/hw/pc.h b/hw/pc.h
index a3ad931..4c9b4c3 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -47,6 +47,7 @@ int apic_get_interrupt(CPUState *env);
 qemu_irq *ioapic_init(void);
 void ioapic_set_irq(void *opaque, int vector, int level);
 void apic_reset_irq_delivered(void);
+void apic_set_irq_delivered(void);
 int apic_get_irq_delivered(void);
 
 qemu_irq *kvm_ioapic_init(void);
-- 
1.6.5.2

[Qemu-devel] [PATCH 4/9] provide i8259-kvm

2009-12-02 Thread Glauber Costa

This patch provides the file i8259-kvm.c, which implements a schim over
the kvm in-kernel PIC.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 Makefile.target   |2 +-
 hw/i8259-kvm.c|  112 +
 hw/pc.c   |8 +++-
 hw/pc.h   |1 +
 kvm-all.c |7 ++-
 kvm.h |4 ++
 target-i386/kvm.c |   26 
 target-ppc/kvm.c  |5 ++
 8 files changed, 161 insertions(+), 4 deletions(-)
 create mode 100644 hw/i8259-kvm.c

diff --git a/Makefile.target b/Makefile.target
index 844ba46..8fb68a4 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,7 +199,7 @@ obj-i386-y += usb-uhci.o vmmouse.o vmport.o vmware_vga.o 
hpet.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += ne2000-isa.o
 
-obj-i386-$(CONFIG_KVM) += ioapic-kvm.o
+obj-i386-$(CONFIG_KVM) += ioapic-kvm.o i8259-kvm.o
 
 # shared objects
 obj-ppc-y = ppc.o ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/macio.o
diff --git a/hw/i8259-kvm.c b/hw/i8259-kvm.c
new file mode 100644
index 000..bd6387b
--- /dev/null
+++ b/hw/i8259-kvm.c
@@ -0,0 +1,112 @@
+#include hw.h
+#include pc.h
+#include isa.h
+#include monitor.h
+#include qemu-timer.h
+#include kvm.h
+
+static void kvm_i8259_set_irq(void *opaque, int irq, int level)
+{
+int pic_ret;
+
+if (kvm_set_irq(irq, level, pic_ret)) {
+if (pic_ret != 0)
+/* In theory, we should not be using any apic state, but we need
+ * to warn devices such as the rtc about state of delivery. Since 
this
+ * one is just a marker, it is no big deal */
+apic_set_irq_delivered();
+return;
+}
+}
+
+static void kvm_pic_reset(void *opaque)
+{
+struct kvm_pic_state *s = opaque;
+struct kvm_irqchip *chip;
+
+s-last_irr = 0;
+s-irr = 0;
+s-imr = 0;
+s-isr = 0;
+s-priority_add = 0;
+s-irq_base = 0;
+s-read_reg_select = 0;
+s-poll = 0;
+s-special_mask = 0;
+s-init_state = 0;
+s-auto_eoi = 0;
+s-rotate_on_auto_eoi = 0;
+s-special_fully_nested_mode = 0;
+s-init4 = 0;
+
+chip = container_of(s, struct kvm_irqchip, chip.pic);
+kvm_set_irqchip(chip);
+}
+
+static void pic_pre_save(void *opaque)
+{
+struct kvm_pic_state *s = opaque;
+struct kvm_irqchip *chip;
+
+chip = container_of(s, struct kvm_irqchip, chip.pic);
+
+kvm_get_irqchip(chip);
+}
+
+static int pic_post_load(void *opaque, int version_id)
+{
+struct kvm_pic_state *s = opaque;
+struct kvm_irqchip *chip;
+
+chip = container_of(s, struct kvm_irqchip, chip.pic);
+
+return kvm_set_irqchip(chip);
+}
+
+static const VMStateDescription vmstate_kvm_pic = {
+.name = i8259-kvm,
+.version_id = 1,
+.pre_save = pic_pre_save,
+.post_load = pic_post_load,
+.minimum_version_id = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT8(last_irr, struct kvm_pic_state),
+VMSTATE_UINT8(irr, struct kvm_pic_state),
+VMSTATE_UINT8(imr, struct kvm_pic_state),
+VMSTATE_UINT8(isr, struct kvm_pic_state),
+VMSTATE_UINT8(priority_add, struct kvm_pic_state),
+VMSTATE_UINT8(irq_base, struct kvm_pic_state),
+VMSTATE_UINT8(read_reg_select, struct kvm_pic_state),
+VMSTATE_UINT8(poll, struct kvm_pic_state),
+VMSTATE_UINT8(special_mask, struct kvm_pic_state),
+VMSTATE_UINT8(init_state, struct kvm_pic_state),
+VMSTATE_UINT8(auto_eoi, struct kvm_pic_state),
+VMSTATE_UINT8(rotate_on_auto_eoi, struct kvm_pic_state),
+VMSTATE_UINT8(special_fully_nested_mode, struct kvm_pic_state),
+VMSTATE_UINT8(init4, struct kvm_pic_state),
+VMSTATE_UINT8(elcr, struct kvm_pic_state),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static void kvm_pic_init1(int io_addr, struct kvm_pic_state *s)
+{
+vmstate_register(io_addr, vmstate_kvm_pic, s);
+qemu_register_reset(kvm_pic_reset, s);
+}
+
+qemu_irq *kvm_i8259_init(qemu_irq parent_irq)
+{
+struct kvm_irqchip *master, *slave;
+
+master = qemu_mallocz(sizeof(*master));
+slave = qemu_mallocz(sizeof(*slave));
+
+master-chip_id = KVM_IRQCHIP_PIC_MASTER;
+slave-chip_id = KVM_IRQCHIP_PIC_SLAVE;
+
+kvm_pic_init1(0x20, master-chip.pic);
+kvm_pic_init1(0xa0, slave-chip.pic);
+
+return qemu_allocate_irqs(kvm_i8259_set_irq, master, 16);
+}
diff --git a/hw/pc.c b/hw/pc.c
index 7bff913..a58562a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1098,8 +1098,14 @@ static void pc_init1(ram_addr_t ram_size,
 }
 
 cpu_irq = qemu_allocate_irqs(pic_irq_request, NULL, 1);
-i8259 = i8259_init(cpu_irq[0]);
 isa_irq_state = qemu_mallocz(sizeof(*isa_irq_state));
+
+if (kvm_enabled()  kvm_irqchip_in_kernel()) {
+i8259 = kvm_i8259_init(cpu_irq[0]);
+} else {
+i8259 = i8259_init(cpu_irq[0]);
+}
+
 isa_irq_state-i8259 = i8259;
 isa_irq = qemu_allocate_irqs(isa_irq_handler, isa_irq_state

[Qemu-devel] [PATCH 5/9] Don't call apic functions directly from kvm code

2009-12-02 Thread Glauber Costa

It is actually not necessary to call a tpr function to save and load cr8,
as cr8 is part of the processor state, and thus, it is much easier
to just add it to CPUState.

As for apic base, wrap kvm usages, so we can call either the qemu device,
or the in kernel version.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   25 +++--
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index eb9532a..7c4fa47 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -606,6 +606,7 @@ typedef struct CPUX86State {
 SegmentCache idt; /* only base and limit are used */
 
 target_ulong cr[5]; /* NOTE: cr1 is unused */
+target_ulong cr8;
 int32_t a20_mask;
 
 /* FPU state */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index cf42f85..78fc941 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -344,6 +344,19 @@ static void get_seg(SegmentCache *lhs, const struct 
kvm_segment *rhs)
| (rhs-avl * DESC_AVL_MASK);
 }
 
+static void kvm_set_apic_base(CPUState *env, uint64_t val)
+{
+if (!kvm_irqchip_in_kernel())
+cpu_set_apic_base(env, val);
+}
+
+static uint64_t kvm_get_apic_base(CPUState *env)
+{
+if (!kvm_irqchip_in_kernel())
+return cpu_get_apic_base(env);
+return 0;
+}
+
 static void kvm_getput_reg(__u64 *kvm_reg, target_ulong *qemu_reg, int set)
 {
 if (set)
@@ -455,8 +468,8 @@ static int kvm_put_sregs(CPUState *env)
 sregs.cr3 = env-cr[3];
 sregs.cr4 = env-cr[4];
 
-sregs.cr8 = cpu_get_apic_tpr(env);
-sregs.apic_base = cpu_get_apic_base(env);
+sregs.cr8 = env-cr8;
+sregs.apic_base = kvm_get_apic_base(env);
 
 sregs.efer = env-efer;
 
@@ -561,7 +574,7 @@ static int kvm_get_sregs(CPUState *env)
 env-cr[3] = sregs.cr3;
 env-cr[4] = sregs.cr4;
 
-cpu_set_apic_base(env, sregs.apic_base);
+kvm_set_apic_base(env, sregs.apic_base);
 
 env-efer = sregs.efer;
 //cpu_set_apic_tpr(env, sregs.cr8);
@@ -777,7 +790,7 @@ int kvm_arch_pre_run(CPUState *env, struct kvm_run *run)
 run-request_interrupt_window = 0;
 
 dprintf(setting tpr\n);
-run-cr8 = cpu_get_apic_tpr(env);
+run-cr8 = env-cr8;
 
 return 0;
 }
@@ -789,8 +802,8 @@ int kvm_arch_post_run(CPUState *env, struct kvm_run *run)
 else
 env-eflags = ~IF_MASK;
 
-cpu_set_apic_tpr(env, run-cr8);
-cpu_set_apic_base(env, run-apic_base);
+env-cr8 = run-cr8;
+kvm_set_apic_base(env, run-apic_base);
 
 return 0;
 }
-- 
1.6.5.2

[Qemu-devel] [PATCH 7/9] provide apic-kvm

2009-12-02 Thread Glauber Costa

This patch provides the file apic-kvm.c, which implements a schim over
the kvm in-kernel APIC.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 Makefile.target   |2 +-
 hw/apic-kvm.c |  157 +
 hw/pc.c   |6 ++-
 hw/pc.h   |2 +
 kvm.h |5 ++
 target-i386/cpu.h |4 ++
 target-i386/kvm.c |   19 ++-
 7 files changed, 192 insertions(+), 3 deletions(-)
 create mode 100644 hw/apic-kvm.c

diff --git a/Makefile.target b/Makefile.target
index 8fb68a4..c23aeb3 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,7 +199,7 @@ obj-i386-y += usb-uhci.o vmmouse.o vmport.o vmware_vga.o 
hpet.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += ne2000-isa.o
 
-obj-i386-$(CONFIG_KVM) += ioapic-kvm.o i8259-kvm.o
+obj-i386-$(CONFIG_KVM) += ioapic-kvm.o i8259-kvm.o apic-kvm.o
 
 # shared objects
 obj-ppc-y = ppc.o ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/macio.o
diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c
new file mode 100644
index 000..089fa45
--- /dev/null
+++ b/hw/apic-kvm.c
@@ -0,0 +1,157 @@
+#include hw.h
+#include pc.h
+#include pci.h
+#include msix.h
+#include qemu-timer.h
+#include host-utils.h
+#include kvm.h
+
+#define APIC_LVT_NB  6
+#define APIC_LVT_LINT0   3
+
+struct qemu_lapic_state {
+uint32_t apicbase;
+uint8_t id;
+uint8_t arb_id;
+uint8_t tpr;
+uint32_t spurious_vec;
+uint8_t log_dest;
+uint8_t dest_mode;
+uint32_t isr[8];  /* in service register */
+uint32_t tmr[8];  /* trigger mode register */
+uint32_t irr[8]; /* interrupt request register */
+uint32_t lvt[APIC_LVT_NB];
+uint32_t esr; /* error register */
+uint32_t icr[2];
+
+uint32_t divide_conf;
+int count_shift;
+uint32_t initial_count;
+int64_t initial_count_load_time, next_time;
+uint32_t idx;
+int sipi_vector;
+int wait_for_sipi;
+};
+
+typedef struct APICState {
+CPUState *cpu_env;
+
+/* KVM lapic structure is just a big array of regs. But it is what kvm
+ * functions expect. So have both the fields separated, for easy access,
+ * and the kvm stucture, for ioctls communications */
+union {
+struct qemu_lapic_state dev;
+struct kvm_lapic_state kvm_lapic_state;
+};
+} APICState;
+
+void kvm_apic_set_base(CPUState *env, uint64_t val)
+{
+APICState *s = env-apic_state;
+
+if (!s)
+return;
+
+s-dev.apicbase = val;
+}
+
+uint64_t kvm_apic_get_base(CPUState *env)
+{
+APICState *s = env-apic_state;
+
+return s ? s-dev.apicbase : 0;
+}
+
+static void apic_pre_save(void *opaque)
+{
+APICState *s = opaque;
+
+kvm_get_lapic(s-cpu_env, s-kvm_lapic_state);
+}
+
+static int apic_post_load(void *opaque, int version_id)
+{
+APICState *s = opaque;
+
+return kvm_set_lapic(s-cpu_env, s-kvm_lapic_state);
+}
+
+static const VMStateDescription vmstate_apic = {
+.name = apic-kvm,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_BUFFER_UNSAFE(kvm_lapic_state.regs, APICState, 1, 
KVM_APIC_REG_SIZE),
+VMSTATE_END_OF_LIST()
+},
+.pre_save = apic_pre_save,
+.post_load = apic_post_load,
+};
+
+static void kvm_apic_reset(void *opaque)
+{
+APICState *s = opaque;
+int bsp;
+int i;
+
+cpu_synchronize_state(s-cpu_env);
+
+bsp = cpu_is_bsp(s-cpu_env);
+
+s-dev.apicbase = 0xfee0 |
+(bsp ? MSR_IA32_APICBASE_BSP : 0) | MSR_IA32_APICBASE_ENABLE;
+
+cpu_reset(s-cpu_env);
+
+s-dev.tpr = 0;
+s-dev.spurious_vec = 0xff;
+s-dev.log_dest = 0;
+s-dev.dest_mode = 0xf;
+memset(s-dev.isr, 0, sizeof(s-dev.isr));
+memset(s-dev.tmr, 0, sizeof(s-dev.tmr));
+memset(s-dev.irr, 0, sizeof(s-dev.irr));
+for(i = 0; i  APIC_LVT_NB; i++)
+s-dev.lvt[i] = 1  16; /* mask LVT */
+s-dev.esr = 0;
+memset(s-dev.icr, 0, sizeof(s-dev.icr));
+s-dev.divide_conf = 0;
+s-dev.count_shift = 0;
+s-dev.initial_count = 0;
+s-dev.initial_count_load_time = 0;
+s-dev.next_time = 0;
+s-dev.wait_for_sipi = 1;
+
+s-cpu_env-halted = !(s-dev.apicbase  MSR_IA32_APICBASE_BSP);
+
+s-cpu_env-mp_state
+= s-cpu_env-halted ? KVM_MP_STATE_UNINITIALIZED : 
KVM_MP_STATE_RUNNABLE;
+
+kvm_put_mp_state(s-cpu_env);
+
+if (bsp) {
+/*
+ * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
+ * time typically by BIOS, so PIC interrupt can be delivered to the
+ * processor when local APIC is enabled.
+ */
+s-dev.lvt[APIC_LVT_LINT0] = 0x700;
+}
+kvm_set_lapic(s-cpu_env, s-kvm_lapic_state);
+}
+
+int kvm_apic_init(CPUState *env)
+{
+APICState *s;
+
+s = qemu_mallocz(sizeof(*s));
+env-apic_state = s;
+s-cpu_env = env;
+
+msix_supported = 1;
+
+vmstate_register(s-dev.id, vmstate_apic, s

[Qemu-devel] [PATCH 6/9] export kvm_put_mp_state

2009-12-02 Thread Glauber Costa

We'll use it from inside the in-kernel apic chip, so get it into a header
file.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 target-i386/cpu.h |4 
 target-i386/kvm.c |2 +-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 7c4fa47..1ebf80e 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -921,4 +921,8 @@ void apic_init_reset(CPUState *env);
 void apic_sipi(CPUState *env);
 void do_cpu_init(CPUState *env);
 void do_cpu_sipi(CPUState *env);
+
+/* KVM hooks */
+int kvm_put_mp_state(CPUState *env);
+
 #endif /* CPU_I386_H */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 78fc941..4eb61f1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -687,7 +687,7 @@ static int kvm_get_msrs(CPUState *env)
 return 0;
 }
 
-static int kvm_put_mp_state(CPUState *env)
+int kvm_put_mp_state(CPUState *env)
 {
 struct kvm_mp_state mp_state = { .mp_state = env-mp_state };
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 8/9] Initialize in-kernel irqchip

2009-12-02 Thread Glauber Costa

Now that we have all devices set up, this patch initializes the irqchip.
This is dependant on the io-thread, since we need someone to pull ourselves
out of the halted state.

I believe this should be the default when we are running over the io-thread.
Later on, I plan to post a patch that makes it optional.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 kvm-all.c |   20 
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index e46093f..0c6aba7 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -416,6 +416,22 @@ int kvm_set_irq(int irq, int level, int *status)
 return kvm_arch_set_irq(kvm_state, irq, level, status);
 }
 
+static int kvm_create_irqchip(KVMState *s)
+{
+int ret = 0;
+#if defined(CONFIG_IOTHREAD)
+if (!kvm_check_extension(s, KVM_CAP_IRQCHIP))
+return -1;
+
+ret = kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP);
+if (ret  0)
+return ret;
+
+s-irqchip_in_kernel = 1;
+#endif
+return ret;
+}
+
 int kvm_init(int smp_cpus)
 {
 static const char upgrade_note[] =
@@ -502,6 +518,10 @@ int kvm_init(int smp_cpus)
 }
 #endif
 
+ret = kvm_create_irqchip(s);
+if (ret  0)
+goto err;
+
 ret = kvm_arch_init(s, smp_cpus);
 if (ret  0)
 goto err;
-- 
1.6.5.2

[Qemu-devel] [PATCH 9/9] Do GSI routing

2009-12-02 Thread Glauber Costa

To follow correctly what our bios ACPI tables say, we have to be able to program
our irqchips with GSI routing mappings. This support is already in qemu-kvm

Signed-off-by: Glauber Costa glom...@redhat.com
---
 kvm-all.c |6 ++-
 kvm.h |3 +
 target-i386/kvm.c |  118 +
 3 files changed, 125 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 0c6aba7..a5739c4 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -66,6 +66,8 @@ struct KVMState
 #endif
 int irqchip_in_kernel;
 int pit_in_kernel;
+
+KVMArchState *arch_state;
 };
 
 static KVMState *kvm_state;
@@ -522,12 +524,12 @@ int kvm_init(int smp_cpus)
 if (ret  0)
 goto err;
 
+kvm_state = s;
+
 ret = kvm_arch_init(s, smp_cpus);
 if (ret  0)
 goto err;
 
-kvm_state = s;
-
 return 0;
 
 err:
diff --git a/kvm.h b/kvm.h
index cb896e3..15fb34a 100644
--- a/kvm.h
+++ b/kvm.h
@@ -72,7 +72,10 @@ int kvm_set_irq(int irq, int level, int *status);
 /* internal API */
 
 struct KVMState;
+struct KVMArchState;
+
 typedef struct KVMState KVMState;
+typedef struct KVMArchState KVMArchState;
 
 int kvm_ioctl(KVMState *s, int type, ...);
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index a908819..7e1ce15 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -37,6 +37,16 @@
 
 #ifdef KVM_CAP_EXT_CPUID
 
+struct KVMArchState
+{
+struct kvm_irq_routing *irq_routes;
+int nr_allocated_irq_routes;
+void *used_gsi_bitmap;
+int max_gsi;
+};
+
+static KVMArchState *kvm_arch_state;
+
 static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max)
 {
 struct kvm_cpuid2 *cpuid;
@@ -272,10 +282,118 @@ static int kvm_has_msr_star(CPUState *env)
 return 0;
 }
 
+/*
+ * Setup x86 specific IRQ routing
+ */
+static inline void set_gsi(KVMArchState *s, unsigned int gsi)
+{
+uint32_t *bitmap = s-used_gsi_bitmap;
+
+if (gsi  s-max_gsi)
+bitmap[gsi / 32] |= 1U  (gsi % 32);
+else
+fprintf(stderr, Invalid GSI %d\n, gsi);
+}
+
+static int kvm_add_routing_entry(KVMArchState *s, struct kvm_irq_routing_entry 
*entry)
+{
+struct kvm_irq_routing *z;
+struct kvm_irq_routing_entry *new;
+int n, size;
+
+if (s-irq_routes-nr == s-nr_allocated_irq_routes) {
+n = s-nr_allocated_irq_routes * 2;
+if (n  64)
+n = 64;
+size = sizeof(struct kvm_irq_routing);
+size += n * sizeof(*new);
+z = realloc(s-irq_routes, size);
+if (!z)
+return -ENOMEM;
+s-nr_allocated_irq_routes = n;
+s-irq_routes = z;
+}
+n = s-irq_routes-nr++;
+new = s-irq_routes-entries[n];
+memset(new, 0, sizeof(*new));
+new-gsi = entry-gsi;
+new-type = entry-type;
+new-flags = entry-flags;
+new-u = entry-u;
+
+set_gsi(s, entry-gsi);
+
+return 0;
+}
+
+static int kvm_add_irq_route(KVMArchState *s, int gsi, int irqchip, int pin)
+{
+struct kvm_irq_routing_entry e;
+
+e.gsi = gsi;
+e.type = KVM_IRQ_ROUTING_IRQCHIP;
+e.flags = 0;
+e.u.irqchip.irqchip = irqchip;
+e.u.irqchip.pin = pin;
+return kvm_add_routing_entry(s, e);
+}
+
+static int kvm_init_irq_routing(KVMState *s)
+{
+int i, r;
+int gsi_count, gsi_bits;
+
+gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING);
+if (!kvm_irqchip_in_kernel()  (gsi_count  0)) {
+return 0;
+}
+
+/* Round up so we can search ints using ffs */
+gsi_bits = ((gsi_count - 31)  ~31);
+kvm_arch_state-used_gsi_bitmap = qemu_mallocz(gsi_bits / 8);
+kvm_arch_state-max_gsi = gsi_bits;
+
+/* Mark any over-allocated bits as already in use */
+for (i = gsi_count; i  gsi_bits; i++) {
+set_gsi(kvm_arch_state, i);
+}
+
+kvm_arch_state-irq_routes-nr = 0;
+
+for (i = 0; i  8; ++i) {
+if (i == 2)
+continue;
+r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_PIC_MASTER, i);
+if (r  0)
+return r;
+}
+for (i = 8; i  16; ++i) {
+r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_PIC_SLAVE, i - 8);
+if (r  0)
+return r;
+}
+for (i = 0; i  24; ++i) {
+if (i == 0) {
+r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_IOAPIC, 2);
+} else if (i != 2) {
+r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_IOAPIC, i);
+}
+if (r  0)
+return r;
+}
+
+kvm_arch_state-irq_routes-flags = 0;
+return kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, kvm_arch_state-irq_routes);
+}
 int kvm_arch_init(KVMState *s, int smp_cpus)
 {
 int ret;
 
+kvm_arch_state = qemu_mallocz(sizeof(*kvm_arch_state));
+kvm_arch_state-irq_routes = 
qemu_mallocz(sizeof(*kvm_arch_state-irq_routes));
+
+kvm_init_irq_routing(s);
+
 /* create vm86 tss.  KVM uses vm86 mode to emulate 16-bit code
  * directly.  In order to use vm86 mode, a TSS is needed

[Qemu-devel] [PATCH 00/11] SMP support in qemu.git

2009-12-02 Thread Glauber Costa

Avi/Marcelo

Please include this in a branch in qemu-kvm for future inclusion
in qemu.git. It is the same material people have already commented
on in the ML.

Glauber Costa (11):
  Don't mess with halted state.
  store thread-specific env information
  update halted state on mp_state sync
  qemu_flush_work for remote vcpu execution
  tell kernel about all registers instead of just mp_state
  flush state in migration post_load
  Don't call kvm cpu reset on initialization
  use cpu_kick instead of direct signalling.
  Use per-cpu reset handlers.
  Use __thread where available.
  remove smp restriction from kvm

 configure |   17 
 cpu-all.h |3 +
 cpu-defs.h|   16 
 exec-all.h|   12 ++
 exec.c|   32 
 hw/apic-kvm.c |   26 +++-
 kvm-all.c |   49 +---
 kvm.h |   10 +
 target-i386/kvm.c |   12 ++
 target-ppc/kvm.c  |5 ++
 vl.c  |  108 ++--
 11 files changed, 244 insertions(+), 46 deletions(-)

[Qemu-devel] [PATCH 02/11] store thread-specific env information

2009-12-02 Thread Glauber Costa

Since we'll have multiple cpu threads, at least for kvm, we need a way to store
and retrieve the CPUState associated with the current execution thread.
For the I/O thread, this will be NULL.

I am using pthread functions for that, for portability, but we could as well
use __thread keyword.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 vl.c |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index 44763af..321b18d 100644
--- a/vl.c
+++ b/vl.c
@@ -3434,6 +3434,24 @@ static void block_io_signals(void);
 static void unblock_io_signals(void);
 static int tcg_has_work(void);
 
+static pthread_key_t current_env;
+
+CPUState *qemu_get_current_env(void);
+CPUState *qemu_get_current_env(void)
+{
+return pthread_getspecific(current_env);
+}
+
+static void qemu_set_current_env(CPUState *env)
+{
+pthread_setspecific(current_env, env);
+}
+
+static void qemu_init_current_env(void)
+{
+pthread_key_create(current_env, NULL);
+}
+
 static int qemu_init_main_loop(void)
 {
 int ret;
@@ -3446,6 +3464,7 @@ static int qemu_init_main_loop(void)
 qemu_mutex_init(qemu_fair_mutex);
 qemu_mutex_init(qemu_global_mutex);
 qemu_mutex_lock(qemu_global_mutex);
+qemu_init_current_env();
 
 unblock_io_signals();
 qemu_thread_self(io_thread);
@@ -3484,6 +3503,8 @@ static void *kvm_cpu_thread_fn(void *arg)
 
 block_io_signals();
 qemu_thread_self(env-thread);
+qemu_set_current_env(env);
+
 if (kvm_enabled())
 kvm_init_vcpu(env);
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 01/11] Don't mess with halted state.

2009-12-02 Thread Glauber Costa

When we have irqchip in kernel, halted state is kernel
business. So don't initialize it in our code.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 hw/apic-kvm.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c
index 089fa45..e5a0bfc 100644
--- a/hw/apic-kvm.c
+++ b/hw/apic-kvm.c
@@ -122,10 +122,9 @@ static void kvm_apic_reset(void *opaque)
 s-dev.next_time = 0;
 s-dev.wait_for_sipi = 1;
 
-s-cpu_env-halted = !(s-dev.apicbase  MSR_IA32_APICBASE_BSP);
 
 s-cpu_env-mp_state
-= s-cpu_env-halted ? KVM_MP_STATE_UNINITIALIZED : 
KVM_MP_STATE_RUNNABLE;
+= bsp ? KVM_MP_STATE_RUNNABLE : KVM_MP_STATE_UNINITIALIZED;
 
 kvm_put_mp_state(s-cpu_env);
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-02 Thread Glauber Costa

This function is similar to qemu-kvm's on_vcpu mechanism. Totally synchronous,
and guarantees that a given function will be executed at the specified vcpu.

This patch also convert usage within the breakpoints system

Signed-off-by: Glauber Costa glom...@redhat.com
---
 cpu-all.h  |3 ++
 cpu-defs.h |   14 +
 kvm-all.c  |   17 +--
 vl.c   |   61 +++
 4 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index e214374..8270d43 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -763,6 +763,9 @@ extern CPUState *cpu_single_env;
 extern int64_t qemu_icount;
 extern int use_icount;
 
+void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data);
+void qemu_flush_work(CPUState *env);
+
 #define CPU_INTERRUPT_HARD   0x02 /* hardware interrupt pending */
 #define CPU_INTERRUPT_EXITTB 0x04 /* exit the current TB (use for x86 a20 
case) */
 #define CPU_INTERRUPT_TIMER  0x08 /* internal timer exception pending */
diff --git a/cpu-defs.h b/cpu-defs.h
index 95068b5..18792fc 100644
--- a/cpu-defs.h
+++ b/cpu-defs.h
@@ -31,6 +31,8 @@
 #include qemu-queue.h
 #include targphys.h
 
+#include qemu-thread.h
+
 #ifndef TARGET_LONG_BITS
 #error TARGET_LONG_BITS must be defined before including this header
 #endif
@@ -134,6 +136,13 @@ typedef struct CPUWatchpoint {
 QTAILQ_ENTRY(CPUWatchpoint) entry;
 } CPUWatchpoint;
 
+typedef struct QemuWorkItem {
+void (*func)(void *data);
+void *data;
+int done;
+} QemuWorkItem;
+
+
 #define CPU_TEMP_BUF_NLONGS 128
 #define CPU_COMMON  \
 struct TranslationBlock *current_tb; /* currently executing TB  */  \
@@ -175,6 +184,10 @@ typedef struct CPUWatchpoint {
 QTAILQ_HEAD(watchpoints_head, CPUWatchpoint) watchpoints;\
 CPUWatchpoint *watchpoint_hit;  \
 \
+QemuWorkItem queued_work;   \
+uint64_t queued_local, queued_total;\
+struct QemuMutex queue_lock;\
+\
 struct GDBRegisterState *gdb_regs;  \
 \
 /* Core interrupt code */   \
@@ -194,6 +207,7 @@ typedef struct CPUWatchpoint {
 uint32_t created;   \
 struct QemuThread *thread;  \
 struct QemuCond *halt_cond; \
+struct QemuCond work_cond;  \
 const char *cpu_model_str;  \
 struct KVMState *kvm_state; \
 struct kvm_run *kvm_run;\
diff --git a/kvm-all.c b/kvm-all.c
index a5739c4..f7d89c6 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -617,7 +617,7 @@ int kvm_cpu_exec(CPUState *env)
 struct kvm_run *run = env-kvm_run;
 int ret;
 
-dprintf(kvm_cpu_exec()\n);
+dprintf(kvm_cpu_exec() %d\n, env-cpu_index);
 
 do {
 if (env-exit_request) {
@@ -932,19 +932,6 @@ void kvm_setup_guest_memory(void *start, size_t size)
 }
 
 #ifdef KVM_CAP_SET_GUEST_DEBUG
-static void on_vcpu(CPUState *env, void (*func)(void *data), void *data)
-{
-#ifdef CONFIG_IOTHREAD
-if (env == cpu_single_env) {
-func(data);
-return;
-}
-abort();
-#else
-func(data);
-#endif
-}
-
 struct kvm_sw_breakpoint *kvm_find_sw_breakpoint(CPUState *env,
  target_ulong pc)
 {
@@ -992,7 +979,7 @@ int kvm_update_guest_debug(CPUState *env, unsigned long 
reinject_trap)
 data.dbg.control |= reinject_trap;
 data.env = env;
 
-on_vcpu(env, kvm_invoke_set_guest_debug, data);
+qemu_queue_work(env, kvm_invoke_set_guest_debug, data);
 return data.err;
 }
 
diff --git a/vl.c b/vl.c
index 321b18d..c7b46a9 100644
--- a/vl.c
+++ b/vl.c
@@ -3403,6 +3403,11 @@ void qemu_notify_event(void)
 }
 }
 
+void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data)
+{
+func(data);
+}
+
 void qemu_mutex_lock_iothread(void) {}
 void qemu_mutex_unlock_iothread(void) {}
 
@@ -3436,8 +3441,7 @@ static int tcg_has_work(void);
 
 static pthread_key_t current_env;
 
-CPUState *qemu_get_current_env(void);
-CPUState *qemu_get_current_env(void)
+static CPUState *qemu_get_current_env(void)
 {
 return pthread_getspecific(current_env);
 }
@@ -3474,8 +3478,10 @@ static int qemu_init_main_loop(void)
 
 static void qemu_wait_io_event(CPUState *env)
 {
-while (!tcg_has_work())
+while (!tcg_has_work

[Qemu-devel] [PATCH 03/11] update halted state on mp_state sync

2009-12-02 Thread Glauber Costa

If we are using in-kernel irqchip, halted state belongs in the kernel.
So everytime we grab kernel's idea of mpstate, we also need to propagate
halted state to userspace.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 target-i386/kvm.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7e1ce15..65ad2a1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -825,6 +825,11 @@ static int kvm_get_mp_state(CPUState *env)
 return ret;
 }
 env-mp_state = mp_state.mp_state;
+
+if (kvm_irqchip_in_kernel()) {
+env-halted = (env-mp_state == KVM_MP_STATE_HALTED);
+}
+
 return 0;
 }
 
-- 
1.6.5.2

1 2 >

1 - 100 of 189 matches

Mail list logo