Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions
On Tue, Dec 8, 2020 at 8:11 AM Stefan Hajnoczi wrote: > > On Thu, Oct 22, 2020 at 05:29:16PM +0100, Fam Zheng wrote: > > On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote: > > > On 2020/10/19 21:25, Paolo Bonzini wrote: > > > > On 19/10/20 14:40, Zhenyu Ye wrote: > > > > > The kernel backtrace for io_submit in GUEST is: > > > > > > > > > > guest# ./offcputime -K -p `pgrep -nx fio` > > > > > b'finish_task_switch' > > > > > b'__schedule' > > > > > b'schedule' > > > > > b'io_schedule' > > > > > b'blk_mq_get_tag' > > > > > b'blk_mq_get_request' > > > > > b'blk_mq_make_request' > > > > > b'generic_make_request' > > > > > b'submit_bio' > > > > > b'blkdev_direct_IO' > > > > > b'generic_file_read_iter' > > > > > b'aio_read' > > > > > b'io_submit_one' > > > > > b'__x64_sys_io_submit' > > > > > b'do_syscall_64' > > > > > b'entry_SYSCALL_64_after_hwframe' > > > > > -fio (1464) > > > > > 40031912 > > > > > > > > > > And Linux io_uring can avoid the latency problem. > > > > Thanks for the info. What this tells us is basically the inflight > > requests are high. It's sad that the linux-aio is in practice > > implemented as a blocking API. it is. > > > > Host side backtrace will be of more help. Can you get that too? > > I guess Linux AIO didn't set the BLK_MQ_REQ_NOWAIT flag so the task went > to sleep when it ran out of blk-mq tags. The easiest solution is to move > to io_uring. Linux AIO is broken - it's not AIO :). Agree! > > If we know that no other process is writing to the host block device > then maybe we can determine the blk-mq tags limit (the queue depth) and > avoid sending more requests. That way QEMU doesn't block, but I don't > think this approach works when other processes are submitting I/O to the > same host block device :(. > > Fam's original suggestion of invoking io_submit(2) from a worker thread > is an option, but I'm afraid it will slow down the uncontended case. > > I'm CCing Glauber in case he battled this in the past in ScyllaDB. We have, and a lot. I don't recall seeing this particular lock, but XFS would block us all the time if it had to update metadata to submit the operation, lock inodes, etc. The work we did at the time was in fixing those things in the kernel as much as we could. But the API is just like that... > > Stefan
Re: [Qemu-devel] [PATCH v3] Add an isa device for SGA
On 06/07/2011 04:17 PM, Anthony Liguori wrote: On 05/16/2011 01:45 PM, Glauber Costa wrote: This patch adds a dummy legacy ISA device whose responsibility is to deploy sgabios, an option rom for a serial graphics adapter. The proposal is that this device is always-on when -nographics, but can otherwise be enable in any setup when -device sga is used. [v2: suggestions on qdev by Markus ] [v3: cleanups and documentation, per list suggestions ] Signed-off-by: Glauber Costaglom...@redhat.com Applied. But I'd like to figure out what to do about sgabios.bin. I think we should ship a copy. Agree. Regards, Anthony Liguori --- Makefile.target | 2 +- hw/pc.c | 9 hw/sga.c | 56 +++ 3 files changed, 66 insertions(+), 1 deletions(-) create mode 100644 hw/sga.c diff --git a/Makefile.target b/Makefile.target index fdbdc6c..004ea7e 100644 --- a/Makefile.target +++ b/Makefile.target @@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o # Hardware support obj-i386-y += vga.o obj-i386-y += mc146818rtc.o i8259.o pc.o -obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o +obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmport.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += extboot.o diff --git a/hw/pc.c b/hw/pc.c index 8d351ba..5a8e00a 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1096,6 +1096,15 @@ void pc_vga_init(PCIBus *pci_bus) isa_vga_init(); } } + + /* + * sga does not suppress normal vga output. So a machine can have both a + * vga card and sga manually enabled. Output will be seen on both. + * For nographic case, sga is enabled at all times + */ + if (display_type == DT_NOGRAPHIC) { + isa_create_simple(sga); + } } static void cpu_request_exit(void *opaque, int irq, int level) diff --git a/hw/sga.c b/hw/sga.c new file mode 100644 index 000..7ef750a --- /dev/null +++ b/hw/sga.c @@ -0,0 +1,56 @@ +/* + * QEMU dummy ISA device for loading sgabios option rom. + * + * Copyright (c) 2011 Glauber Costa, Red Hat Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + * + * sgabios code originally available at code.google.com/p/sgabios + * + */ +#include pci.h +#include pc.h +#include loader.h +#include sysemu.h + +#define SGABIOS_FILENAME sgabios.bin + +typedef struct ISAGAState { + ISADevice dev; +} ISASGAState; + +static int isa_cirrus_vga_initfn(ISADevice *dev) +{ + rom_add_vga(SGABIOS_FILENAME); + return 0; +} + +static ISADeviceInfo sga_info = { + .qdev.name = sga, + .qdev.desc = Serial Graphics Adapter, + .qdev.size = sizeof(ISASGAState), + .init = isa_cirrus_vga_initfn, +}; + +static void sga_register(void) +{ + isa_qdev_register(sga_info); +} + +device_init(sga_register);
[Qemu-devel] [PATCH v3] Add an isa device for SGA
This patch adds a dummy legacy ISA device whose responsibility is to deploy sgabios, an option rom for a serial graphics adapter. The proposal is that this device is always-on when -nographics, but can otherwise be enable in any setup when -device sga is used. [v2: suggestions on qdev by Markus ] [v3: cleanups and documentation, per list suggestions ] Signed-off-by: Glauber Costa glom...@redhat.com --- Makefile.target |2 +- hw/pc.c |9 hw/sga.c| 56 +++ 3 files changed, 66 insertions(+), 1 deletions(-) create mode 100644 hw/sga.c diff --git a/Makefile.target b/Makefile.target index fdbdc6c..004ea7e 100644 --- a/Makefile.target +++ b/Makefile.target @@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o # Hardware support obj-i386-y += vga.o obj-i386-y += mc146818rtc.o i8259.o pc.o -obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o +obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmport.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += extboot.o diff --git a/hw/pc.c b/hw/pc.c index 8d351ba..5a8e00a 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1096,6 +1096,15 @@ void pc_vga_init(PCIBus *pci_bus) isa_vga_init(); } } + +/* + * sga does not suppress normal vga output. So a machine can have both a + * vga card and sga manually enabled. Output will be seen on both. + * For nographic case, sga is enabled at all times + */ +if (display_type == DT_NOGRAPHIC) { +isa_create_simple(sga); +} } static void cpu_request_exit(void *opaque, int irq, int level) diff --git a/hw/sga.c b/hw/sga.c new file mode 100644 index 000..7ef750a --- /dev/null +++ b/hw/sga.c @@ -0,0 +1,56 @@ +/* + * QEMU dummy ISA device for loading sgabios option rom. + * + * Copyright (c) 2011 Glauber Costa, Red Hat Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + * + * sgabios code originally available at code.google.com/p/sgabios + * + */ +#include pci.h +#include pc.h +#include loader.h +#include sysemu.h + +#define SGABIOS_FILENAME sgabios.bin + +typedef struct ISAGAState { +ISADevice dev; +} ISASGAState; + +static int isa_cirrus_vga_initfn(ISADevice *dev) +{ +rom_add_vga(SGABIOS_FILENAME); +return 0; +} + +static ISADeviceInfo sga_info = { +.qdev.name= sga, +.qdev.desc= Serial Graphics Adapter, +.qdev.size= sizeof(ISASGAState), +.init = isa_cirrus_vga_initfn, +}; + +static void sga_register(void) +{ + isa_qdev_register(sga_info); +} + +device_init(sga_register); -- 1.7.4.2
Re: [Qemu-devel] [PATCH v2] Add an isa device for SGA
On Fri, May 13, 2011 at 10:43:33AM +0200, Markus Armbruster wrote: Glauber Costa glom...@redhat.com writes: This patch adds a dummy legacy ISA device whose responsibility is to deploy sgabios, an option rom for a serial graphics adapter. The proposal is that this device is always-on when -nographics, but can otherwise be enable in any setup when -device sga is used. [v2: suggestions on qdev by Markus ] One more question: should -device sga suppress the default VGA? The other three VGA devices do; check out default_list[] in vl.c. No, it should not. This rom is specially designed not to: It will chain all int10h requests to whatever it was there before. [...] diff --git a/hw/pc.h b/hw/pc.h index 1291e2d..a00e054 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base, void pci_cirrus_vga_init(PCIBus *bus); void isa_cirrus_vga_init(void); +/* serial graphics */ +void isa_sga_init(void); + No longer exists, please drop. You're right, my bad. Anthony, is it needed to send a v3 to take this away, or will you remove manually ?
[Qemu-devel] [PATCH] Avoid segmentation fault for qdev device not found
qdev_try_create will cope well with a NULL bus, since it will assume the main system bus by default. qdev_create, however, wants to print a message, in which it instantiates the bus name. That simple and at first inoffensive message will generate a segmentation found if the reason for failure is a NULL bus. I propose we avoid that - thus generating the normal hw_error by always passing a valid bus to qdev_try_create - if none, be it the main system bus. The code for testing a NULL bus is not remove from qdev_try_create because it is a externally visible function, and we want it to continue working fine. Signed-off-by: Glauber Costa glom...@redhat.com --- hw/qdev.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/hw/qdev.c b/hw/qdev.c index 1aa1ea0..21ef075 100644 --- a/hw/qdev.c +++ b/hw/qdev.c @@ -108,6 +108,10 @@ DeviceState *qdev_create(BusState *bus, const char *name) { DeviceState *dev; +if (!bus) { +bus = sysbus_get_default(); +} + dev = qdev_try_create(bus, name); if (!dev) { hw_error(Unknown device '%s' for bus '%s'\n, name, bus-info-name); -- 1.7.4.2
[Qemu-devel] [PATCH v2] Add an isa device for SGA
This patch adds a dummy legacy ISA device whose responsibility is to deploy sgabios, an option rom for a serial graphics adapter. The proposal is that this device is always-on when -nographics, but can otherwise be enable in any setup when -device sga is used. [v2: suggestions on qdev by Markus ] Signed-off-by: Glauber Costa glom...@redhat.com --- Makefile.target |2 +- hw/pc.c |4 hw/pc.h |3 +++ hw/sga.c| 31 +++ 4 files changed, 39 insertions(+), 1 deletions(-) create mode 100644 hw/sga.c diff --git a/Makefile.target b/Makefile.target index fdbdc6c..004ea7e 100644 --- a/Makefile.target +++ b/Makefile.target @@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o # Hardware support obj-i386-y += vga.o obj-i386-y += mc146818rtc.o i8259.o pc.o -obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o +obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmport.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += extboot.o diff --git a/hw/pc.c b/hw/pc.c index 8d351ba..78bf7de 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1096,6 +1096,10 @@ void pc_vga_init(PCIBus *pci_bus) isa_vga_init(); } } + +if (display_type == DT_NOGRAPHIC) { +isa_create_simple(sga); +} } static void cpu_request_exit(void *opaque, int irq, int level) diff --git a/hw/pc.h b/hw/pc.h index 1291e2d..a00e054 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base, void pci_cirrus_vga_init(PCIBus *bus); void isa_cirrus_vga_init(void); +/* serial graphics */ +void isa_sga_init(void); + /* ne2000.c */ static inline bool isa_ne2000_init(int base, int irq, NICInfo *nd) { diff --git a/hw/sga.c b/hw/sga.c new file mode 100644 index 000..08cef56 --- /dev/null +++ b/hw/sga.c @@ -0,0 +1,31 @@ +#include pci.h +#include pc.h +#include loader.h +#include sysemu.h + +#define SGABIOS_FILENAME sgabios.bin + +typedef struct ISAGAState { +ISADevice dev; +} ISASGAState; + + +static int isa_cirrus_vga_initfn(ISADevice *dev) +{ +rom_add_vga(SGABIOS_FILENAME); +return 0; +} + +static ISADeviceInfo sga_info = { +.qdev.name= sga, +.qdev.desc= Serial Graphics Adapter, +.qdev.size= sizeof(ISASGAState), +.init = isa_cirrus_vga_initfn, +}; + +static void sga_register(void) +{ + isa_qdev_register(sga_info); +} + +device_init(sga_register); -- 1.7.4.2
[Qemu-devel] [PATCH] Add an isa device for SGA
This patch adds a dummy legacy ISA device whose responsibility is to deploy sgabios, an option rom for a serial graphics adapter. The proposal is that this device is always-on when -nographics, but can otherwise be enable in any setup when -device sga is used. Signed-off-by: Glauber Costa glom...@redhat.com --- Makefile.target |2 +- hw/pc.c |2 ++ hw/pc.h |3 +++ hw/sga.c| 49 + 4 files changed, 55 insertions(+), 1 deletions(-) create mode 100644 hw/sga.c diff --git a/Makefile.target b/Makefile.target index fdbdc6c..004ea7e 100644 --- a/Makefile.target +++ b/Makefile.target @@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o # Hardware support obj-i386-y += vga.o obj-i386-y += mc146818rtc.o i8259.o pc.o -obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o +obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmport.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += extboot.o diff --git a/hw/pc.c b/hw/pc.c index 8d351ba..56c3887 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1096,6 +1096,8 @@ void pc_vga_init(PCIBus *pci_bus) isa_vga_init(); } } + +isa_sga_init(); } static void cpu_request_exit(void *opaque, int irq, int level) diff --git a/hw/pc.h b/hw/pc.h index 1291e2d..a00e054 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base, void pci_cirrus_vga_init(PCIBus *bus); void isa_cirrus_vga_init(void); +/* serial graphics */ +void isa_sga_init(void); + /* ne2000.c */ static inline bool isa_ne2000_init(int base, int irq, NICInfo *nd) { diff --git a/hw/sga.c b/hw/sga.c new file mode 100644 index 000..411191b --- /dev/null +++ b/hw/sga.c @@ -0,0 +1,49 @@ +#include pci.h +#include pc.h +#include loader.h +#include sysemu.h + +#define SGABIOS_FILENAME sgabios.bin + +typedef struct ISAGAState { +ISADevice dev; +} ISASGAState; + +/* We can have both -device, and the initfn called, so better + * avoid to have the rom loaded twice */ +static void deploy_rom(void) +{ +static int rom_deployed = 0; + +if (!rom_deployed++) { +rom_add_vga(SGABIOS_FILENAME); +} +} + +static int isa_cirrus_vga_initfn(ISADevice *dev) +{ +deploy_rom(); + +return 0; +} + +void isa_sga_init(void) +{ +if (display_type == DT_NOGRAPHIC) { +deploy_rom(); +} +} + +static ISADeviceInfo sga_info = { +.qdev.name= sga, +.qdev.desc= Serial Graphics Adapter, +.qdev.size= sizeof(ISASGAState), +.init = isa_cirrus_vga_initfn, +}; + +static void sga_register(void) +{ + isa_qdev_register(sga_info); +} + +device_init(sga_register); -- 1.7.4.2
Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
On Wed, 2011-05-04 at 06:09 -0300, Marcelo Tosatti wrote: On Wed, May 04, 2011 at 04:06:59AM -0400, Ulrich Obergfell wrote: Hi Marcelo, Whats prev_period for, since in practice the period will not change between interrupts (OS programs comparator once, or perhaps twice during bootup) ? 'prev_period' is needed if a guest o/s changes the comparator period 'on the fly' (without stopping and restarting the timer). guest o/s changes period | ti(n-1) |ti(n) ti(n+1) | v | | +-+--+ --- prev_period --- -- period -- The idea is that each timer interrupt represents a certain quantum of time (the comparator period). If a guest o/s changes the period between timer interrupt 'n-1' and timer interrupt 'n', I think the new value should not take effect before timer interrupt 'n'. Timer interrupt 'n' still represents the old/previous quantum, and timer interrupt 'n+1' represents the new quantum. Hence, the patch decrements 'ticks_not_accounted' by 'prev_period' and sets 'prev_period' to 'period' when an interrupt was delivered to the guest o/s. +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t-ticks_not_accounted -= t-prev_period; +t-prev_period = t-period; +} else { Most of the time 'prev_period' is equal to 'period'. It should only be different in the scenario shown above. OK, makes sense. You should probably reset ticks_not_accounted to zero on HPET initialization (for example, to avoid miscalibration when kexec'ing a new kernel). Everybody resetting the machine in anyway is expected to force devices to be reinitialized, right ? I may be wrong, but I was under the impression that kexec would do this as well. In this case, the reset function should be enough.
Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
On Wed, 2011-05-04 at 16:46 +0300, Gleb Natapov wrote: On Wed, May 04, 2011 at 10:36:12AM -0300, Glauber Costa wrote: On Wed, 2011-05-04 at 06:09 -0300, Marcelo Tosatti wrote: On Wed, May 04, 2011 at 04:06:59AM -0400, Ulrich Obergfell wrote: Hi Marcelo, Whats prev_period for, since in practice the period will not change between interrupts (OS programs comparator once, or perhaps twice during bootup) ? 'prev_period' is needed if a guest o/s changes the comparator period 'on the fly' (without stopping and restarting the timer). guest o/s changes period | ti(n-1) |ti(n) ti(n+1) | v | | +-+--+ --- prev_period --- -- period -- The idea is that each timer interrupt represents a certain quantum of time (the comparator period). If a guest o/s changes the period between timer interrupt 'n-1' and timer interrupt 'n', I think the new value should not take effect before timer interrupt 'n'. Timer interrupt 'n' still represents the old/previous quantum, and timer interrupt 'n+1' represents the new quantum. Hence, the patch decrements 'ticks_not_accounted' by 'prev_period' and sets 'prev_period' to 'period' when an interrupt was delivered to the guest o/s. +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t-ticks_not_accounted -= t-prev_period; +t-prev_period = t-period; +} else { Most of the time 'prev_period' is equal to 'period'. It should only be different in the scenario shown above. OK, makes sense. You should probably reset ticks_not_accounted to zero on HPET initialization (for example, to avoid miscalibration when kexec'ing a new kernel). Everybody resetting the machine in anyway is expected to force devices to be reinitialized, right ? I may be wrong, but I was under the impression that kexec would do this as well. In this case, the reset function should be enough. kexec does not reset a machine. That's the whole point of kexec in fact. Sure thing, but doesn't it force the initialization routine of the devices themselves, without going through the bios ? -- Gleb.
Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
On Tue, 2011-05-03 at 16:03 -0300, Marcelo Tosatti wrote: On Thu, Apr 28, 2011 at 04:25:00PM +0200, Ulrich Obergfell wrote: Loss of periodic timer interrupts caused by delayed callbacks and by interrupt coalescing is compensated by gradually injecting additional interrupts during subsequent timer intervals, starting at a rate of one additional interrupt per interval. The injection of additional interrupts is based on a backlog of unaccounted HPET clock periods (new HPETTimer field 'ticks_not_accounted'). The backlog increases due to delayed callbacks and coalesced interrupts, and it decreases if an interrupt was injected successfully. If the backlog increases while compensation is still in progress, the rate at which additional interrupts are injected is increased too. A limit is imposed on the backlog and on the rate. Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass slower and faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour may be acceptable. Signed-off-by: Ulrich Obergfell uober...@redhat.com --- hw/hpet.c | 63 +++- 1 files changed, 61 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 35466ae..92d5f58 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -32,6 +32,7 @@ #include sysbus.h #include mc146818rtc.h #include sysemu.h +#include assert.h //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -42,6 +43,9 @@ #define HPET_MSI_SUPPORT0 +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */ +#define MAX_IRQ_RATE(uint32_t)10 + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -326,28 +330,63 @@ static const VMStateDescription vmstate_hpet = { } }; +static bool hpet_timer_has_tick_backlog(HPETTimer *t) +{ +uint64_t backlog = t-ticks_not_accounted - (t-period + t-prev_period); +return (backlog = t-period); +} + /* * timer expiration callback */ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; +HPETState *s = t-state; uint64_t diff; - +int irq_delivered = 0; +uint32_t irq_count = 0; uint64_t period = t-period; uint64_t cur_tick = hpet_get_ticks(t-state); +if (s-driftfix !t-ticks_not_accounted) { +t-ticks_not_accounted = t-prev_period = t-period; +} if (timer_is_periodic(t) period != 0) { if (t-config HPET_TN_32BIT) { while (hpet_time_after(cur_tick, t-cmp)) { t-cmp = (uint32_t)(t-cmp + t-period); +t-ticks_not_accounted += t-period; +irq_count++; } } else { while (hpet_time_after64(cur_tick, t-cmp)) { t-cmp += period; +t-ticks_not_accounted += period; +irq_count++; } } diff = hpet_calculate_diff(t, cur_tick); +if (s-driftfix) { +if (t-ticks_not_accounted MAX_TICKS_NOT_ACCOUNTED) { +t-ticks_not_accounted = t-period + t-prev_period; +} +if (hpet_timer_has_tick_backlog(t)) { +if (t-irq_rate == 1 || irq_count 1) { +t-irq_rate++; +t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE); +} +if (t-divisor == 0) { +assert(irq_count); +} +if (irq_count) { +t-divisor = t-irq_rate; +} +diff /= t-divisor--; +} else { +t-irq_rate = 1; +} +} qemu_mod_timer(t-qemu_timer, qemu_get_clock_ns(vm_clock) + (int64_t)ticks_to_ns(diff)); } else if (t-config HPET_TN_32BIT !timer_is_periodic(t)) { @@ -358,7 +397,22 @@ static void hpet_timer(void *opaque) t-wrap_flag = 0; } } -update_irq(t, 1); +if (s-driftfix timer_is_periodic(t) period != 0) { +if (t-ticks_not_accounted = t-period + t-prev_period) { +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t-ticks_not_accounted -= t-prev_period; +t-prev_period = t-period; +} else { +if (irq_count) { +t-irq_rate++; +t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE); +} +} +}
Re: [Qemu-devel] [PATCH v2 2a/6] x86: Allow multiple cpu feature matches of lookup_feature
On Tue, 2011-04-19 at 13:06 +0200, Jan Kiszka wrote: kvmclock is represented by two feature bits. Therefore, lookup_feature needs to continue its search even after the first match. Enhance it accordingly and switch to a bool return type at this chance. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- target-i386/cpuid.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) Glauber, could you check/ack this? Marcelo, please respin the series afterward. I'd like to see all bits upstream and merged back into qemu-kvm to proceed with switching the latter to upstream's kvm infrastructure. Yes, this patch is okay. Actually, I did sent out something like this, maybe Marcelo applied only part of the series? Anyway, Jan's version is handy, please apply it. diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index 814d13e..0ac592f 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -182,20 +182,22 @@ static int altcmp(const char *s, const char *e, const char *altstr) } /* search featureset for flag *[s..e), if found set corresponding bit in - * *pval and return success, otherwise return zero + * *pval and return true, otherwise return false */ -static int lookup_feature(uint32_t *pval, const char *s, const char *e, -const char **featureset) +static bool lookup_feature(uint32_t *pval, const char *s, const char *e, + const char **featureset) { uint32_t mask; const char **ppc; +bool found = false; -for (mask = 1, ppc = featureset; mask; mask = 1, ++ppc) +for (mask = 1, ppc = featureset; mask; mask = 1, ++ppc) { if (*ppc !altcmp(s, e, *ppc)) { *pval |= mask; -break; +found = true; } -return (mask ? 1 : 0); +} +return found; } static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features,
[Qemu-devel] [PATCH 0/2] allow fw_dir to be specified in the option-rom switch
Some roms should not live in genroms/, the default place for all roms passed through -option-rom switch. Rather, they'd like to be placed in vgaroms. This patch allows it to happen. Glauber Costa (2): document bootindex option add fw_dir option to option-rom switch hw/pc.c |7 ++- qemu-config.c |3 +++ qemu-options.hx | 14 -- sysemu.h|1 + vl.c|1 + 5 files changed, 23 insertions(+), 3 deletions(-) -- 1.7.4
[Qemu-devel] [PATCH 1/2] document bootindex option
bootindex option was added to -option-rom switch, but never documented. Signed-off-by: Glauber Costa glom...@redhat.com --- qemu-options.hx | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/qemu-options.hx b/qemu-options.hx index 18f54d2..96927cc 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2079,13 +2079,20 @@ to cope with initialization race conditions. ETEXI DEF(option-rom, HAS_ARG, QEMU_OPTION_option_rom, \ --option-rom rom load a file, rom, into the option ROM space\n, +-option-rom rom[,bootindex=idx] load a file, rom, into the option ROM space\n, QEMU_ARCH_ALL) STEXI -@item -option-rom @var{file} +@item -option-rom @var{file}[,bootindex=@var{bootindex}] @findex -option-rom Load the contents of @var{file} as an option ROM. This option is useful to load things like EtherBoot. + +@table @option +@item bootindex=@var{bootindex} +Defines which boot index the option rom will be given in boot entry vectors, +allowing fine-grained selection of devices boot order. +@end table + ETEXI DEF(clock, HAS_ARG, QEMU_OPTION_clock, \ -- 1.7.4
[Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch
The option-rom puts all roms passed by this switch in the genroms directory, through the fw_dir option. But as it turns out, not all roms should be placed there. VGA roms are a of a different kind. They live in a different segment, and are scanned first. This patch allows qemu to use external vgaroms such as sgabios, a rom for a serial graphics adapter. Without this patch, we lose all the initial BIOS output: until the option rom is initialized, the only available vga rom is the default cirrus/vesa one. We're also vulnerable to option rom enumeration order: if a vga oprom is initialized before, say, gpxe, we are able to see its output in the adapter. If it is initialized after, we miss it. So I believe the proper solution is to allow users to specify that a rom belongs in vgaroms directory instead. Signed-off-by: Glauber Costa glom...@redhat.com CC: Gleb Natapov g...@redhat.com --- hw/pc.c |7 ++- qemu-config.c |3 +++ qemu-options.hx |7 +-- sysemu.h|1 + vl.c|1 + 5 files changed, 16 insertions(+), 3 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 8d351ba..736efbb 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1055,7 +1055,12 @@ void pc_memory_init(ram_addr_t ram_size, } for (i = 0; i nb_option_roms; i++) { -rom_add_option(option_rom[i].name, option_rom[i].bootindex); +/* do it here, instead of in vl.c, to avoid cluttering that file with x86 material */ +if (!option_rom[i].fw_dir) { +option_rom[i].fw_dir = genroms; +} +rom_add_file(option_rom[i].name, option_rom[i].fw_dir, 0, + option_rom[i].bootindex); } } diff --git a/qemu-config.c b/qemu-config.c index 6d9c238..97b8515 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -450,6 +450,9 @@ QemuOptsList qemu_option_rom_opts = { }, { .name = romfile, .type = QEMU_OPT_STRING, +}, { +.name = fw_dir, +.type = QEMU_OPT_STRING, }, { /* end if list */ } }, diff --git a/qemu-options.hx b/qemu-options.hx index 96927cc..d9eec62 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2079,10 +2079,10 @@ to cope with initialization race conditions. ETEXI DEF(option-rom, HAS_ARG, QEMU_OPTION_option_rom, \ --option-rom rom[,bootindex=idx] load a file, rom, into the option ROM space\n, +-option-rom rom[,bootindex=idx][,fw_dir=dir] load a file, rom, into the option ROM space\n, QEMU_ARCH_ALL) STEXI -@item -option-rom @var{file}[,bootindex=@var{bootindex}] +@item -option-rom @var{file}[,bootindex=@var{bootindex}][,fw_dir=@var{dir}] @findex -option-rom Load the contents of @var{file} as an option ROM. This option is useful to load things like EtherBoot. @@ -2091,6 +2091,9 @@ This option is useful to load things like EtherBoot. @item bootindex=@var{bootindex} Defines which boot index the option rom will be given in boot entry vectors, allowing fine-grained selection of devices boot order. +@item fw_dir=@var{dir} (x86 only) +Specify under which firmware directory entry this rom should live. Current +allowed values are vgaroms and genroms (default). @end table ETEXI diff --git a/sysemu.h b/sysemu.h index 3f7e17e..2f8be32 100644 --- a/sysemu.h +++ b/sysemu.h @@ -160,6 +160,7 @@ extern uint64_t node_cpumask[MAX_NODES]; typedef struct QEMUOptionRom { const char *name; int32_t bootindex; +const char *fw_dir; } QEMUOptionRom; extern QEMUOptionRom option_rom[MAX_OPTION_ROMS]; extern int nb_option_roms; diff --git a/vl.c b/vl.c index 8bcf2ae..e1d7868 100644 --- a/vl.c +++ b/vl.c @@ -2675,6 +2675,7 @@ int main(int argc, char **argv, char **envp) fprintf(stderr, Option ROM file is not specified\n); exit(1); } +option_rom[nb_option_roms].fw_dir = qemu_opt_get(opts, fw_dir); nb_option_roms++; break; case QEMU_OPTION_semihosting: -- 1.7.4
Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch
On Tue, 2011-04-12 at 12:40 -0500, Anthony Liguori wrote: On 04/12/2011 12:23 PM, Glauber Costa wrote: The option-rom puts all roms passed by this switch in the genroms directory, through the fw_dir option. But as it turns out, not all roms should be placed there. VGA roms are a of a different kind. They live in a different segment, and are scanned first. This patch allows qemu to use external vgaroms such as sgabios, a rom for a serial graphics adapter. Wouldn't it make more sense to have a new PCI device that had sgabios that could play the role of the VGA device. Then you could say -vga sga Regards, This would be good, if we included sga along the roms we ship. We're still, however, left with the problem that in the future, people may want to use their own roms. And the fact that not all of them should live in genroms persists. Anthony Liguori Without this patch, we lose all the initial BIOS output: until the option rom is initialized, the only available vga rom is the default cirrus/vesa one. We're also vulnerable to option rom enumeration order: if a vga oprom is initialized before, say, gpxe, we are able to see its output in the adapter. If it is initialized after, we miss it. So I believe the proper solution is to allow users to specify that a rom belongs in vgaroms directory instead. Signed-off-by: Glauber Costaglom...@redhat.com CC: Gleb Natapovg...@redhat.com --- hw/pc.c |7 ++- qemu-config.c |3 +++ qemu-options.hx |7 +-- sysemu.h|1 + vl.c|1 + 5 files changed, 16 insertions(+), 3 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 8d351ba..736efbb 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1055,7 +1055,12 @@ void pc_memory_init(ram_addr_t ram_size, } for (i = 0; i nb_option_roms; i++) { -rom_add_option(option_rom[i].name, option_rom[i].bootindex); +/* do it here, instead of in vl.c, to avoid cluttering that file with x86 material */ +if (!option_rom[i].fw_dir) { +option_rom[i].fw_dir = genroms; +} +rom_add_file(option_rom[i].name, option_rom[i].fw_dir, 0, + option_rom[i].bootindex); } } diff --git a/qemu-config.c b/qemu-config.c index 6d9c238..97b8515 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -450,6 +450,9 @@ QemuOptsList qemu_option_rom_opts = { }, { .name = romfile, .type = QEMU_OPT_STRING, +}, { +.name = fw_dir, +.type = QEMU_OPT_STRING, }, { /* end if list */ } }, diff --git a/qemu-options.hx b/qemu-options.hx index 96927cc..d9eec62 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2079,10 +2079,10 @@ to cope with initialization race conditions. ETEXI DEF(option-rom, HAS_ARG, QEMU_OPTION_option_rom, \ --option-rom rom[,bootindex=idx] load a file, rom, into the option ROM space\n, +-option-rom rom[,bootindex=idx][,fw_dir=dir] load a file, rom, into the option ROM space\n, QEMU_ARCH_ALL) STEXI -@item -option-rom @var{file}[,bootindex=@var{bootindex}] +@item -option-rom @var{file}[,bootindex=@var{bootindex}][,fw_dir=@var{dir}] @findex -option-rom Load the contents of @var{file} as an option ROM. This option is useful to load things like EtherBoot. @@ -2091,6 +2091,9 @@ This option is useful to load things like EtherBoot. @item bootindex=@var{bootindex} Defines which boot index the option rom will be given in boot entry vectors, allowing fine-grained selection of devices boot order. +@item fw_dir=@var{dir} (x86 only) +Specify under which firmware directory entry this rom should live. Current +allowed values are vgaroms and genroms (default). @end table ETEXI diff --git a/sysemu.h b/sysemu.h index 3f7e17e..2f8be32 100644 --- a/sysemu.h +++ b/sysemu.h @@ -160,6 +160,7 @@ extern uint64_t node_cpumask[MAX_NODES]; typedef struct QEMUOptionRom { const char *name; int32_t bootindex; +const char *fw_dir; } QEMUOptionRom; extern QEMUOptionRom option_rom[MAX_OPTION_ROMS]; extern int nb_option_roms; diff --git a/vl.c b/vl.c index 8bcf2ae..e1d7868 100644 --- a/vl.c +++ b/vl.c @@ -2675,6 +2675,7 @@ int main(int argc, char **argv, char **envp) fprintf(stderr, Option ROM file is not specified\n); exit(1); } +option_rom[nb_option_roms].fw_dir = qemu_opt_get(opts, fw_dir); nb_option_roms++; break; case QEMU_OPTION_semihosting:
Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch
On Tue, 2011-04-12 at 13:31 -0500, Anthony Liguori wrote: On 04/12/2011 01:13 PM, Glauber Costa wrote: On Tue, 2011-04-12 at 12:40 -0500, Anthony Liguori wrote: On 04/12/2011 12:23 PM, Glauber Costa wrote: The option-rom puts all roms passed by this switch in the genroms directory, through the fw_dir option. But as it turns out, not all roms should be placed there. VGA roms are a of a different kind. They live in a different segment, and are scanned first. This patch allows qemu to use external vgaroms such as sgabios, a rom for a serial graphics adapter. Wouldn't it make more sense to have a new PCI device that had sgabios that could play the role of the VGA device. Then you could say -vga sga Regards, This would be good, if we included sga along the roms we ship. We're still, however, left with the problem that in the future, people may want to use their own roms. Fortunately, I don't think that there are myriads of people writing 16-bit option roms so I don't think this is a pressing problem :-) For sure, but if we had this discussion a while ago, sgabios wouldn't exist back then, and now it does =p And the fact that not all of them should live in genroms persists. Actually genroms should disappear. We should make -option-rom work by using a dummy PCI device or something like that. In a side track, I think sgabios is a pretty useful addition to qemu. Would you consider including it together with the other roms ? This would definitely increase the out-of-the box experience. I think we don't even need to have a -vga sga switch then. Since sgabios does not prevent output from going to the normal vga, we could always install it, or install when there is a -serial parameter present.
Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch
On Tue, 2011-04-12 at 14:19 -0500, Anthony Liguori wrote: On 04/12/2011 01:47 PM, Glauber Costa wrote: For sure, but if we had this discussion a while ago, sgabios wouldn't exist back then, and now it does =p Actually, it's been around for ages :-) And the fact that not all of them should live in genroms persists. Actually genroms should disappear. We should make -option-rom work by using a dummy PCI device or something like that. In a side track, I think sgabios is a pretty useful addition to qemu. Would you consider including it together with the other roms ? Definitely. This would definitely increase the out-of-the box experience. I think we don't even need to have a -vga sga switch then. Since sgabios does not prevent output from going to the normal vga, we could always install it, or install when there is a -serial parameter present. Is that true? It needs to load during video bios init, no? So doesn't that preclude having a normal video bios installed? No. sgabios chains all int10h calls.
Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
On Mon, 2011-04-11 at 08:10 -0500, Anthony Liguori wrote: On 04/11/2011 04:08 AM, Avi Kivity wrote: On 04/11/2011 12:06 PM, Ulrich Obergfell wrote: vmstate_hpet_timer = { VMSTATE_UINT64(fsb, HPETTimer), VMSTATE_UINT64(period, HPETTimer), VMSTATE_UINT8(wrap_flag, HPETTimer), + VMSTATE_UINT64_V(saved_period, HPETTimer, 3), + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3), + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), + VMSTATE_UINT32_V(divisor, HPETTimer, 3), We ought to be able to use a subsection keyed off of whether any ticks are currently accumulated, no? Anthony, I'm not sure if I understand your question correctly. Are you suggesting to migrate the driftfix-related state conditionally / only if there are any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ? The size of the driftfix-related state is 28 bytes per timer and we have 32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes. Hence, unconditional migration of the driftfix-related state should not cause significant additional overhead. It's not about overhead. Maybe I missed something. Could you please explain which benefit you see in using a subsection ? In the common case of there being no drift, you can migrate from a qemu that supports driftfix to a qemu that doesn't. Right, subsections are a trick. The idea is that when you introduce new state for a device model that is not always going to be set, when you do the migration, you detect whether the state is set or not and if it's not set, instead of sending empty versions of that state (i.e. missed_ticks=0) you just don't send the new state at all. This means that you can migrate to an older version of QEMU provided the migration would work correctly. Using subsections and testing for hpet option being disabled vs enabled, is fine. But checking for the existence of drift, like you suggested (or at least how I understood you), is very tricky. It is expected to change many times during guest lifetime, and would make our migration predictability something Heisenberg would be proud of.
Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
On Mon, 2011-04-11 at 08:47 -0500, Anthony Liguori wrote: On 04/11/2011 08:39 AM, Glauber Costa wrote: On Mon, 2011-04-11 at 08:10 -0500, Anthony Liguori wrote: On 04/11/2011 04:08 AM, Avi Kivity wrote: On 04/11/2011 12:06 PM, Ulrich Obergfell wrote: vmstate_hpet_timer = { VMSTATE_UINT64(fsb, HPETTimer), VMSTATE_UINT64(period, HPETTimer), VMSTATE_UINT8(wrap_flag, HPETTimer), + VMSTATE_UINT64_V(saved_period, HPETTimer, 3), + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3), + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), + VMSTATE_UINT32_V(divisor, HPETTimer, 3), We ought to be able to use a subsection keyed off of whether any ticks are currently accumulated, no? Anthony, I'm not sure if I understand your question correctly. Are you suggesting to migrate the driftfix-related state conditionally / only if there are any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ? The size of the driftfix-related state is 28 bytes per timer and we have 32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes. Hence, unconditional migration of the driftfix-related state should not cause significant additional overhead. It's not about overhead. Maybe I missed something. Could you please explain which benefit you see in using a subsection ? In the common case of there being no drift, you can migrate from a qemu that supports driftfix to a qemu that doesn't. Right, subsections are a trick. The idea is that when you introduce new state for a device model that is not always going to be set, when you do the migration, you detect whether the state is set or not and if it's not set, instead of sending empty versions of that state (i.e. missed_ticks=0) you just don't send the new state at all. This means that you can migrate to an older version of QEMU provided the migration would work correctly. Using subsections and testing for hpet option being disabled vs enabled, is fine. But checking for the existence of drift, like you suggested (or at least how I understood you), is very tricky. It is expected to change many times during guest lifetime, and would make our migration predictability something Heisenberg would be proud of. Is this true? I would expect it to be very tied to workloads. For idle workloads, you should never have accumulated missed ticks whereas with heavy workloads, you always will have accumulated ticks. Is that not correct? Yes, it is , but we lose a lot of reliability by tying migration to the workload. Given that we still have to start qemu the same way both sides, we end up with a situation in which at time t, migration is possible, and at time t+1 migration is not. I'd rather have subsections enabled at all times when the option to allow driftfix is enabled.
[Qemu-devel] Re: [PATCH v2 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
On Fri, 2011-04-08 at 17:54 +0200, Jan Kiszka wrote: +} Did I miss some change in the plan? I thought we were heading for a generic, reusable driftfix tool box (or periodic timer service)? Or is this intentionally an intermediate step? Which is a medium to long way in the future. There is benefit of having this early, since it can deliver real functionality - a reliable hpet, and converting to whatever the api may look like in the future.
[Qemu-devel] Re: [PATCH 3/3] don't create kvmclock when one of the flags are present.
On Fri, 2011-03-18 at 11:24 +0100, Jan Kiszka wrote: On 2011-03-17 23:42, Glauber Costa wrote: kvmclock presence can be signalled by two different flags. So for device creation, we have to test for both. Patch is OK, but the subject's logic is inverted. Indeed, should have said something like to test for either of them Dear maintainers, is it okay to commit with a minor edit to the changelogs? Jan Signed-off-by: Glauber Costa glom...@redhat.com --- hw/kvmclock.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/hw/kvmclock.c b/hw/kvmclock.c index b6ceddf..004c4ad 100644 --- a/hw/kvmclock.c +++ b/hw/kvmclock.c @@ -103,7 +103,11 @@ static SysBusDeviceInfo kvmclock_info = { void kvmclock_create(void) { if (kvm_enabled() -first_cpu-cpuid_kvm_features (1ULL KVM_FEATURE_CLOCKSOURCE)) { +first_cpu-cpuid_kvm_features ((1ULL KVM_FEATURE_CLOCKSOURCE) +#ifdef KVM_FEATURE_CLOCKSOURCE2 +|| (1ULL KVM_FEATURE_CLOCKSOURCE2) +#endif +)) { sysbus_create_simple(kvmclock, -1, NULL); } }
[Qemu-devel] [PATCH 1/3] use kernel-provided para_features instead of statically coming up with new capabilities
According to Avi's comments over my last submission, I decided to take a different, and more correct direction - we hope. This patch is now using the features provided by KVM_GET_SUPPORTED_CPUID directly to mask out features from guest-visible cpuid. The old get_para_features() mechanism is kept for older kernels that do not implement it. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/kvm.c | 76 +++- 1 files changed, 45 insertions(+), 31 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index da757fa..dc1e547 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -95,6 +95,35 @@ static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max) return cpuid; } +#ifdef CONFIG_KVM_PARA +struct kvm_para_features { +int cap; +int feature; +} para_features[] = { +{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE }, +{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY }, +{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP }, +#ifdef KVM_CAP_ASYNC_PF +{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF }, +#endif +{ -1, -1 } +}; + +static int get_para_features(CPUState *env) +{ +int i, features = 0; + +for (i = 0; i ARRAY_SIZE(para_features) - 1; i++) { +if (kvm_check_extension(env-kvm_state, para_features[i].cap)) { +features |= (1 para_features[i].feature); +} +} + +return features; +} +#endif + + uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function, uint32_t index, int reg) { @@ -102,6 +131,7 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function, int i, max; uint32_t ret = 0; uint32_t cpuid_1_edx; +int has_kvm_features = 0; max = 1; while ((cpuid = try_get_cpuid(env-kvm_state, max)) == NULL) { @@ -111,6 +141,9 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function, for (i = 0; i cpuid-nent; ++i) { if (cpuid-entries[i].function == function cpuid-entries[i].index == index) { +if (cpuid-entries[i].function == KVM_CPUID_FEATURES) { +has_kvm_features = 1; +} switch (reg) { case R_EAX: ret = cpuid-entries[i].eax; @@ -141,41 +174,16 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function, } } +/* fallback for older kernels */ +if (!has_kvm_features (function == KVM_CPUID_FEATURES)) { +ret = get_para_features(env); +} + qemu_free(cpuid); return ret; } -#ifdef CONFIG_KVM_PARA -struct kvm_para_features { -int cap; -int feature; -} para_features[] = { -{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE }, -{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY }, -{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP }, -#ifdef KVM_CAP_ASYNC_PF -{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF }, -#endif -{ -1, -1 } -}; - -static int get_para_features(CPUState *env) -{ -int i, features = 0; - -for (i = 0; i ARRAY_SIZE(para_features) - 1; i++) { -if (kvm_check_extension(env-kvm_state, para_features[i].cap)) { -features |= (1 para_features[i].feature); -} -} -#ifdef KVM_CAP_ASYNC_PF -has_msr_async_pf_en = features (1 KVM_FEATURE_ASYNC_PF); -#endif -return features; -} -#endif - #ifdef KVM_CAP_MCE static int kvm_get_mce_cap_supported(KVMState *s, uint64_t *mce_cap, int *max_banks) @@ -363,7 +371,13 @@ int kvm_arch_init_vcpu(CPUState *env) c = cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_FEATURES; -c-eax = env-cpuid_kvm_features get_para_features(env); +c-eax = env-cpuid_kvm_features kvm_arch_get_supported_cpuid(env, +KVM_CPUID_FEATURES, 0, R_EAX); + +#ifdef KVM_CAP_ASYNC_PF +has_msr_async_pf_en = c-eax (1 KVM_FEATURE_ASYNC_PF); +#endif + #endif cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused); -- 1.7.2.3
[Qemu-devel] [PATCH 0/3] enable newer msr set for kvm
This patch is a follow up to an earlier one that aims to enable kvmclock newer msr set. This time I'm doing it through a more sane mechanism of consulting the kernel about the supported msr set. Glauber Costa (3): use kernel-provided para_features instead of statically coming up with new capabilities add kvmclock to its second bit don't create kvmclock when one of the flags are present. hw/kvmclock.c |6 +++- target-i386/cpuid.c |3 +- target-i386/kvm.c | 76 ++- 3 files changed, 51 insertions(+), 34 deletions(-) -- 1.7.2.3
[Qemu-devel] [PATCH 3/3] don't create kvmclock when one of the flags are present.
kvmclock presence can be signalled by two different flags. So for device creation, we have to test for both. Signed-off-by: Glauber Costa glom...@redhat.com --- hw/kvmclock.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/hw/kvmclock.c b/hw/kvmclock.c index b6ceddf..004c4ad 100644 --- a/hw/kvmclock.c +++ b/hw/kvmclock.c @@ -103,7 +103,11 @@ static SysBusDeviceInfo kvmclock_info = { void kvmclock_create(void) { if (kvm_enabled() -first_cpu-cpuid_kvm_features (1ULL KVM_FEATURE_CLOCKSOURCE)) { +first_cpu-cpuid_kvm_features ((1ULL KVM_FEATURE_CLOCKSOURCE) +#ifdef KVM_FEATURE_CLOCKSOURCE2 +|| (1ULL KVM_FEATURE_CLOCKSOURCE2) +#endif +)) { sysbus_create_simple(kvmclock, -1, NULL); } } -- 1.7.2.3
[Qemu-devel] [PATCH 2/3] add kvmclock to its second bit
We have two bits that can represent kvmclock in cpuid. They signal the guest which msr set to use. When we tweak flags involving this value - specially when we use -, we have to act on both. Besides adding it to the kvm features list, we also have to break the assumption represented by the break in lookup_feature. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/cpuid.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index d28de20..48f9bbd 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -73,7 +73,7 @@ static const char *ext3_feature_name[] = { }; static const char *kvm_feature_name[] = { -kvmclock, kvm_nopiodelay, kvm_mmu, NULL, kvm_asyncpf, NULL, NULL, NULL, +kvmclock, kvm_nopiodelay, kvm_mmu, kvmclock, kvm_asyncpf, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, @@ -193,7 +193,6 @@ static int lookup_feature(uint32_t *pval, const char *s, const char *e, for (mask = 1, ppc = featureset; mask; mask = 1, ++ppc) if (*ppc !altcmp(s, e, *ppc)) { *pval |= mask; -break; } return (mask ? 1 : 0); } -- 1.7.2.3
[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote: On 2011-02-07 13:27, Glauber Costa wrote: On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote: If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com --- Makefile.target |4 +- hw/kvmclock.c | 125 +++ hw/kvmclock.h | 14 ++ hw/pc_piix.c| 31 +++--- 4 files changed, 165 insertions(+), 9 deletions(-) create mode 100644 hw/kvmclock.c create mode 100644 hw/kvmclock.h diff --git a/Makefile.target b/Makefile.target index b0ba95f..30232fa 100644 --- a/Makefile.target +++ b/Makefile.target @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU LIBS+=-lm endif -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS) +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS) config-target.h: config-target.h-timestamp config-target.h-timestamp: config-target.mak @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o +obj-i386-y += pc_piix.o kvmclock.o obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/kvmclock.c b/hw/kvmclock.c new file mode 100644 index 000..b6ceddf --- /dev/null +++ b/hw/kvmclock.c @@ -0,0 +1,125 @@ +/* + * QEMU KVM support, paravirtual clock device + * + * Copyright (C) 2011 Siemens AG + * + * Authors: + * Jan Kiszkajan.kis...@siemens.com + * + * This work is licensed under the terms of the GNU GPL version 2. + * See the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include sysemu.h +#include sysbus.h +#include kvm.h +#include kvmclock.h + +#if defined(CONFIG_KVM_PARA) defined(KVM_CAP_ADJUST_CLOCK) + +#include linux/kvm.h +#include linux/kvm_para.h + +typedef struct KVMClockState { +SysBusDevice busdev; +uint64_t clock; +bool clock_valid; +} KVMClockState; + +static void kvmclock_pre_save(void *opaque) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; +int ret; + +if (s-clock_valid) { +return; +} +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data); +if (ret 0) { +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret)); +data.clock = 0; +} +s-clock = data.clock; +/* + * If the VM is stopped, declare the clock state valid to avoid re-reading + * it on next vmsave (which would return a different value). Will be reset + * when the VM is continued. + */ +s-clock_valid = !vm_running; +} + +static int kvmclock_post_load(void *opaque, int version_id) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; + +data.clock = s-clock; +data.flags = 0; +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data); +} + +static void kvmclock_vm_state_change(void *opaque, int running, int reason) +{ +KVMClockState *s = opaque; + +if (running) { +s-clock_valid = false; +} +} + +static int kvmclock_init(SysBusDevice *dev) +{ +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev); + +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s); +return 0; +} + +static const VMStateDescription kvmclock_vmsd = { +.name = kvmclock, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.pre_save = kvmclock_pre_save, +.post_load = kvmclock_post_load, +.fields = (VMStateField[]) { +VMSTATE_UINT64(clock, KVMClockState), +VMSTATE_END_OF_LIST() +} +}; + +static SysBusDeviceInfo kvmclock_info = { +.qdev.name = kvmclock, +.qdev.size = sizeof(KVMClockState), +.qdev.vmsd = kvmclock_vmsd, +.qdev.no_user = 1, +.init = kvmclock_init, +}; + +/* Note: Must be called after VCPU initialization. */ +void kvmclock_create(void) +{ +if (kvm_enabled() +first_cpu-cpuid_kvm_features (1ULL KVM_FEATURE_CLOCKSOURCE)) { +sysbus_create_simple(kvmclock, -1, NULL); +} +} + +static void kvmclock_register_device(void) +{ +if (kvm_enabled()) { +sysbus_register_withprop(kvmclock_info); +} +} + +device_init(kvmclock_register_device); + +#else /* !(CONFIG_KVM_PARA KVM_CAP_ADJUST_CLOCK) */ + +void kvmclock_create(void) +{ +} +#endif
[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-02-07 at 15:03 +0100, Jan Kiszka wrote: On 2011-02-07 14:40, Glauber Costa wrote: On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote: On 2011-02-07 13:27, Glauber Costa wrote: On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote: If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com --- Makefile.target |4 +- hw/kvmclock.c | 125 +++ hw/kvmclock.h | 14 ++ hw/pc_piix.c| 31 +++--- 4 files changed, 165 insertions(+), 9 deletions(-) create mode 100644 hw/kvmclock.c create mode 100644 hw/kvmclock.h diff --git a/Makefile.target b/Makefile.target index b0ba95f..30232fa 100644 --- a/Makefile.target +++ b/Makefile.target @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU LIBS+=-lm endif -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS) +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS) config-target.h: config-target.h-timestamp config-target.h-timestamp: config-target.mak @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o +obj-i386-y += pc_piix.o kvmclock.o obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/kvmclock.c b/hw/kvmclock.c new file mode 100644 index 000..b6ceddf --- /dev/null +++ b/hw/kvmclock.c @@ -0,0 +1,125 @@ +/* + * QEMU KVM support, paravirtual clock device + * + * Copyright (C) 2011 Siemens AG + * + * Authors: + * Jan Kiszkajan.kis...@siemens.com + * + * This work is licensed under the terms of the GNU GPL version 2. + * See the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include sysemu.h +#include sysbus.h +#include kvm.h +#include kvmclock.h + +#if defined(CONFIG_KVM_PARA) defined(KVM_CAP_ADJUST_CLOCK) + +#include linux/kvm.h +#include linux/kvm_para.h + +typedef struct KVMClockState { +SysBusDevice busdev; +uint64_t clock; +bool clock_valid; +} KVMClockState; + +static void kvmclock_pre_save(void *opaque) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; +int ret; + +if (s-clock_valid) { +return; +} +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data); +if (ret 0) { +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret)); +data.clock = 0; +} +s-clock = data.clock; +/* + * If the VM is stopped, declare the clock state valid to avoid re-reading + * it on next vmsave (which would return a different value). Will be reset + * when the VM is continued. + */ +s-clock_valid = !vm_running; +} + +static int kvmclock_post_load(void *opaque, int version_id) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; + +data.clock = s-clock; +data.flags = 0; +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data); +} + +static void kvmclock_vm_state_change(void *opaque, int running, int reason) +{ +KVMClockState *s = opaque; + +if (running) { +s-clock_valid = false; +} +} + +static int kvmclock_init(SysBusDevice *dev) +{ +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev); + +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s); +return 0; +} + +static const VMStateDescription kvmclock_vmsd = { +.name = kvmclock, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.pre_save = kvmclock_pre_save, +.post_load = kvmclock_post_load, +.fields = (VMStateField[]) { +VMSTATE_UINT64(clock, KVMClockState), +VMSTATE_END_OF_LIST() +} +}; + +static SysBusDeviceInfo kvmclock_info = { +.qdev.name = kvmclock, +.qdev.size = sizeof(KVMClockState), +.qdev.vmsd = kvmclock_vmsd, +.qdev.no_user = 1, +.init = kvmclock_init, +}; + +/* Note: Must be called after VCPU initialization. */ +void kvmclock_create(void) +{ +if (kvm_enabled() +first_cpu-cpuid_kvm_features (1ULL KVM_FEATURE_CLOCKSOURCE)) { +sysbus_create_simple(kvmclock, -1, NULL); +} +} + +static void kvmclock_register_device(void) +{ +if (kvm_enabled()) { +sysbus_register_withprop(kvmclock_info); +} +} + +device_init(kvmclock_register_device); + +#else
[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-02-07 at 19:12 +0100, Jan Kiszka wrote: On 2011-02-07 19:04, Glauber Costa wrote: On Mon, 2011-02-07 at 15:03 +0100, Jan Kiszka wrote: On 2011-02-07 14:40, Glauber Costa wrote: On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote: On 2011-02-07 13:27, Glauber Costa wrote: On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote: If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com --- Makefile.target |4 +- hw/kvmclock.c | 125 +++ hw/kvmclock.h | 14 ++ hw/pc_piix.c| 31 +++--- 4 files changed, 165 insertions(+), 9 deletions(-) create mode 100644 hw/kvmclock.c create mode 100644 hw/kvmclock.h diff --git a/Makefile.target b/Makefile.target index b0ba95f..30232fa 100644 --- a/Makefile.target +++ b/Makefile.target @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU LIBS+=-lm endif -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS) +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS) config-target.h: config-target.h-timestamp config-target.h-timestamp: config-target.mak @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o +obj-i386-y += pc_piix.o kvmclock.o obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/kvmclock.c b/hw/kvmclock.c new file mode 100644 index 000..b6ceddf --- /dev/null +++ b/hw/kvmclock.c @@ -0,0 +1,125 @@ +/* + * QEMU KVM support, paravirtual clock device + * + * Copyright (C) 2011 Siemens AG + * + * Authors: + * Jan Kiszkajan.kis...@siemens.com + * + * This work is licensed under the terms of the GNU GPL version 2. + * See the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include sysemu.h +#include sysbus.h +#include kvm.h +#include kvmclock.h + +#if defined(CONFIG_KVM_PARA) defined(KVM_CAP_ADJUST_CLOCK) + +#include linux/kvm.h +#include linux/kvm_para.h + +typedef struct KVMClockState { +SysBusDevice busdev; +uint64_t clock; +bool clock_valid; +} KVMClockState; + +static void kvmclock_pre_save(void *opaque) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; +int ret; + +if (s-clock_valid) { +return; +} +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data); +if (ret 0) { +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret)); +data.clock = 0; +} +s-clock = data.clock; +/* + * If the VM is stopped, declare the clock state valid to avoid re-reading + * it on next vmsave (which would return a different value). Will be reset + * when the VM is continued. + */ +s-clock_valid = !vm_running; +} + +static int kvmclock_post_load(void *opaque, int version_id) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; + +data.clock = s-clock; +data.flags = 0; +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data); +} + +static void kvmclock_vm_state_change(void *opaque, int running, int reason) +{ +KVMClockState *s = opaque; + +if (running) { +s-clock_valid = false; +} +} + +static int kvmclock_init(SysBusDevice *dev) +{ +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev); + +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s); +return 0; +} + +static const VMStateDescription kvmclock_vmsd = { +.name = kvmclock, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.pre_save = kvmclock_pre_save, +.post_load = kvmclock_post_load, +.fields = (VMStateField[]) { +VMSTATE_UINT64(clock, KVMClockState), +VMSTATE_END_OF_LIST() +} +}; + +static SysBusDeviceInfo kvmclock_info = { +.qdev.name = kvmclock, +.qdev.size = sizeof(KVMClockState), +.qdev.vmsd = kvmclock_vmsd, +.qdev.no_user = 1, +.init = kvmclock_init, +}; + +/* Note: Must be called after VCPU initialization. */ +void kvmclock_create(void) +{ +if (kvm_enabled() +first_cpu-cpuid_kvm_features (1ULL KVM_FEATURE_CLOCKSOURCE)) { +sysbus_create_simple(kvmclock, -1, NULL); +} +} + +static void kvmclock_register_device(void) +{ +if (kvm_enabled
[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote: If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com --- Makefile.target |4 +- hw/kvmclock.c | 125 +++ hw/kvmclock.h | 14 ++ hw/pc_piix.c| 31 +++--- 4 files changed, 165 insertions(+), 9 deletions(-) create mode 100644 hw/kvmclock.c create mode 100644 hw/kvmclock.h diff --git a/Makefile.target b/Makefile.target index b0ba95f..30232fa 100644 --- a/Makefile.target +++ b/Makefile.target @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU LIBS+=-lm endif -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS) +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS) config-target.h: config-target.h-timestamp config-target.h-timestamp: config-target.mak @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += debugcon.o multiboot.o -obj-i386-y += pc_piix.o +obj-i386-y += pc_piix.o kvmclock.o obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o # shared objects diff --git a/hw/kvmclock.c b/hw/kvmclock.c new file mode 100644 index 000..b6ceddf --- /dev/null +++ b/hw/kvmclock.c @@ -0,0 +1,125 @@ +/* + * QEMU KVM support, paravirtual clock device + * + * Copyright (C) 2011 Siemens AG + * + * Authors: + * Jan Kiszkajan.kis...@siemens.com + * + * This work is licensed under the terms of the GNU GPL version 2. + * See the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include sysemu.h +#include sysbus.h +#include kvm.h +#include kvmclock.h + +#if defined(CONFIG_KVM_PARA) defined(KVM_CAP_ADJUST_CLOCK) + +#include linux/kvm.h +#include linux/kvm_para.h + +typedef struct KVMClockState { +SysBusDevice busdev; +uint64_t clock; +bool clock_valid; +} KVMClockState; + +static void kvmclock_pre_save(void *opaque) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; +int ret; + +if (s-clock_valid) { +return; +} +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, data); +if (ret 0) { +fprintf(stderr, KVM_GET_CLOCK failed: %s\n, strerror(ret)); +data.clock = 0; +} +s-clock = data.clock; +/* + * If the VM is stopped, declare the clock state valid to avoid re-reading + * it on next vmsave (which would return a different value). Will be reset + * when the VM is continued. + */ +s-clock_valid = !vm_running; +} + +static int kvmclock_post_load(void *opaque, int version_id) +{ +KVMClockState *s = opaque; +struct kvm_clock_data data; + +data.clock = s-clock; +data.flags = 0; +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, data); +} + +static void kvmclock_vm_state_change(void *opaque, int running, int reason) +{ +KVMClockState *s = opaque; + +if (running) { +s-clock_valid = false; +} +} + +static int kvmclock_init(SysBusDevice *dev) +{ +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev); + +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s); +return 0; +} + +static const VMStateDescription kvmclock_vmsd = { +.name = kvmclock, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.pre_save = kvmclock_pre_save, +.post_load = kvmclock_post_load, +.fields = (VMStateField[]) { +VMSTATE_UINT64(clock, KVMClockState), +VMSTATE_END_OF_LIST() +} +}; + +static SysBusDeviceInfo kvmclock_info = { +.qdev.name = kvmclock, +.qdev.size = sizeof(KVMClockState), +.qdev.vmsd = kvmclock_vmsd, +.qdev.no_user = 1, +.init = kvmclock_init, +}; + +/* Note: Must be called after VCPU initialization. */ +void kvmclock_create(void) +{ +if (kvm_enabled() +first_cpu-cpuid_kvm_features (1ULL KVM_FEATURE_CLOCKSOURCE)) { +sysbus_create_simple(kvmclock, -1, NULL); +} +} + +static void kvmclock_register_device(void) +{ +if (kvm_enabled()) { +sysbus_register_withprop(kvmclock_info); +} +} + +device_init(kvmclock_register_device); + +#else /* !(CONFIG_KVM_PARA KVM_CAP_ADJUST_CLOCK) */ + +void kvmclock_create(void) +{ +} +#endif /* !(CONFIG_KVM_PARA KVM_CAP_ADJUST_CLOCK) */ diff --git a/hw/kvmclock.h b/hw/kvmclock.h new file mode 100644 index 000..7a83cbe --- /dev/null +++ b/hw/kvmclock.h @@ -0,0 +1,14 @@ +/* + * QEMU KVM support, paravirtual clock device + * + * Copyright (C) 2011
[Qemu-devel] [PATCH v3] make tsc stable over migration and machine start
If the machine is stopped, we should not record two different tsc values upon a save operation. The same problem happens with kvmclock. But kvmclock is taking a different diretion, being now seen as a separate device. Since this is unlikely to happen with the tsc, I am taking the approach here of simply registering a handler for state change, and using a per-CPUState variable that prevents double updates for the TSC. Signed-off-by: Glauber Costa glom...@redhat.com CC: Jan Kiszka jan.kis...@web.de --- v2: updated tsc validation logic, as asked by Jan v3: regenerated against uq/master --- target-i386/cpu.h |1 + target-i386/kvm.c | 18 +- 2 files changed, 18 insertions(+), 1 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index af701a4..5f1df8b 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -734,6 +734,7 @@ typedef struct CPUX86State { uint32_t sipi_vector; uint32_t cpuid_kvm_features; uint32_t cpuid_svm_features; +bool tsc_valid; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 05010bb..ed9557e 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -301,6 +301,15 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, #endif } +static void cpu_update_state(void *opaque, int running, int reason) +{ +CPUState *env = opaque; + +if (running) { +env-tsc_valid = false; +} +} + int kvm_arch_init_vcpu(CPUState *env) { struct { @@ -434,6 +443,8 @@ int kvm_arch_init_vcpu(CPUState *env) } #endif +qemu_add_vm_change_state_handler(cpu_update_state, env); + return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data); } @@ -1061,7 +1072,12 @@ static int kvm_get_msrs(CPUState *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } -msrs[n++].index = MSR_IA32_TSC; + +if (!env-tsc_valid) { +msrs[n++].index = MSR_IA32_TSC; +env-tsc_valid = !vm_running; +} + #ifdef TARGET_X86_64 if (lm_capable_kernel) { msrs[n++].index = MSR_CSTAR; -- 1.7.2.3
[Qemu-devel] [PATCH v2] make tsc stable over migration and machine start
If the machine is stopped, we should not record two different tsc values upon a save operation. The same problem happens with kvmclock. But kvmclock is taking a different diretion, being now seen as a separate device. Since this is unlikely to happen with the tsc, I am taking the approach here of simply registering a handler for state change, and using a per-CPUState variable that prevents double updates for the TSC. Signed-off-by: Glauber Costa glom...@redhat.com CC: Jan Kiszka jan.kis...@web.de --- v2: updated tsc validation logic, as asked by Jan --- target-i386/cpu.h |1 + target-i386/kvm.c | 18 +- 2 files changed, 18 insertions(+), 1 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 6d619e8..6bb2e87 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -732,6 +732,7 @@ typedef struct CPUX86State { uint32_t sipi_vector; uint32_t cpuid_kvm_features; uint32_t cpuid_svm_features; +bool tsc_valid; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target-i386/kvm.c b/target-i386/kvm.c index ecb8405..9cc198a 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -302,6 +302,15 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, static int _kvm_arch_init_vcpu(CPUState *env); +static void cpu_update_state(void *opaque, int running, int reason) +{ +CPUState *env = opaque; + +if (running) { +env-tsc_valid = false; +} +} + int kvm_arch_init_vcpu(CPUState *env) { int r; @@ -444,6 +453,8 @@ int kvm_arch_init_vcpu(CPUState *env) } #endif +qemu_add_vm_change_state_handler(cpu_update_state, env); + return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data); } @@ -1093,7 +1104,12 @@ static int kvm_get_msrs(CPUState *env) msrs[n++].index = MSR_STAR; if (kvm_has_msr_hsave_pa(env)) msrs[n++].index = MSR_VM_HSAVE_PA; -msrs[n++].index = MSR_IA32_TSC; + +if (!env-tsc_valid) { +msrs[n++].index = MSR_IA32_TSC; +env-tsc_valid = !vm_running; +} + #ifdef TARGET_X86_64 if (lm_capable_kernel) { msrs[n++].index = MSR_CSTAR; -- 1.7.2.3
[Qemu-devel] Re: [PATCH] make tsc stable over migration and machine start
On Tue, 2011-02-01 at 21:26 +0100, Jan Kiszka wrote: On 2011-02-01 20:17, Glauber Costa wrote: If the machine is stopped, we should not record two different tsc values upon a save operation. The same problem happens with kvmclock. But kvmclock is taking a different diretion, being now seen as a separate device. Since this is unlikely to happen with the tsc, I am taking the approach here of simply registering a handler for state change, and using a per-CPUState variable that prevents double updates for the TSC. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/cpu.h |1 + target-i386/kvm.c | 19 ++- 2 files changed, 19 insertions(+), 1 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 6d619e8..7f1c4f8 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -732,6 +732,7 @@ typedef struct CPUX86State { uint32_t sipi_vector; uint32_t cpuid_kvm_features; uint32_t cpuid_svm_features; +uint8_t update_tsc; bool please. /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target-i386/kvm.c b/target-i386/kvm.c index ecb8405..c3925be 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -302,6 +302,16 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, static int _kvm_arch_init_vcpu(CPUState *env); +static void cpu_update_state(void *opaque, int running, int reason) +{ +CPUState *env = opaque; + +if (!running) { +env-update_tsc = 1; +} +} + + Additional blank line. int kvm_arch_init_vcpu(CPUState *env) { int r; @@ -444,6 +454,8 @@ int kvm_arch_init_vcpu(CPUState *env) } #endif +qemu_add_vm_change_state_handler(cpu_update_state, env); + return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data); } @@ -1093,7 +1105,12 @@ static int kvm_get_msrs(CPUState *env) msrs[n++].index = MSR_STAR; if (kvm_has_msr_hsave_pa(env)) msrs[n++].index = MSR_VM_HSAVE_PA; -msrs[n++].index = MSR_IA32_TSC; + +if (env-update_tsc) { +msrs[n++].index = MSR_IA32_TSC; +env-update_tsc = 0; +} + #ifdef TARGET_X86_64 if (lm_capable_kernel) { msrs[n++].index = MSR_CSTAR; Not quite the logic I'm using for kvmclock: Ok. I have all the interest in keeping the same logic. I will respin. cpu_update_state() if running tsc_valid = false; kvm_get_msrs() ... if (!tsc_valid) read_tsc tsc_valid = !vm_running; That ensure we always read the tsc while the VM is running, and not only after it was stopped (might otherwise be surprising when once visualizing the MSRs). Jan
[Qemu-devel] [PATCH] make tsc stable over migration and machine start
If the machine is stopped, we should not record two different tsc values upon a save operation. The same problem happens with kvmclock. But kvmclock is taking a different diretion, being now seen as a separate device. Since this is unlikely to happen with the tsc, I am taking the approach here of simply registering a handler for state change, and using a per-CPUState variable that prevents double updates for the TSC. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/cpu.h |1 + target-i386/kvm.c | 19 ++- 2 files changed, 19 insertions(+), 1 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 6d619e8..7f1c4f8 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -732,6 +732,7 @@ typedef struct CPUX86State { uint32_t sipi_vector; uint32_t cpuid_kvm_features; uint32_t cpuid_svm_features; +uint8_t update_tsc; /* in order to simplify APIC support, we leave this pointer to the user */ diff --git a/target-i386/kvm.c b/target-i386/kvm.c index ecb8405..c3925be 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -302,6 +302,16 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, static int _kvm_arch_init_vcpu(CPUState *env); +static void cpu_update_state(void *opaque, int running, int reason) +{ +CPUState *env = opaque; + +if (!running) { +env-update_tsc = 1; +} +} + + int kvm_arch_init_vcpu(CPUState *env) { int r; @@ -444,6 +454,8 @@ int kvm_arch_init_vcpu(CPUState *env) } #endif +qemu_add_vm_change_state_handler(cpu_update_state, env); + return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data); } @@ -1093,7 +1105,12 @@ static int kvm_get_msrs(CPUState *env) msrs[n++].index = MSR_STAR; if (kvm_has_msr_hsave_pa(env)) msrs[n++].index = MSR_VM_HSAVE_PA; -msrs[n++].index = MSR_IA32_TSC; + +if (env-update_tsc) { +msrs[n++].index = MSR_IA32_TSC; +env-update_tsc = 0; +} + #ifdef TARGET_X86_64 if (lm_capable_kernel) { msrs[n++].index = MSR_CSTAR; -- 1.7.2.3
[Qemu-devel] Re: [PATCH v3 11/21] kvm: x86: Reset paravirtual MSRs
On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Make sure to write the cleared MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and MSR_KVM_ASYNC_PF_EN to the kernel state so that a freshly booted guest cannot be disturbed by old values. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com looks good. target-i386/kvm.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index d8f26bf..8267655 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -845,6 +845,13 @@ static int kvm_put_msrs(CPUState *env, int level) if (smp_cpus == 1 || env-tsc != 0) { kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc); } +} +/* + * The following paravirtual MSRs have side effects on the guest or are + * too heavy for normal writeback. Limit them to reset or full state + * updates. + */ +if (level = KVM_PUT_RESET_STATE) { kvm_msr_entry_set(msrs[n++], MSR_KVM_SYSTEM_TIME, env-system_time_msr); kvm_msr_entry_set(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state
On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. I would prefer having no pre-save, and doing the ioctl in the state change handler. But I won't nitpick about that, if the maintainers think this is okay, all the rest of the patch looks fine as well.
[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state
On Tue, 2011-01-04 at 12:34 +0100, Jan Kiszka wrote: Am 04.01.2011 12:08, Glauber Costa wrote: On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. I would prefer having no pre-save, and doing the ioctl in the state change handler. But I won't nitpick about that, if the maintainers think this is okay, all the rest of the patch looks fine as well. I did this for a reason: to be able to obtain the current clock state even while the vm is running. It's cleaner IMHO. but if we're on pre-save, we are certain that the vm is stopped at this moment, no ?
[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state
On Tue, 2011-01-04 at 12:45 +0100, Jan Kiszka wrote: Am 04.01.2011 12:43, Glauber Costa wrote: On Tue, 2011-01-04 at 12:34 +0100, Jan Kiszka wrote: Am 04.01.2011 12:08, Glauber Costa wrote: On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. I would prefer having no pre-save, and doing the ioctl in the state change handler. But I won't nitpick about that, if the maintainers think this is okay, all the rest of the patch looks fine as well. I did this for a reason: to be able to obtain the current clock state even while the vm is running. It's cleaner IMHO. but if we're on pre-save, we are certain that the vm is stopped at this moment, no ? Maybe at the moment, but not for device state dumping or maybe (dunno) for Kemari's continuous sync'ing. I simply want to avoid this implicit dependency. that's fine.
[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-01-03 at 18:04 +0200, Avi Kivity wrote: On 01/03/2011 10:33 AM, Jan Kiszka wrote: From: Jan Kiszkajan.kis...@siemens.com If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. kvmclock is a per-cpu affair. @@ -534,6 +599,10 @@ int kvm_arch_init(int smp_cpus) int ret; struct utsname utsname; +#ifdef KVM_CAP_ADJUST_CLOCK +sysbus_register_withprop(kvmclock_info); +#endif + So this doesn't look right. I think we're fine with just migrating the MSRs, like we migrate anything else that has to do with the cpu. The ioctl jan is handling here is a global one, that adjusts the base offset for the clock over migration. It is okay.
[Qemu-devel] Re: [PATCH v2 11/17] kvm: x86: Reset paravirtual MSRs
On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Make sure to clear MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and MSR_KVM_ASYNC_PF_EN so that a freshly booted guest cannot be disturbed by old values. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com --- target-i386/kvm.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index d8f26bf..664a4a0 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -453,6 +453,9 @@ void kvm_arch_reset_vcpu(CPUState *env) env-nmi_injected = 0; env-nmi_pending = 0; env-xcr0 = 1; +env-system_time_msr = 0; +env-wall_clock_msr = 0; +env-async_pf_en_msr = 0; Have you seen this happening? I'd expect CPUState to be zeroed out over init. And if it is not, I guess we should...
[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com Hi Jan. I've just recently posted a patch (not sure what was made from it), that fixes a bug that you reintroduce here. The bug is: if we call KVM_GET_CLOCK ioctl in pre_save, this means that this value will change every time we issue savevm, even if the machine is not run in between. Ideally, we'd like to have two consecutive savevms returning the exact same thing in that situation. I like the general approach of moving it to sysbus, but please move the ioctl to change state notifiers. Cheers!
[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state
On Mon, 2011-01-03 at 17:30 +0100, Jan Kiszka wrote: Am 03.01.2011 17:04, Avi Kivity wrote: On 01/03/2011 10:33 AM, Jan Kiszka wrote: From: Jan Kiszkajan.kis...@siemens.com If kvmclock is used, which implies the kernel supports it, register a kvmclock device with the sysbus. Its main purpose is to save and restore the kernel state on migration, but this will also allow to visualize it one day. kvmclock is a per-cpu affair. Nope, it's state (the one save/restored here) is per VM. @@ -534,6 +599,10 @@ int kvm_arch_init(int smp_cpus) int ret; struct utsname utsname; +#ifdef KVM_CAP_ADJUST_CLOCK +sysbus_register_withprop(kvmclock_info); +#endif + So this doesn't look right. I think we're fine with just migrating the MSRs, like we migrate anything else that has to do with the cpu. The kvmclock state is not contained in any MSR. It's an independent machine state that can be indirectly obtained via MSR access. Therefore, qemu-kvm currently registers only one vmstate entry per machine, and this patch just turns this into a clean device - because that's what kvmclock is in the end, something like an HPET. Jan is right, the per-cpu information only cares about which piece of memory will be written to.
[Qemu-devel] Re: [PATCH v2 11/17] kvm: x86: Reset paravirtual MSRs
On Mon, 2011-01-03 at 17:46 +0100, Jan Kiszka wrote: Am 03.01.2011 17:40, Glauber Costa wrote: On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Make sure to clear MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and MSR_KVM_ASYNC_PF_EN so that a freshly booted guest cannot be disturbed by old values. Signed-off-by: Jan Kiszka jan.kis...@siemens.com CC: Glauber Costa glom...@redhat.com --- target-i386/kvm.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index d8f26bf..664a4a0 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -453,6 +453,9 @@ void kvm_arch_reset_vcpu(CPUState *env) env-nmi_injected = 0; env-nmi_pending = 0; env-xcr0 = 1; +env-system_time_msr = 0; +env-wall_clock_msr = 0; +env-async_pf_en_msr = 0; Have you seen this happening? I'd expect CPUState to be zeroed out over init. And if it is not, I guess we should... Ah, true, those three are part of the section that is zeroed. Will drop that hunk on repost. Guess we should rather move some other variables in that region too and avoid clearing them manually like above... Jan Agreed.
[Qemu-devel] Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.
On Wed, 2010-12-08 at 17:31 -0200, Marcelo Tosatti wrote: On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote: On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote: On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote: Usually nobody usually thinks about that scenario (me included and specially), but kvmclock can be actually disabled in the host. It happens in two scenarios: 1. host too old. 2. we passed -kvmclock to our -cpu parameter. In both cases, we should not register kvmclock savevm section. This patch achives that by registering this section only if kvmclock is actually currently enabled in cpuid. The only caveat is that we have to register the savevm section a little bit later, since we won't know the final kvmclock state before cpuid gets parsed. What is the problem of registering the section? Restoring the value if the host does not support it returns an error? Can't you ignore the error if kvmclock is not reported in cpuid, in the restore handler? We can change the restore handler, but not the restore handler of binaries that are already out there. The motivation here is precisely to address migration to hosts without kvmclock, so it's better to have a way to disable, than to count on the fact that the other side will be able to ignore it. OK. Can't you register conditionally on kvmclock cpuid bit at the end of kvm_arch_init_vcpu, in target-i386/kvm.c? Haven't looked at it, but will today. Actually, tsc has (obviously) the same problem and I plan to respin the patch today including a fix for it as well. Thanks!
[Qemu-devel] Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.
On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote: On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote: Usually nobody usually thinks about that scenario (me included and specially), but kvmclock can be actually disabled in the host. It happens in two scenarios: 1. host too old. 2. we passed -kvmclock to our -cpu parameter. In both cases, we should not register kvmclock savevm section. This patch achives that by registering this section only if kvmclock is actually currently enabled in cpuid. The only caveat is that we have to register the savevm section a little bit later, since we won't know the final kvmclock state before cpuid gets parsed. What is the problem of registering the section? Restoring the value if the host does not support it returns an error? Can't you ignore the error if kvmclock is not reported in cpuid, in the restore handler? We can change the restore handler, but not the restore handler of binaries that are already out there. The motivation here is precisely to address migration to hosts without kvmclock, so it's better to have a way to disable, than to count on the fact that the other side will be able to ignore it.
[Qemu-devel] [PATCH v2 2/2] make kvmclock value idempotent for stopped machine
Although we never made such commitment clear (well, to the best of my knowledge), some people expect that two savevm issued in sequence in a stopped machine will yield the same results. This is not a crazy requirement, since we don't expect a stopped machine to be updating its state, for any device. With kvmclock, this is not the case, since the .pre_save hook will issue an ioctl to the host to acquire a timestamp, which is always changing. This patch moves the value acquisition to vm state change handlers, conditional on not being run. This could mean mean our get clock ioctl is issued more times, but this should be fine since vm_stop is not a hot path. When we do migrate, we'll transfer that value along. Signed-off-by: Glauber Costa glom...@redhat.com CC: Paolo Bonzini pbonz...@redhat.com --- qemu-kvm-x86.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 14a52ce..0e357ac 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -500,11 +500,11 @@ static int kvm_enable_tpr_access_reporting(CPUState *env) #ifdef KVM_CAP_ADJUST_CLOCK static struct kvm_clock_data kvmclock_data; -static void kvmclock_pre_save(void *opaque) +static void kvmclock_update_clock(void *opaque, int running, int reason) { struct kvm_clock_data *cl = opaque; -if (!kvmclock_enabled) +if ((!kvmclock_enabled) || running) return; kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl); @@ -522,7 +522,6 @@ static const VMStateDescription vmstate_kvmclock= { .version_id = 1, .minimum_version_id = 1, .minimum_version_id_old = 1, -.pre_save = kvmclock_pre_save, .post_load = kvmclock_post_load, .fields = (VMStateField []) { VMSTATE_U64(clock, struct kvm_clock_data), @@ -537,6 +536,7 @@ void kvmclock_register_savevm(void) #ifdef KVM_CAP_ADJUST_CLOCK if (kvmclock_enabled kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) { vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data); +qemu_add_vm_change_state_handler(kvmclock_update_clock, kvmclock_data); } #endif } -- 1.7.2.3
[Qemu-devel] [PATCH v2 0/2] savevm odness related to kvmclock
Some users told me that savevm path is behaving oddly wrt kvmclock. The first oddness is that a guarantee we never made (AFAIK) is being broken: two consecutive savevm operations, with the machine stopped in between produces different results, due to the call to KVM_GET_CLOCK ioctl. I believe the assumption that if the vm does not run, its saveable state won't change is fairly reasonable. Maybe we should formally guarantee that? Also, this patch deals with the fact that this happens even if kvmclock is disabled in cpuid: its savevm section is registered nevertheless. Here, I try to register it only if it's enabled at machine start. v2: improvements suggested by Paolo, and patch reordering. Glauber Costa (2): Do not register kvmclock savevm section if kvmclock is disabled. make kvmclock value idempotent for stopped machine cpus.c|3 +++ qemu-kvm-x86.c| 23 +++ qemu-kvm.h|3 +++ target-i386/kvm.c |7 +++ 4 files changed, 28 insertions(+), 8 deletions(-) -- 1.7.2.3
[Qemu-devel] [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.
Usually nobody usually thinks about that scenario (me included and specially), but kvmclock can be actually disabled in the host. It happens in two scenarios: 1. host too old. 2. we passed -kvmclock to our -cpu parameter. In both cases, we should not register kvmclock savevm section. This patch achives that by registering this section only if kvmclock is actually currently enabled in cpuid. The only caveat is that we have to register the savevm section a little bit later, since we won't know the final kvmclock state before cpuid gets parsed. Signed-off-by: Glauber Costa glom...@redhat.com --- cpus.c|3 +++ qemu-kvm-x86.c| 19 +-- qemu-kvm.h|3 +++ target-i386/kvm.c |7 +++ 4 files changed, 26 insertions(+), 6 deletions(-) diff --git a/cpus.c b/cpus.c index a55c330..a24098e 100644 --- a/cpus.c +++ b/cpus.c @@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void) for (cpu = first_cpu; cpu; cpu = cpu-next_cpu) { cpu_synchronize_post_init(cpu); } +if (kvm_enabled()) { +kvmclock_register_savevm(); +} } int cpu_is_stopped(CPUState *env) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 20b7d6d..14a52ce 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -504,6 +504,9 @@ static void kvmclock_pre_save(void *opaque) { struct kvm_clock_data *cl = opaque; +if (!kvmclock_enabled) +return; + kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl); } @@ -528,6 +531,16 @@ static const VMStateDescription vmstate_kvmclock= { }; #endif +/* This has to happen after vcpu setup*/ +void kvmclock_register_savevm(void) +{ +#ifdef KVM_CAP_ADJUST_CLOCK +if (kvmclock_enabled kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) { +vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data); +} +#endif +} + int kvm_arch_qemu_create_context(void) { int r; @@ -545,12 +558,6 @@ int kvm_arch_qemu_create_context(void) return -1; } -#ifdef KVM_CAP_ADJUST_CLOCK -if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) { -vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data); -} -#endif - r = kvm_set_boot_cpu_id(0); if (r 0 r != -ENOSYS) { return r; diff --git a/qemu-kvm.h b/qemu-kvm.h index 0f3fb50..0a104ef 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -752,6 +752,9 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t rip, #define qemu_kvm_cpu_stop(env) do {} while(0) #endif +extern int kvmclock_enabled; +void kvmclock_register_savevm(void); + #ifdef CONFIG_KVM typedef struct KVMSlot { diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 95e5d02..5443765 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, } static int _kvm_arch_init_vcpu(CPUState *env); +int kvmclock_enabled = 1; int kvm_arch_init_vcpu(CPUState *env) { @@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env) memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_FEATURES; c-eax = env-cpuid_kvm_features get_para_features(env); + +if (!(c-eax (1 KVM_FEATURE_CLOCKSOURCE))) { +/* In theory cpuid is per-cpu, and this is a global variable, + * but we don't expect kvmclock enabled in some cpus only */ +kvmclock_enabled = 0; +} #endif cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused); -- 1.7.2.3
[Qemu-devel] [PATCH 0/2] Fix savevm odness related to kvmclock
Some users told me that savevm path is behaving oddly wrt kvmclock. The first oddness is that a guarantee we never made (AFAIK) is being broken: two consecutive savevm operations, with the machine stopped in between produces different results, due to the call to KVM_GET_CLOCK ioctl. I believe the assumption that if the vm does not run, its saveable state won't change is fairly reasonable. Maybe we should formally guarantee that? The second patch deals with the fact that this happens even if kvmclock is disabled in cpuid: its savevm section is registered nevertheless. Here, I try to register it only if it's enabled at machine start. Thanks Glauber Costa (2): make kvmclock value idempotent for stopped machine Do not register kvmclock savevm section if kvmclock is disabled. cpus.c|7 +++ qemu-kvm-x86.c| 25 +++-- qemu-kvm.h|4 target-i386/kvm.c |7 +++ 4 files changed, 33 insertions(+), 10 deletions(-) -- 1.7.2.3
[Qemu-devel] [PATCH 2/2] Do not register kvmclock savevm section if kvmclock is disabled.
Usually nobody usually thinks about that scenario (me included and specially), but kvmclock can be actually disabled in the host. It happens in two scenarios: 1. host too old. 2. we passed -kvmclock to our -cpu parameter. In both cases, we should not register kvmclock savevm section. This patch achives that by registering this section only if kvmclock is actually currently enabled in cpuid. The only caveat is that we have to register the savevm section a little bit later, since we won't know the final kvmclock state before cpuid gets parsed. Signed-off-by: Glauber Costa glom...@redhat.com --- cpus.c|3 +++ qemu-kvm-x86.c| 20 ++-- qemu-kvm.h|2 ++ target-i386/kvm.c |7 +++ 4 files changed, 26 insertions(+), 6 deletions(-) diff --git a/cpus.c b/cpus.c index 879a03a..eef716c 100644 --- a/cpus.c +++ b/cpus.c @@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void) for (cpu = first_cpu; cpu; cpu = cpu-next_cpu) { cpu_synchronize_post_init(cpu); } +if (kvm_enabled()) { +kvmclock_register_savevm(); +} } int cpu_is_stopped(CPUState *env) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index d099d3d..668c8cf 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -502,6 +502,9 @@ static struct kvm_clock_data kvmclock_data; void kvmclock_update_clock(void) { +if (!kvmclock_enabled) +return; + kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, kvmclock_data); } @@ -525,6 +528,17 @@ static const VMStateDescription vmstate_kvmclock= { }; #endif +/* This has to happen after vcpu setup*/ +void kvmclock_register_savevm(void) +{ +#ifdef KVM_CAP_ADJUST_CLOCK +if (kvmclock_enabled kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) { +printf(registering kvmclock savevm section\n); +vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data); +} +#endif +} + int kvm_arch_qemu_create_context(void) { int r; @@ -542,12 +556,6 @@ int kvm_arch_qemu_create_context(void) return -1; } -#ifdef KVM_CAP_ADJUST_CLOCK -if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) { -vmstate_register(NULL, 0, vmstate_kvmclock, kvmclock_data); -} -#endif - r = kvm_set_boot_cpu_id(0); if (r 0 r != -ENOSYS) { return r; diff --git a/qemu-kvm.h b/qemu-kvm.h index b0b7ab3..f51a2d6 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -753,6 +753,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t rip, #endif void kvmclock_update_clock(void); +extern int kvmclock_enabled; +void kvmclock_register_savevm(void); #ifdef CONFIG_KVM diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 95e5d02..5443765 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, } static int _kvm_arch_init_vcpu(CPUState *env); +int kvmclock_enabled = 1; int kvm_arch_init_vcpu(CPUState *env) { @@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env) memset(c, 0, sizeof(*c)); c-function = KVM_CPUID_FEATURES; c-eax = env-cpuid_kvm_features get_para_features(env); + +if (!(c-eax (1 KVM_FEATURE_CLOCKSOURCE))) { +/* In theory cpuid is per-cpu, and this is a global variable, + * but we don't expect kvmclock enabled in some cpus only */ +kvmclock_enabled = 0; +} #endif cpu_x86_cpuid(env, 0, 0, limit, unused, unused, unused); -- 1.7.2.3
[Qemu-devel] [PATCH 1/2] make kvmclock value idempotent for stopped machine
Although we never made such commitment clear (well, to the best of my knowledge), some people expect that two savevm issued in sequence in a stopped machine will yield the same results. This is not a crazy requirement, since we don't expect a stopped machine to be updating its state, for any device. With kvmclock, this is not the case, since the .pre_save hook will issue an ioctl to the host to acquire a timestamp, which is always changing. This patch moves the value acquisition to vm_stop. This should mean our get clock ioctl is issued more times, but this should be fine since vm_stop is not a hot path. When we do migrate, we'll transfer that value along. Signed-off-by: Glauber Costa glom...@redhat.com --- cpus.c |4 qemu-kvm-x86.c |7 ++- qemu-kvm.h |2 ++ 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/cpus.c b/cpus.c index a55c330..879a03a 100644 --- a/cpus.c +++ b/cpus.c @@ -112,6 +112,10 @@ static void do_vm_stop(int reason) pause_all_vcpus(); vm_state_notify(0, reason); monitor_protocol_event(QEVENT_STOP, NULL); +if (kvm_enabled()) { +kvmclock_update_clock(); +} + } } diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 20b7d6d..d099d3d 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -500,11 +500,9 @@ static int kvm_enable_tpr_access_reporting(CPUState *env) #ifdef KVM_CAP_ADJUST_CLOCK static struct kvm_clock_data kvmclock_data; -static void kvmclock_pre_save(void *opaque) +void kvmclock_update_clock(void) { -struct kvm_clock_data *cl = opaque; - -kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl); +kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, kvmclock_data); } static int kvmclock_post_load(void *opaque, int version_id) @@ -519,7 +517,6 @@ static const VMStateDescription vmstate_kvmclock= { .version_id = 1, .minimum_version_id = 1, .minimum_version_id_old = 1, -.pre_save = kvmclock_pre_save, .post_load = kvmclock_post_load, .fields = (VMStateField []) { VMSTATE_U64(clock, struct kvm_clock_data), diff --git a/qemu-kvm.h b/qemu-kvm.h index 0f3fb50..b0b7ab3 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -752,6 +752,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t rip, #define qemu_kvm_cpu_stop(env) do {} while(0) #endif +void kvmclock_update_clock(void); + #ifdef CONFIG_KVM typedef struct KVMSlot { -- 1.7.2.3
[Qemu-devel] Re: [PATCH 0/5] RFC: distinguish warm reset from cold reset.
On Mon, Aug 30, 2010 at 05:35:20PM +0900, Isaku Yamahata wrote: On Mon, Aug 30, 2010 at 10:59:19AM +0300, Avi Kivity wrote: On 08/30/2010 10:49 AM, Isaku Yamahata wrote: This patch set distinguish warm reset from cold reset by introducing warm reset callback handler. The first 4 patches are trivial clean up patches. The last patch of 5/5 is RFC patch. The following thread arose cold reset vs warm reset issues. http://lists.nongnu.org/archive/html/qemu-devel/2010-08/msg00186.html The summary is - warm reset is wanted in qemu - Pressing the reset button is a warm reset on real machines - Sparc64 CPU uses different reset vector for warm and cold reset, so system_reset acts like a reset button - Bus reset can be implemented utilizing qdev frame work instead of implemeting it each bus layer independently. - The modification should be incremental. Anthony would like to see that as an incremental addition to what we have today (like introducing a propagating warm reset callback) and thinking through what the actual behavior should and shouldn't be. If the direction is okay, The next step would be a patch(set) for qdev which would introduce qdev_cold_reset(), qdev_warm_reset(), DeviceInfo::cold_reset and DeviceInfo::warm_reset and would obsolete qdev_reset() and DeviceInfo::reset. What would be the difference between warm and cold reset? Former called on any reset, while the latter called on power up only? What I have in mind at the moment is, warm reset callback is called on warm reset, not called on power up. cold reset callback is called only on power up (and power cycle). Hmm, should warm reset handler be called on power up? Each cold reset callbacks can call corresponding warm reset handler if necessary. So it would be redundant to call qdev_warm_reset() on power up. Or would it be more robust to call warm reset in addition to cold reset on power on? I guess a good way to do that would be making reset just mean warm reset, and keep a single list. Those would be called on every kind of reset. Alternatively, cards that want to have a different power-on code could then be added to an amendment list, interfaced by a cold-reset API.
[Qemu-devel] Re: [PATCH] e1000: Fix hotplug
On Mon, Aug 02, 2010 at 03:15:17PM -0600, Alex Williamson wrote: When we hotplug the device, we don't go through a reset cycle, which means a hot added e1000 is useless until the VM reboots. I do guess, however, that this is true for any device, right? Wouldn't it be better to just call the newly added reset function at hotplug? One way to do that, would be to store a value indicated qemu has already started. If you add a reset handler after that, the function is called before being placed on the list.
[Qemu-devel] Re: [PATCH] qdev: Reset hotplugged devices
On Tue, Aug 03, 2010 at 10:19:47AM -0600, Alex Williamson wrote: Several devices rely on their reset() function being called to initialize device state, e1000 and rtl8139 in particular. When the device is hot added, the reset doesn't occur, often leaving the device in an unusable state. Adding a call to reset() after init() for hotplugged devices puts the device in the expected state for the guest. Signed-off-by: Alex Williamson alex.william...@redhat.com I like this one better.
Re: [Qemu-devel] [PATCH] stop cpus before forking.
On Mon, Jun 14, 2010 at 02:58:47PM -0500, Anthony Liguori wrote: On 06/14/2010 02:42 PM, Glauber Costa wrote: On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote: On 06/14/2010 02:27 PM, Glauber Costa wrote: This patch fixes a bug that happens with kvm, irqchip-in-kernel, while adding a netdev. Despite the situations of reproduction being specific to kvm, I believe this fix is pretty generic, and fits here. Specially if we ever want to have our own irqchip in kernel too. The problem happens after the fork system call, and although it is not 100 % reproduceable, happens pretty often. After fork, the memory where the apic is mapped is present in both processes. It ends up confusing the vcpus somewhere in the irq- ack path, and qemu hangs, with no irqs being delivered at all from that point on. Making sure the vcpus are stopped before forking makes the problem go away. Besides, this is a pretty unfrequent operation, which already hangs the io-thread for a while. So it should not hurt performance. Signed-off-by: Glauber Costaglom...@redhat.com This doesn't make very much sense to me but smells like a kernel bug to me. My interpretation is that by doing that, we make sure no in-flight requests are happening. Actually, a sleep(x), with x sufficiently big is enough to make this problem go away, but that is too hacky. vm_stop() is probably just acting a glorified sleep() since it has to wait for each thread to stop. No, it is not. It also makes sure no vcpus are running, and thus, not generating new requests (or waiting for any replies, for that matter). (Note: I am not advocating the inclusion of this, just trying to build my own awareness)
[Qemu-devel] [PATCH] stop cpus before forking.
This patch fixes a bug that happens with kvm, irqchip-in-kernel, while adding a netdev. Despite the situations of reproduction being specific to kvm, I believe this fix is pretty generic, and fits here. Specially if we ever want to have our own irqchip in kernel too. The problem happens after the fork system call, and although it is not 100 % reproduceable, happens pretty often. After fork, the memory where the apic is mapped is present in both processes. It ends up confusing the vcpus somewhere in the irq - ack path, and qemu hangs, with no irqs being delivered at all from that point on. Making sure the vcpus are stopped before forking makes the problem go away. Besides, this is a pretty unfrequent operation, which already hangs the io-thread for a while. So it should not hurt performance. Signed-off-by: Glauber Costa glom...@redhat.com --- net/tap.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/tap.c b/net/tap.c index 0147dab..f34dd9c 100644 --- a/net/tap.c +++ b/net/tap.c @@ -330,6 +330,9 @@ static int launch_script(const char *setup_script, const char *ifname, int fd) sigaddset(mask, SIGCHLD); sigprocmask(SIG_BLOCK, mask, oldmask); +/* make sure no cpus are running, so the apic does not + * get confused */ +vm_stop(0); /* try to launch network script */ pid = fork(); if (pid == 0) { @@ -350,6 +353,7 @@ static int launch_script(const char *setup_script, const char *ifname, int fd) execv(setup_script, args); _exit(1); } else if (pid 0) { +vm_start(); while (waitpid(pid, status, 0) != pid) { /* loop */ } -- 1.7.0.1
Re: [Qemu-devel] [PATCH] stop cpus before forking.
On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote: On 06/14/2010 02:27 PM, Glauber Costa wrote: This patch fixes a bug that happens with kvm, irqchip-in-kernel, while adding a netdev. Despite the situations of reproduction being specific to kvm, I believe this fix is pretty generic, and fits here. Specially if we ever want to have our own irqchip in kernel too. The problem happens after the fork system call, and although it is not 100 % reproduceable, happens pretty often. After fork, the memory where the apic is mapped is present in both processes. It ends up confusing the vcpus somewhere in the irq- ack path, and qemu hangs, with no irqs being delivered at all from that point on. Making sure the vcpus are stopped before forking makes the problem go away. Besides, this is a pretty unfrequent operation, which already hangs the io-thread for a while. So it should not hurt performance. Signed-off-by: Glauber Costaglom...@redhat.com This doesn't make very much sense to me but smells like a kernel bug to me. My interpretation is that by doing that, we make sure no in-flight requests are happening. Actually, a sleep(x), with x sufficiently big is enough to make this problem go away, but that is too hacky. I do agree that this is most likely a kernel bug. But as with any other kernel bugs, I believe this is a easy workaround to have things working even in older kernels until we fix it. Even if it isn't, I can't rationalize why stopping the vm like this is enough to fix such a problem. Is the problem that the KVM VCPU threads get duplicated while potentially running or something like that? I doubt fork is duplicating the vcpu threads. More than that, this bug does not happen with userspace irqchip. So I believe that either irq request or the ack itself is reaching the wrong process, forever stalling the apic.
Re: [Qemu-devel] [PATCH] stop cpus before forking.
On Mon, Jun 14, 2010 at 02:58:47PM -0500, Anthony Liguori wrote: On 06/14/2010 02:42 PM, Glauber Costa wrote: On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote: On 06/14/2010 02:27 PM, Glauber Costa wrote: This patch fixes a bug that happens with kvm, irqchip-in-kernel, while adding a netdev. Despite the situations of reproduction being specific to kvm, I believe this fix is pretty generic, and fits here. Specially if we ever want to have our own irqchip in kernel too. The problem happens after the fork system call, and although it is not 100 % reproduceable, happens pretty often. After fork, the memory where the apic is mapped is present in both processes. It ends up confusing the vcpus somewhere in the irq- ack path, and qemu hangs, with no irqs being delivered at all from that point on. Making sure the vcpus are stopped before forking makes the problem go away. Besides, this is a pretty unfrequent operation, which already hangs the io-thread for a while. So it should not hurt performance. Signed-off-by: Glauber Costaglom...@redhat.com This doesn't make very much sense to me but smells like a kernel bug to me. My interpretation is that by doing that, we make sure no in-flight requests are happening. Actually, a sleep(x), with x sufficiently big is enough to make this problem go away, but that is too hacky. vm_stop() is probably just acting a glorified sleep() since it has to wait for each thread to stop. I do agree that this is most likely a kernel bug. But as with any other kernel bugs, I believe this is a easy workaround to have things working even in older kernels until we fix it. If we don't know what the bug is, then we do not know whether this is a work around. Rather, this change happens to make the bug more difficult to reproduce with your test case. Even if it isn't, I can't rationalize why stopping the vm like this is enough to fix such a problem. Is the problem that the KVM VCPU threads get duplicated while potentially running or something like that? I doubt fork is duplicating the vcpu threads. More than that, this bug does not happen with userspace irqchip. So I believe that either irq request or the ack itself is reaching the wrong process, forever stalling the apic. That sounds more like a signal delivery issue. It's not obvious to me that we're doing the wrong thing with signal mask though. If it's a signal mask related issue, then vm_stop isn't a proper fix as there would be still be a race. I do can investigate it further, but I doubt it is signal-delivery related. I spent the first days believing it was, but now, I believe it is much more likely to be apic-related. We don't need to wait for the child to exit for this bug to happen, so SIGCHLD is never raised. And with in-kernel irqchip, we don't deliver signals during normal vcpu execution.
[Qemu-devel] [PATCH] introduce -machine switch
This patch adds initial support for the -machine option, that allows command line specification of machine attributes. Besides its value per-se, it is the saner way we found to allow for enabling/disabling of kvm's in-kernel irqchip. machine-related options like kernel, initrd, etc, are now accepted under this switch. Note: This is against anthony's staging. --- hw/boards.h |4 +++ hw/pc_piix.c|3 ++ qemu-options.hx | 14 ++ vl.c| 72 +++ 4 files changed, 67 insertions(+), 26 deletions(-) diff --git a/hw/boards.h b/hw/boards.h index 18b6b8f..bac8583 100644 --- a/hw/boards.h +++ b/hw/boards.h @@ -35,6 +35,10 @@ extern QEMUMachine *current_machine; #define COMMON_MACHINE_OPTS() \ { \ +.name = machine, \ +.type = QEMU_OPT_STRING,\ +}, \ +{ \ .name = ram_size, \ .type = QEMU_OPT_NUMBER,\ }, \ diff --git a/hw/pc_piix.c b/hw/pc_piix.c index f01194c..3ddb695 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -67,6 +67,9 @@ static void pc_init1(QemuOpts *opts, int pci_enabled) vmport_init(); +if (!kernel_cmdline) +kernel_cmdline = ; + /* allocate ram and load rom/bios */ pc_memory_init(ram_size, kernel_filename, kernel_cmdline, initrd_filename, below_4g_mem_size, above_4g_mem_size); diff --git a/qemu-options.hx b/qemu-options.hx index a6928b7..76ca866 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -35,6 +35,20 @@ STEXI Select the emulated @var{machine} (@code{-M ?} for list) ETEXI +DEF(machine, HAS_ARG, QEMU_OPTION_machine, +-machine [machine=m][,ram_size=ram][,boot_device=dev]\n + [,kernel=vmlinux][,cmdline=kernel_cmdline][,initrd=initrd]\n + [,cpu=cpu_type]\n +pc-specific options: [,acpi=on|off]\n +kvm-x86 specific options: [,apic_in_kernel=on|off]\n +select emulated machine (-machine ? for list)\n, +QEMU_ARCH_ALL) +STEXI +...@item -machine @var{machine}[,@var{option}] +...@findex -machine +Select the emulated @var{machine} (@code{-machine ?} for list) +ETEXI + DEF(cpu, HAS_ARG, QEMU_OPTION_cpu, -cpu cpuselect CPU (-cpu ? for list)\n, QEMU_ARCH_ALL) STEXI diff --git a/vl.c b/vl.c index 96b8d35..177ffe2 100644 --- a/vl.c +++ b/vl.c @@ -1605,6 +1605,16 @@ static QEMUMachine *find_machine(const char *name) if (m-alias !strcmp(m-alias, name)) return m; } + +printf(Supported machines are:\n); +for(m = first_machine; m != NULL; m = m-next) { + if (m-alias) + printf(%-10s %s (alias of %s)\n, + m-alias, m-desc, m-name); + printf(%-10s %s%s\n, + m-name, m-desc, + m-is_default ? (default) : ); +} return NULL; } @@ -2567,7 +2577,7 @@ int main(int argc, char **argv, char **envp) DisplayState *ds; DisplayChangeListener *dcl; int cyls, heads, secs, translation; -QemuOpts *hda_opts = NULL, *opts; +QemuOpts *hda_opts = NULL, *machine_opts = NULL, *opts = NULL; int optind; const char *optarg; const char *loadvm = NULL; @@ -2697,21 +2707,29 @@ int main(int argc, char **argv, char **envp) exit(1); } switch(popt-index) { +case QEMU_OPTION_machine: { + const char *mach; + +machine_opts = qemu_opts_parse(qemu_machine_opts, optarg, 0); +if (!machine_opts) { +fprintf(stderr, parse error: %s\n, optarg); +exit(1); +} +mach = qemu_opt_get(machine_opts, machine); + + if (!mach) + break; + +machine = find_machine(mach); + +if (!machine) +exit(*mach != '?'); + break; + } case QEMU_OPTION_M: machine = find_machine(optarg); -if (!machine) { -QEMUMachine *m; -printf(Supported machines are:\n); -for(m = first_machine; m != NULL; m = m-next) { -if (m-alias) -printf(%-10s %s (alias of %s)\n, - m-alias, m-desc, m-name); -printf(%-10s %s%s\n, - m-name, m-desc, - m-is_default ? (default) : ); -} -exit(*optarg != '?'); -} +if (!machine) + exit(*optarg != '?'); break; case QEMU_OPTION_cpu: /* hw initialization will
[Qemu-devel] Re: [PATCH v2 2/2] basic machine opts framework
On Wed, Jun 02, 2010 at 09:15:10AM +0200, Jan Kiszka wrote: +QemuOptsList qemu_machine_opts = { +.name = M, +.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head), +.desc = { +{ +.name = mach, +.type = QEMU_OPT_STRING, +},{ +.name = irqchip, +.type = QEMU_OPT_STRING, +}, Can't we make the concrete machine define what options it needs? Pushing this into the generic code may soon end up in a bunch of very special switches that are unused on most machines or even have no meaning for them. Also, I would suggest to introduce the generic part first, and then add first users like the x86 irqchip. Yeah, in general, I do agree with you. Me and anthony talked about it for a while some time ago, and more or less concluded that it could be possible to avoid that, putting a little think which options to add. the irqchip option, if you note, is not x86-specific, in any case. Any machine has an irqchip. The first idea was to use something like apic=in_kernel|userspace which would be, that, very x86-centric. So, since letting machines define their own options adds complexity, my take would be to add those common options, and add infrastructure for machine-specific options when we see something that makes it unavoidable. What do you think?
[Qemu-devel] [PATCH v2 1/2] early set current_machine
this way, the machine_init function itself can know which machine is current in use, not only the late init code. Signed-off-by: Glauber Costa glom...@redhat.com --- vl.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/vl.c b/vl.c index 96838f8..7a8b20b 100644 --- a/vl.c +++ b/vl.c @@ -5824,6 +5824,9 @@ int main(int argc, char **argv, char **envp) if (machine-compat_props) { qdev_prop_register_compat(machine-compat_props); } + +current_machine = machine; + machine-init(ram_size, boot_devices, kernel_filename, kernel_cmdline, initrd_filename, cpu_model); @@ -5841,8 +5844,6 @@ int main(int argc, char **argv, char **envp) } } -current_machine = machine; - /* init USB devices */ if (usb_enabled) { if (foreach_device_config(DEV_USB, usb_parse) 0) -- 1.6.2.2
[Qemu-devel] [PATCH v2 2/2] basic machine opts framework
This patch adds initial support for the -machine option, that allows command line specification of machine attributes (always relying on safe defaults). Besides its value per-se, it is the saner way we found to allow for enabling/disabling of kvm's in-kernel irqchip. A machine with in-kernel-irqchip could be specified as: -machine irqchip=apic-kvm And one without it: -machine irqchip=apic To demonstrate how it'd work, this patch introduces a choice between pic and apic, pic being the old-style isa thing. --- hw/boards.h | 10 ++ hw/pc.c | 45 +++-- qemu-config.c | 16 qemu-config.h |1 + qemu-options.hx |9 + vl.c| 54 ++ 6 files changed, 129 insertions(+), 6 deletions(-) diff --git a/hw/boards.h b/hw/boards.h index d889341..187794e 100644 --- a/hw/boards.h +++ b/hw/boards.h @@ -12,6 +12,15 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size, const char *initrd_filename, const char *cpu_model); +typedef void (QEMUIrqchipFunc)(void *opaque); + +typedef struct QEMUIrqchip { +const char *name; +QEMUIrqchipFunc *init; +int used; +int is_default; +} QEMUIrqchip; + typedef struct QEMUMachine { const char *name; const char *alias; @@ -21,6 +30,7 @@ typedef struct QEMUMachine { int max_cpus; int is_default; CompatProperty *compat_props; +QEMUIrqchip *irqchip; struct QEMUMachine *next; } QEMUMachine; diff --git a/hw/pc.c b/hw/pc.c index 408d6d6..b3de30a 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1007,21 +1007,43 @@ int cpu_is_bsp(CPUState *env) return env-cpuid_apic_id == 0; } +static void qemu_apic_init(void *opaque) +{ +CPUState *env = opaque; +if (!(env-cpuid_features CPUID_APIC)) { +fprintf(stderr, CPU lacks APIC cpuid flag\n); +exit(1); +} +env-cpuid_apic_id = env-cpu_index; +/* APIC reset callback resets cpu */ +apic_init(env); +} + +static void qemu_pic_init(void *opaque) +{ +CPUState *env = opaque; + +if (smp_cpus 1) { +fprintf(stderr, PIC can't support smp systems\n); +exit(1); +} +qemu_register_reset((QEMUResetHandler*)cpu_reset, env); +} + static CPUState *pc_new_cpu(const char *cpu_model) { CPUState *env; +QEMUIrqchip *ic; env = cpu_init(cpu_model); if (!env) { fprintf(stderr, Unable to find x86 CPU definition\n); exit(1); } -if ((env-cpuid_features CPUID_APIC) || smp_cpus 1) { -env-cpuid_apic_id = env-cpu_index; -/* APIC reset callback resets cpu */ -apic_init(env); -} else { -qemu_register_reset((QEMUResetHandler*)cpu_reset, env); + +for (ic = current_machine-irqchip; ic-name != NULL; ic++) { +if (ic-used) +ic-init(env); } return env; } @@ -1370,6 +1392,17 @@ static QEMUMachine pc_machine = { .desc = Standard PC, .init = pc_init_pci, .max_cpus = 255, +.irqchip = (QEMUIrqchip[]){ +{ +.name = apic, +.init = qemu_apic_init, +.is_default = 1, +},{ +.name = pic, +.init = qemu_pic_init, +}, +{ /* end of list */ }, +}, .is_default = 1, }; diff --git a/qemu-config.c b/qemu-config.c index cae92f7..e83b301 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -196,6 +196,21 @@ QemuOptsList qemu_rtc_opts = { }, }; +QemuOptsList qemu_machine_opts = { +.name = M, +.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head), +.desc = { +{ +.name = mach, +.type = QEMU_OPT_STRING, +},{ +.name = irqchip, +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; + static QemuOptsList *lists[] = { qemu_drive_opts, qemu_chardev_opts, @@ -203,6 +218,7 @@ static QemuOptsList *lists[] = { qemu_netdev_opts, qemu_net_opts, qemu_rtc_opts, +qemu_machine_opts, NULL, }; diff --git a/qemu-config.h b/qemu-config.h index 3cc8864..9b02ee0 100644 --- a/qemu-config.h +++ b/qemu-config.h @@ -7,6 +7,7 @@ extern QemuOptsList qemu_device_opts; extern QemuOptsList qemu_netdev_opts; extern QemuOptsList qemu_net_opts; extern QemuOptsList qemu_rtc_opts; +extern QemuOptsList qemu_machine_opts; int qemu_set_option(const char *str); diff --git a/qemu-options.hx b/qemu-options.hx index 20aa242..4dfc55a 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -31,6 +31,15 @@ STEXI Select the emulated @var{machine} (@code{-M ?} for list) ETEXI +DEF(machine, HAS_ARG, QEMU_OPTION_machine, +-machine mach=m[,irqchip=chip]\n +select emulated machine (-machine ? for list)\n) +STEXI +...@item -machine @var{machine}[,@var{option}] +...@findex -machine +Select the emulated
[Qemu-devel] [PATCH v2 0/2] basic machine opts framework
Hello, This is a resent (rebased) of an old patch set of mine, I sent some time ago. With that, we should have all the needed infrastructure to select the in-kernel irqchip for KVM. Glauber Costa (2): early set current_machine basic machine opts framework hw/boards.h | 10 + hw/pc.c | 45 - qemu-config.c | 16 ++ qemu-config.h |1 + qemu-options.hx |9 vl.c| 59 +- 6 files changed, 132 insertions(+), 8 deletions(-)
[Qemu-devel] [PATCH 2/2] machine opts framework
This patch adds initial support for the -machine option, that allows command line specification of machine attributes (always relying on safe defaults). Besides its value per-se, it is the saner way we found to allow for enabling/disabling of kvm's in-kernel irqchip. A machine with in-kernel-irqchip could be specified as: -machine irqchip=apic-kvm And one without it: -machine irqchip=apic To demonstrate how it'd work, this patch introduces a choice between pic and apic, pic being the old-style isa thing. --- hw/boards.h | 10 ++ hw/pc.c | 45 +++-- qemu-config.c | 16 qemu-config.h |1 + qemu-options.hx |9 + vl.c| 54 ++ 6 files changed, 129 insertions(+), 6 deletions(-) diff --git a/hw/boards.h b/hw/boards.h index 6f0f0d7..831728c 100644 --- a/hw/boards.h +++ b/hw/boards.h @@ -12,6 +12,15 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size, const char *initrd_filename, const char *cpu_model); +typedef void (QEMUIrqchipFunc)(void *opaque); + +typedef struct QEMUIrqchip { +const char *name; +QEMUIrqchipFunc *init; +int used; +int is_default; +} QEMUIrqchip; + typedef struct QEMUMachine { const char *name; const char *alias; @@ -28,6 +37,7 @@ typedef struct QEMUMachine { no_sdcard:1; int is_default; GlobalProperty *compat_props; +QEMUIrqchip *irqchip; struct QEMUMachine *next; } QEMUMachine; diff --git a/hw/pc.c b/hw/pc.c index ba14df0..43ec022 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -752,21 +752,43 @@ int cpu_is_bsp(CPUState *env) return env-cpu_index == 0; } +static void qemu_apic_init(void *opaque) +{ +CPUState *env = opaque; +if (!(env-cpuid_features CPUID_APIC)) { +fprintf(stderr, CPU lacks APIC cpuid flag\n); +exit(1); +} +env-cpuid_apic_id = env-cpu_index; +/* APIC reset callback resets cpu */ +apic_init(env); +} + +static void qemu_pic_init(void *opaque) +{ +CPUState *env = opaque; + +if (smp_cpus 1) { +fprintf(stderr, PIC can't support smp systems\n); +exit(1); +} +qemu_register_reset((QEMUResetHandler*)cpu_reset, env); +} + static CPUState *pc_new_cpu(const char *cpu_model) { CPUState *env; +QEMUIrqchip *ic; env = cpu_init(cpu_model); if (!env) { fprintf(stderr, Unable to find x86 CPU definition\n); exit(1); } -if ((env-cpuid_features CPUID_APIC) || smp_cpus 1) { -env-cpuid_apic_id = env-cpu_index; -/* APIC reset callback resets cpu */ -apic_init(env); -} else { -qemu_register_reset((QEMUResetHandler*)cpu_reset, env); + +for (ic = current_machine-irqchip; ic-name != NULL; ic++) { +if (ic-used) +ic-init(env); } return env; } @@ -1074,6 +1096,17 @@ static QEMUMachine pc_machine = { .desc = Standard PC, .init = pc_init_pci, .max_cpus = 255, +.irqchip = (QEMUIrqchip[]){ +{ +.name = apic, +.init = qemu_apic_init, +.is_default = 1, +},{ +.name = pic, +.init = qemu_pic_init, +}, +{ /* end of list */ }, +}, .is_default = 1, }; diff --git a/qemu-config.c b/qemu-config.c index 150157c..2b985a9 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -296,6 +296,21 @@ QemuOptsList qemu_cpudef_opts = { }, }; +QemuOptsList qemu_machine_opts = { +.name = M, +.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head), +.desc = { +{ +.name = mach, +.type = QEMU_OPT_STRING, +},{ +.name = irqchip, +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; + static QemuOptsList *lists[] = { qemu_drive_opts, qemu_chardev_opts, @@ -306,6 +321,7 @@ static QemuOptsList *lists[] = { qemu_global_opts, qemu_mon_opts, qemu_cpudef_opts, +qemu_machine_opts, NULL, }; diff --git a/qemu-config.h b/qemu-config.h index f217c58..ea302f0 100644 --- a/qemu-config.h +++ b/qemu-config.h @@ -10,6 +10,7 @@ extern QemuOptsList qemu_rtc_opts; extern QemuOptsList qemu_global_opts; extern QemuOptsList qemu_mon_opts; extern QemuOptsList qemu_cpudef_opts; +extern QemuOptsList qemu_machine_opts; QemuOptsList *qemu_find_opts(const char *group); int qemu_set_option(const char *str); diff --git a/qemu-options.hx b/qemu-options.hx index 8450b45..585ecd2 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -34,6 +34,15 @@ STEXI Select the emulated @var{machine} (@code{-M ?} for list) ETEXI +DEF(machine, HAS_ARG, QEMU_OPTION_machine, +-machine mach=m[,irqchip=chip]\n +select emulated machine (-machine ? for list)\n) +STEXI +...@item -machine
[Qemu-devel] [PATCH 1/2] early set current_machine
this way, the machine_init function itself can know which machine is current in use, not only the late init code. --- vl.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/vl.c b/vl.c index d69250c..ceddeac 100644 --- a/vl.c +++ b/vl.c @@ -4841,6 +4841,9 @@ int main(int argc, char **argv, char **envp) } qemu_add_globals(); + +current_machine = machine; + machine-init(ram_size, boot_devices, kernel_filename, kernel_cmdline, initrd_filename, cpu_model); @@ -4859,8 +4862,6 @@ int main(int argc, char **argv, char **envp) } } -current_machine = machine; - /* init USB devices */ if (usb_enabled) { if (foreach_device_config(DEV_USB, usb_parse) 0) -- 1.6.6
Re: [Qemu-devel] [PATCH 2/2] machine opts framework
On Wed, Mar 24, 2010 at 02:43:35PM -0500, Anthony Liguori wrote: On 03/24/2010 02:26 PM, Glauber Costa wrote: This patch adds initial support for the -machine option, that allows command line specification of machine attributes (always relying on safe defaults). Besides its value per-se, it is the saner way we found to allow for enabling/disabling of kvm's in-kernel irqchip. A machine with in-kernel-irqchip could be specified as: -machine irqchip=apic-kvm And one without it: -machine irqchip=apic To demonstrate how it'd work, this patch introduces a choice between pic and apic, pic being the old-style isa thing. I started from a different place. See machine-qemuopts in my staging tree. I think we should combine efforts. Regards, Absolutely. I see little overlap between what we did. Just a comment on yours: -static void an5206_init(ram_addr_t ram_size, - const char *boot_device, - const char *kernel_filename, const char *kernel_cmdline, - const char *initrd_filename, const char *cpu_model) +static void an5206_init(QemuOpts *opts) { Since we're changing init functions anyway, I believe we should also pass a pointer to the machine structure. With that, we can avoing using the global current_machine.
Re: [Qemu-devel] Re: qdev property bug?
On Mon, Dec 14, 2009 at 04:01:43PM +0200, Michael S. Tsirkin wrote: On Mon, Dec 14, 2009 at 02:35:31PM +0100, Alexander Graf wrote: Michael S. Tsirkin wrote: On Mon, Dec 14, 2009 at 12:55:28PM +0100, Alexander Graf wrote: Am 14.12.2009 um 11:59 schrieb Michael S. Tsirkin m...@redhat.com: On Mon, Dec 14, 2009 at 11:16:34AM +0100, Gerd Hoffmann wrote: On 12/14/09 10:44, Michael S. Tsirkin wrote: No, it did not even start booting the kernel. Just gave me blank screen. [ testing ] Oh. That is something completely different. A bug in the rom loader. It fails to fit both e1000 (default nic) and virtio-net boot roms into the option rom area and bails out (before loading seabios). vl.c doesn't check the return value and happily continues (without bios). Which doesn't work out very well ... With two identical nics the (single) rom fits and qemu boots. Hmm. Of course vl.c must be fixed to check the return value. Yes. Not sure how to deal with the rom size issue. The gPXE roms look quit big compared to the older roms we had. Hmm, it's a regression then ... How does real hw handle this? I'm pretty sure most servers these days use more option rom space than this. They usually have some onboard raid bios, external storage, on-board nic, pci nic, ... Real hardware might do several things I know about - option rom is typically small. - option rom is not loaded always (BIOS option), or not for all cards. There are might be other tricks. There are probably other tricks. I was booting up a machine that had like 5 options roms going through their initialization that all weren't exactly small. This might boil down to better ROMs. E.g. maybe they request memory with PMM and then give it up after initialization? That would be my guess. However, gpxe rom are also used in real hardware. Maybe there is some config trick we're missing?
Re: [Qemu-devel] Re: qdev property bug?
On Mon, Dec 14, 2009 at 04:11:43PM +0200, Michael S. Tsirkin wrote: On Mon, Dec 14, 2009 at 08:11:59AM -0600, Anthony Liguori wrote: Alexander Graf wrote: Michael S. Tsirkin wrote: On Mon, Dec 14, 2009 at 12:55:28PM +0100, Alexander Graf wrote: Am 14.12.2009 um 11:59 schrieb Michael S. Tsirkin m...@redhat.com: On Mon, Dec 14, 2009 at 11:16:34AM +0100, Gerd Hoffmann wrote: On 12/14/09 10:44, Michael S. Tsirkin wrote: No, it did not even start booting the kernel. Just gave me blank screen. [ testing ] Oh. That is something completely different. A bug in the rom loader. It fails to fit both e1000 (default nic) and virtio-net boot roms into the option rom area and bails out (before loading seabios). vl.c doesn't check the return value and happily continues (without bios). Which doesn't work out very well ... With two identical nics the (single) rom fits and qemu boots. Hmm. Of course vl.c must be fixed to check the return value. Yes. Not sure how to deal with the rom size issue. The gPXE roms look quit big compared to the older roms we had. Hmm, it's a regression then ... How does real hw handle this? I'm pretty sure most servers these days use more option rom space than this. They usually have some onboard raid bios, external storage, on-board nic, pci nic, ... Real hardware might do several things I know about - option rom is typically small. - option rom is not loaded always (BIOS option), or not for all cards. There are might be other tricks. There are probably other tricks. I was booting up a machine that had like 5 options roms going through their initialization that all weren't exactly small. So there must be some way to just have more option rom space. What do you mean? Well, what's keeping us from having 5 MB of option roms? For starters, option roms run in real mode when you only have 1MB of addressable memory :-) Implementing anything else would just be a waste of time. It'd break again when ppl do device assignment. Alex We need some solution for 0.12 though IMO. This does not need to address device assignment, but it must be simple. Agreed. If there is a solution that gives us the chance to support an arbitrary number of option roms that wouldn't take forever to implement, I'd rather take that one though. For 0.12, we just need to fail gracefully (meaning stop loading option roms when we run out of room). It's not a regression compared to 0.11. Regards, Anthony Liguori Well I am pretty sure that I used virtio + e1000 with 0.11 and apparently I can't now. So it does look like a regression to me ... e1000 is the problem here, since it is by far the largest roms of them all.
Re: [Qemu-devel] Re: qdev property bug?
On Mon, Dec 14, 2009 at 08:22:12AM -0600, Anthony Liguori wrote: Michael S. Tsirkin wrote: Well I am pretty sure that I used virtio + e1000 with 0.11 and apparently I can't now. So it does look like a regression to me ... That's what I said, we should make sure that we stop loading roms when we run out of room as opposed to trampling over the bios space. Can't we first load the bios? Then we're pretty sure oproms will never trample it.
Re: [Qemu-devel] Spice project is now open
Spice is a library, it is library for remote display, it handle by itself all the connection between the spice client to the host that run the guest, it include: sound, display, keyboard, usb, network tunneling (for printers) and so on... I want to add that qemu is not the sole user of spice, Spice will be used as a protocol to connect into physical windows/linux machines So how can we change the library just for qemu? I don't fully understand spice yet, but what's the difficulty here? libraries changes every single day to try to acomodate for the needs of specific users, be it through generalizations, shims, or whatever. This is just another day in the OSS world. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Spice project is now open
I think we should allow freedom of choice to the users to decide what protcol they want to use, Spice and VNC are all diffrent and were born to meet diffrent goals. I would happy to answer more questions if anyone have I think the simple point is that, AFAICS, the spice folks are expecting the qemu team to integrate their big ugly tarball, instead of doing what everyone else does, which is forward port everything to current head and then provide a current set of patches against GIT head. Anything else is just a waste of time. The paths both projects are at are too far apart. More important than forward porting, is respecting the design decisions a huge community has agreed upon. Of course, when you become part of that community, you can try to shift these designs towards your goals, but trying to force them is just ridiculous. That said, I do believe spice can play a essential role in making us go forward, but the attitude towards it has to change quite a bit. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Spice project is now open
On Fri, Dec 11, 2009 at 5:22 PM, Izik Eidus iei...@redhat.com wrote: On Fri, 11 Dec 2009 13:06:47 -0600 Anthony Liguori anth...@codemonkey.ws wrote: Izik Eidus wrote: I want to add that qemu is not the sole user of spice, Spice will be used as a protocol to connect into physical windows/linux machines So how can we change the library just for qemu? A library is not necessarily a problem. What would be a probably is if the library maintains guest visible state. There are a lot of advantages to keeping qemu as the sole maintainer of guest visible state as it simplifies things like live migration. More importantly, it allows us to do things like Avi's suggested security sandboxing using seccomp(). For that to work, we need to make sure that we can isolate any code that interacts directly with the guest. Spice guest visible state inside qemu is just its PCI QXL device. This part is qemu specificed. But this part can work together with vnc with no problems, right? If this is so, why don't we just start by merging it, while trying to make the case for the rendering protocol in parallel ? -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] [PATCH 00/17] pci: switch a ton of drivers to symbolic names
On Thu, Dec 10, 2009 at 4:09 PM, Michael S. Tsirkin m...@redhat.com wrote: The recent e1000 bug made the important of using symbolic macros for pci config access clear for me. So I started going over drivers and converting to symbolic constants instead of hard-coded ones. I did a large part until I run out of steam. Maybe some brave soul will take up converting the rest of them, or maybe I will: note that when converting bridges one should be careful to use bridge macros where appropriate. Instead of testing a huge number of configurations, I compared binaries before and after conversion. Almost all of them generate exact same stripped binary before and after the change. The only object changed was eepro100, objdump showed that the change was because gcc for some reason decides to use a bit more stack for init function after comments are added there. This methodology was the reason that I added TODOs where I saw deviations from spec or other code ugliness, will have to be fixed separately. IMHO, this is a huge enhancement. I myself was found expending huge amounts of time trying do figure out the meaning of some specific constants in the past. +1 -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Using qemu-thread for synchronization
On Thu, Dec 10, 2009 at 12:46 PM, Liran Schour lir...@il.ibm.com wrote: I want to be able to synchronize between the code that is running the live migration with the code that call fro the completion callback of async IO. For that I am using qemu-thread.c (i.e QemuCond). I see that I have problems while linking if I do not use --enable-io-thread. Can someone explain me this. And if I want to do the above what is the right way to do that? qemu_thread is only available when you use --enable-io-thread. But beware, it is not 100 % stable yet. I had a queue of patches that makes it slightly better, tough... -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] [PATCH] Permit zero-sized qemu_malloc() friends
So the only correct way would be to write: array = malloc(size); if (array == NULL size != 0) { return -ENOMEM; } Of course we can differentiate. A failed malloc will abort, a successful one will not. If you were writing portable code. But that's not what people write. You can argue that qemu_malloc() can have any semantics we want and while that's true, it doesn't change my above statement. I think the main argument for this behavior in qemu is that people are used to using this idiom with malloc() but it's a non-portable practice. If qemu_malloc() didn't carry the name malloc() then semantics with size=0 would be a different discussion. But so far, all qemu_* functions tend to behave almost exactly like their C counterparts. Relying on the result of size=0 with malloc() is broken. We can change qemu_malloc to qemu_alloc_memory(), or whatever. But from the moment we do things like abort on failing, we are already deviating from its C counterpart. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin
Does msix work with this patchset when in-kernel irqchip is enabled? Haven't tested. But since I see that msix need some special code in qemu-kvm, it probably won't. But I assume we can just add a patch ontop of this to add that code and make it work, right? -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin
Haven't tested. But since I see that msix need some special code in qemu-kvm, it probably won't. But I assume we can just add a patch ontop of this to add that code and make it work, right? Sure. However - are you making in-kernel irqchip the default? If so, I think the best way to do it would be: 1. add in-kernel irqchip, off by default 2. fix msix with in-kernel 3. make in-kernel on by default Right now there are no knobs to disable it, since last time I checked, people were inclined to solve that by adding a machine type that does not do irqchip in kernel, if wanted. However, since it is probably not going to reach 0.12 anyway, we can come up with a patch that fixes msix, and bundle in this series before I actually enable the irqchip (which is one of the last patches) -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution
On Thu, Dec 3, 2009 at 10:20 AM, Avi Kivity a...@redhat.com wrote: On 12/02/2009 03:48 PM, Glauber Costa wrote: This function is similar to qemu-kvm's on_vcpu mechanism. Is similar? You're replacing on_vcpu(). Yeah, it began similar, now it is pretty much the same thing, but using qemu-specific data structures Totally synchronous, and guarantees that a given function will be executed at the specified vcpu. This patch also convert usage within the breakpoints system +void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data); The name suggests that it is asynchronous. Why is this patch necessary? to keep gdbstub working. -- error compiling committee.c: too many arguments to function -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 11/11] remove smp restriction from kvm
On Thu, Dec 3, 2009 at 10:23 AM, Avi Kivity a...@redhat.com wrote: On 12/02/2009 03:48 PM, Glauber Costa wrote: We don't support smp without irqchip in kernel, so only abort in that situation What's the reason for this restriction? It is temporary. But as far as my testing goes, we don't come even close to working without in-kernel irqchip. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution
Keep the name then. The new name is misleading. ok. Totally synchronous, and guarantees that a given function will be executed at the specified vcpu. This patch also convert usage within the breakpoints system +void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data); The name suggests that it is asynchronous. Why is this patch necessary? to keep gdbstub working. Because it fixes things. Please elaborate. gdbstub is called from the i/o thread , and call vcpu ioctls. So it has to use the on_vcpu mechanism to guarantee its execution in the right thread. What I meant is that currently, gdbstub is the only user of it, at least in qemu.git -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 11/11] remove smp restriction from kvm
On Thu, Dec 3, 2009 at 10:37 AM, Avi Kivity a...@redhat.com wrote: On 12/03/2009 02:36 PM, Glauber Costa wrote: On Thu, Dec 3, 2009 at 10:23 AM, Avi Kivitya...@redhat.com wrote: On 12/02/2009 03:48 PM, Glauber Costa wrote: We don't support smp without irqchip in kernel, so only abort in that situation What's the reason for this restriction? It is temporary. But as far as my testing goes, we don't come even close to working without in-kernel irqchip. This works well in qemu-kvm. What's the reason? it may indicate a bug in up. The reason is that we do not handle SIPI in qemu.git yet. Now that the basics of smp are working, it should not be too difficult to get it working. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin
On Thu, Dec 3, 2009 at 10:45 AM, Avi Kivity a...@redhat.com wrote: On 12/03/2009 02:39 PM, Michael S. Tsirkin wrote: Right now there are no knobs to disable it, since last time I checked, people were inclined to solve that by adding a machine type that does not do irqchip in kernel, if wanted. Can't everything doable by machine type also doable from command line switch? Disabling irqchip used to be a valuable debugging tool. Even worse, if we throw everything into the the machine type, we end up with 2^n machine types in order to control everything. I am totally fine re-including the cmdline patch in the series. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution
On Thu, Dec 3, 2009 at 12:29 PM, Paolo Bonzini pbonz...@redhat.com wrote: On 12/02/2009 02:48 PM, Glauber Costa wrote: + if (env == qemu_get_current_env()) { Will always be false for TCG + iothread. What's wrong with qemu_cpu_self(env)? It appears to do the same, and it would also make the whole thread-local storage stuff redundant. Indeed. I'll turn to qemu_cpu_self. thanks -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] [PATCH v2 05/11] tell kernel about all registers instead of just mp_state
+{ + if (kvm_enabled()) { Is this ever called or intended to be called when kvm is disabled? + kvm_cpu_flush_state(env); + } I don't think so. But this is here for consistency with its synchronize brother. -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] [PATCH v2 0/11]
On Wed, Dec 2, 2009 at 8:59 AM, Gleb Natapov g...@redhat.com wrote: On Tue, Dec 01, 2009 at 10:51:26AM -0200, Glauber Costa wrote: This is a repost of the -smp series. Note that it depends on irqchip-in-kernel, that is already in staging. Also, you'll have to enable the io-thread, for the time being. From the last version, main change is that I am not calling queue_work automatically from vcpu ioctls. queue_work is only used currently for the gdb stub. All other uses were by-passed by the new qemu_register_vcpu_reset(), since most of it uses (all racy) came from the reset handlers. Looks good to me except one thing. I don't see how you are addressing the problem fixed by commit b8a7857071b477b28d3055e33ff0298fc91f329a in qemu-kvm. The problem is that mp_state can change in kernel between call to kvm_cpu_synchronize_state() and vcpu re-entry. In this case old mp_state will overwrite new one. I plan to do it in a patch that already lives on my tree. It basically flushes register state in early post_load -- Glauber Costa. Free as in Freedom http://glommer.net The less confident you are, the more serious you have to act.
Re: [Qemu-devel] [PATCH v2 04/11] qemu_flush_work for remote vcpu execution
On Wed, Dec 02, 2009 at 11:27:53AM -0200, Marcelo Tosatti wrote: On Tue, Dec 01, 2009 at 10:51:30AM -0200, Glauber Costa wrote: This function is similar to qemu-kvm's on_vcpu mechanism. Totally synchronous, and guarantees that a given function will be executed at the specified vcpu. This patch also convert usage within the breakpoints system Signed-off-by: Glauber Costa glom...@redhat.com --- @@ -3436,8 +3441,7 @@ static int tcg_has_work(void); static pthread_key_t current_env; -CPUState *qemu_get_current_env(void); -CPUState *qemu_get_current_env(void) +static CPUState *qemu_get_current_env(void) { return pthread_getspecific(current_env); } @@ -3474,8 +3478,10 @@ static int qemu_init_main_loop(void) static void qemu_wait_io_event(CPUState *env) { -while (!tcg_has_work()) +while (!tcg_has_work()) { This checks all cpus, while for KVM it should check only the current cpu. +qemu_flush_work(env); qemu_cond_timedwait(env-halt_cond, qemu_global_mutex, 1000); +} KVM vcpu threads should block SIGUSR1, set the in-kernel signal mask with KVM_SET_SIGNAL_MASK ioctl, and eat the signal in qemu_wait_io_event (qemu_flush_work should run after eating the signal). Similarly to qemu-kvm's kvm_main_loop_wait. Otherwise a vcpu thread can lose a signal (say handle SIGUSR1 when not holding qemu_global_mutex before kernel entry). I think this the source of the problems patch 8 attempts to fix. I remember you had patches for both theses fixes. Can you resend them, ontop of this series?
[Qemu-devel] [PATCH 0/9] in-kernel irqchip, new spin
Avi/Marcelo Please include this in a branch in qemu-kvm, for future inclusion in qemu.git This is basically the same thing that was sitting in staging for weeks, just with a build bug fix on non-x86 hosts. Glauber Costa (9): introduce VMSTATE_U64 Provide ioapic-kvm provide apic_set_irq_delivered provide i8259-kvm Don't call apic functions directly from kvm code export kvm_put_mp_state provide apic-kvm Initialize in-kernel irqchip Do GSI routing Makefile.target |2 + hw/apic-kvm.c | 157 hw/apic.c |5 ++ hw/hw.h | 24 +++ hw/i8259-kvm.c| 112 +++ hw/ioapic-kvm.c | 89 + hw/pc.c | 21 +- hw/pc.h |6 ++ kvm-all.c | 47 +- kvm.h | 16 + savevm.c | 23 +++ target-i386/cpu.h |9 +++ target-i386/kvm.c | 188 +++-- target-ppc/kvm.c |5 ++ 14 files changed, 693 insertions(+), 11 deletions(-) create mode 100644 hw/apic-kvm.c create mode 100644 hw/i8259-kvm.c create mode 100644 hw/ioapic-kvm.c
[Qemu-devel] [PATCH 1/9] introduce VMSTATE_U64
Slightly modified version of a patch already included in qemu-kvm: This is a patch actually written by Juan, which, according to him, he plans on posting to qemu.git. Problem is that linux defines u64 in a way that is type-uncompatible with uint64_t. I am including it here, because it is a dependency to my patch series that follows. Signed-off-by: Glauber Costa glom...@redhat.com --- hw/hw.h | 24 savevm.c | 23 +++ 2 files changed, 47 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 7889aa3..e8a53f9 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -337,6 +337,11 @@ extern const VMStateInfo vmstate_info_uint16; extern const VMStateInfo vmstate_info_uint32; extern const VMStateInfo vmstate_info_uint64; +#ifdef __linux__ +#include linux/types.h +extern const VMStateInfo vmstate_info_u64; +#endif + extern const VMStateInfo vmstate_info_timer; extern const VMStateInfo vmstate_info_ptimer; extern const VMStateInfo vmstate_info_buffer; @@ -393,6 +398,16 @@ extern const VMStateInfo vmstate_info_unused_buffer; .offset = vmstate_offset_array(_state, _field, _type, _num), \ } +#define VMSTATE_ARRAY_UNSAFE(_field, _state, _num, _version, _info, _type) {\ +.name = (stringify(_field)), \ +.version_id = (_version),\ +.num= (_num),\ +.info = (_info), \ +.size = sizeof(_type), \ +.flags = VMS_ARRAY, \ +.offset = offsetof(_state, _field) \ +} + #define VMSTATE_ARRAY_TEST(_field, _state, _num, _test, _info, _type) {\ .name = (stringify(_field)), \ .field_exists = (_test), \ @@ -596,6 +611,15 @@ extern const VMStateDescription vmstate_i2c_slave; #define VMSTATE_UINT64(_f, _s)\ VMSTATE_UINT64_V(_f, _s, 0) +/* This is needed because on linux __u64 is unsigned long long + and on glibc uint64_t is unsigned long on 64 bits */ +#ifdef __linux__ +#define VMSTATE_U64_V(_f, _s, _v) \ +VMSTATE_SINGLE(_f, _s, _v, vmstate_info_u64, __u64) +#define VMSTATE_U64(_f, _s) \ +VMSTATE_U64_V(_f, _s, 0) +#endif + #define VMSTATE_UINT8_EQUAL(_f, _s) \ VMSTATE_SINGLE(_f, _s, 0, vmstate_info_uint8_equal, uint8_t) diff --git a/savevm.c b/savevm.c index 18c2e54..7d68ec0 100644 --- a/savevm.c +++ b/savevm.c @@ -872,6 +872,29 @@ const VMStateInfo vmstate_info_uint64 = { .put = put_uint64, }; +/* 64 bit linux kernel unsigned int */ + +#ifdef __linux__ +static int get_u64(QEMUFile *f, void *pv, size_t size) +{ +__u64 *v = pv; +qemu_get_be64s(f, (uint64_t *)v); +return 0; +} + +static void put_u64(QEMUFile *f, void *pv, size_t size) +{ +__u64 *v = pv; +qemu_put_be64s(f, (uint64_t *)v); +} + +const VMStateInfo vmstate_info_u64 = { +.name = __u64, +.get = get_u64, +.put = put_u64, +}; +#endif /* __linux__ */ + /* 8 bit int. See that the received value is the same than the one in the field */ -- 1.6.5.2
[Qemu-devel] [PATCH 2/9] Provide ioapic-kvm
This patch provides the file ioapic-kvm.c, which implements a schim over the kvm in-kernel IO APIC. Signed-off-by: Glauber Costa glom...@redhat.com --- Makefile.target |2 + hw/ioapic-kvm.c | 89 +++ hw/pc.c |7 - hw/pc.h |2 + kvm-all.c | 20 kvm.h |4 ++ 6 files changed, 123 insertions(+), 1 deletions(-) create mode 100644 hw/ioapic-kvm.c diff --git a/Makefile.target b/Makefile.target index 2985658..844ba46 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,6 +199,8 @@ obj-i386-y += usb-uhci.o vmmouse.o vmport.o vmware_vga.o hpet.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += ne2000-isa.o +obj-i386-$(CONFIG_KVM) += ioapic-kvm.o + # shared objects obj-ppc-y = ppc.o ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/macio.o obj-ppc-y += ide/cmd646.o diff --git a/hw/ioapic-kvm.c b/hw/ioapic-kvm.c new file mode 100644 index 000..78e0984 --- /dev/null +++ b/hw/ioapic-kvm.c @@ -0,0 +1,89 @@ +#include hw.h +#include pc.h +#include qemu-timer.h +#include host-utils.h +#include kvm.h + +#define IOAPIC_NUM_PINS0x18 +#define IOAPIC_DEFAULT_BASE_ADDRESS 0xfec0 + +static void ioapic_reset(void *opaque) +{ +struct kvm_ioapic_state *s = opaque; +struct kvm_irqchip *chip; +int i; + +chip = container_of(s, struct kvm_irqchip, chip.ioapic); + +chip-chip_id = KVM_IRQCHIP_IOAPIC; + +memset(s, 0, sizeof(*s)); +s-base_address = IOAPIC_DEFAULT_BASE_ADDRESS; +for(i = 0; i IOAPIC_NUM_PINS; i++) +s-redirtbl[i].bits = 1 16; /* mask LVT */ + +kvm_set_irqchip(chip); +} + +static void ioapic_pre_save(void *opaque) +{ +struct kvm_ioapic_state *s = opaque; +struct kvm_irqchip *chip; + +chip = container_of(s, struct kvm_irqchip, chip.ioapic); + +kvm_get_irqchip(chip); +} + +static int ioapic_post_load(void *opaque, int version_id) +{ +struct kvm_ioapic_state *s = opaque; +struct kvm_irqchip *chip; + +chip = container_of(s, struct kvm_irqchip, chip.ioapic); + +return kvm_set_irqchip(chip); +} + +static const VMStateDescription vmstate_kvm_ioapic = { +.name = ioapic-kvm, +.version_id = 1, +.minimum_version_id = 1, +.post_load = ioapic_post_load, +.pre_save = ioapic_pre_save, +.fields = (VMStateField []) { +/* Linux does not define __u64 the same as uint64_t */ +VMSTATE_U64(base_address, struct kvm_ioapic_state), +VMSTATE_UINT32(id, struct kvm_ioapic_state), +VMSTATE_UINT32(ioregsel, struct kvm_ioapic_state), +VMSTATE_UINT32(irr, struct kvm_ioapic_state), +VMSTATE_ARRAY_UNSAFE(redirtbl, struct kvm_ioapic_state, IOAPIC_NUM_PINS, 0, vmstate_info_u64, __u64), +VMSTATE_END_OF_LIST() +} +}; + + +static void kvm_ioapic_set_irq(void *opaque, int vector, int level) +{ +/* +int pic_ret; + +if (kvm_set_irq(vector, level, pic_ret)) { +if (pic_ret != 0) +apic_set_irq_delivered(); +return; +} +*/ +} + +qemu_irq *kvm_ioapic_init(void) +{ +struct kvm_irqchip *s; + +s = qemu_mallocz(sizeof(*s)); + +vmstate_register(0, vmstate_kvm_ioapic, s-chip.ioapic); +qemu_register_reset(ioapic_reset, s-chip.ioapic); + +return qemu_allocate_irqs(kvm_ioapic_set_irq, s-chip.ioapic, IOAPIC_NUM_PINS); +} diff --git a/hw/pc.c b/hw/pc.c index 97964b2..7bff913 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -44,6 +44,7 @@ #include ide.h #include loader.h #include elf.h +#include kvm.h /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -1144,7 +1145,11 @@ static void pc_init1(ram_addr_t ram_size, register_ioport_write(0x92, 1, 1, ioport92_write, NULL); if (pci_enabled) { -isa_irq_state-ioapic = ioapic_init(); +if (kvm_enabled() kvm_irqchip_in_kernel()) { +isa_irq_state-ioapic = kvm_ioapic_init(); +} else { +isa_irq_state-ioapic = ioapic_init(); +} } pit = pit_init(0x40, isa_reserve_irq(0)); pcspk_init(pit); diff --git a/hw/pc.h b/hw/pc.h index 03ffc91..a3ad931 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -49,6 +49,8 @@ void ioapic_set_irq(void *opaque, int vector, int level); void apic_reset_irq_delivered(void); int apic_get_irq_delivered(void); +qemu_irq *kvm_ioapic_init(void); + /* i8254.c */ #define PIT_FREQ 1193182 diff --git a/kvm-all.c b/kvm-all.c index b605caa..35230c1 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -393,6 +393,26 @@ int kvm_check_extension(KVMState *s, unsigned int extension) return ret; } +#ifdef KVM_CAP_IRQCHIP +int kvm_set_irqchip(struct kvm_irqchip *chip) +{ +if (!kvm_state-irqchip_in_kernel) { +return 0; +} + +return kvm_vm_ioctl(kvm_state, KVM_SET_IRQCHIP, chip); +} + +int kvm_get_irqchip(struct kvm_irqchip *chip) +{ +if (!kvm_state-irqchip_in_kernel) { +return 0; +} + +return
[Qemu-devel] [PATCH 3/9] provide apic_set_irq_delivered
i8259 chip will use it, so provide it, and export it through pc.h Signed-off-by: Glauber Costa glom...@redhat.com --- hw/apic.c |5 + hw/pc.h |1 + 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index 87e7dc0..482bb1e 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -388,6 +388,11 @@ void apic_reset_irq_delivered(void) apic_irq_delivered = 0; } +void apic_set_irq_delivered(void) +{ +apic_irq_delivered = 1; +} + int apic_get_irq_delivered(void) { return apic_irq_delivered; diff --git a/hw/pc.h b/hw/pc.h index a3ad931..4c9b4c3 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -47,6 +47,7 @@ int apic_get_interrupt(CPUState *env); qemu_irq *ioapic_init(void); void ioapic_set_irq(void *opaque, int vector, int level); void apic_reset_irq_delivered(void); +void apic_set_irq_delivered(void); int apic_get_irq_delivered(void); qemu_irq *kvm_ioapic_init(void); -- 1.6.5.2
[Qemu-devel] [PATCH 4/9] provide i8259-kvm
This patch provides the file i8259-kvm.c, which implements a schim over the kvm in-kernel PIC. Signed-off-by: Glauber Costa glom...@redhat.com --- Makefile.target |2 +- hw/i8259-kvm.c| 112 + hw/pc.c |8 +++- hw/pc.h |1 + kvm-all.c |7 ++- kvm.h |4 ++ target-i386/kvm.c | 26 target-ppc/kvm.c |5 ++ 8 files changed, 161 insertions(+), 4 deletions(-) create mode 100644 hw/i8259-kvm.c diff --git a/Makefile.target b/Makefile.target index 844ba46..8fb68a4 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,7 +199,7 @@ obj-i386-y += usb-uhci.o vmmouse.o vmport.o vmware_vga.o hpet.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += ne2000-isa.o -obj-i386-$(CONFIG_KVM) += ioapic-kvm.o +obj-i386-$(CONFIG_KVM) += ioapic-kvm.o i8259-kvm.o # shared objects obj-ppc-y = ppc.o ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/macio.o diff --git a/hw/i8259-kvm.c b/hw/i8259-kvm.c new file mode 100644 index 000..bd6387b --- /dev/null +++ b/hw/i8259-kvm.c @@ -0,0 +1,112 @@ +#include hw.h +#include pc.h +#include isa.h +#include monitor.h +#include qemu-timer.h +#include kvm.h + +static void kvm_i8259_set_irq(void *opaque, int irq, int level) +{ +int pic_ret; + +if (kvm_set_irq(irq, level, pic_ret)) { +if (pic_ret != 0) +/* In theory, we should not be using any apic state, but we need + * to warn devices such as the rtc about state of delivery. Since this + * one is just a marker, it is no big deal */ +apic_set_irq_delivered(); +return; +} +} + +static void kvm_pic_reset(void *opaque) +{ +struct kvm_pic_state *s = opaque; +struct kvm_irqchip *chip; + +s-last_irr = 0; +s-irr = 0; +s-imr = 0; +s-isr = 0; +s-priority_add = 0; +s-irq_base = 0; +s-read_reg_select = 0; +s-poll = 0; +s-special_mask = 0; +s-init_state = 0; +s-auto_eoi = 0; +s-rotate_on_auto_eoi = 0; +s-special_fully_nested_mode = 0; +s-init4 = 0; + +chip = container_of(s, struct kvm_irqchip, chip.pic); +kvm_set_irqchip(chip); +} + +static void pic_pre_save(void *opaque) +{ +struct kvm_pic_state *s = opaque; +struct kvm_irqchip *chip; + +chip = container_of(s, struct kvm_irqchip, chip.pic); + +kvm_get_irqchip(chip); +} + +static int pic_post_load(void *opaque, int version_id) +{ +struct kvm_pic_state *s = opaque; +struct kvm_irqchip *chip; + +chip = container_of(s, struct kvm_irqchip, chip.pic); + +return kvm_set_irqchip(chip); +} + +static const VMStateDescription vmstate_kvm_pic = { +.name = i8259-kvm, +.version_id = 1, +.pre_save = pic_pre_save, +.post_load = pic_post_load, +.minimum_version_id = 1, +.fields = (VMStateField []) { +VMSTATE_UINT8(last_irr, struct kvm_pic_state), +VMSTATE_UINT8(irr, struct kvm_pic_state), +VMSTATE_UINT8(imr, struct kvm_pic_state), +VMSTATE_UINT8(isr, struct kvm_pic_state), +VMSTATE_UINT8(priority_add, struct kvm_pic_state), +VMSTATE_UINT8(irq_base, struct kvm_pic_state), +VMSTATE_UINT8(read_reg_select, struct kvm_pic_state), +VMSTATE_UINT8(poll, struct kvm_pic_state), +VMSTATE_UINT8(special_mask, struct kvm_pic_state), +VMSTATE_UINT8(init_state, struct kvm_pic_state), +VMSTATE_UINT8(auto_eoi, struct kvm_pic_state), +VMSTATE_UINT8(rotate_on_auto_eoi, struct kvm_pic_state), +VMSTATE_UINT8(special_fully_nested_mode, struct kvm_pic_state), +VMSTATE_UINT8(init4, struct kvm_pic_state), +VMSTATE_UINT8(elcr, struct kvm_pic_state), +VMSTATE_END_OF_LIST() +} +}; + +static void kvm_pic_init1(int io_addr, struct kvm_pic_state *s) +{ +vmstate_register(io_addr, vmstate_kvm_pic, s); +qemu_register_reset(kvm_pic_reset, s); +} + +qemu_irq *kvm_i8259_init(qemu_irq parent_irq) +{ +struct kvm_irqchip *master, *slave; + +master = qemu_mallocz(sizeof(*master)); +slave = qemu_mallocz(sizeof(*slave)); + +master-chip_id = KVM_IRQCHIP_PIC_MASTER; +slave-chip_id = KVM_IRQCHIP_PIC_SLAVE; + +kvm_pic_init1(0x20, master-chip.pic); +kvm_pic_init1(0xa0, slave-chip.pic); + +return qemu_allocate_irqs(kvm_i8259_set_irq, master, 16); +} diff --git a/hw/pc.c b/hw/pc.c index 7bff913..a58562a 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1098,8 +1098,14 @@ static void pc_init1(ram_addr_t ram_size, } cpu_irq = qemu_allocate_irqs(pic_irq_request, NULL, 1); -i8259 = i8259_init(cpu_irq[0]); isa_irq_state = qemu_mallocz(sizeof(*isa_irq_state)); + +if (kvm_enabled() kvm_irqchip_in_kernel()) { +i8259 = kvm_i8259_init(cpu_irq[0]); +} else { +i8259 = i8259_init(cpu_irq[0]); +} + isa_irq_state-i8259 = i8259; isa_irq = qemu_allocate_irqs(isa_irq_handler, isa_irq_state
[Qemu-devel] [PATCH 5/9] Don't call apic functions directly from kvm code
It is actually not necessary to call a tpr function to save and load cr8, as cr8 is part of the processor state, and thus, it is much easier to just add it to CPUState. As for apic base, wrap kvm usages, so we can call either the qemu device, or the in kernel version. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/cpu.h |1 + target-i386/kvm.c | 25 +++-- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index eb9532a..7c4fa47 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -606,6 +606,7 @@ typedef struct CPUX86State { SegmentCache idt; /* only base and limit are used */ target_ulong cr[5]; /* NOTE: cr1 is unused */ +target_ulong cr8; int32_t a20_mask; /* FPU state */ diff --git a/target-i386/kvm.c b/target-i386/kvm.c index cf42f85..78fc941 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -344,6 +344,19 @@ static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs) | (rhs-avl * DESC_AVL_MASK); } +static void kvm_set_apic_base(CPUState *env, uint64_t val) +{ +if (!kvm_irqchip_in_kernel()) +cpu_set_apic_base(env, val); +} + +static uint64_t kvm_get_apic_base(CPUState *env) +{ +if (!kvm_irqchip_in_kernel()) +return cpu_get_apic_base(env); +return 0; +} + static void kvm_getput_reg(__u64 *kvm_reg, target_ulong *qemu_reg, int set) { if (set) @@ -455,8 +468,8 @@ static int kvm_put_sregs(CPUState *env) sregs.cr3 = env-cr[3]; sregs.cr4 = env-cr[4]; -sregs.cr8 = cpu_get_apic_tpr(env); -sregs.apic_base = cpu_get_apic_base(env); +sregs.cr8 = env-cr8; +sregs.apic_base = kvm_get_apic_base(env); sregs.efer = env-efer; @@ -561,7 +574,7 @@ static int kvm_get_sregs(CPUState *env) env-cr[3] = sregs.cr3; env-cr[4] = sregs.cr4; -cpu_set_apic_base(env, sregs.apic_base); +kvm_set_apic_base(env, sregs.apic_base); env-efer = sregs.efer; //cpu_set_apic_tpr(env, sregs.cr8); @@ -777,7 +790,7 @@ int kvm_arch_pre_run(CPUState *env, struct kvm_run *run) run-request_interrupt_window = 0; dprintf(setting tpr\n); -run-cr8 = cpu_get_apic_tpr(env); +run-cr8 = env-cr8; return 0; } @@ -789,8 +802,8 @@ int kvm_arch_post_run(CPUState *env, struct kvm_run *run) else env-eflags = ~IF_MASK; -cpu_set_apic_tpr(env, run-cr8); -cpu_set_apic_base(env, run-apic_base); +env-cr8 = run-cr8; +kvm_set_apic_base(env, run-apic_base); return 0; } -- 1.6.5.2
[Qemu-devel] [PATCH 7/9] provide apic-kvm
This patch provides the file apic-kvm.c, which implements a schim over the kvm in-kernel APIC. Signed-off-by: Glauber Costa glom...@redhat.com --- Makefile.target |2 +- hw/apic-kvm.c | 157 + hw/pc.c |6 ++- hw/pc.h |2 + kvm.h |5 ++ target-i386/cpu.h |4 ++ target-i386/kvm.c | 19 ++- 7 files changed, 192 insertions(+), 3 deletions(-) create mode 100644 hw/apic-kvm.c diff --git a/Makefile.target b/Makefile.target index 8fb68a4..c23aeb3 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,7 +199,7 @@ obj-i386-y += usb-uhci.o vmmouse.o vmport.o vmware_vga.o hpet.o obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o obj-i386-y += ne2000-isa.o -obj-i386-$(CONFIG_KVM) += ioapic-kvm.o i8259-kvm.o +obj-i386-$(CONFIG_KVM) += ioapic-kvm.o i8259-kvm.o apic-kvm.o # shared objects obj-ppc-y = ppc.o ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/macio.o diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c new file mode 100644 index 000..089fa45 --- /dev/null +++ b/hw/apic-kvm.c @@ -0,0 +1,157 @@ +#include hw.h +#include pc.h +#include pci.h +#include msix.h +#include qemu-timer.h +#include host-utils.h +#include kvm.h + +#define APIC_LVT_NB 6 +#define APIC_LVT_LINT0 3 + +struct qemu_lapic_state { +uint32_t apicbase; +uint8_t id; +uint8_t arb_id; +uint8_t tpr; +uint32_t spurious_vec; +uint8_t log_dest; +uint8_t dest_mode; +uint32_t isr[8]; /* in service register */ +uint32_t tmr[8]; /* trigger mode register */ +uint32_t irr[8]; /* interrupt request register */ +uint32_t lvt[APIC_LVT_NB]; +uint32_t esr; /* error register */ +uint32_t icr[2]; + +uint32_t divide_conf; +int count_shift; +uint32_t initial_count; +int64_t initial_count_load_time, next_time; +uint32_t idx; +int sipi_vector; +int wait_for_sipi; +}; + +typedef struct APICState { +CPUState *cpu_env; + +/* KVM lapic structure is just a big array of regs. But it is what kvm + * functions expect. So have both the fields separated, for easy access, + * and the kvm stucture, for ioctls communications */ +union { +struct qemu_lapic_state dev; +struct kvm_lapic_state kvm_lapic_state; +}; +} APICState; + +void kvm_apic_set_base(CPUState *env, uint64_t val) +{ +APICState *s = env-apic_state; + +if (!s) +return; + +s-dev.apicbase = val; +} + +uint64_t kvm_apic_get_base(CPUState *env) +{ +APICState *s = env-apic_state; + +return s ? s-dev.apicbase : 0; +} + +static void apic_pre_save(void *opaque) +{ +APICState *s = opaque; + +kvm_get_lapic(s-cpu_env, s-kvm_lapic_state); +} + +static int apic_post_load(void *opaque, int version_id) +{ +APICState *s = opaque; + +return kvm_set_lapic(s-cpu_env, s-kvm_lapic_state); +} + +static const VMStateDescription vmstate_apic = { +.name = apic-kvm, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_BUFFER_UNSAFE(kvm_lapic_state.regs, APICState, 1, KVM_APIC_REG_SIZE), +VMSTATE_END_OF_LIST() +}, +.pre_save = apic_pre_save, +.post_load = apic_post_load, +}; + +static void kvm_apic_reset(void *opaque) +{ +APICState *s = opaque; +int bsp; +int i; + +cpu_synchronize_state(s-cpu_env); + +bsp = cpu_is_bsp(s-cpu_env); + +s-dev.apicbase = 0xfee0 | +(bsp ? MSR_IA32_APICBASE_BSP : 0) | MSR_IA32_APICBASE_ENABLE; + +cpu_reset(s-cpu_env); + +s-dev.tpr = 0; +s-dev.spurious_vec = 0xff; +s-dev.log_dest = 0; +s-dev.dest_mode = 0xf; +memset(s-dev.isr, 0, sizeof(s-dev.isr)); +memset(s-dev.tmr, 0, sizeof(s-dev.tmr)); +memset(s-dev.irr, 0, sizeof(s-dev.irr)); +for(i = 0; i APIC_LVT_NB; i++) +s-dev.lvt[i] = 1 16; /* mask LVT */ +s-dev.esr = 0; +memset(s-dev.icr, 0, sizeof(s-dev.icr)); +s-dev.divide_conf = 0; +s-dev.count_shift = 0; +s-dev.initial_count = 0; +s-dev.initial_count_load_time = 0; +s-dev.next_time = 0; +s-dev.wait_for_sipi = 1; + +s-cpu_env-halted = !(s-dev.apicbase MSR_IA32_APICBASE_BSP); + +s-cpu_env-mp_state += s-cpu_env-halted ? KVM_MP_STATE_UNINITIALIZED : KVM_MP_STATE_RUNNABLE; + +kvm_put_mp_state(s-cpu_env); + +if (bsp) { +/* + * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization + * time typically by BIOS, so PIC interrupt can be delivered to the + * processor when local APIC is enabled. + */ +s-dev.lvt[APIC_LVT_LINT0] = 0x700; +} +kvm_set_lapic(s-cpu_env, s-kvm_lapic_state); +} + +int kvm_apic_init(CPUState *env) +{ +APICState *s; + +s = qemu_mallocz(sizeof(*s)); +env-apic_state = s; +s-cpu_env = env; + +msix_supported = 1; + +vmstate_register(s-dev.id, vmstate_apic, s
[Qemu-devel] [PATCH 6/9] export kvm_put_mp_state
We'll use it from inside the in-kernel apic chip, so get it into a header file. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/cpu.h |4 target-i386/kvm.c |2 +- 2 files changed, 5 insertions(+), 1 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 7c4fa47..1ebf80e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -921,4 +921,8 @@ void apic_init_reset(CPUState *env); void apic_sipi(CPUState *env); void do_cpu_init(CPUState *env); void do_cpu_sipi(CPUState *env); + +/* KVM hooks */ +int kvm_put_mp_state(CPUState *env); + #endif /* CPU_I386_H */ diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 78fc941..4eb61f1 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -687,7 +687,7 @@ static int kvm_get_msrs(CPUState *env) return 0; } -static int kvm_put_mp_state(CPUState *env) +int kvm_put_mp_state(CPUState *env) { struct kvm_mp_state mp_state = { .mp_state = env-mp_state }; -- 1.6.5.2
[Qemu-devel] [PATCH 8/9] Initialize in-kernel irqchip
Now that we have all devices set up, this patch initializes the irqchip. This is dependant on the io-thread, since we need someone to pull ourselves out of the halted state. I believe this should be the default when we are running over the io-thread. Later on, I plan to post a patch that makes it optional. Signed-off-by: Glauber Costa glom...@redhat.com --- kvm-all.c | 20 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index e46093f..0c6aba7 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -416,6 +416,22 @@ int kvm_set_irq(int irq, int level, int *status) return kvm_arch_set_irq(kvm_state, irq, level, status); } +static int kvm_create_irqchip(KVMState *s) +{ +int ret = 0; +#if defined(CONFIG_IOTHREAD) +if (!kvm_check_extension(s, KVM_CAP_IRQCHIP)) +return -1; + +ret = kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP); +if (ret 0) +return ret; + +s-irqchip_in_kernel = 1; +#endif +return ret; +} + int kvm_init(int smp_cpus) { static const char upgrade_note[] = @@ -502,6 +518,10 @@ int kvm_init(int smp_cpus) } #endif +ret = kvm_create_irqchip(s); +if (ret 0) +goto err; + ret = kvm_arch_init(s, smp_cpus); if (ret 0) goto err; -- 1.6.5.2
[Qemu-devel] [PATCH 9/9] Do GSI routing
To follow correctly what our bios ACPI tables say, we have to be able to program our irqchips with GSI routing mappings. This support is already in qemu-kvm Signed-off-by: Glauber Costa glom...@redhat.com --- kvm-all.c |6 ++- kvm.h |3 + target-i386/kvm.c | 118 + 3 files changed, 125 insertions(+), 2 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 0c6aba7..a5739c4 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -66,6 +66,8 @@ struct KVMState #endif int irqchip_in_kernel; int pit_in_kernel; + +KVMArchState *arch_state; }; static KVMState *kvm_state; @@ -522,12 +524,12 @@ int kvm_init(int smp_cpus) if (ret 0) goto err; +kvm_state = s; + ret = kvm_arch_init(s, smp_cpus); if (ret 0) goto err; -kvm_state = s; - return 0; err: diff --git a/kvm.h b/kvm.h index cb896e3..15fb34a 100644 --- a/kvm.h +++ b/kvm.h @@ -72,7 +72,10 @@ int kvm_set_irq(int irq, int level, int *status); /* internal API */ struct KVMState; +struct KVMArchState; + typedef struct KVMState KVMState; +typedef struct KVMArchState KVMArchState; int kvm_ioctl(KVMState *s, int type, ...); diff --git a/target-i386/kvm.c b/target-i386/kvm.c index a908819..7e1ce15 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -37,6 +37,16 @@ #ifdef KVM_CAP_EXT_CPUID +struct KVMArchState +{ +struct kvm_irq_routing *irq_routes; +int nr_allocated_irq_routes; +void *used_gsi_bitmap; +int max_gsi; +}; + +static KVMArchState *kvm_arch_state; + static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max) { struct kvm_cpuid2 *cpuid; @@ -272,10 +282,118 @@ static int kvm_has_msr_star(CPUState *env) return 0; } +/* + * Setup x86 specific IRQ routing + */ +static inline void set_gsi(KVMArchState *s, unsigned int gsi) +{ +uint32_t *bitmap = s-used_gsi_bitmap; + +if (gsi s-max_gsi) +bitmap[gsi / 32] |= 1U (gsi % 32); +else +fprintf(stderr, Invalid GSI %d\n, gsi); +} + +static int kvm_add_routing_entry(KVMArchState *s, struct kvm_irq_routing_entry *entry) +{ +struct kvm_irq_routing *z; +struct kvm_irq_routing_entry *new; +int n, size; + +if (s-irq_routes-nr == s-nr_allocated_irq_routes) { +n = s-nr_allocated_irq_routes * 2; +if (n 64) +n = 64; +size = sizeof(struct kvm_irq_routing); +size += n * sizeof(*new); +z = realloc(s-irq_routes, size); +if (!z) +return -ENOMEM; +s-nr_allocated_irq_routes = n; +s-irq_routes = z; +} +n = s-irq_routes-nr++; +new = s-irq_routes-entries[n]; +memset(new, 0, sizeof(*new)); +new-gsi = entry-gsi; +new-type = entry-type; +new-flags = entry-flags; +new-u = entry-u; + +set_gsi(s, entry-gsi); + +return 0; +} + +static int kvm_add_irq_route(KVMArchState *s, int gsi, int irqchip, int pin) +{ +struct kvm_irq_routing_entry e; + +e.gsi = gsi; +e.type = KVM_IRQ_ROUTING_IRQCHIP; +e.flags = 0; +e.u.irqchip.irqchip = irqchip; +e.u.irqchip.pin = pin; +return kvm_add_routing_entry(s, e); +} + +static int kvm_init_irq_routing(KVMState *s) +{ +int i, r; +int gsi_count, gsi_bits; + +gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING); +if (!kvm_irqchip_in_kernel() (gsi_count 0)) { +return 0; +} + +/* Round up so we can search ints using ffs */ +gsi_bits = ((gsi_count - 31) ~31); +kvm_arch_state-used_gsi_bitmap = qemu_mallocz(gsi_bits / 8); +kvm_arch_state-max_gsi = gsi_bits; + +/* Mark any over-allocated bits as already in use */ +for (i = gsi_count; i gsi_bits; i++) { +set_gsi(kvm_arch_state, i); +} + +kvm_arch_state-irq_routes-nr = 0; + +for (i = 0; i 8; ++i) { +if (i == 2) +continue; +r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_PIC_MASTER, i); +if (r 0) +return r; +} +for (i = 8; i 16; ++i) { +r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_PIC_SLAVE, i - 8); +if (r 0) +return r; +} +for (i = 0; i 24; ++i) { +if (i == 0) { +r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_IOAPIC, 2); +} else if (i != 2) { +r = kvm_add_irq_route(kvm_arch_state, i, KVM_IRQCHIP_IOAPIC, i); +} +if (r 0) +return r; +} + +kvm_arch_state-irq_routes-flags = 0; +return kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, kvm_arch_state-irq_routes); +} int kvm_arch_init(KVMState *s, int smp_cpus) { int ret; +kvm_arch_state = qemu_mallocz(sizeof(*kvm_arch_state)); +kvm_arch_state-irq_routes = qemu_mallocz(sizeof(*kvm_arch_state-irq_routes)); + +kvm_init_irq_routing(s); + /* create vm86 tss. KVM uses vm86 mode to emulate 16-bit code * directly. In order to use vm86 mode, a TSS is needed
[Qemu-devel] [PATCH 00/11] SMP support in qemu.git
Avi/Marcelo Please include this in a branch in qemu-kvm for future inclusion in qemu.git. It is the same material people have already commented on in the ML. Glauber Costa (11): Don't mess with halted state. store thread-specific env information update halted state on mp_state sync qemu_flush_work for remote vcpu execution tell kernel about all registers instead of just mp_state flush state in migration post_load Don't call kvm cpu reset on initialization use cpu_kick instead of direct signalling. Use per-cpu reset handlers. Use __thread where available. remove smp restriction from kvm configure | 17 cpu-all.h |3 + cpu-defs.h| 16 exec-all.h| 12 ++ exec.c| 32 hw/apic-kvm.c | 26 +++- kvm-all.c | 49 +--- kvm.h | 10 + target-i386/kvm.c | 12 ++ target-ppc/kvm.c |5 ++ vl.c | 108 ++-- 11 files changed, 244 insertions(+), 46 deletions(-)
[Qemu-devel] [PATCH 02/11] store thread-specific env information
Since we'll have multiple cpu threads, at least for kvm, we need a way to store and retrieve the CPUState associated with the current execution thread. For the I/O thread, this will be NULL. I am using pthread functions for that, for portability, but we could as well use __thread keyword. Signed-off-by: Glauber Costa glom...@redhat.com --- vl.c | 21 + 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 44763af..321b18d 100644 --- a/vl.c +++ b/vl.c @@ -3434,6 +3434,24 @@ static void block_io_signals(void); static void unblock_io_signals(void); static int tcg_has_work(void); +static pthread_key_t current_env; + +CPUState *qemu_get_current_env(void); +CPUState *qemu_get_current_env(void) +{ +return pthread_getspecific(current_env); +} + +static void qemu_set_current_env(CPUState *env) +{ +pthread_setspecific(current_env, env); +} + +static void qemu_init_current_env(void) +{ +pthread_key_create(current_env, NULL); +} + static int qemu_init_main_loop(void) { int ret; @@ -3446,6 +3464,7 @@ static int qemu_init_main_loop(void) qemu_mutex_init(qemu_fair_mutex); qemu_mutex_init(qemu_global_mutex); qemu_mutex_lock(qemu_global_mutex); +qemu_init_current_env(); unblock_io_signals(); qemu_thread_self(io_thread); @@ -3484,6 +3503,8 @@ static void *kvm_cpu_thread_fn(void *arg) block_io_signals(); qemu_thread_self(env-thread); +qemu_set_current_env(env); + if (kvm_enabled()) kvm_init_vcpu(env); -- 1.6.5.2
[Qemu-devel] [PATCH 01/11] Don't mess with halted state.
When we have irqchip in kernel, halted state is kernel business. So don't initialize it in our code. Signed-off-by: Glauber Costa glom...@redhat.com --- hw/apic-kvm.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c index 089fa45..e5a0bfc 100644 --- a/hw/apic-kvm.c +++ b/hw/apic-kvm.c @@ -122,10 +122,9 @@ static void kvm_apic_reset(void *opaque) s-dev.next_time = 0; s-dev.wait_for_sipi = 1; -s-cpu_env-halted = !(s-dev.apicbase MSR_IA32_APICBASE_BSP); s-cpu_env-mp_state -= s-cpu_env-halted ? KVM_MP_STATE_UNINITIALIZED : KVM_MP_STATE_RUNNABLE; += bsp ? KVM_MP_STATE_RUNNABLE : KVM_MP_STATE_UNINITIALIZED; kvm_put_mp_state(s-cpu_env); -- 1.6.5.2
[Qemu-devel] [PATCH 04/11] qemu_flush_work for remote vcpu execution
This function is similar to qemu-kvm's on_vcpu mechanism. Totally synchronous, and guarantees that a given function will be executed at the specified vcpu. This patch also convert usage within the breakpoints system Signed-off-by: Glauber Costa glom...@redhat.com --- cpu-all.h |3 ++ cpu-defs.h | 14 + kvm-all.c | 17 +-- vl.c | 61 +++ 4 files changed, 75 insertions(+), 20 deletions(-) diff --git a/cpu-all.h b/cpu-all.h index e214374..8270d43 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -763,6 +763,9 @@ extern CPUState *cpu_single_env; extern int64_t qemu_icount; extern int use_icount; +void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data); +void qemu_flush_work(CPUState *env); + #define CPU_INTERRUPT_HARD 0x02 /* hardware interrupt pending */ #define CPU_INTERRUPT_EXITTB 0x04 /* exit the current TB (use for x86 a20 case) */ #define CPU_INTERRUPT_TIMER 0x08 /* internal timer exception pending */ diff --git a/cpu-defs.h b/cpu-defs.h index 95068b5..18792fc 100644 --- a/cpu-defs.h +++ b/cpu-defs.h @@ -31,6 +31,8 @@ #include qemu-queue.h #include targphys.h +#include qemu-thread.h + #ifndef TARGET_LONG_BITS #error TARGET_LONG_BITS must be defined before including this header #endif @@ -134,6 +136,13 @@ typedef struct CPUWatchpoint { QTAILQ_ENTRY(CPUWatchpoint) entry; } CPUWatchpoint; +typedef struct QemuWorkItem { +void (*func)(void *data); +void *data; +int done; +} QemuWorkItem; + + #define CPU_TEMP_BUF_NLONGS 128 #define CPU_COMMON \ struct TranslationBlock *current_tb; /* currently executing TB */ \ @@ -175,6 +184,10 @@ typedef struct CPUWatchpoint { QTAILQ_HEAD(watchpoints_head, CPUWatchpoint) watchpoints;\ CPUWatchpoint *watchpoint_hit; \ \ +QemuWorkItem queued_work; \ +uint64_t queued_local, queued_total;\ +struct QemuMutex queue_lock;\ +\ struct GDBRegisterState *gdb_regs; \ \ /* Core interrupt code */ \ @@ -194,6 +207,7 @@ typedef struct CPUWatchpoint { uint32_t created; \ struct QemuThread *thread; \ struct QemuCond *halt_cond; \ +struct QemuCond work_cond; \ const char *cpu_model_str; \ struct KVMState *kvm_state; \ struct kvm_run *kvm_run;\ diff --git a/kvm-all.c b/kvm-all.c index a5739c4..f7d89c6 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -617,7 +617,7 @@ int kvm_cpu_exec(CPUState *env) struct kvm_run *run = env-kvm_run; int ret; -dprintf(kvm_cpu_exec()\n); +dprintf(kvm_cpu_exec() %d\n, env-cpu_index); do { if (env-exit_request) { @@ -932,19 +932,6 @@ void kvm_setup_guest_memory(void *start, size_t size) } #ifdef KVM_CAP_SET_GUEST_DEBUG -static void on_vcpu(CPUState *env, void (*func)(void *data), void *data) -{ -#ifdef CONFIG_IOTHREAD -if (env == cpu_single_env) { -func(data); -return; -} -abort(); -#else -func(data); -#endif -} - struct kvm_sw_breakpoint *kvm_find_sw_breakpoint(CPUState *env, target_ulong pc) { @@ -992,7 +979,7 @@ int kvm_update_guest_debug(CPUState *env, unsigned long reinject_trap) data.dbg.control |= reinject_trap; data.env = env; -on_vcpu(env, kvm_invoke_set_guest_debug, data); +qemu_queue_work(env, kvm_invoke_set_guest_debug, data); return data.err; } diff --git a/vl.c b/vl.c index 321b18d..c7b46a9 100644 --- a/vl.c +++ b/vl.c @@ -3403,6 +3403,11 @@ void qemu_notify_event(void) } } +void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data) +{ +func(data); +} + void qemu_mutex_lock_iothread(void) {} void qemu_mutex_unlock_iothread(void) {} @@ -3436,8 +3441,7 @@ static int tcg_has_work(void); static pthread_key_t current_env; -CPUState *qemu_get_current_env(void); -CPUState *qemu_get_current_env(void) +static CPUState *qemu_get_current_env(void) { return pthread_getspecific(current_env); } @@ -3474,8 +3478,10 @@ static int qemu_init_main_loop(void) static void qemu_wait_io_event(CPUState *env) { -while (!tcg_has_work()) +while (!tcg_has_work
[Qemu-devel] [PATCH 03/11] update halted state on mp_state sync
If we are using in-kernel irqchip, halted state belongs in the kernel. So everytime we grab kernel's idea of mpstate, we also need to propagate halted state to userspace. Signed-off-by: Glauber Costa glom...@redhat.com --- target-i386/kvm.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 7e1ce15..65ad2a1 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -825,6 +825,11 @@ static int kvm_get_mp_state(CPUState *env) return ret; } env-mp_state = mp_state.mp_state; + +if (kvm_irqchip_in_kernel()) { +env-halted = (env-mp_state == KVM_MP_STATE_HALTED); +} + return 0; } -- 1.6.5.2