Re: [PATCH 08/18] virtio_ring: support for used_event idx feature
On Wed, 4 May 2011 23:51:38 +0300, Michael S. Tsirkin m...@redhat.com wrote: Add support for the used_event idx feature: when enabling interrupts, publish the current avail index value to the host so that we get interrupts on the next update. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_ring.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 507d6eb..3a3ed75 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -320,6 +320,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len) ret = vq-data[i]; detach_buf(vq, i); vq-last_used_idx++; + /* If we expect an interrupt for the next entry, tell host + * by writing event index and flush out the write before + * the read in the next get_buf call. */ + if (!(vq-vring.avail-flags VRING_AVAIL_F_NO_INTERRUPT)) { + vring_used_event(vq-vring) = vq-last_used_idx; + virtio_mb(); + } + Hmm, so you're still using the avail-flags; it's just if thresholding is enabled the host will ignore it? It's a little subtle, but it keeps this patch small. Perhaps we'll want to make it more explicit later. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/18] virtio: add api for delayed callbacks
On Wed, 4 May 2011 23:52:33 +0300, Michael S. Tsirkin m...@redhat.com wrote: Add an API that tells the other side that callbacks should be delayed until a lot of work has been done. Implement using the new used_event feature. Since you're going to add a capacity query anyway, why not add the threshold argument here? Then the caller can choose how much space it needs. Maybe net and block will want different things... Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] virtio_ring: need_event api comment fix
On Thu, 5 May 2011 18:08:17 +0300, Michael S. Tsirkin m...@redhat.com wrote: fix typo in a comment: size - side Reported-by: Stefan Hajnoczi stefa...@gmail.com Signed-off-by: Michael S. Tsirkin m...@redhat.com I could smerge these together for you, but I *really* want benchmarks in these commit messages. Thanks, Rusty. PS. Was away last week, hence the delay on this... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers
Hi, This is version 4 of a series of patches that I originally posted in: http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328 Changes since version 3: in patch part 1/5 and part 4/5 - Added stub functions for 'target_reset_irq_delivered' and 'target_get_irq_delivered'. Added registration functions that are used by apic code to replace the stubs. - Removed NULL pointer checks from update_irq(). in patch part 5/5 - A minor modification in hpet_timer_has_tick_backlog(). - Renamed the local variable 'irq_count' in hpet_timer() to 'period_count'. - Driftfix-related fields in struct 'HPETTimer' are no longer being initialized/reset in hpet_reset(). Added the function hpet_timer_driftfix_reset() which is called when the guest . sets the 'CFG_ENABLE' bit (overall enable) in the General Configuration Register. . sets the 'TN_ENABLE' bit (timer N interrupt enable) in the Timer N Configuration and Capabilities Register. Please review and please comment. Regards, Uli Ulrich Obergfell (5): hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add driftfix property to HPETState and DeviceInfo hpet 'driftfix': add fields to HPETTimer and VMStateDescription hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts hw/apic.c |4 ++ hw/hpet.c | 119 ++-- hw/pc.h | 13 +++ vl.c | 13 +++ 4 files changed, 145 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)
'target_get_irq_delivered' and 'target_reset_irq_delivered' point to functions that are called by update_irq() to detect coalesced interrupts. Initially they point to stub functions which pretend successful interrupt injection. apic code calls two registration functions to replace the stubs with apic_get_irq_delivered() and apic_reset_irq_delivered(). This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell uober...@redhat.com --- hw/apic.c |4 hw/pc.h | 13 + vl.c | 13 + 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index a45b57f..94b1d15 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -17,6 +17,7 @@ * License along with this library; if not, see http://www.gnu.org/licenses/ */ #include hw.h +#include pc.h #include apic.h #include ioapic.h #include qemu-timer.h @@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = { static void apic_register_devices(void) { +register_target_get_irq_delivered(apic_get_irq_delivered); +register_target_reset_irq_delivered(apic_reset_irq_delivered); + sysbus_register_withprop(apic_info); } diff --git a/hw/pc.h b/hw/pc.h index 1291e2d..79885f4 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -7,6 +7,19 @@ #include fdc.h #include net.h +extern int (*target_get_irq_delivered)(void); +extern void (*target_reset_irq_delivered)(void); + +static inline void register_target_get_irq_delivered(int (*func)(void)) +{ +target_get_irq_delivered = func; +} + +static inline void register_target_reset_irq_delivered(void (*func)(void)) +{ +target_reset_irq_delivered = func; +} + /* PC-style peripherals (also used by other machines). */ /* serial.c */ diff --git a/vl.c b/vl.c index a143250..a2bbc61 100644 --- a/vl.c +++ b/vl.c @@ -233,6 +233,19 @@ const char *prom_envs[MAX_PROM_ENVS]; const char *nvram = NULL; int boot_menu; +static int target_get_irq_delivered_stub(void) +{ +return 1; +} + +static void target_reset_irq_delivered_stub(void) +{ +return; +} + +int (*target_get_irq_delivered)(void) = target_get_irq_delivered_stub; +void (*target_reset_irq_delivered)(void) = target_reset_irq_delivered_stub; + typedef struct FWBootEntry FWBootEntry; struct FWBootEntry { -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
The new fields in HPETTimer are covered by a separate VMStateDescription which is a subsection of 'vmstate_hpet_timer'. They are only migrated if -global hpet.driftfix=on Signed-off-by: Ulrich Obergfell uober...@redhat.com --- hw/hpet.c | 33 + 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7513065..7ab6e62 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -55,6 +55,10 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +uint64_t prev_period; +uint64_t ticks_not_accounted; +uint32_t irq_rate; +uint32_t divisor; } HPETTimer; typedef struct HPETState { @@ -246,6 +250,27 @@ static int hpet_post_load(void *opaque, int version_id) return 0; } +static bool hpet_timer_driftfix_vmstate_needed(void *opaque) +{ +HPETTimer *t = opaque; + +return (t-state-driftfix != 0); +} + +static const VMStateDescription vmstate_hpet_timer_driftfix = { +.name = hpet_timer_driftfix, +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(prev_period, HPETTimer), +VMSTATE_UINT64(ticks_not_accounted, HPETTimer), +VMSTATE_UINT32(irq_rate, HPETTimer), +VMSTATE_UINT32(divisor, HPETTimer), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_hpet_timer = { .name = hpet_timer, .version_id = 1, @@ -260,6 +285,14 @@ static const VMStateDescription vmstate_hpet_timer = { VMSTATE_UINT8(wrap_flag, HPETTimer), VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() +}, +.subsections = (VMStateSubsection []) { +{ +.vmsd = vmstate_hpet_timer_driftfix, +.needed = hpet_timer_driftfix_vmstate_needed, +}, { +/* empty */ +} } }; -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)
update_irq() uses a similar method as in 'rtc_td_hack' to detect coalesced interrupts. The function entry addresses are retrieved from 'target_get_irq_delivered' and 'target_reset_irq_delivered'. This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell uober...@redhat.com --- hw/hpet.c | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7ab6e62..e57c654 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -175,11 +175,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current) } } -static void update_irq(struct HPETTimer *timer, int set) +static int update_irq(struct HPETTimer *timer, int set) { uint64_t mask; HPETState *s; int route; +int irq_delivered = 1; if (timer-tn = 1 hpet_in_legacy_mode(timer-state)) { /* if LegacyReplacementRoute bit is set, HPET specification requires @@ -204,8 +205,16 @@ static void update_irq(struct HPETTimer *timer, int set) qemu_irq_raise(s-irqs[route]); } else { s-isr = ~mask; -qemu_irq_pulse(s-irqs[route]); +if (s-driftfix) { +target_reset_irq_delivered(); +qemu_irq_raise(s-irqs[route]); +irq_delivered = target_get_irq_delivered(); +qemu_irq_lower(s-irqs[route]); +} else { +qemu_irq_pulse(s-irqs[route]); +} } +return irq_delivered; } static void hpet_pre_save(void *opaque) -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
driftfix is a 'bit type' property. Compensation of delayed callbacks and coalesced interrupts can be enabled with the command line option -global hpet.driftfix=on driftfix is 'off' (disabled) by default. Signed-off-by: Ulrich Obergfell uober...@redhat.com --- hw/hpet.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 6ce07bc..7513065 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -72,6 +72,8 @@ typedef struct HPETState { uint64_t isr; /* interrupt status reg */ uint64_t hpet_counter; /* main counter */ uint8_t hpet_id; /* instance id */ + +uint32_t driftfix; } HPETState; static uint32_t hpet_in_legacy_mode(HPETState *s) @@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = { .qdev.props = (Property[]) { DEFINE_PROP_UINT8(timers, HPETState, num_timers, HPET_MIN_TIMERS), DEFINE_PROP_BIT(msi, HPETState, flags, HPET_MSI_SUPPORT, false), +DEFINE_PROP_BIT(driftfix, HPETState, driftfix, 0, false), DEFINE_PROP_END_OF_LIST(), }, }; -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Loss of periodic timer interrupts caused by delayed callbacks and by interrupt coalescing is compensated by gradually injecting additional interrupts during subsequent timer intervals, starting at a rate of one additional interrupt per interval. The injection of additional interrupts is based on a backlog of unaccounted HPET clock periods (new HPETTimer field 'ticks_not_accounted'). The backlog increases due to delayed callbacks and coalesced interrupts, and it decreases if an interrupt was injected successfully. If the backlog increases while compensation is still in progress, the rate at which additional interrupts are injected is increased too. A limit is imposed on the backlog and on the rate. Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass slower and faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour may be acceptable. Signed-off-by: Ulrich Obergfell uober...@redhat.com --- hw/hpet.c | 70 +++- 1 files changed, 68 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index e57c654..519fc6b 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -31,6 +31,7 @@ #include hpet_emul.h #include sysbus.h #include mc146818rtc.h +#include assert.h //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -41,6 +42,9 @@ #define HPET_MSI_SUPPORT0 +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */ +#define MAX_IRQ_RATE(uint32_t)10 + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -324,14 +328,35 @@ static const VMStateDescription vmstate_hpet = { } }; +static void hpet_timer_driftfix_reset(HPETTimer *t) +{ +if (t-state-driftfix timer_is_periodic(t)) { +t-ticks_not_accounted = t-prev_period = t-period; +t-irq_rate = 1; +t-divisor = 1; +} +} + +static bool hpet_timer_has_tick_backlog(HPETTimer *t) +{ +uint64_t backlog = 0; + +if (t-ticks_not_accounted = t-period + t-prev_period) { +backlog = t-ticks_not_accounted - (t-period + t-prev_period); +} +return (backlog = t-period); +} + /* * timer expiration callback */ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; +HPETState *s = t-state; uint64_t diff; - +int irq_delivered = 0; +uint32_t period_count = 0; uint64_t period = t-period; uint64_t cur_tick = hpet_get_ticks(t-state); @@ -339,13 +364,37 @@ static void hpet_timer(void *opaque) if (t-config HPET_TN_32BIT) { while (hpet_time_after(cur_tick, t-cmp)) { t-cmp = (uint32_t)(t-cmp + t-period); +t-ticks_not_accounted += t-period; +period_count++; } } else { while (hpet_time_after64(cur_tick, t-cmp)) { t-cmp += period; +t-ticks_not_accounted += period; +period_count++; } } diff = hpet_calculate_diff(t, cur_tick); +if (s-driftfix) { +if (t-ticks_not_accounted MAX_TICKS_NOT_ACCOUNTED) { +t-ticks_not_accounted = t-period + t-prev_period; +} +if (hpet_timer_has_tick_backlog(t)) { +if (t-irq_rate == 1 || period_count 1) { +t-irq_rate++; +t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE); +} +if (t-divisor == 0) { +assert(period_count); +} +if (period_count) { +t-divisor = t-irq_rate; +} +diff /= t-divisor--; +} else { +t-irq_rate = 1; +} +} qemu_mod_timer(t-qemu_timer, qemu_get_clock_ns(vm_clock) + (int64_t)ticks_to_ns(diff)); } else if (t-config HPET_TN_32BIT !timer_is_periodic(t)) { @@ -356,7 +405,22 @@ static void hpet_timer(void *opaque) t-wrap_flag = 0; } } -update_irq(t, 1); +if (s-driftfix timer_is_periodic(t) period != 0) { +if (t-ticks_not_accounted = t-period + t-prev_period) { +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t-ticks_not_accounted -= t-prev_period; +t-prev_period = t-period; +} else { +if (period_count) { +t-irq_rate++; +t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE); +} +} +} +} else { +update_irq(t, 1); +} } static void hpet_set_timer(HPETTimer *t) @@ -525,6 +589,7 @@ static void hpet_ram_writel(void
[PATCH] kvm tools: Fix and improve the CPU register dump debug output code
* Pekka Enberg penb...@kernel.org wrote: Ingo Molnar reported that 'kill -3' didn't work on his machine: * Ingo Molnar mi...@elte.hu wrote: This is really cumbersome to debug - is there some good way to get to the RIP that the guest is hanging in? If kvm would print that out to the host console (even if it's just the raw RIP initially) on a kill -3 that would help enormously. Looks like the code should be doing that already - but the ioctl(KVM_GET_SREGS) hangs: [pid 748] ioctl(6, KVM_GET_SREGS Avi Kivity pointed out that it's not safe to call KVM_GET_SREGS (or other vcpu related ioctls) from other threads: is it not OK to call KVM_GET_SREGS from other threads than the one that's doing KVM_RUN? From Documentation/kvm/api.txt: - vcpu ioctls: These query and set attributes that control the operation of a single virtual cpu. Only run vcpu ioctls from the same thread that was used to create the vcpu. Fix that up by using pthread_kill() to force the threads that are doing KVM_RUN to do the register dumps. Reported: Ingo Molnar mi...@elte.hu Cc: Asias He asias.he...@gmail.com Cc: Avi Kivity a...@redhat.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Prasad Joshi prasadjoshi...@gmail.com Cc: Sasha Levin levinsasha...@gmail.com Signed-off-by: Pekka Enberg penb...@kernel.org --- tools/kvm/kvm-run.c | 20 +--- 1 files changed, 17 insertions(+), 3 deletions(-) diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c index eb50b6a..58e2977 100644 --- a/tools/kvm/kvm-run.c +++ b/tools/kvm/kvm-run.c @@ -127,6 +127,18 @@ static const struct option options[] = { OPT_END() }; +static void handle_sigusr1(int sig) +{ + struct kvm_cpu *cpu = current_kvm_cpu; + + if (!cpu) + return; + + kvm_cpu__show_registers(cpu); + kvm_cpu__show_code(cpu); + kvm_cpu__show_page_tables(cpu); +} + static void handle_sigquit(int sig) { int i; @@ -134,9 +146,10 @@ static void handle_sigquit(int sig) for (i = 0; i nrcpus; i++) { struct kvm_cpu *cpu = kvm_cpus[i]; - kvm_cpu__show_registers(cpu); - kvm_cpu__show_code(cpu); - kvm_cpu__show_page_tables(cpu); + if (!cpu) + continue; + + pthread_kill(cpu-thread, SIGUSR1); } serial8250__inject_sysrq(kvm); i can see a couple of problems with the debug printout code, which currently produces a stream of such dumps for each vcpu: Registers: rip: rsp: 16ca flags: 00010002 rax: rbx: rcx: rdx: rsi: rdi: rbp: 8000 r8: r9: r10: r11: r12: r13: r14: r15: cr0: 6010 cr2: 0070 cr3: cr4: cr8: Segment registers: register selector base limit type p dpl db s l g avl csf000 000f 031 3 0 1 0 0 0 ss1000 0001 031 3 0 1 0 0 0 ds1000 0001 031 3 0 1 0 0 0 es1000 0001 031 3 0 1 0 0 0 fs1000 0001 031 3 0 1 0 0 0 gs1000 0001 031 3 0 1 0 0 0 tr 0b1 0 0 0 0 0 0 ldt 021 0 0 0 0 0 0 gdt idt [ efer: apic base: fee00900 nmi: enabled ] Interrupt bitmap: Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 cf eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 f6 c4 0e 75 4b Stack: 0x16ca: 00 00 00 00 00 00 00 00 0x16d2: 00 00 00 00 00 00 00 00 0x16da: 00 00 00 00 00 00 00 00 0x16e2: 00 00 00 00 00 00 00 00 The problems are: - This does not work very well on SMP with lots of vcpus, because the printing is unserialized, resulting in a jumbled mess of an output, all vcpus trying to print to the console at once, often mixing lines and characters randomly. - stdout from a signal handler must be flushed, otherwise lines can remain buffered if someone saves the output via 'tee' for example. - the dumps from the various CPUs are not distinguishable - they are just dumped after each other with no
[PATCH] kvm tools: Dump vCPUs in order
* Ingo Molnar mi...@elte.hu wrote: The patch below addresses these concerns, serializes the output, tidies up the printout, resulting in this new output: There's one bug remaining that my patch does not address: the vCPUs are not printed in order: # vCPU #0's dump: # vCPU #2's dump: # vCPU #24's dump: # vCPU #5's dump: # vCPU #39's dump: # vCPU #38's dump: # vCPU #51's dump: # vCPU #11's dump: # vCPU #10's dump: # vCPU #12's dump: This is undesirable as the order of printout is highly random, so successive dumps are difficult to compare. The patch below serializes the signalling itself. (this is on top of the previous patch) The patch also tweaks the vCPU printout line a bit so that it does not start with '#', which is discarded if such messages are pasted into Git commit messages. Signed-off-by: Ingo Molnar mi...@elte.hu diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c index 221435d..00c70c7 100644 --- a/tools/kvm/kvm-run.c +++ b/tools/kvm/kvm-run.c @@ -25,6 +25,7 @@ #include kvm/term.h #include kvm/ioport.h #include kvm/threadpool.h +#include kvm/barrier.h /* header files for gitish interface */ #include kvm/kvm-run.h @@ -132,7 +133,7 @@ static const struct option options[] = { * Serialize debug printout so that the output of multiple vcpus does not * get mixed up: */ -static DEFINE_MUTEX(printout_mutex); +static int printout_done; static void handle_sigusr1(int sig) { @@ -141,13 +142,13 @@ static void handle_sigusr1(int sig) if (!cpu) return; - mutex_lock(printout_mutex); - printf(\n#\n# vCPU #%ld's dump:\n#\n, cpu-cpu_id); + printf(\n #\n # vCPU #%ld's dump:\n #\n, cpu-cpu_id); kvm_cpu__show_registers(cpu); kvm_cpu__show_code(cpu); kvm_cpu__show_page_tables(cpu); fflush(stdout); - mutex_unlock(printout_mutex); + printout_done = 1; + mb(); } static void handle_sigquit(int sig) @@ -160,7 +161,15 @@ static void handle_sigquit(int sig) if (!cpu) continue; + printout_done = 0; pthread_kill(cpu-thread, SIGUSR1); + /* +* Wait for the vCPU to dump state before signalling +* the next thread. Since this is debug code it does +* not matter that we are burning CPU time a bit: +*/ + while (!printout_done) + mb(); } serial8250__inject_sysrq(kvm); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Add documentation for KVM_CAP_NR_VCPUS
On 05/07/2011 05:42 PM, Pekka Enberg wrote: Document KVM_CAP_NR_VCPUS that can be used by the userspace to determine maximum number of VCPUs it can create with the KVM_CREATE_VCPU ioctl. This capability was added in 2.6.26; so the documentation should state that if the capability is not available the user should assume 4 cpus max (the limit at the time). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Validate userspace_addr of memslot when registered
On 05/07/2011 10:35 AM, Takuya Yoshikawa wrote: From: Takuya Yoshikawayoshikawa.tak...@oss.ntt.co.jp This way, we can avoid checking the user space address many times when we read the guest memory. Although we can do the same for write if we check which slots are writable, we do not care write now: reading the guest memory happens more often than writing. Thanks, applied. I changed VERIFY_READ to VERIFY_WRITE, since the checks are exactly the same. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-s390: userspace access to guest storage keys
On 05/06/2011 01:25 PM, Carsten Otte wrote: From: Carsten Otteco...@de.ibm.com This patch gives userspace access to the guest visible storage keys. Three operations are supported: KVM_S390_KEYOP_SSKE for setting storage keys, similar to the set storage key extended (SSKE) instruction. KVM_S390_KEYOP_ISKE for reading storage key content, similar to the insert storage key extended (ISKE) instruction. KVM_s390_KEYOP_RRBE for reading and resetting the page reference bit, similar to the reset reference bit extended (RRBE) instruction. Note that all functions take userspace addresses as input, which typically differ from guest addresses. This work was requested by Alex Graf for guest live migration: Different from x86 the guest's view of dirty and reference information is not stored in the page table entrys that are part of the guest address space but are stored in the storage key instead. Thus, the storage key needs to be read, transfered, and written back on the migration target side. And not in main memory, either? Signed-off-by: Carsten Otteco...@de.ibm.com --- arch/s390/include/asm/kvm_host.h |4 + arch/s390/kvm/kvm-s390.c | 149 ++- include/linux/kvm.h |7 + Documentation/kvm/api.txt 3 files changed, 157 insertions(+), 3 deletions(-) Index: linux-2.6/arch/s390/include/asm/kvm_host.h === --- linux-2.6.orig/arch/s390/include/asm/kvm_host.h +++ linux-2.6/arch/s390/include/asm/kvm_host.h @@ -47,6 +47,10 @@ struct sca_block { #define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1)) #define KVM_PAGES_PER_HPAGE(x)(KVM_HPAGE_SIZE(x) / PAGE_SIZE) +#define KVM_S390_KEYOP_SSKE 0x01 +#define KVM_S390_KEYOP_ISKE 0x02 +#define KVM_S390_KEYOP_RRBE 0x03 kvm_host.h is not exported to userspace. Use asm/kvm.h instead. snip black magic -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM migration with different source and dest paths
On 05/06/2011 12:30 PM, Onkar Mahajan wrote: Is it possible to migrate KVM guest to different paths like this: path of the image of guest01 on host A /home/joe/guest01.img path of the image of guest01 on host B /home/bill/image/temp/guest01.img is this possible ? if it is any pointers as to how to do this ? Are you referring to a single image which is accessible via two paths? Or two different images? Both are possible (the former by simply using the new path for the destination, the latter by using the migrate -b switch. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: X86EMUL_PROPAGATE_FAULT
On 05/05/2011 06:05 PM, Matteo wrote: Hello to everybody, I am working on KVM version 2.6.38 and I'm facing a new problem on an emulated instruction whose implementation is already in kvm. The error shows up after the emulation of the RET opcode (C3 Byte Opcode). When trying to emulate the instruction at the address loaded after the pop instruction made by the RET I get an X86EMUL_PROPAGATE_FAULT error due to a gpa == UNMAPPED_GVA as you can see in the following debug trace: ---8- x86_decode_insn:2705 - Starting New Instruction Decode x86_decode_insn:2709 - c-eip = ctxt-eip = 3226138255 x86_decode_insn:2759 - Opcode - c3 x86_decode_insn:2928 - Decode and fetch the source operand x86_decode_insn:2931 - SrcNone x86_decode_insn:3015 - Decode and fetch the second source operand x86_decode_insn:3018 - Src2None x86_decode_insn:3044 - Decode and fetch the destination operand x86_decode_insn:3089 - ImplicitOps x86_decode_insn:3092 - No Destination Operand x86_emulate_instruction:4458 - Returned from x86_decode_insn with r = 0 x86_emulate_insn:3194 - starting special_insn... x86_emulate_insn:3196 - c-eip = 3226138256 x86_emulate_insn:3565 - starting writeback... writeback:1178 - c-eip = 2147483648 x86_emulate_instruction:4538 - Return from x86_emulate_insn with code r = 0 ---8--- So the next instruction will be emulated reading the opcode with eip=2147483648 as stated before but the emulation fails with the following debug trace ---8--- x86_decode_insn:2705 - Starting New Instruction Decode x86_decode_insn:2709 - c-eip = ctxt-eip = 2147483648 x86_decode_insn:2757 - Read opcode from eip kvm_read_guest_virt_helper:3724 - gpa == UNMAPPED_GVA return X86EMUL_PROPAGATE_FAULT do_fetch_insn_byte:573 - ops-fetch returns an error ---8 The instruction has returned to an EIP that is outside RAM, so kvm is unable to fetch the next instruction. This is likely due to a bug (in kvm or the guest) that has occurred much earlier. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] rcu: provide rcu_virt_note_context_switch() function.
On 05/04/2011 07:35 PM, Paul E. McKenney wrote: On Wed, May 04, 2011 at 04:31:03PM +0300, Gleb Natapov wrote: Provide rcu_virt_note_context_switch() for vitalization use to note quiescent state during guest entry. Very good, queued on -rcu. Unless you tell me otherwise, I will assume that you want to carry the patch modifying KVM to use this. Is -rcu a fast-forward-only tree (like tip)? If so I'll merge it and apply patch 2. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] KVM: x86 emulator: Unused opt removal and some cleanups
On 05/01/2011 08:21 PM, Takuya Yoshikawa wrote: Patches 0-4: Just remove unused opt Patch 5: grpX emulation cleanup Patch 6: jmp far emulation cleanup Some functions introduced in patch 5/6 will be called by opcode::execute later. Applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble adding kvm clock trace to qemu-kvm
On 04/30/2011 08:00 PM, Chris Thompson wrote: I'm trying to add a trace to qemu-kvm that will log the value of the vcpu's clock when a specific interrupt gets pushed. I'm working with qemu-kvm-0.14.0 on the 2.6.32-31 kernel. I've added the following to kvm_arch_try_push_interrupts in qemu-kvm-x86.c: if (irq == 41) { // Get the VCPU's TSC struct kvm_clock_data clock; kvm_vcpu_ioctl(env, KVM_GET_CLOCK, clock); uint64_t ticks = clock.clock; trace_kvm_clock_at_injection(ticks); } This mechanism is only active with -no-kvm-irqchip; otherwise interrupt injection happens in the kernel. And here's the trace event I added: kvm_clock_at_injection(uint64_t ticks) interrupt 41 at clock %PRIu64 I have that trace and the virtio_blk_req_complete trace enabled. An excerpt from the resulting trace output from simpletrace.py: virtio_blk_req_complete 288390365546367 30461.681 req=46972352 status=0 kvm_clock_at_injection 288390365546578 0.211 ticks=46972352 virtio_blk_req_complete 288390394870065 29323.487 req=46972352 status=0 kvm_clock_at_injection 288390394870276 0.211 ticks=46972352 Am I getting the guest's clock incorrectly? And even if so, why is it the same as the request pointer that virtio_blk_req_complete reports? Any ideas are appreciated. What is the 'ticks' field? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions
On 05/08/2011 11:18 AM, Nadav Har'El wrote: This patch includes a utility function for decoding pointer operands of VMX instructions issued by L1 (a guest hypervisor) + /* +* TODO: throw #GP (and return 1) in various cases that the VM* +* instructions require it - e.g., offset beyond segment limit, +* unusable or unreadable/unwritable segment, non-canonical 64-bit +* address, and so on. Currently these are not checked. +*/ + return 0; +} + Note: emulate.c now contains a function (linearize()) which does these calculations. We need to generalize it and expose it so nvmx can make use of it. There is no real security concern since these instructions are only allowed from cpl 0 anyway. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/30] nVMX: Move host-state field setup to a function
On 05/08/2011 11:22 AM, Nadav Har'El wrote: Move the setting of constant host-state fields (fields that do not change throughout the life of the guest) from vmx_vcpu_setup to a new common function vmx_set_constant_host_state(). This function will also be used to set the host state when running L2 guests. */ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) { - u32 host_sysenter_cs, msr_low, msr_high; - u32 junk; + u32 msr_low, msr_high; Unused? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-s390: userspace access to guest storage keys
On 09.05.2011, at 10:43, Avi Kivity wrote: On 05/06/2011 01:25 PM, Carsten Otte wrote: From: Carsten Otteco...@de.ibm.com This patch gives userspace access to the guest visible storage keys. Three operations are supported: KVM_S390_KEYOP_SSKE for setting storage keys, similar to the set storage key extended (SSKE) instruction. KVM_S390_KEYOP_ISKE for reading storage key content, similar to the insert storage key extended (ISKE) instruction. KVM_s390_KEYOP_RRBE for reading and resetting the page reference bit, similar to the reset reference bit extended (RRBE) instruction. Note that all functions take userspace addresses as input, which typically differ from guest addresses. This work was requested by Alex Graf for guest live migration: Different from x86 the guest's view of dirty and reference information is not stored in the page table entrys that are part of the guest address space but are stored in the storage key instead. Thus, the storage key needs to be read, transfered, and written back on the migration target side. And not in main memory, either? Nope - storage keys are only accessible using special instructions. They're not in RAM (visible to a guest) :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
On 05/08/2011 11:23 AM, Nadav Har'El wrote: This patch contains code to prepare the VMCS which can be used to actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our own guests). +/* + * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it + * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2 + * guest in a way that will both be appropriate to L1's requests, and our + * needs. In addition to modifying the active vmcs (which is vmcs02), this + * function also has additional necessary side-effects, like setting various + * vcpu-arch fields. + */ +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) +{ snip + vmcs_write64(VMCS_LINK_POINTER, vmcs12-vmcs_link_pointer); I think this is wrong - anything having to do with vmcs linking will need to be emulated, we can't let the cpu see the real value (and even if we don't emulate, we have to translate addresses like you do for the apic access page. + vmcs_write64(TSC_OFFSET, + vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset); This is probably wrong (everything with time is probably wrong), but we can deal with it (much) later. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Rename pci_device to pci_hdr for clarity
On 05/07/2011 06:50 PM, Sasha Levin wrote: Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/virtio/blk.c | 14 +++--- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c index accfc3e..cc3dc78 100644 --- a/tools/kvm/virtio/blk.c +++ b/tools/kvm/virtio/blk.c @@ -45,7 +45,7 @@ struct blk_dev { struct virt_queue vqs[NUM_VIRT_QUEUES]; struct blk_dev_job jobs[NUM_VIRT_QUEUES]; - struct pci_device_headerpci_device; + struct pci_device_headerpci_hdr; }; static struct blk_dev *bdevs[VIRTIO_BLK_MAX_DEV]; @@ -103,7 +103,7 @@ static bool virtio_blk_pci_io_in(struct kvm *self, u16 port, void *data, int siz break; case VIRTIO_PCI_ISR: ioport__write8(data, 0x1); - kvm__irq_line(self, bdev-pci_device.irq_line, 0); + kvm__irq_line(self, bdev-pci_hdr.irq_line, 0); break; case VIRTIO_MSI_CONFIG_VECTOR: ioport__write16(data, bdev-config_vector); @@ -167,7 +167,7 @@ static void virtio_blk_do_io(struct kvm *kvm, void *param) while (virt_queue__available(vq)) virtio_blk_do_io_request(kvm, bdev, vq); - kvm__irq_line(kvm, bdev-pci_device.irq_line, 1); + kvm__irq_line(kvm, bdev-pci_hdr.irq_line, 1); } static bool virtio_blk_pci_io_out(struct kvm *self, u16 port, void *data, int size, u32 count) @@ -283,7 +283,7 @@ void virtio_blk__init(struct kvm *self, struct disk_image *disk) .blk_config = (struct virtio_blk_config) { .capacity = disk-size / SECTOR_SIZE, }, - .pci_device = (struct pci_device_header) { + .pci_hdr = (struct pci_device_header) { .vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET, .device_id = PCI_DEVICE_ID_VIRTIO_BLK, .header_type= PCI_HEADER_TYPE_NORMAL, @@ -298,10 +298,10 @@ void virtio_blk__init(struct kvm *self, struct disk_image *disk) if (irq__register_device(PCI_DEVICE_ID_VIRTIO_BLK, dev, pin, line) 0) return; - bdev-pci_device.irq_pin= pin; - bdev-pci_device.irq_line = line; + bdev-pci_hdr.irq_pin = pin; + bdev-pci_hdr.irq_line = line; - pci__register(bdev-pci_device, dev); + pci__register(bdev-pci_hdr, dev); ioport__register(blk_dev_base_addr, virtio_blk_io_ops, IOPORT_VIRTIO_BLK_SIZE); } Does this renaming apply to other devices as well? Mind to send followup patches? -- Best Regards, Asias He -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
Hi, and thanks again for the reviews. On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12: +vmcs_write64(TSC_OFFSET, +vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset); This is probably wrong (everything with time is probably wrong), but we can deal with it (much) later. I thought this was right :-) Why do you believe it to be wrong? L1 wants to add vmcs12-tsc_offset to its own TSC to generate L2's TSC. But L1's TSC is itself with vmx-nested.vmcs01_tsc_offset from L0's TSC. So their sum, vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset, is the offset of L2's TSC from L0's TSC. Am I missing something? Thanks, Nadav. -- Nadav Har'El|Monday, May 9 2011, 5 Iyyar 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I couldn't think of an interesting http://nadav.harel.org.il |signature to put here... Maybe next time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/30] nVMX: Move host-state field setup to a function
On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 15/30] nVMX: Move host-state field setup to a function: static int vmx_vcpu_setup(struct vcpu_vmx *vmx) { -u32 host_sysenter_cs, msr_low, msr_high; -u32 junk; +u32 msr_low, msr_high; Unused? Well, it's actually is used, because I left the GUEST_IA32_PAT setting in vmx_vcpu_setup. I guess I couldn't have moved these two variables inside the if((vmcs_config.vmentry_ctrl VM_ENTRY_LOAD_IA32_PAT) block, but I didn't. Similarly, the host_pat variable can also move inside the if(). I'll make these changes. -- Nadav Har'El|Monday, May 9 2011, 5 Iyyar 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Shortening Year-2000 to Y2K was just the http://nadav.harel.org.il |kind of thinking that caused that problem! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/30] nVMX: Exiting from L2 to L1
On 05/08/2011 11:25 AM, Nadav Har'El wrote: This patch implements nested_vmx_vmexit(), called when the nested L2 guest exits and we want to run its L1 parent and let it handle this exit. Note that this will not necessarily be called on every L2 exit. L0 may decide to handle a particular exit on its own, without L1's involvement; In that case, L0 will handle the exit, and resume running L2, without running L1 and without calling nested_vmx_vmexit(). The logic for deciding whether to handle a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), will appear in the next patch. /* + * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits + * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), + * and this function updates it to reflect the changes to the guest state while + * L2 was running (and perhaps made some exits which were handled directly by L0 + * without going back to L1), and to reflect the exit reason. + * Note that we do not have to copy here all VMCS fields, just those that + * could have changed by the L2 guest or the exit - i.e., the guest-state and + * exit-information fields only. Other fields are modified by L1 with VMWRITE, + * which already writes to vmcs12 directly. + */ +void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) +{ snip + vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); Again, this should be emulated, not assigned to the guest. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
On 05/09/2011 01:27 PM, Nadav Har'El wrote: Hi, and thanks again for the reviews. On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12: + vmcs_write64(TSC_OFFSET, + vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset); This is probably wrong (everything with time is probably wrong), but we can deal with it (much) later. I thought this was right :-) Why do you believe it to be wrong? Just out of principle, everything to do with time is wrong. L1 wants to add vmcs12-tsc_offset to its own TSC to generate L2's TSC. But L1's TSC is itself with vmx-nested.vmcs01_tsc_offset from L0's TSC. So their sum, vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset, is the offset of L2's TSC from L0's TSC. Am I missing something? Only Zach knows. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/30] nVMX: Correct handling of interrupt injection
On 05/08/2011 11:26 AM, Nadav Har'El wrote: When KVM wants to inject an interrupt, the guest should think a real interrupt has happened. Normally (in the non-nested case) this means checking that the guest doesn't block interrupts (and if it does, inject when it doesn't - using the interrupt window VMX mechanism), and setting up the appropriate VMCS fields for the guest to receive the interrupt. However, when we are running a nested guest (L2) and its hypervisor (L1) requested exits on interrupts (as most hypervisors do), the most efficient thing to do is to exit L2, telling L1 that the exit was caused by an interrupt, the one we were injecting; Only when L1 asked not to be notified of interrupts, we should inject directly to the running L2 guest (i.e., the normal code path). However, properly doing what is described above requires invasive changes to the flow of the existing code, which we elected not to do in this stage. Instead we do something more simplistic and less efficient: we modify vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt now, to exit from L2 to L1 before continuing the normal code. The normal kvm code then notices that L1 is blocking interrupts, and sets the interrupt window to inject the interrupt later to L1. Shortly after, L1 gets the interrupt while it is itself running, not as an exit from L2. The cost is an extra L1 exit (the interrupt window). Signed-off-by: Nadav Har'Eln...@il.ibm.com --- arch/x86/kvm/vmx.c | 35 +++ 1 file changed, 35 insertions(+) --- .before/arch/x86/kvm/vmx.c 2011-05-08 10:43:20.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-05-08 10:43:20.0 +0300 @@ -3675,9 +3675,25 @@ out: return ret; } +/* + * In nested virtualization, check if L1 asked to exit on external interrupts. + * For most existing hypervisors, this will always return true. + */ +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu) +{ + return get_vmcs12(vcpu)-pin_based_vm_exec_control + PIN_BASED_EXT_INTR_MASK; +} + static void enable_irq_window(struct kvm_vcpu *vcpu) { u32 cpu_based_vm_exec_control; + if (is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) + /* We can get here when nested_run_pending caused +* vmx_interrupt_allowed() to return false. In this case, do +* nothing - the interrupt will be injected later. +*/ + return; Why not do (or schedule) the nested vmexit here? It's more natural than in vmx_interrupt_allowed() which from its name you'd expect to only read stuff. I guess it can live for now if there's some unexpected complexity there. cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; @@ -3800,6 +3816,13 @@ static void vmx_set_nmi_mask(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) { + if (to_vmx(vcpu)-nested.nested_run_pending) + return 0; + nested_vmx_vmexit(vcpu, true); + /* fall through to normal code, but now in L1, not L2 */ + } + return (vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); @@ -5463,6 +5486,14 @@ static int vmx_handle_exit(struct kvm_vc if (vmx-emulation_required emulate_invalid_guest_state) return handle_invalid_guest_state(vcpu); + /* +* the KVM_REQ_EVENT optimization bit is only on for one entry, and if +* we did not inject a still-pending event to L1 now because of +* nested_run_pending, we need to re-enable this bit. +*/ + if (vmx-nested.nested_run_pending) + kvm_make_request(KVM_REQ_EVENT, vcpu); + if (exit_reason == EXIT_REASON_VMLAUNCH || exit_reason == EXIT_REASON_VMRESUME) vmx-nested.nested_run_pending = 1; @@ -5660,6 +5691,8 @@ static void __vmx_complete_interrupts(st static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { + if (is_guest_mode(vmx-vcpu)) + return; __vmx_complete_interrupts(vmx, vmx-idt_vectoring_info, VM_EXIT_INSTRUCTION_LEN, IDT_VECTORING_ERROR_CODE); @@ -5667,6 +5700,8 @@ static void vmx_complete_interrupts(stru static void vmx_cancel_injection(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return; Hmm. What if L0 injected something into L2? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 24/30] nVMX: Correct handling of idt vectoring info
On 05/08/2011 11:27 AM, Nadav Har'El wrote: This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested case. When a guest exits while handling an interrupt or exception, we get this information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1, there's nothing we need to do, because L1 will see this field in vmcs12, and handle it itself. However, when L2 exits and L0 handles the exit itself and plans to return to L2, L0 must inject this event to L2. In the normal non-nested case, the idt_vectoring_info case is discovered after the exit, and the decision to inject (though not the injection itself) is made at that point. However, in the nested case a decision of whether to return to L2 or L1 also happens during the injection phase (see the previous patches), so in the nested case we can only decide what to do about the idt_vectoring_info right after the injection, i.e., in the beginning of vmx_vcpu_run, which is the first time we know for sure if we're staying in L2 (i.e., nested_mode is true). +static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx) +{ + int irq = vmx-idt_vectoring_info VECTORING_INFO_VECTOR_MASK; + int type = vmx-idt_vectoring_info VECTORING_INFO_TYPE_MASK; + int errCodeValid = vmx-idt_vectoring_info + VECTORING_INFO_DELIVER_CODE_MASK; Innovative coding style. + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + irq | type | INTR_INFO_VALID_MASK | errCodeValid); + Why not do a 1:1 copy? + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, + vmx-nested.vm_exit_instruction_len); + if (errCodeValid) + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, + vmx-nested.idt_vectoring_error_code); +} + #ifdef CONFIG_X86_64 #define R r #define Q q -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphics pass-through
On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. BTW, what ATI adapter did you use precisely, and what did work, what not? One thing I was wondering: Most modern adapters should be PCIe these days. Our NVIDIA definitely is. But so far we are claiming to have it attached to a PCI bus. That caps all the extended capabilities e.g. Could this make some relevant difference? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/30] nVMX: Nested VMX, v9
On 05/08/2011 11:15 AM, Nadav Har'El wrote: Hi, This is the ninth iteration of the nested VMX patch set. This iteration addresses all of the comments and requests that were raised by reviewers in the previous rounds, with only a few exception listed below. Some of the issues which were solved in this version include: * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to 256 vmcs02s, one for each L2. Now we only have one, which is reused. We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool of vmcs02s. This option will be useful in the future if vmcs02 won't be filled from scratch on each entry from L1 to L2 (currently, it is). * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was unnecessary, as all the necessary values are either known to KVM or appear in vmcs12. This structure is now gone for good. * There is no longer a vmcs_fields sub-structure that everyone disliked. All the VMCS fields appear directly in the vmcs12 structure, which makes the code simpler and more readable. * Make sure that the vmcs12 fields have fixed sizes and location, and add some extra padding, to support live migration and improve future-proofing. * For some fields, nested exit used to fail to return the host-state as set by L1. Fixed that. * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it in L0 and return to L2) is now more correct, and handles more exit reasons. * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code. The code is now shorter (uses existing functions like kvm_set_cr3, etc.), more readable, and more uniform (no pieces of code for enable_ept and not, less special code for cr0.TS, and none of that ugly cr0.PG monkey-business). * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now unneeded function sync_cached_regs_to_vcms(). * Fix return value of the VMX msrs to be more correct, and more constant (not to needlessly vary on different hosts). * Added some more missing verifications to vmcs12's fields (cleanly failing the nested entry if these verifications fail). * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0, but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it. * Removed or rate limited printouts which could be exploited by guests. * Fix VM_ENTRY_LOAD_IA32_PAT feature handling. * Fixed potential bug and verified that nested vmx now works with both CONFIG_PREEMPT and CONFIG_SMP enabled. * Dozens of other code cleanups and bug fixes. Only a few issues from previous reviews remain unaddressed. These are: * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was still not rewritten. It works, though ;-) * No KVM autotests for nested VMX yet. * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not supported. As explained above, the current code uses L1's MSR bitmap to avoid costly exits to L1, but still suffers exits to L0 on each MSR access in L2. * Still no option for disabling some capabilities advertised to L1. * No support for TPR_SHADOW feature for L1. This new set of patches applies to the current KVM trunk (I checked with 082f9eced53d50c136e42d072598da4be4b9ba23). If you wish, you can also check out an already-patched version of KVM from branch nvmx9 of the repository: git://github.com/nyh/kvm-nested-vmx.git About nested VMX: - The following 30 patches implement nested VMX support. This feature enables a guest to use the VMX APIs in order to run its own nested guests. In other words, it allows running hypervisors (that use VMX) under KVM. Multiple guest hypervisors can be run concurrently, and each of those can in turn host multiple guests. The theory behind this work, our implementation, and its performance characteristics were presented in OSDI 2010 (the USENIX Symposium on Operating Systems Design and Implementation). Our paper was titled The Turtles Project: Design and Implementation of Nested Virtualization, and was awarded Jay Lepreau Best Paper. The paper is available online, at: http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf This patch set does not include all the features described in the paper. In particular, this patch set is missing nested EPT (L1 can't use EPT and must use shadow page tables). It is also missing some features required to run VMWare hypervisors as a guest. These missing features will be sent as follow-on patchs. Running nested VMX: -- The nested VMX feature is currently disabled by default. It must be explicitly enabled with the nested=1 option to the kvm-intel module. No modifications are required to user space (qemu). However, qemu's default emulated CPU type (qemu64) does not list the VMX CPU feature, so it must be explicitly enabled, by giving
Re: [PATCH] kvm-s390: userspace access to guest storage keys
On 05/09/2011 01:11 PM, Alexander Graf wrote: And not in main memory, either? Nope - storage keys are only accessible using special instructions. They're not in RAM (visible to a guest) :). Interesting, so where are they kept? An on-chip memory? That would limit the amount of main memory to that indexed by the chip. Extra off-chip memory? Asking purely out of interest, this has no bearing on the patch. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Rename pci_device to pci_hdr for clarity
On Mon, 2011-05-09 at 18:35 +0800, Asias He wrote: On 05/07/2011 06:50 PM, Sasha Levin wrote: Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/virtio/blk.c | 14 +++--- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c index accfc3e..cc3dc78 100644 --- a/tools/kvm/virtio/blk.c +++ b/tools/kvm/virtio/blk.c @@ -45,7 +45,7 @@ struct blk_dev { struct virt_queue vqs[NUM_VIRT_QUEUES]; struct blk_dev_job jobs[NUM_VIRT_QUEUES]; - struct pci_device_headerpci_device; + struct pci_device_headerpci_hdr; }; static struct blk_dev *bdevs[VIRTIO_BLK_MAX_DEV]; @@ -103,7 +103,7 @@ static bool virtio_blk_pci_io_in(struct kvm *self, u16 port, void *data, int siz break; case VIRTIO_PCI_ISR: ioport__write8(data, 0x1); - kvm__irq_line(self, bdev-pci_device.irq_line, 0); + kvm__irq_line(self, bdev-pci_hdr.irq_line, 0); break; case VIRTIO_MSI_CONFIG_VECTOR: ioport__write16(data, bdev-config_vector); @@ -167,7 +167,7 @@ static void virtio_blk_do_io(struct kvm *kvm, void *param) while (virt_queue__available(vq)) virtio_blk_do_io_request(kvm, bdev, vq); - kvm__irq_line(kvm, bdev-pci_device.irq_line, 1); + kvm__irq_line(kvm, bdev-pci_hdr.irq_line, 1); } static bool virtio_blk_pci_io_out(struct kvm *self, u16 port, void *data, int size, u32 count) @@ -283,7 +283,7 @@ void virtio_blk__init(struct kvm *self, struct disk_image *disk) .blk_config = (struct virtio_blk_config) { .capacity = disk-size / SECTOR_SIZE, }, - .pci_device = (struct pci_device_header) { + .pci_hdr = (struct pci_device_header) { .vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET, .device_id = PCI_DEVICE_ID_VIRTIO_BLK, .header_type= PCI_HEADER_TYPE_NORMAL, @@ -298,10 +298,10 @@ void virtio_blk__init(struct kvm *self, struct disk_image *disk) if (irq__register_device(PCI_DEVICE_ID_VIRTIO_BLK, dev, pin, line) 0) return; - bdev-pci_device.irq_pin= pin; - bdev-pci_device.irq_line = line; + bdev-pci_hdr.irq_pin = pin; + bdev-pci_hdr.irq_line = line; - pci__register(bdev-pci_device, dev); + pci__register(bdev-pci_hdr, dev); ioport__register(blk_dev_base_addr, virtio_blk_io_ops, IOPORT_VIRTIO_BLK_SIZE); } Does this renaming apply to other devices as well? Mind to send followup patches? No, It was virtio-blk specific. I named the var pci_device when I've added multiple virtio-blk device support. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-s390: userspace access to guest storage keys
On 09.05.2011, at 13:20, Avi Kivity wrote: On 05/09/2011 01:11 PM, Alexander Graf wrote: And not in main memory, either? Nope - storage keys are only accessible using special instructions. They're not in RAM (visible to a guest) :). Interesting, so where are they kept? An on-chip memory? That would limit the amount of main memory to that indexed by the chip. Extra off-chip memory? I'll leave the answer of this question to the real experts :). All I know is that the ISA defines special instructions to fetch them, so the question on where they are stored is implementation dependent. In qemu for example, they are kept in an extra array that's just malloc'ed. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/30] nVMX: Nested VMX, v9
On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested VMX, v9: Okay, truly excellent. The code is now a lot more readable, and I'm almost beginning to understand it. The code comments are also very good, I wish we had the same quality comments in the rest of kvm. We can probably merge the next iteration if there aren't significant comments from others. Thanks! The only worrying thing is the issue you raise in patch 8. Is there a simple fix you can push that addresses correctness? I'll fix this for the next iteration. I wanted to avoid changing the existing vcpus_on_cpu machinary, but you're probably right - it's better to just do this correctly once and for all than to try to explain the problem away, or to pray that future processors also continue to work properly if you forget to vmclear a vmcs... -- Nadav Har'El|Monday, May 9 2011, 5 Iyyar 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |A diplomat thinks twice before saying http://nadav.harel.org.il |nothing. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for May 10th
Please send in any agenda items you are interested in covering. From last week, we have already: - import kvm headers into qemu, drop #ifdef maze (Jan) Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] rcu: provide rcu_virt_note_context_switch() function.
On Mon, May 09, 2011 at 11:51:34AM +0300, Avi Kivity wrote: On 05/04/2011 07:35 PM, Paul E. McKenney wrote: On Wed, May 04, 2011 at 04:31:03PM +0300, Gleb Natapov wrote: Provide rcu_virt_note_context_switch() for vitalization use to note quiescent state during guest entry. Very good, queued on -rcu. Unless you tell me otherwise, I will assume that you want to carry the patch modifying KVM to use this. Is -rcu a fast-forward-only tree (like tip)? If so I'll merge it and apply patch 2. Yep, -rcu is subject to rebase and feeds into -tip. The patch is SHA 29ce83181dd757d3116bf774aafffc4b6b20 in git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git Branch is rcu/next. My guess is that this commit will show up in -tip soon. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble adding kvm clock trace to qemu-kvm
On Sat, Apr 30, 2011 at 6:00 PM, Chris Thompson cth...@cs.umn.edu wrote: I'm trying to add a trace to qemu-kvm that will log the value of the vcpu's clock when a specific interrupt gets pushed. I'm working with qemu-kvm-0.14.0 on the 2.6.32-31 kernel. I've added the following to kvm_arch_try_push_interrupts in qemu-kvm-x86.c: if (irq == 41) { // Get the VCPU's TSC struct kvm_clock_data clock; kvm_vcpu_ioctl(env, KVM_GET_CLOCK, clock); uint64_t ticks = clock.clock; trace_kvm_clock_at_injection(ticks); } And here's the trace event I added: kvm_clock_at_injection(uint64_t ticks) interrupt 41 at clock %PRIu64 I have that trace and the virtio_blk_req_complete trace enabled. An excerpt from the resulting trace output from simpletrace.py: virtio_blk_req_complete 288390365546367 30461.681 req=46972352 status=0 kvm_clock_at_injection 288390365546578 0.211 ticks=46972352 virtio_blk_req_complete 288390394870065 29323.487 req=46972352 status=0 kvm_clock_at_injection 288390394870276 0.211 ticks=46972352 Did you modify simpletrace.py? The 288390365546367 field is should not be there. The output format should be: trace-event-name delta-microseconds [arg0=val0...] It looks like your simpletrace.py may be pretty-printing trace records incorrectly. If you have a public git tree you can link to I'd be happy to check that simpletrace.py is working. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMSTATE_U64 vs. VMSTATE_UINT64
Jan Kiszka jan.kis...@siemens.com wrote: Hi guys, can anyone comment on commit e4d6d49061 (introduce VMSTATE_U64) in qemu-kvm again? I strongly suspect this thing was only introduced to be able to grab from a __u64 (for kvmclock) without generating a compiler warning that you may got when using uint64_t, right? Yes, it was that on 64 bit u64 was unsigned long long and uint64_t was only unsigned long or something like that. I have forgot if it was kvmclock or irqchip also. I think that Marcelo also requested it at some point, but dont' remember details :-( Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphics pass-through
On Mon, 2011-05-09 at 13:14 +0200, Jan Kiszka wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. Interesting, IIRC I could never get VESA modes to work. I believe I only had a basic VGA16 mode running in a Windows guest too. BTW, what ATI adapter did you use precisely, and what did work, what not? I have an old X550 (rv380?). I also have an Nvidia gs8400, but ISTR the ATI working better for me. One thing I was wondering: Most modern adapters should be PCIe these days. Our NVIDIA definitely is. But so far we are claiming to have it attached to a PCI bus. That caps all the extended capabilities e.g. Could this make some relevant difference? The BIOS and early boot use shouldn't care too much about that, but I could imagine the high performance drivers potentially failing. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphics pass-through
On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. BTW, what ATI adapter did you use precisely, and what did work, what not? Not hijacking the mail thread. Just wanted to provide some inputs. Few days back I had tried passing through the secondary graphics card. I could pass-through two graphics cards to virtual machine. 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood [Radeon HD 5670] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e151 Flags: bus master, fast devsel, latency 0, IRQ 87 Memory at d000 (64-bit, prefetchable) [size=256M] Memory at fe6e (64-bit, non-prefetchable) [size=128K] I/O ports at b000 [size=256] Expansion ROM at fe6c [disabled] [size=128K] Capabilities: access denied Kernel driver in use: radeon Kernel modules: radeon 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] (rev a1) (prog-if 00 [VGA controller]) Subsystem: nVidia Corporation Device 0492 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 24 Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] Region 5: I/O ports at ec00 [size=128] Expansion ROM at fe9e [disabled] [size=128K] Capabilities: access denied Kernel driver in use: nouveau Kernel modules: nouveau, nvidiafb Both of them are PCIe cards. I have one more ATI card and another NVIDIA card which does not work. One of the reason the pass-through did not work is because of the limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places a hard limit of 256MB or so on the amount of PCI memory space. Thus, for some of the VGA device that need more memory never worked for me. SeaBIOS allows this memory region to be extended to some value near 512MB, but even then the range is not enough. Another problem with SeaBIOS which limits the amount of memory space is: SeaBIOS allocates the BAR regions as they are encountered. As far as I know, the BAR regions should be naturally aligned. Thus the simple strategy of the SeaBIOS results in large fragmentation. Therefore, even after increasing the PCI memory space to 512MB the BAR regions were unallocated. I will confirm you the details of other graphics cards which do not work. Thanks and Regards, Prasad One thing I was wondering: Most modern adapters should be PCIe these days. Our NVIDIA definitely is. But so far we are claiming to have it attached to a PCI bus. That caps all the extended capabilities e.g. Could this make some relevant difference? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphics pass-through
On 2011-05-09 16:29, Alex Williamson wrote: On Mon, 2011-05-09 at 13:14 +0200, Jan Kiszka wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. Interesting, IIRC I could never get VESA modes to work. I believe I only had a basic VGA16 mode running in a Windows guest too. BTW, what ATI adapter did you use precisely, and what did work, what not? I have an old X550 (rv380?). I also have an Nvidia gs8400, but ISTR the ATI working better for me. Is that Nvidia a PCIe adapter? Did it show BIOS / early boot messages properly? BTW, we are fighting with a Quadro FX 3800. One thing I was wondering: Most modern adapters should be PCIe these days. Our NVIDIA definitely is. But so far we are claiming to have it attached to a PCI bus. That caps all the extended capabilities e.g. Could this make some relevant difference? The BIOS and early boot use shouldn't care too much about that, but I could imagine the high performance drivers potentially failing. Thanks, Yeah, that was my thinking as well. But we will try to confirm this by tracing the BIOS activities. There is a telling that some adapters do not allow reading the true cold-boot ROM content during runtime, thus booting those adapters inside the guest may fail to some degree. Anyway, I've hacked on the q35 patches until they allowed me to boot a Linux guest with an assigned PCIe Atheros WLAN adapter - all caps were suddenly visible. Those bits are now on their way to our test box. Let's see if they are able to change the BSOD a bit... Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphics pass-through
On 2011-05-09 16:55, Prasad Joshi wrote: On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. BTW, what ATI adapter did you use precisely, and what did work, what not? Not hijacking the mail thread. Just wanted to provide some inputs. Much appreciated in fact! Few days back I had tried passing through the secondary graphics card. I could pass-through two graphics cards to virtual machine. 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood [Radeon HD 5670] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e151 Flags: bus master, fast devsel, latency 0, IRQ 87 Memory at d000 (64-bit, prefetchable) [size=256M] Memory at fe6e (64-bit, non-prefetchable) [size=128K] I/O ports at b000 [size=256] Expansion ROM at fe6c [disabled] [size=128K] Capabilities: access denied Kernel driver in use: radeon Kernel modules: radeon 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] (rev a1) (prog-if 00 [VGA controller]) Subsystem: nVidia Corporation Device 0492 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 24 Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] Region 5: I/O ports at ec00 [size=128] Expansion ROM at fe9e [disabled] [size=128K] Capabilities: access denied Kernel driver in use: nouveau Kernel modules: nouveau, nvidiafb Both of them are PCIe cards. I have one more ATI card and another NVIDIA card which does not work. Interesting. That may rule out missing PCIe capabilities as source for the NVIDIA driver indisposition. Did you passed those cards each as primary to the guest, or was the guest seeing multiple adapters? I presume you only got output after early boot was completed, right? To avoid having to deal with legacy I/O forwarding, we started with a dual adapter setup in the hope to leave the primary guest adapter at know-to-work cirrus-vga. But already in a native setup with on-board primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk to its hardware in this constellation. One of the reason the pass-through did not work is because of the limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places a hard limit of 256MB or so on the amount of PCI memory space. Thus, for some of the VGA device that need more memory never worked for me. SeaBIOS allows this memory region to be extended to some value near 512MB, but even then the range is not enough. Another problem with SeaBIOS which limits the amount of memory space is: SeaBIOS allocates the BAR regions as they are encountered. As far as I know, the BAR regions should be naturally aligned. Thus the simple strategy of the SeaBIOS results in large fragmentation. Therefore, even after increasing the PCI memory space to 512MB the BAR regions were unallocated. That's an interesting trace! We'll check this here, but I bet it contributes to the problems. Our FX 3800 has 1G memory... I will confirm you the details of other graphics cards which do not work. TiA, Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Graphics pass-through
On Mon, May 9, 2011 at 4:27 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-05-09 16:55, Prasad Joshi wrote: On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. BTW, what ATI adapter did you use precisely, and what did work, what not? Not hijacking the mail thread. Just wanted to provide some inputs. Much appreciated in fact! Few days back I had tried passing through the secondary graphics card. I could pass-through two graphics cards to virtual machine. 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood [Radeon HD 5670] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e151 Flags: bus master, fast devsel, latency 0, IRQ 87 Memory at d000 (64-bit, prefetchable) [size=256M] Memory at fe6e (64-bit, non-prefetchable) [size=128K] I/O ports at b000 [size=256] Expansion ROM at fe6c [disabled] [size=128K] Capabilities: access denied Kernel driver in use: radeon Kernel modules: radeon 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] (rev a1) (prog-if 00 [VGA controller]) Subsystem: nVidia Corporation Device 0492 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 24 Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] Region 5: I/O ports at ec00 [size=128] Expansion ROM at fe9e [disabled] [size=128K] Capabilities: access denied Kernel driver in use: nouveau Kernel modules: nouveau, nvidiafb Both of them are PCIe cards. I have one more ATI card and another NVIDIA card which does not work. Interesting. That may rule out missing PCIe capabilities as source for the NVIDIA driver indisposition. Did you passed those cards each as primary to the guest, or was the guest seeing multiple adapters? I passed the graphics device as a primary device to the guest virtual machine, with -vga none parameter to disable the default vga device. I presume you only got output after early boot was completed, right? Yes you are correct. I got the display, only after the KMS was started. The initial BIOS messages were not displayed. To avoid having to deal with legacy I/O forwarding, we started with a dual adapter setup in the hope to leave the primary guest adapter at know-to-work cirrus-vga. But already in a native setup with on-board primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk to its hardware in this constellation. Windows operating system never worked for me with either of the graphics card. One of the reason the pass-through did not work is because of the limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places a hard limit of 256MB or so on the amount of PCI memory space. Thus, for some of the VGA device that need more memory never worked for me. SeaBIOS allows this memory region to be extended to some value near 512MB, but even then the range is not enough. Another problem with SeaBIOS which limits the amount of memory space is: SeaBIOS allocates the BAR regions as they are encountered. As far as I know, the BAR regions should be naturally aligned. Thus the simple strategy of the SeaBIOS results in large fragmentation. Therefore, even after increasing the PCI memory space to 512MB the BAR regions were unallocated. That's an interesting trace! We'll check this here, but I bet it contributes to the problems. Our FX 3800 has 1G memory... Yes it
Re: Graphics pass-through
On Mon, 2011-05-09 at 17:27 +0200, Jan Kiszka wrote: On 2011-05-09 16:55, Prasad Joshi wrote: On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. BTW, what ATI adapter did you use precisely, and what did work, what not? Not hijacking the mail thread. Just wanted to provide some inputs. Much appreciated in fact! Few days back I had tried passing through the secondary graphics card. I could pass-through two graphics cards to virtual machine. 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood [Radeon HD 5670] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e151 Flags: bus master, fast devsel, latency 0, IRQ 87 Memory at d000 (64-bit, prefetchable) [size=256M] Memory at fe6e (64-bit, non-prefetchable) [size=128K] I/O ports at b000 [size=256] Expansion ROM at fe6c [disabled] [size=128K] Capabilities: access denied Kernel driver in use: radeon Kernel modules: radeon 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] (rev a1) (prog-if 00 [VGA controller]) Subsystem: nVidia Corporation Device 0492 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 24 Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] Region 5: I/O ports at ec00 [size=128] Expansion ROM at fe9e [disabled] [size=128K] Capabilities: access denied Kernel driver in use: nouveau Kernel modules: nouveau, nvidiafb Both of them are PCIe cards. I have one more ATI card and another NVIDIA card which does not work. Interesting. That may rule out missing PCIe capabilities as source for the NVIDIA driver indisposition. Did you passed those cards each as primary to the guest, or was the guest seeing multiple adapters? I presume you only got output after early boot was completed, right? To avoid having to deal with legacy I/O forwarding, we started with a dual adapter setup in the hope to leave the primary guest adapter at know-to-work cirrus-vga. But already in a native setup with on-board primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk to its hardware in this constellation. One of the reason the pass-through did not work is because of the limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places a hard limit of 256MB or so on the amount of PCI memory space. Thus, for some of the VGA device that need more memory never worked for me. SeaBIOS allows this memory region to be extended to some value near 512MB, but even then the range is not enough. Another problem with SeaBIOS which limits the amount of memory space is: SeaBIOS allocates the BAR regions as they are encountered. As far as I know, the BAR regions should be naturally aligned. Thus the simple strategy of the SeaBIOS results in large fragmentation. Therefore, even after increasing the PCI memory space to 512MB the BAR regions were unallocated. That's an interesting trace! We'll check this here, but I bet it contributes to the problems. Our FX 3800 has 1G memory... Yes, qemu leaves far too little MMIO space to think about assigning graphics cards. Both of my cards have 512MB and I hacked qemu to leave a bigger gap via something like: diff --git a/hw/pc.c b/hw/pc.c index 0ea6d10..a6376f8 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -879,6 +879,8 @@ void
Re: Graphics pass-through
On 2011-05-09 17:48, Alex Williamson wrote: On Mon, 2011-05-09 at 17:27 +0200, Jan Kiszka wrote: On 2011-05-09 16:55, Prasad Joshi wrote: On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-05-05 17:17, Alex Williamson wrote: And what about the host? When does Linux release the legacy range? Always or only when a specific (!=vga/vesa) framebuffer driver is loaded? Well, that's where it'd be nice if the vga arbiter was actually in more widespread use. It currently seems to be nothing more than a shared mutex, but it would actually be useful if it included backends to do the chipset vga routing changes. I think when I was testing this, I was externally poking PCI bridge chipset to toggle the VGA_EN bit. Right, we had to drop the approach to pass through the secondary card for now, the arbiter was not switching properly. Haven't checked yet if VGA_EN was properly set, though the kernel code looks like it should take care of this. Even with handing out the primary adapter, we had only mixed success so far. The onboard adapter worked well (in VESA mode), but the NVIDIA is not displaying early boot messages at all. Maybe a vgabios issue. Windows was booting nevertheless - until we installed the NVIDIA drivers. Than it ran into a blue screen. BTW, what ATI adapter did you use precisely, and what did work, what not? Not hijacking the mail thread. Just wanted to provide some inputs. Much appreciated in fact! Few days back I had tried passing through the secondary graphics card. I could pass-through two graphics cards to virtual machine. 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood [Radeon HD 5670] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e151 Flags: bus master, fast devsel, latency 0, IRQ 87 Memory at d000 (64-bit, prefetchable) [size=256M] Memory at fe6e (64-bit, non-prefetchable) [size=128K] I/O ports at b000 [size=256] Expansion ROM at fe6c [disabled] [size=128K] Capabilities: access denied Kernel driver in use: radeon Kernel modules: radeon 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] (rev a1) (prog-if 00 [VGA controller]) Subsystem: nVidia Corporation Device 0492 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 24 Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] Region 5: I/O ports at ec00 [size=128] Expansion ROM at fe9e [disabled] [size=128K] Capabilities: access denied Kernel driver in use: nouveau Kernel modules: nouveau, nvidiafb Both of them are PCIe cards. I have one more ATI card and another NVIDIA card which does not work. Interesting. That may rule out missing PCIe capabilities as source for the NVIDIA driver indisposition. Did you passed those cards each as primary to the guest, or was the guest seeing multiple adapters? I presume you only got output after early boot was completed, right? To avoid having to deal with legacy I/O forwarding, we started with a dual adapter setup in the hope to leave the primary guest adapter at know-to-work cirrus-vga. But already in a native setup with on-board primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk to its hardware in this constellation. One of the reason the pass-through did not work is because of the limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places a hard limit of 256MB or so on the amount of PCI memory space. Thus, for some of the VGA device that need more memory never worked for me. SeaBIOS allows this memory region to be extended to some value near 512MB, but even then the range is not enough. Another problem with SeaBIOS which limits the amount of memory space is: SeaBIOS allocates the BAR regions as they are encountered. As far as I know, the BAR regions should be naturally aligned. Thus the simple strategy of the SeaBIOS results in large fragmentation. Therefore, even after increasing the PCI memory space to 512MB the BAR regions were unallocated. That's an interesting trace! We'll check this here, but I bet it contributes to the problems. Our FX 3800 has 1G memory... Yes, qemu leaves far too little MMIO space to think about assigning graphics cards. Both of my cards have 512MB and I hacked qemu to leave a bigger gap via something like: diff --git a/hw/pc.c b/hw/pc.c index 0ea6d10..a6376f8 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -879,6 +879,8 @@ void pc_cpus_init(const char *cpu_model) }
Re: [PATCH 27/30] nVMX: Additional TSC-offset handling
On 05/08/2011 01:29 AM, Nadav Har'El wrote: In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to set vmcs12.tsc_offset, for this change to survive the next nested entry (see prepare_vmcs02()). Both changes look correct to me. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)
On 05/08/2011 12:06 PM, Nikola Ciprich wrote: OK, I see.. the problem is, that I'm trying to hunt down bug causing hangs when 2.6.32 guests try to run tcpdump - this seems to be reproducible even on latest 2.6.32.x, and seems like it depends on kvm-clock.. So I was thinking about bisecting between 2.6.32 and latest git which doesn't seem to suffer this problem but hitting another (different) problem in 2.6.32 complicates thinks a bit :( If somebody would have some hint on how to proceed, I'd be more then grateful.. cheers n. What are you bisecting, the host kernel or the guest kernel, and what version is the host kernel? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)
The guest, because latest kernels do not suffer this problem, so I'd like to find fix so it can be pushed to -stable (we're using 2.6.32.x) host is currently 2.6.37 (and i'm currently testing 2.6.38 as well) n. On Mon, May 09, 2011 at 10:32:26AM -0700, Zachary Amsden wrote: On 05/08/2011 12:06 PM, Nikola Ciprich wrote: OK, I see.. the problem is, that I'm trying to hunt down bug causing hangs when 2.6.32 guests try to run tcpdump - this seems to be reproducible even on latest 2.6.32.x, and seems like it depends on kvm-clock.. So I was thinking about bisecting between 2.6.32 and latest git which doesn't seem to suffer this problem but hitting another (different) problem in 2.6.32 complicates thinks a bit :( If somebody would have some hint on how to proceed, I'd be more then grateful.. cheers n. What are you bisecting, the host kernel or the guest kernel, and what version is the host kernel? -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpbiOBiVY8Rf.pgp Description: PGP signature
Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)
On 05/09/2011 11:25 AM, Nikola Ciprich wrote: The guest, because latest kernels do not suffer this problem, so I'd like to find fix so it can be pushed to -stable (we're using 2.6.32.x) host is currently 2.6.37 (and i'm currently testing 2.6.38 as well) n. That's a pretty wide range to be bisecting, and I think we know for a fact there were some kvmclock related bugs in that range. If you are looking for something causing problems with tcpdump, I'd suggest getting rid of kvmclock in your testing and using TSC instead; if you're looking to verify that kvmclock related changed have been backported to -stable, rather than bisect and run into bugs, it would probably be faster to check the commit logs for arch/x86/kvm/x86.c and make sure you're not missing anything from me or Glauber that has been applied to the most recent branch. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)
That's a pretty wide range to be bisecting, and I think we know for a fact there were some kvmclock related bugs in that range. thats true, I might try to pick those that seem related and see if it helpts.. If you are looking for something causing problems with tcpdump, I'd suggest getting rid of kvmclock in your testing and using TSC instead; that's the problem, I can't reproduce the problems without kvm-clock enabled, so it must be related to it somehow.. if you're looking to verify that kvmclock related changed have been backported to -stable, rather than bisect and run into bugs, it would probably be faster to check the commit logs for arch/x86/kvm/x86.c and make sure you're not missing anything from me or Glauber that has been applied to the most recent branch. yup, I'll try and report... thanks for the hints! n. Zach -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgpYcvv8ncavM.pgp Description: PGP signature
[PATCH v2] kvm: Add documentation for KVM_CAP_NR_VCPUS
Document KVM_CAP_NR_VCPUS that can be used by the userspace to determine maximum number of VCPUs it can create with the KVM_CREATE_VCPU ioctl. Cc: Avi Kivity a...@redhat.com Cc: Marcelo Tosatti mtosa...@redhat.com Cc: Jan Kiszka jan.kis...@web.de Signed-off-by: Pekka Enberg penb...@kernel.org --- Documentation/kvm/api.txt |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index 9bef4e4..533da6b 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -175,7 +175,10 @@ Parameters: vcpu id (apic id on x86) Returns: vcpu fd on success, -1 on error This API adds a vcpu to a virtual machine. The vcpu id is a small integer -in the range [0, max_vcpus). +in the range [0, max_vcpus). You can use KVM_CAP_NR_VCPUS of the +KVM_CHECK_EXTENSION ioctl() to determine the value for max_vcpus at run-time. +If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4 +cpus max. 4.8 KVM_GET_DIRTY_LOG (vm ioctl) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique
From: Cyrill Gorcunov gorcu...@gmail.com Subject: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique v2 PCI device numbers must be unique on a bus (as a part of Bus/Device/Function tuple).Make it so. Note the patch is rather a fast fix since we need a bit more smart pci device manager (in particular multiple virtio block devices most probably should lay on a separate pci bus). v2: Sasha spotted the nit in virtio_rng__init, ioport function was touched insted of irq__register_device. Hey, I don't like the new patch subject trend you're trying to start at all. You can make it kvm tools,pci: Make PCI device numbers unique but in this particular case PCI already appears in the title so kmv tools: Make PCI device numbers unique is the right thing to do. In addition, the changelog doesn't really tell me much. Does it fix something? Why would we need a smart pci device manager and why is that relevant for this patch? Hmmh? Pekka -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique
On 05/09/2011 11:53 PM, Pekka Enberg wrote: From: Cyrill Gorcunov gorcu...@gmail.com Subject: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique v2 PCI device numbers must be unique on a bus (as a part of Bus/Device/Function tuple).Make it so. Note the patch is rather a fast fix since we need a bit more smart pci device manager (in particular multiple virtio block devices most probably should lay on a separate pci bus). v2: Sasha spotted the nit in virtio_rng__init, ioport function was touched insted of irq__register_device. Hey, I don't like the new patch subject trend you're trying to start at all. You can make it kvm tools,pci: Make PCI device numbers unique but in this particular case PCI already appears in the title so kmv tools: Make PCI device numbers unique is the right thing to do. PCI stands for kvm tools subsystem, but if you like more the last -- I'm fine with it. In addition, the changelog doesn't really tell me much. Does it fix something? Why would we need a smart pci device manager and why is that relevant for this patch? Hmmh? Pekka The thing is that at moment the id's passed to MP table is incorrect, they are to be 5 bits long (mp spec). The smart manager we need -- it's because there could be multiple virtio block device and they _are_ to be separated pci devices, ie with own numbers and intx# assignents. As result we probably should have such virtio devices to lay on a separate pci bus, or if the number of pci devices exceed the width of address line then we should pass them to another pci bus. That is what I had in mind but I'm not sure all this should come to the changelog. -- Cyrill -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] PCI: Track the size of each saved capability data area
On Wed, 20 Apr 2011 14:31:33 -0600 Alex Williamson alex.william...@redhat.com wrote: -struct pci_cap_saved_state { - struct hlist_node next; +struct pci_cap_saved { char cap_nr; + unsigned int size; u32 data[0]; }; +struct pci_cap_saved_state { + struct hlist_node next; + struct pci_cap_saved saved; +}; + struct pcie_link_state; struct pci_vpd; struct pci_sriov; @@ -366,7 +371,7 @@ static inline struct pci_cap_saved_state *pci_find_saved_cap( struct hlist_node *pos; hlist_for_each_entry(tmp, pos, pci_dev-saved_cap_space, next) { - if (tmp-cap_nr == cap) + if (tmp-saved.cap_nr == cap) return tmp; } return NULL; Looks pretty good in general. But I think the naming makes it harder to read than it ought to be. So we have a pci_cap_saved_state, which implies capability info, and that's fine. But pci_cap_saved doesn't communicate much; maybe pci_cap_data or pci_cap_saved_data would be better? Thanks, -- Jesse Barnes, Intel Open Source Technology Center -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
problem with DRBD backed iSCSI storage pool and KVM guests
So I've been struggling with configuring a proper HA stack using DRBD on two dedicated, back-end storage nodes and using KVM on two dedicated, front-end nodes (so four machines total). I'm stuck at just keeping an exported iSCSI LUN consistent for one VM while switching over on the back-end DRBD storage nodes. In my testing, I think I've narrowed it down to KVM's cache setting, but it doesn't make sense and it looks like it will inhibit things later for live migration based on what I've read on this list. So the stack looks something like this. I have the bottom DRBD layer set to use its normal write-back cache settings since both the back-end storage machines have battery backed units for their RAID controllers and I assume that writes only successfully return from the DRBD layer when both the storage controller and the network synchronization protocol (using protocol C of course) return a successful write (to the RAID controllers cache and the DRBD partner). I'm using a primary/secondary setup for the DRBD component. The next layer is the splicing up of the exported DRBD device itself. I'm using nested LVM (not cLVM) for this per the DRBD documentation. It's my understanding that cLVM shouldn't be necessary since the volume group is only active on the primary DRBD node, so no cluster locking should be needed. Hopefully that is correct. On to the iSCSI layer, I'm using tgtd on the target side on each back-end node and iscsid on the initiator side from the front-end nodes. I have the write cache on both the target and initiator disabled as much as I seemingly can. I'm passing the crazy option for this via tgtadm: --- mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0 since corosync is doing everything within the cluster stack to set up and tear down the iSCSI target and LUN's instead of defining the write-cache off option in /etc/tgt/targets.conf. I confirm that sdparm --get=WCE returns: --- WCE 0 [cha: n, def: 0] as expected from the initiator. But I'm still not honestly sure that the target daemon isn't using the page cache on the current primary back-end node. This might be the source of my problem, but documentation on this is sparse and the mode_page above is the best I could find along with the check via sdparm. And finally, there's KVM itself. In order to test all of this, I created a random 1GB file from /dev/urandom on the guest (using RHEL 6 for both the host and the guest). I then copy the random file to a new file and force the current back-end primary node into standby. This successfully restarts the entire stack of components after less than say 10-15 seconds. I have the initiators set to: --- node.session.timeo.replacement_timeout = -1 which should hang forever if I understand the configuration file comments correctly and never report SCSI errors higher up the stack. Anyway, the fail over finishes, I diff the two files and I also do md5sum on each. Now, this is the part where I'm stuck. If I define the virtual disk within KVM to use writethrough cache mode, then while I see a bunch of: --- Buffer I/O error on device dm-0, logical block ... end_request: I/O error, dev vda, sector ... those types of error messages, the cp finishes and the new file seems to be a bit for bit copy of the original. Everything appears to have worked. If I set the cache to none, which apparently I'll need to do anyway for live migration to work (which is the ultimate goal in all of this), then I see the same errors above (although immediately as soon as I initiate the standby operation on the cluster while using writethrough mode, the messages don't show up for a bit) and not only do the files differ typically, it's usually not too long before the ext4 file system sitting on top of vda starts to become very unhappy and gets remounted in read-only mode. So am I missing something here? Using the -1 for the iscsid configuration above, I assumed KVM would never even see any sort of errors at all, but instead would simply hang indefinitely until things came back. Anyone else running a setup like this? Thanks for reading. I can post configuration files as needed or take this to the open-iscsi lists next if KVM doesn't appear to be the issue at this point. -- Mark Nipper ni...@bitgnome.net (XMPP) +1 979 575 3193 - There are 10 kinds of people in the world; those who know binary and those who don't. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patching guest kernel code for better performance from HOST
On Sunday 08 May 2011 02:22 AM, Alexander Graf wrote: On 07.05.2011, at 22:32, Dushyant Bansal wrote: Hi, On patching 'mfmsr' instruction with 'lwz', guest exits when it tries to execute that 'lwz' instruction. I am looking for possible causes for this exit. Here are the details: Initially, pc: 0xc0019420, instruction: 0x7ca6 [mfmsr r0] As this is a privileged instruction, this causes an exit. qemu-system-ppc-4443 [000] 19733.740013: kvm_book3s_exit: exit=0x700 | pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | dar=0xe1736a00 | srr1=0x1004d032 qemu-system-ppc-4443 [000] 19733.740029: kvm_book3s_patch: return=0 | pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | new_inst=0x8000f05c qemu-system-ppc-4443 [000] 19733.740030: kvm_ppc_instr: inst 2080374950 pc 0xc0019420 emulate 0 qemu-system-ppc-4443 [000] 19733.740037: kvm_book3s_reenter: reentry r=1 | pc=0xc0019420 I patched this instruction with: 0x8000f05c: lwzr0, -4096(offset of msr) This instruction reads the 'msr' field of the magic page into register r0. Then, I do not increment the pc value, so the guest starts at the same pc which now points to the new patched instruction. This 'lwz' instruction is causing a exit due to 'BOOK3S_INTERRUPT_PROGRAM' (exit_nr: 0x700). What could be the reason for this exit? As, 'lwz' is not a privileged instruction, I am unable to think of any reason. Did you flush the icache after you patched the instruction? See the function flush_icache_range. Without, your CPU still has the old instruction in its cache, making it trap again :). Thanks. I tried flush_icache_range((ulong)pc, (ulong)pc + 4); The system becomes unresponsive and I have to use force shut down. Here, pc will have the address of guest instruction and flush_icache_range is called from host. Maybe, I am not using flush_icache_range in the correct way. Also, my host os is ppc64 and the guest is ppc32. I also tried: flush_cache_all() But the instruction is still present in the instruction cache. Dushyant -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patching guest kernel code for better performance from HOST
On 09.05.2011, at 12:34, Dushyant Bansal wrote: On Sunday 08 May 2011 02:22 AM, Alexander Graf wrote: On 07.05.2011, at 22:32, Dushyant Bansal wrote: Hi, On patching 'mfmsr' instruction with 'lwz', guest exits when it tries to execute that 'lwz' instruction. I am looking for possible causes for this exit. Here are the details: Initially, pc: 0xc0019420, instruction: 0x7ca6 [mfmsr r0] As this is a privileged instruction, this causes an exit. qemu-system-ppc-4443 [000] 19733.740013: kvm_book3s_exit: exit=0x700 | pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | dar=0xe1736a00 | srr1=0x1004d032 qemu-system-ppc-4443 [000] 19733.740029: kvm_book3s_patch: return=0 | pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | new_inst=0x8000f05c qemu-system-ppc-4443 [000] 19733.740030: kvm_ppc_instr: inst 2080374950 pc 0xc0019420 emulate 0 qemu-system-ppc-4443 [000] 19733.740037: kvm_book3s_reenter: reentry r=1 | pc=0xc0019420 I patched this instruction with: 0x8000f05c: lwzr0, -4096(offset of msr) This instruction reads the 'msr' field of the magic page into register r0. Then, I do not increment the pc value, so the guest starts at the same pc which now points to the new patched instruction. This 'lwz' instruction is causing a exit due to 'BOOK3S_INTERRUPT_PROGRAM' (exit_nr: 0x700). What could be the reason for this exit? As, 'lwz' is not a privileged instruction, I am unable to think of any reason. Did you flush the icache after you patched the instruction? See the function flush_icache_range. Without, your CPU still has the old instruction in its cache, making it trap again :). Thanks. I tried flush_icache_range((ulong)pc, (ulong)pc + 4); The system becomes unresponsive and I have to use force shut down. Here, pc will have the address of guest instruction and flush_icache_range is called from host. Maybe, I am not using flush_icache_range in the correct way. Also, my host os is ppc64 and the guest is ppc32. I also tried: flush_cache_all() But the instruction is still present in the instruction cache. Just patch the _st function to flush the icache on the host virtual address every time it gets invoked :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html