Re: [PATCH 08/18] virtio_ring: support for used_event idx feature

2011-05-09 Thread Rusty Russell
On Wed, 4 May 2011 23:51:38 +0300, Michael S. Tsirkin m...@redhat.com wrote:
 Add support for the used_event idx feature: when enabling
 interrupts, publish the current avail index value to
 the host so that we get interrupts on the next update.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/virtio/virtio_ring.c |   14 ++
  1 files changed, 14 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index 507d6eb..3a3ed75 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -320,6 +320,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned 
 int *len)
   ret = vq-data[i];
   detach_buf(vq, i);
   vq-last_used_idx++;
 + /* If we expect an interrupt for the next entry, tell host
 +  * by writing event index and flush out the write before
 +  * the read in the next get_buf call. */
 + if (!(vq-vring.avail-flags  VRING_AVAIL_F_NO_INTERRUPT)) {
 + vring_used_event(vq-vring) = vq-last_used_idx;
 + virtio_mb();
 + }
 +

Hmm, so you're still using the avail-flags; it's just if thresholding
is enabled the host will ignore it?

It's a little subtle, but it keeps this patch small.  Perhaps we'll want
to make it more explicit later.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/18] virtio: add api for delayed callbacks

2011-05-09 Thread Rusty Russell
On Wed, 4 May 2011 23:52:33 +0300, Michael S. Tsirkin m...@redhat.com wrote:
 Add an API that tells the other side that callbacks
 should be delayed until a lot of work has been done.
 Implement using the new used_event feature.

Since you're going to add a capacity query anyway, why not add the
threshold argument here?

Then the caller can choose how much space it needs.  Maybe net and block
will want different things...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio_ring: need_event api comment fix

2011-05-09 Thread Rusty Russell
On Thu, 5 May 2011 18:08:17 +0300, Michael S. Tsirkin m...@redhat.com wrote:
 fix typo in a comment: size - side
 
 Reported-by: Stefan Hajnoczi stefa...@gmail.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

I could smerge these together for you, but I *really* want benchmarks in
these commit messages.

Thanks,
Rusty.
PS. Was away last week, hence the delay on this...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers

2011-05-09 Thread Ulrich Obergfell
Hi,

This is version 4 of a series of patches that I originally posted in:

http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html

http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328


Changes since version 3:

 in patch part 1/5 and part 4/5

 -  Added stub functions for 'target_reset_irq_delivered' and
'target_get_irq_delivered'. Added registration functions
that are used by apic code to replace the stubs.

 -  Removed NULL pointer checks from update_irq().

 in patch part 5/5

 -  A minor modification in hpet_timer_has_tick_backlog().

 -  Renamed the local variable 'irq_count' in hpet_timer() to
'period_count'.

 -  Driftfix-related fields in struct 'HPETTimer' are no longer
being initialized/reset in hpet_reset(). Added the function
hpet_timer_driftfix_reset() which is called when the guest

.  sets the 'CFG_ENABLE' bit (overall enable)
   in the General Configuration Register.

.  sets the 'TN_ENABLE' bit (timer N interrupt enable)
   in the Timer N Configuration and Capabilities Register.


Please review and please comment.

Regards,

Uli


Ulrich Obergfell (5):
  hpet 'driftfix': add hooks required to detect coalesced interrupts
(x86 apic only)
  hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
  hpet 'driftfix': add fields to HPETTimer and VMStateDescription
  hpet 'driftfix': add code in update_irq() to detect coalesced
interrupts (x86 apic only)
  hpet 'driftfix': add code in hpet_timer() to compensate delayed
callbacks and coalesced interrupts

 hw/apic.c |4 ++
 hw/hpet.c |  119 ++--
 hw/pc.h   |   13 +++
 vl.c  |   13 +++
 4 files changed, 145 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)

2011-05-09 Thread Ulrich Obergfell
'target_get_irq_delivered' and 'target_reset_irq_delivered' point
to functions that are called by update_irq() to detect coalesced
interrupts. Initially they point to stub functions which pretend
successful interrupt injection. apic code calls two registration
functions to replace the stubs with apic_get_irq_delivered() and
apic_reset_irq_delivered().

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 hw/apic.c |4 
 hw/pc.h   |   13 +
 vl.c  |   13 +
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index a45b57f..94b1d15 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see http://www.gnu.org/licenses/
  */
 #include hw.h
+#include pc.h
 #include apic.h
 #include ioapic.h
 #include qemu-timer.h
@@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = {
 
 static void apic_register_devices(void)
 {
+register_target_get_irq_delivered(apic_get_irq_delivered);
+register_target_reset_irq_delivered(apic_reset_irq_delivered);
+
 sysbus_register_withprop(apic_info);
 }
 
diff --git a/hw/pc.h b/hw/pc.h
index 1291e2d..79885f4 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -7,6 +7,19 @@
 #include fdc.h
 #include net.h
 
+extern int (*target_get_irq_delivered)(void);
+extern void (*target_reset_irq_delivered)(void);
+
+static inline void register_target_get_irq_delivered(int (*func)(void))
+{
+target_get_irq_delivered = func;
+}
+
+static inline void register_target_reset_irq_delivered(void (*func)(void))
+{
+target_reset_irq_delivered = func;
+}
+
 /* PC-style peripherals (also used by other machines).  */
 
 /* serial.c */
diff --git a/vl.c b/vl.c
index a143250..a2bbc61 100644
--- a/vl.c
+++ b/vl.c
@@ -233,6 +233,19 @@ const char *prom_envs[MAX_PROM_ENVS];
 const char *nvram = NULL;
 int boot_menu;
 
+static int target_get_irq_delivered_stub(void)
+{
+return 1;
+}
+
+static void target_reset_irq_delivered_stub(void)
+{
+return;
+}
+
+int (*target_get_irq_delivered)(void) = target_get_irq_delivered_stub;
+void (*target_reset_irq_delivered)(void) = target_reset_irq_delivered_stub;
+
 typedef struct FWBootEntry FWBootEntry;
 
 struct FWBootEntry {
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-05-09 Thread Ulrich Obergfell
The new fields in HPETTimer are covered by a separate VMStateDescription
which is a subsection of 'vmstate_hpet_timer'. They are only migrated if

-global hpet.driftfix=on

Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 hw/hpet.c |   33 +
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7513065..7ab6e62 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -55,6 +55,10 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+uint64_t prev_period;
+uint64_t ticks_not_accounted;
+uint32_t irq_rate;
+uint32_t divisor;
 } HPETTimer;
 
 typedef struct HPETState {
@@ -246,6 +250,27 @@ static int hpet_post_load(void *opaque, int version_id)
 return 0;
 }
 
+static bool hpet_timer_driftfix_vmstate_needed(void *opaque)
+{
+HPETTimer *t = opaque;
+
+return (t-state-driftfix != 0);
+}
+
+static const VMStateDescription vmstate_hpet_timer_driftfix = {
+.name = hpet_timer_driftfix,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(prev_period, HPETTimer),
+VMSTATE_UINT64(ticks_not_accounted, HPETTimer),
+VMSTATE_UINT32(irq_rate, HPETTimer),
+VMSTATE_UINT32(divisor, HPETTimer),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_hpet_timer = {
 .name = hpet_timer,
 .version_id = 1,
@@ -260,6 +285,14 @@ static const VMStateDescription vmstate_hpet_timer = {
 VMSTATE_UINT8(wrap_flag, HPETTimer),
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
+},
+.subsections = (VMStateSubsection []) {
+{
+.vmsd = vmstate_hpet_timer_driftfix,
+.needed = hpet_timer_driftfix_vmstate_needed,
+}, {
+/* empty */
+}
 }
 };
 
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)

2011-05-09 Thread Ulrich Obergfell
update_irq() uses a similar method as in 'rtc_td_hack' to detect
coalesced interrupts. The function entry addresses are retrieved
from 'target_get_irq_delivered' and 'target_reset_irq_delivered'.

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 hw/hpet.c |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7ab6e62..e57c654 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -175,11 +175,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, 
uint64_t current)
 }
 }
 
-static void update_irq(struct HPETTimer *timer, int set)
+static int update_irq(struct HPETTimer *timer, int set)
 {
 uint64_t mask;
 HPETState *s;
 int route;
+int irq_delivered = 1;
 
 if (timer-tn = 1  hpet_in_legacy_mode(timer-state)) {
 /* if LegacyReplacementRoute bit is set, HPET specification requires
@@ -204,8 +205,16 @@ static void update_irq(struct HPETTimer *timer, int set)
 qemu_irq_raise(s-irqs[route]);
 } else {
 s-isr = ~mask;
-qemu_irq_pulse(s-irqs[route]);
+if (s-driftfix) {
+target_reset_irq_delivered();
+qemu_irq_raise(s-irqs[route]);
+irq_delivered = target_get_irq_delivered();
+qemu_irq_lower(s-irqs[route]);
+} else {
+qemu_irq_pulse(s-irqs[route]);
+}
 }
+return irq_delivered;
 }
 
 static void hpet_pre_save(void *opaque)
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo

2011-05-09 Thread Ulrich Obergfell
driftfix is a 'bit type' property. Compensation of delayed callbacks
and coalesced interrupts can be enabled with the command line option

-global hpet.driftfix=on

driftfix is 'off' (disabled) by default.

Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 hw/hpet.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 6ce07bc..7513065 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -72,6 +72,8 @@ typedef struct HPETState {
 uint64_t isr;   /* interrupt status reg */
 uint64_t hpet_counter;  /* main counter */
 uint8_t  hpet_id;   /* instance id */
+
+uint32_t driftfix;
 } HPETState;
 
 static uint32_t hpet_in_legacy_mode(HPETState *s)
@@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = {
 .qdev.props = (Property[]) {
 DEFINE_PROP_UINT8(timers, HPETState, num_timers, HPET_MIN_TIMERS),
 DEFINE_PROP_BIT(msi, HPETState, flags, HPET_MSI_SUPPORT, false),
+DEFINE_PROP_BIT(driftfix, HPETState, driftfix, 0, false),
 DEFINE_PROP_END_OF_LIST(),
 },
 };
-- 
1.6.2.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-09 Thread Ulrich Obergfell
Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. The injection of additional
interrupts is based on a backlog of unaccounted HPET clock periods
(new HPETTimer field 'ticks_not_accounted'). The backlog increases
due to delayed callbacks and coalesced interrupts, and it decreases
if an interrupt was injected successfully. If the backlog increases
while compensation is still in progress, the rate at which additional
interrupts are injected is increased too. A limit is imposed on the
backlog and on the rate.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass slower and faster than real time (depending on
the guest's time keeping algorithm). Compensation is disabled by
default and can be enabled for guests where this behaviour may be
acceptable.

Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 hw/hpet.c |   70 +++-
 1 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index e57c654..519fc6b 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
 #include hpet_emul.h
 #include sysbus.h
 #include mc146818rtc.h
+#include assert.h
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -41,6 +42,9 @@
 
 #define HPET_MSI_SUPPORT0
 
+#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
+#define MAX_IRQ_RATE(uint32_t)10
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -324,14 +328,35 @@ static const VMStateDescription vmstate_hpet = {
 }
 };
 
+static void hpet_timer_driftfix_reset(HPETTimer *t)
+{
+if (t-state-driftfix  timer_is_periodic(t)) {
+t-ticks_not_accounted = t-prev_period = t-period;
+t-irq_rate = 1;
+t-divisor = 1;
+}
+}
+
+static bool hpet_timer_has_tick_backlog(HPETTimer *t)
+{
+uint64_t backlog = 0;
+
+if (t-ticks_not_accounted = t-period + t-prev_period) {
+backlog = t-ticks_not_accounted - (t-period + t-prev_period);
+}
+return (backlog = t-period);
+}
+
 /*
  * timer expiration callback
  */
 static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
+HPETState *s = t-state;
 uint64_t diff;
-
+int irq_delivered = 0;
+uint32_t period_count = 0;
 uint64_t period = t-period;
 uint64_t cur_tick = hpet_get_ticks(t-state);
 
@@ -339,13 +364,37 @@ static void hpet_timer(void *opaque)
 if (t-config  HPET_TN_32BIT) {
 while (hpet_time_after(cur_tick, t-cmp)) {
 t-cmp = (uint32_t)(t-cmp + t-period);
+t-ticks_not_accounted += t-period;
+period_count++;
 }
 } else {
 while (hpet_time_after64(cur_tick, t-cmp)) {
 t-cmp += period;
+t-ticks_not_accounted += period;
+period_count++;
 }
 }
 diff = hpet_calculate_diff(t, cur_tick);
+if (s-driftfix) {
+if (t-ticks_not_accounted  MAX_TICKS_NOT_ACCOUNTED) {
+t-ticks_not_accounted = t-period + t-prev_period;
+}
+if (hpet_timer_has_tick_backlog(t)) {
+if (t-irq_rate == 1 || period_count  1) {
+t-irq_rate++;
+t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE);
+}
+if (t-divisor == 0) {
+assert(period_count);
+}
+if (period_count) {
+t-divisor = t-irq_rate;
+}
+diff /= t-divisor--;
+} else {
+t-irq_rate = 1;
+}
+}
 qemu_mod_timer(t-qemu_timer,
qemu_get_clock_ns(vm_clock) + 
(int64_t)ticks_to_ns(diff));
 } else if (t-config  HPET_TN_32BIT  !timer_is_periodic(t)) {
@@ -356,7 +405,22 @@ static void hpet_timer(void *opaque)
 t-wrap_flag = 0;
 }
 }
-update_irq(t, 1);
+if (s-driftfix  timer_is_periodic(t)  period != 0) {
+if (t-ticks_not_accounted = t-period + t-prev_period) {
+irq_delivered = update_irq(t, 1);
+if (irq_delivered) {
+t-ticks_not_accounted -= t-prev_period;
+t-prev_period = t-period;
+} else {
+if (period_count) {
+t-irq_rate++;
+t-irq_rate = MIN(t-irq_rate, MAX_IRQ_RATE);
+}
+}
+}
+} else {
+update_irq(t, 1);
+}
 }
 
 static void hpet_set_timer(HPETTimer *t)
@@ -525,6 +589,7 @@ static void hpet_ram_writel(void 

[PATCH] kvm tools: Fix and improve the CPU register dump debug output code

2011-05-09 Thread Ingo Molnar

* Pekka Enberg penb...@kernel.org wrote:

 Ingo Molnar reported that 'kill -3' didn't work on his machine:
 
   * Ingo Molnar mi...@elte.hu wrote:
 
This is really cumbersome to debug - is there some good way to get to the 
 RIP
that the guest is hanging in? If kvm would print that out to the host 
 console
(even if it's just the raw RIP initially) on a kill -3 that would help
enormously.
 
   Looks like the code should be doing that already - but the 
 ioctl(KVM_GET_SREGS)
   hangs:
 
 [pid   748] ioctl(6, KVM_GET_SREGS
 
 Avi Kivity pointed out that it's not safe to call KVM_GET_SREGS (or other vcpu
 related ioctls) from other threads:
 
is it not OK to call KVM_GET_SREGS from other threads than the one
that's doing KVM_RUN?
 
   From Documentation/kvm/api.txt:
 
- vcpu ioctls: These query and set attributes that control the operation
  of a single virtual cpu.
 
  Only run vcpu ioctls from the same thread that was used to create the
  vcpu.
 
 Fix that up by using pthread_kill() to force the threads that are doing 
 KVM_RUN
 to do the register dumps.
 
 Reported: Ingo Molnar mi...@elte.hu
 Cc: Asias He asias.he...@gmail.com
 Cc: Avi Kivity a...@redhat.com
 Cc: Cyrill Gorcunov gorcu...@gmail.com
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Prasad Joshi prasadjoshi...@gmail.com
 Cc: Sasha Levin levinsasha...@gmail.com
 Signed-off-by: Pekka Enberg penb...@kernel.org
 ---
  tools/kvm/kvm-run.c |   20 +---
  1 files changed, 17 insertions(+), 3 deletions(-)
 
 diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
 index eb50b6a..58e2977 100644
 --- a/tools/kvm/kvm-run.c
 +++ b/tools/kvm/kvm-run.c
 @@ -127,6 +127,18 @@ static const struct option options[] = {
   OPT_END()
  };
  
 +static void handle_sigusr1(int sig)
 +{
 + struct kvm_cpu *cpu = current_kvm_cpu;
 +
 + if (!cpu)
 + return;
 +
 + kvm_cpu__show_registers(cpu);
 + kvm_cpu__show_code(cpu);
 + kvm_cpu__show_page_tables(cpu);
 +}
 +
  static void handle_sigquit(int sig)
  {
   int i;
 @@ -134,9 +146,10 @@ static void handle_sigquit(int sig)
   for (i = 0; i  nrcpus; i++) {
   struct kvm_cpu *cpu = kvm_cpus[i];
  
 - kvm_cpu__show_registers(cpu);
 - kvm_cpu__show_code(cpu);
 - kvm_cpu__show_page_tables(cpu);
 + if (!cpu)
 + continue;
 +
 + pthread_kill(cpu-thread, SIGUSR1);
   }
  
   serial8250__inject_sysrq(kvm);

i can see a couple of problems with the debug printout code, which currently 
produces a stream of such dumps for each vcpu:

Registers:
 rip:    rsp: 16ca flags: 00010002
 rax:    rbx:    rcx: 
 rdx:    rsi:    rdi: 
 rbp: 8000   r8:     r9:  
 r10:    r11:    r12: 
 r13:    r14:    r15: 
 cr0: 6010   cr2: 0070   cr3: 
 cr4:    cr8: 
Segment registers:
 register  selector  base  limit type  p dpl db s l g avl
 csf000  000f    031 3   0  1 0 0 0
 ss1000  0001    031 3   0  1 0 0 0
 ds1000  0001    031 3   0  1 0 0 0
 es1000  0001    031 3   0  1 0 0 0
 fs1000  0001    031 3   0  1 0 0 0
 gs1000  0001    031 3   0  1 0 0 0
 tr      0b1 0   0  0 0 0 0
 ldt         021 0   0  0 0 0 0
 gdt  
 idt  
 [ efer:   apic base: fee00900  nmi: enabled ]
Interrupt bitmap:
     
Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 cf eb 0d 90 90 90 90 
90 90 90 90 90 90 90 90 90 f6 c4 0e 75 4b 
Stack:
  0x16ca: 00 00 00 00  00 00 00 00
  0x16d2: 00 00 00 00  00 00 00 00
  0x16da: 00 00 00 00  00 00 00 00
  0x16e2: 00 00 00 00  00 00 00 00

The problems are:

 - This does not work very well on SMP with lots of vcpus, because the printing 
   is unserialized, resulting in a jumbled mess of an output, all vcpus trying 
   to print to the console at once, often mixing lines and characters randomly.

 - stdout from a signal handler must be flushed, otherwise lines can remain 
   buffered if someone saves the output via 'tee' for example.

 - the dumps from the various CPUs are not distinguishable - they are just
   dumped after each other with no 

[PATCH] kvm tools: Dump vCPUs in order

2011-05-09 Thread Ingo Molnar

* Ingo Molnar mi...@elte.hu wrote:

 The patch below addresses these concerns, serializes the output, tidies up 
 the 
 printout, resulting in this new output:

There's one bug remaining that my patch does not address: the vCPUs are not 
printed in order:

# vCPU #0's dump:
# vCPU #2's dump:
# vCPU #24's dump:
# vCPU #5's dump:
# vCPU #39's dump:
# vCPU #38's dump:
# vCPU #51's dump:
# vCPU #11's dump:
# vCPU #10's dump:
# vCPU #12's dump:

This is undesirable as the order of printout is highly random, so successive 
dumps are difficult to compare.

The patch below serializes the signalling itself. (this is on top of the 
previous patch)

The patch also tweaks the vCPU printout line a bit so that it does not start 
with '#', which is discarded if such messages are pasted into Git commit 
messages.

Signed-off-by: Ingo Molnar mi...@elte.hu

diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
index 221435d..00c70c7 100644
--- a/tools/kvm/kvm-run.c
+++ b/tools/kvm/kvm-run.c
@@ -25,6 +25,7 @@
 #include kvm/term.h
 #include kvm/ioport.h
 #include kvm/threadpool.h
+#include kvm/barrier.h
 
 /* header files for gitish interface  */
 #include kvm/kvm-run.h
@@ -132,7 +133,7 @@ static const struct option options[] = {
  * Serialize debug printout so that the output of multiple vcpus does not
  * get mixed up:
  */
-static DEFINE_MUTEX(printout_mutex);
+static int printout_done;
 
 static void handle_sigusr1(int sig)
 {
@@ -141,13 +142,13 @@ static void handle_sigusr1(int sig)
if (!cpu)
return;
 
-   mutex_lock(printout_mutex);
-   printf(\n#\n# vCPU #%ld's dump:\n#\n, cpu-cpu_id);
+   printf(\n #\n # vCPU #%ld's dump:\n #\n, cpu-cpu_id);
kvm_cpu__show_registers(cpu);
kvm_cpu__show_code(cpu);
kvm_cpu__show_page_tables(cpu);
fflush(stdout);
-   mutex_unlock(printout_mutex);
+   printout_done = 1;
+   mb();
 }
 
 static void handle_sigquit(int sig)
@@ -160,7 +161,15 @@ static void handle_sigquit(int sig)
if (!cpu)
continue;
 
+   printout_done = 0;
pthread_kill(cpu-thread, SIGUSR1);
+   /*
+* Wait for the vCPU to dump state before signalling
+* the next thread. Since this is debug code it does
+* not matter that we are burning CPU time a bit:
+*/
+   while (!printout_done)
+   mb();
}
 
serial8250__inject_sysrq(kvm);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Add documentation for KVM_CAP_NR_VCPUS

2011-05-09 Thread Avi Kivity

On 05/07/2011 05:42 PM, Pekka Enberg wrote:

Document KVM_CAP_NR_VCPUS that can be used by the userspace to determine
maximum number of VCPUs it can create with the KVM_CREATE_VCPU ioctl.



This capability was added in 2.6.26; so the documentation should state 
that if the capability is not available the user should assume 4 cpus 
max (the limit at the time).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Validate userspace_addr of memslot when registered

2011-05-09 Thread Avi Kivity

On 05/07/2011 10:35 AM, Takuya Yoshikawa wrote:

From: Takuya Yoshikawayoshikawa.tak...@oss.ntt.co.jp

This way, we can avoid checking the user space address many times when
we read the guest memory.

Although we can do the same for write if we check which slots are
writable, we do not care write now: reading the guest memory happens
more often than writing.


Thanks, applied.  I changed VERIFY_READ to VERIFY_WRITE, since the 
checks are exactly the same.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-s390: userspace access to guest storage keys

2011-05-09 Thread Avi Kivity

On 05/06/2011 01:25 PM, Carsten Otte wrote:

From: Carsten Otteco...@de.ibm.com

This patch gives userspace access to the guest visible storage keys. Three
operations are supported:
KVM_S390_KEYOP_SSKE for setting storage keys, similar to the set storage key
extended (SSKE) instruction.
KVM_S390_KEYOP_ISKE for reading storage key content, similar to the insert
storage key extended (ISKE) instruction.
KVM_s390_KEYOP_RRBE for reading and resetting the page reference bit, similar
to the reset reference bit extended (RRBE) instruction.
Note that all functions take userspace addresses as input, which typically
differ from guest addresses.

This work was requested by Alex Graf for guest live migration: Different from
x86 the guest's view of dirty and reference information is not stored in the
page table entrys that are part of the guest address space but are stored in
the storage key instead. Thus, the storage key needs to be read, transfered,
and written back on the migration target side.



And not in main memory, either?


Signed-off-by: Carsten Otteco...@de.ibm.com
---
  arch/s390/include/asm/kvm_host.h |4 +
  arch/s390/kvm/kvm-s390.c |  149 
++-
  include/linux/kvm.h  |7 +


Documentation/kvm/api.txt 


  3 files changed, 157 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/s390/include/asm/kvm_host.h
===
--- linux-2.6.orig/arch/s390/include/asm/kvm_host.h
+++ linux-2.6/arch/s390/include/asm/kvm_host.h
@@ -47,6 +47,10 @@ struct sca_block {
  #define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1))
  #define KVM_PAGES_PER_HPAGE(x)(KVM_HPAGE_SIZE(x) / PAGE_SIZE)

+#define KVM_S390_KEYOP_SSKE 0x01
+#define KVM_S390_KEYOP_ISKE 0x02
+#define KVM_S390_KEYOP_RRBE 0x03


kvm_host.h is not exported to userspace.  Use asm/kvm.h instead.

snip black magic

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM migration with different source and dest paths

2011-05-09 Thread Avi Kivity

On 05/06/2011 12:30 PM, Onkar Mahajan wrote:

Is it possible to migrate KVM guest to different paths like this:

path of the image of guest01 on host A
/home/joe/guest01.img

path of the image of guest01 on host B
/home/bill/image/temp/guest01.img


is this possible ?

if it is any pointers as to how to do this ?



Are you referring to a single image which is accessible via two paths?  
Or two different images?


Both are possible (the former by simply using the new path for the 
destination, the latter by using the migrate -b switch.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: X86EMUL_PROPAGATE_FAULT

2011-05-09 Thread Avi Kivity

On 05/05/2011 06:05 PM, Matteo wrote:

Hello to everybody,

I am working on KVM version 2.6.38
and I'm facing a new problem on an emulated instruction
whose implementation is already in kvm.

The error shows up after the emulation of the RET opcode (C3 Byte 
Opcode).

When trying to emulate the instruction
at the address loaded after the pop instruction made by the RET
I get an X86EMUL_PROPAGATE_FAULT error due to a gpa == UNMAPPED_GVA
as you can see in the following debug trace:

---8-
x86_decode_insn:2705 - Starting New Instruction Decode

x86_decode_insn:2709 - c-eip = ctxt-eip = 3226138255
x86_decode_insn:2759 - Opcode - c3
x86_decode_insn:2928 - Decode and fetch the source operand
x86_decode_insn:2931 - SrcNone
x86_decode_insn:3015 - Decode and fetch the second source operand
x86_decode_insn:3018 - Src2None
x86_decode_insn:3044 - Decode and fetch the destination operand
x86_decode_insn:3089 - ImplicitOps
x86_decode_insn:3092 - No Destination Operand
x86_emulate_instruction:4458 - Returned from x86_decode_insn with r = 0

x86_emulate_insn:3194 - starting special_insn...
x86_emulate_insn:3196 - c-eip = 3226138256
x86_emulate_insn:3565 - starting writeback...
writeback:1178 - c-eip = 2147483648
x86_emulate_instruction:4538 - Return from x86_emulate_insn with code 
r = 0
---8--- 



So the next instruction will be emulated reading the opcode with 
eip=2147483648 as stated before

but the emulation fails with the following debug trace

---8--- 


x86_decode_insn:2705 - Starting New Instruction Decode

x86_decode_insn:2709 - c-eip = ctxt-eip = 2147483648
x86_decode_insn:2757 - Read opcode from eip
kvm_read_guest_virt_helper:3724 - gpa == UNMAPPED_GVA return 
X86EMUL_PROPAGATE_FAULT

do_fetch_insn_byte:573 - ops-fetch returns an error
---8 



The instruction has returned to an EIP that is outside RAM, so kvm is 
unable to fetch the next instruction.  This is likely due to a bug (in 
kvm or the guest) that has occurred much earlier.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] rcu: provide rcu_virt_note_context_switch() function.

2011-05-09 Thread Avi Kivity

On 05/04/2011 07:35 PM, Paul E. McKenney wrote:

On Wed, May 04, 2011 at 04:31:03PM +0300, Gleb Natapov wrote:
  Provide rcu_virt_note_context_switch() for vitalization use to note
  quiescent state during guest entry.

Very good, queued on -rcu.

Unless you tell me otherwise, I will assume that you want to carry the
patch modifying KVM to use this.


Is -rcu a fast-forward-only tree (like tip)?  If so I'll merge it and 
apply patch 2.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: x86 emulator: Unused opt removal and some cleanups

2011-05-09 Thread Avi Kivity

On 05/01/2011 08:21 PM, Takuya Yoshikawa wrote:

Patches 0-4: Just remove unused opt
Patch 5: grpX emulation cleanup
Patch 6: jmp far emulation cleanup

Some functions introduced in patch 5/6 will be called by
opcode::execute later.


Applied, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trouble adding kvm clock trace to qemu-kvm

2011-05-09 Thread Avi Kivity

On 04/30/2011 08:00 PM, Chris Thompson wrote:
I'm trying to add a trace to qemu-kvm that will log the value of the 
vcpu's clock when a specific interrupt gets pushed. I'm working with 
qemu-kvm-0.14.0 on the 2.6.32-31 kernel. I've added the following to 
kvm_arch_try_push_interrupts in qemu-kvm-x86.c:


if (irq == 41) {
// Get the VCPU's TSC
struct kvm_clock_data clock;
kvm_vcpu_ioctl(env, KVM_GET_CLOCK, clock);
uint64_t ticks = clock.clock;
trace_kvm_clock_at_injection(ticks);
}



This mechanism is only active with -no-kvm-irqchip; otherwise interrupt 
injection happens in the kernel.



And here's the trace event I added:

kvm_clock_at_injection(uint64_t ticks) interrupt 41 at clock %PRIu64

I have that trace and the virtio_blk_req_complete trace enabled. An 
excerpt from the resulting trace output from simpletrace.py:


virtio_blk_req_complete 288390365546367 30461.681 req=46972352 status=0
kvm_clock_at_injection 288390365546578 0.211 ticks=46972352
virtio_blk_req_complete 288390394870065 29323.487 req=46972352 status=0
kvm_clock_at_injection 288390394870276 0.211 ticks=46972352

Am I getting the guest's clock incorrectly? And even if so, why is it 
the same as the request pointer that virtio_blk_req_complete reports?


Any ideas are appreciated.


What is the 'ticks' field?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:18 AM, Nadav Har'El wrote:

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

+   /*
+* TODO: throw #GP (and return 1) in various cases that the VM*
+* instructions require it - e.g., offset beyond segment limit,
+* unusable or unreadable/unwritable segment, non-canonical 64-bit
+* address, and so on. Currently these are not checked.
+*/
+   return 0;
+}
+


Note: emulate.c now contains a function (linearize()) which does these 
calculations.  We need to generalize it and expose it so nvmx can make 
use of it.


There is no real security concern since these instructions are only 
allowed from cpl 0 anyway.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/30] nVMX: Move host-state field setup to a function

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:22 AM, Nadav Har'El wrote:

Move the setting of constant host-state fields (fields that do not change
throughout the life of the guest) from vmx_vcpu_setup to a new common function
vmx_set_constant_host_state(). This function will also be used to set the
host state when running L2 guests.



   */
  static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
  {
-   u32 host_sysenter_cs, msr_low, msr_high;
-   u32 junk;
+   u32 msr_low, msr_high;



Unused?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-s390: userspace access to guest storage keys

2011-05-09 Thread Alexander Graf

On 09.05.2011, at 10:43, Avi Kivity wrote:

 On 05/06/2011 01:25 PM, Carsten Otte wrote:
 From: Carsten Otteco...@de.ibm.com
 
 This patch gives userspace access to the guest visible storage keys. Three
 operations are supported:
 KVM_S390_KEYOP_SSKE for setting storage keys, similar to the set storage key
 extended (SSKE) instruction.
 KVM_S390_KEYOP_ISKE for reading storage key content, similar to the insert
 storage key extended (ISKE) instruction.
 KVM_s390_KEYOP_RRBE for reading and resetting the page reference bit, similar
 to the reset reference bit extended (RRBE) instruction.
 Note that all functions take userspace addresses as input, which typically
 differ from guest addresses.
 
 This work was requested by Alex Graf for guest live migration: Different from
 x86 the guest's view of dirty and reference information is not stored in the
 page table entrys that are part of the guest address space but are stored in
 the storage key instead. Thus, the storage key needs to be read, transfered,
 and written back on the migration target side.
 
 
 And not in main memory, either?

Nope - storage keys are only accessible using special instructions. They're not 
in RAM (visible to a guest) :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:23 AM, Nadav Har'El wrote:

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
own guests).
+/*
+ * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
+ * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
+ * guest in a way that will both be appropriate to L1's requests, and our
+ * needs. In addition to modifying the active vmcs (which is vmcs02), this
+ * function also has additional necessary side-effects, like setting various
+ * vcpu-arch fields.
+ */
+static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{


snip


+   vmcs_write64(VMCS_LINK_POINTER, vmcs12-vmcs_link_pointer);


I think this is wrong - anything having to do with vmcs linking will 
need to be emulated, we can't let the cpu see the real value (and even 
if we don't emulate, we have to translate addresses like you do for the 
apic access page.



+   vmcs_write64(TSC_OFFSET,
+   vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset);


This is probably wrong (everything with time is probably wrong), but we 
can deal with it (much) later.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Rename pci_device to pci_hdr for clarity

2011-05-09 Thread Asias He
On 05/07/2011 06:50 PM, Sasha Levin wrote:
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---
  tools/kvm/virtio/blk.c |   14 +++---
  1 files changed, 7 insertions(+), 7 deletions(-)
 
 diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c
 index accfc3e..cc3dc78 100644
 --- a/tools/kvm/virtio/blk.c
 +++ b/tools/kvm/virtio/blk.c
 @@ -45,7 +45,7 @@ struct blk_dev {
  
   struct virt_queue   vqs[NUM_VIRT_QUEUES];
   struct blk_dev_job  jobs[NUM_VIRT_QUEUES];
 - struct pci_device_headerpci_device;
 + struct pci_device_headerpci_hdr;
  };
  
  static struct blk_dev *bdevs[VIRTIO_BLK_MAX_DEV];
 @@ -103,7 +103,7 @@ static bool virtio_blk_pci_io_in(struct kvm *self, u16 
 port, void *data, int siz
   break;
   case VIRTIO_PCI_ISR:
   ioport__write8(data, 0x1);
 - kvm__irq_line(self, bdev-pci_device.irq_line, 0);
 + kvm__irq_line(self, bdev-pci_hdr.irq_line, 0);
   break;
   case VIRTIO_MSI_CONFIG_VECTOR:
   ioport__write16(data, bdev-config_vector);
 @@ -167,7 +167,7 @@ static void virtio_blk_do_io(struct kvm *kvm, void *param)
   while (virt_queue__available(vq))
   virtio_blk_do_io_request(kvm, bdev, vq);
  
 - kvm__irq_line(kvm, bdev-pci_device.irq_line, 1);
 + kvm__irq_line(kvm, bdev-pci_hdr.irq_line, 1);
  }
  
  static bool virtio_blk_pci_io_out(struct kvm *self, u16 port, void *data, 
 int size, u32 count)
 @@ -283,7 +283,7 @@ void virtio_blk__init(struct kvm *self, struct disk_image 
 *disk)
   .blk_config = (struct virtio_blk_config) {
   .capacity   = disk-size / SECTOR_SIZE,
   },
 - .pci_device = (struct pci_device_header) {
 + .pci_hdr = (struct pci_device_header) {
   .vendor_id  = PCI_VENDOR_ID_REDHAT_QUMRANET,
   .device_id  = PCI_DEVICE_ID_VIRTIO_BLK,
   .header_type= PCI_HEADER_TYPE_NORMAL,
 @@ -298,10 +298,10 @@ void virtio_blk__init(struct kvm *self, struct 
 disk_image *disk)
   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_BLK, dev, pin, line)  
 0)
   return;
  
 - bdev-pci_device.irq_pin= pin;
 - bdev-pci_device.irq_line   = line;
 + bdev-pci_hdr.irq_pin   = pin;
 + bdev-pci_hdr.irq_line  = line;
  
 - pci__register(bdev-pci_device, dev);
 + pci__register(bdev-pci_hdr, dev);
  
   ioport__register(blk_dev_base_addr, virtio_blk_io_ops, 
 IOPORT_VIRTIO_BLK_SIZE);
  }

Does this renaming apply to other devices as well? Mind to send followup
patches?

-- 
Best Regards,
Asias He
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-09 Thread Nadav Har'El
Hi, and thanks again for the reviews.

On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 17/30] nVMX: Prepare 
vmcs02 from vmcs01 and vmcs12:
 +vmcs_write64(TSC_OFFSET,
 +vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset);
 
 This is probably wrong (everything with time is probably wrong), but we 
 can deal with it (much) later.

I thought this was right :-) Why do you believe it to be wrong?

L1 wants to add vmcs12-tsc_offset to its own TSC to generate L2's TSC.
But L1's TSC is itself with vmx-nested.vmcs01_tsc_offset from L0's TSC.
So their sum, vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset, is the
offset of L2's TSC from L0's TSC. Am I missing something?

Thanks,
Nadav.

-- 
Nadav Har'El|Monday, May  9 2011, 5 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I couldn't think of an interesting
http://nadav.harel.org.il   |signature to put here... Maybe next time.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/30] nVMX: Move host-state field setup to a function

2011-05-09 Thread Nadav Har'El
On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 15/30] nVMX: Move 
host-state field setup to a function:
   static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
   {
 -u32 host_sysenter_cs, msr_low, msr_high;
 -u32 junk;
 +u32 msr_low, msr_high;
 
 
 Unused?

Well, it's actually is used, because I left the GUEST_IA32_PAT setting in
vmx_vcpu_setup. I guess I couldn't have moved these two variables inside
the if((vmcs_config.vmentry_ctrl  VM_ENTRY_LOAD_IA32_PAT) block, but I
didn't. Similarly, the host_pat variable can also move inside the if().

I'll make these changes.

-- 
Nadav Har'El|Monday, May  9 2011, 5 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Shortening Year-2000 to Y2K was just the
http://nadav.harel.org.il   |kind of thinking that caused that problem!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/30] nVMX: Exiting from L2 to L1

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:25 AM, Nadav Har'El wrote:

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.


/*
+ * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
+ * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 
(vmcs12),
+ * and this function updates it to reflect the changes to the guest state while
+ * L2 was running (and perhaps made some exits which were handled directly by 
L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{

snip


+   vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);


Again, this should be emulated, not assigned to the guest.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-09 Thread Avi Kivity

On 05/09/2011 01:27 PM, Nadav Har'El wrote:

Hi, and thanks again for the reviews.

On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 17/30] nVMX: Prepare vmcs02 
from vmcs01 and vmcs12:
  + vmcs_write64(TSC_OFFSET,
  + vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset);

  This is probably wrong (everything with time is probably wrong), but we
  can deal with it (much) later.

I thought this was right :-) Why do you believe it to be wrong?


Just out of principle, everything to do with time is wrong.


L1 wants to add vmcs12-tsc_offset to its own TSC to generate L2's TSC.
But L1's TSC is itself with vmx-nested.vmcs01_tsc_offset from L0's TSC.
So their sum, vmx-nested.vmcs01_tsc_offset + vmcs12-tsc_offset, is the
offset of L2's TSC from L0's TSC. Am I missing something?


Only Zach knows.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/30] nVMX: Correct handling of interrupt injection

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:26 AM, Nadav Har'El wrote:

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the interrupt window VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'Eln...@il.ibm.com
---
  arch/x86/kvm/vmx.c |   35 +++
  1 file changed, 35 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -3675,9 +3675,25 @@ out:
return ret;
  }

+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+   return get_vmcs12(vcpu)-pin_based_vm_exec_control
+   PIN_BASED_EXT_INTR_MASK;
+}
+
  static void enable_irq_window(struct kvm_vcpu *vcpu)
  {
u32 cpu_based_vm_exec_control;
+   if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
+   /* We can get here when nested_run_pending caused
+* vmx_interrupt_allowed() to return false. In this case, do
+* nothing - the interrupt will be injected later.
+*/
+   return;


Why not do (or schedule) the nested vmexit here?  It's more natural than 
in vmx_interrupt_allowed() which from its name you'd expect to only read 
stuff.


I guess it can live for now if there's some unexpected complexity there.



cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3800,6 +3816,13 @@ static void vmx_set_nmi_mask(struct kvm_

  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
  {
+   if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu)) {
+   if (to_vmx(vcpu)-nested.nested_run_pending)
+   return 0;
+   nested_vmx_vmexit(vcpu, true);
+   /* fall through to normal code, but now in L1, not L2 */
+   }
+
return (vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF)
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5463,6 +5486,14 @@ static int vmx_handle_exit(struct kvm_vc
if (vmx-emulation_required  emulate_invalid_guest_state)
return handle_invalid_guest_state(vcpu);

+   /*
+* the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+* we did not inject a still-pending event to L1 now because of
+* nested_run_pending, we need to re-enable this bit.
+*/
+   if (vmx-nested.nested_run_pending)
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+
if (exit_reason == EXIT_REASON_VMLAUNCH ||
exit_reason == EXIT_REASON_VMRESUME)
vmx-nested.nested_run_pending = 1;
@@ -5660,6 +5691,8 @@ static void __vmx_complete_interrupts(st

  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
  {
+   if (is_guest_mode(vmx-vcpu))
+   return;
__vmx_complete_interrupts(vmx, vmx-idt_vectoring_info,
  VM_EXIT_INSTRUCTION_LEN,
  IDT_VECTORING_ERROR_CODE);
@@ -5667,6 +5700,8 @@ static void vmx_complete_interrupts(stru

  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
  {
+   if (is_guest_mode(vcpu))
+   return;


Hmm.  What if L0 injected something into L2?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 24/30] nVMX: Correct handling of idt vectoring info

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:27 AM, Nadav Har'El wrote:

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+   int irq  = vmx-idt_vectoring_info  VECTORING_INFO_VECTOR_MASK;
+   int type = vmx-idt_vectoring_info  VECTORING_INFO_TYPE_MASK;
+   int errCodeValid = vmx-idt_vectoring_info
+   VECTORING_INFO_DELIVER_CODE_MASK;


Innovative coding style.


+   vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+   irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+


Why not do a 1:1 copy?


+   vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+   vmx-nested.vm_exit_instruction_len);
+   if (errCodeValid)
+   vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+   vmx-nested.idt_vectoring_error_code);
+}
+
  #ifdef CONFIG_X86_64
  #define R r
  #define Q q


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphics pass-through

2011-05-09 Thread Jan Kiszka
On 2011-05-05 17:17, Alex Williamson wrote:
 And what about the host? When does Linux release the legacy range?
 Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?
 
 Well, that's where it'd be nice if the vga arbiter was actually in more
 widespread use.  It currently seems to be nothing more than a shared
 mutex, but it would actually be useful if it included backends to do the
 chipset vga routing changes.  I think when I was testing this, I was
 externally poking PCI bridge chipset to toggle the VGA_EN bit.

Right, we had to drop the approach to pass through the secondary card
for now, the arbiter was not switching properly. Haven't checked yet if
VGA_EN was properly set, though the kernel code looks like it should
take care of this.

Even with handing out the primary adapter, we had only mixed success so
far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
not displaying early boot messages at all. Maybe a vgabios issue.
Windows was booting nevertheless - until we installed the NVIDIA
drivers. Than it ran into a blue screen.

BTW, what ATI adapter did you use precisely, and what did work, what not?

One thing I was wondering: Most modern adapters should be PCIe these
days. Our NVIDIA definitely is. But so far we are claiming to have it
attached to a PCI bus. That caps all the extended capabilities e.g.
Could this make some relevant difference?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:15 AM, Nadav Har'El wrote:

Hi,

This is the ninth iteration of the nested VMX patch set. This iteration
addresses all of the comments and requests that were raised by reviewers in
the previous rounds, with only a few exception listed below.

Some of the issues which were solved in this version include:

  * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to
256 vmcs02s, one for each L2. Now we only have one, which is reused.
We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool
of vmcs02s. This option will be useful in the future if vmcs02 won't be
filled from scratch on each entry from L1 to L2 (currently, it is).

  * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was
unnecessary, as all the necessary values are either known to KVM or appear
in vmcs12. This structure is now gone for good.

  * There is no longer a vmcs_fields sub-structure that everyone disliked.
All the VMCS fields appear directly in the vmcs12 structure, which makes
the code simpler and more readable.

  * Make sure that the vmcs12 fields have fixed sizes and location, and add
some extra padding, to support live migration and improve future-proofing.

  * For some fields, nested exit used to fail to return the host-state as set
by L1. Fixed that.

  * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it
in L0 and return to L2) is now more correct, and handles more exit reasons.

  * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code.
The code is now shorter (uses existing functions like kvm_set_cr3, etc.),
more readable, and more uniform (no pieces of code for enable_ept and not,
less special code for cr0.TS, and none of that ugly cr0.PG monkey-business).

  * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now
unneeded function sync_cached_regs_to_vcms().

  * Fix return value of the VMX msrs to be more correct, and more constant
(not to needlessly vary on different hosts).

  * Added some more missing verifications to vmcs12's fields (cleanly failing
the nested entry if these verifications fail).

  * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0,
but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it.

  * Removed or rate limited printouts which could be exploited by guests.

  * Fix VM_ENTRY_LOAD_IA32_PAT feature handling.

  * Fixed potential bug and verified that nested vmx now works with both
CONFIG_PREEMPT and CONFIG_SMP enabled.

  * Dozens of other code cleanups and bug fixes.

Only a few issues from previous reviews remain unaddressed. These are:

  * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was
still not rewritten. It works, though ;-)

  * No KVM autotests for nested VMX yet.

  * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not
supported. As explained above, the current code uses L1's MSR bitmap
to avoid costly exits to L1, but still suffers exits to L0 on each
MSR access in L2.

  * Still no option for disabling some capabilities advertised to L1.

  * No support for TPR_SHADOW feature for L1.

This new set of patches applies to the current KVM trunk (I checked with
082f9eced53d50c136e42d072598da4be4b9ba23).
If you wish, you can also check out an already-patched version of KVM from
branch nvmx9 of the repository:
 git://github.com/nyh/kvm-nested-vmx.git


About nested VMX:
-

The following 30 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
The Turtles Project: Design and Implementation of Nested Virtualization,
and was awarded Jay Lepreau Best Paper. The paper is available online, at:

http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
--

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the nested=1 option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the VMX CPU feature, so it must be
explicitly enabled, by giving 

Re: [PATCH] kvm-s390: userspace access to guest storage keys

2011-05-09 Thread Avi Kivity

On 05/09/2011 01:11 PM, Alexander Graf wrote:


  And not in main memory, either?

Nope - storage keys are only accessible using special instructions. They're not 
in RAM (visible to a guest) :).



Interesting, so where are they kept?  An on-chip memory?  That would 
limit the amount of main memory to that indexed by the chip.  Extra 
off-chip memory?


Asking purely out of interest, this has no bearing on the patch.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Rename pci_device to pci_hdr for clarity

2011-05-09 Thread Sasha Levin
On Mon, 2011-05-09 at 18:35 +0800, Asias He wrote:
 On 05/07/2011 06:50 PM, Sasha Levin wrote:
  Signed-off-by: Sasha Levin levinsasha...@gmail.com
  ---
   tools/kvm/virtio/blk.c |   14 +++---
   1 files changed, 7 insertions(+), 7 deletions(-)
  
  diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c
  index accfc3e..cc3dc78 100644
  --- a/tools/kvm/virtio/blk.c
  +++ b/tools/kvm/virtio/blk.c
  @@ -45,7 +45,7 @@ struct blk_dev {
   
  struct virt_queue   vqs[NUM_VIRT_QUEUES];
  struct blk_dev_job  jobs[NUM_VIRT_QUEUES];
  -   struct pci_device_headerpci_device;
  +   struct pci_device_headerpci_hdr;
   };
   
   static struct blk_dev *bdevs[VIRTIO_BLK_MAX_DEV];
  @@ -103,7 +103,7 @@ static bool virtio_blk_pci_io_in(struct kvm *self, u16 
  port, void *data, int siz
  break;
  case VIRTIO_PCI_ISR:
  ioport__write8(data, 0x1);
  -   kvm__irq_line(self, bdev-pci_device.irq_line, 0);
  +   kvm__irq_line(self, bdev-pci_hdr.irq_line, 0);
  break;
  case VIRTIO_MSI_CONFIG_VECTOR:
  ioport__write16(data, bdev-config_vector);
  @@ -167,7 +167,7 @@ static void virtio_blk_do_io(struct kvm *kvm, void 
  *param)
  while (virt_queue__available(vq))
  virtio_blk_do_io_request(kvm, bdev, vq);
   
  -   kvm__irq_line(kvm, bdev-pci_device.irq_line, 1);
  +   kvm__irq_line(kvm, bdev-pci_hdr.irq_line, 1);
   }
   
   static bool virtio_blk_pci_io_out(struct kvm *self, u16 port, void *data, 
  int size, u32 count)
  @@ -283,7 +283,7 @@ void virtio_blk__init(struct kvm *self, struct 
  disk_image *disk)
  .blk_config = (struct virtio_blk_config) {
  .capacity   = disk-size / SECTOR_SIZE,
  },
  -   .pci_device = (struct pci_device_header) {
  +   .pci_hdr = (struct pci_device_header) {
  .vendor_id  = PCI_VENDOR_ID_REDHAT_QUMRANET,
  .device_id  = PCI_DEVICE_ID_VIRTIO_BLK,
  .header_type= PCI_HEADER_TYPE_NORMAL,
  @@ -298,10 +298,10 @@ void virtio_blk__init(struct kvm *self, struct 
  disk_image *disk)
  if (irq__register_device(PCI_DEVICE_ID_VIRTIO_BLK, dev, pin, line)  
  0)
  return;
   
  -   bdev-pci_device.irq_pin= pin;
  -   bdev-pci_device.irq_line   = line;
  +   bdev-pci_hdr.irq_pin   = pin;
  +   bdev-pci_hdr.irq_line  = line;
   
  -   pci__register(bdev-pci_device, dev);
  +   pci__register(bdev-pci_hdr, dev);
   
  ioport__register(blk_dev_base_addr, virtio_blk_io_ops, 
  IOPORT_VIRTIO_BLK_SIZE);
   }
 
 Does this renaming apply to other devices as well? Mind to send followup
 patches?
 

No, It was virtio-blk specific.
I named the var pci_device when I've added multiple virtio-blk device
support.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-s390: userspace access to guest storage keys

2011-05-09 Thread Alexander Graf

On 09.05.2011, at 13:20, Avi Kivity wrote:

 On 05/09/2011 01:11 PM, Alexander Graf wrote:
 
   And not in main memory, either?
 
 Nope - storage keys are only accessible using special instructions. They're 
 not in RAM (visible to a guest) :).
 
 
 Interesting, so where are they kept?  An on-chip memory?  That would limit 
 the amount of main memory to that indexed by the chip.  Extra off-chip memory?

I'll leave the answer of this question to the real experts :). All I know is 
that the ISA defines special instructions to fetch them, so the question on 
where they are stored is implementation dependent. In qemu for example, they 
are kept in an extra array that's just malloc'ed.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-09 Thread Nadav Har'El
On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 Okay, truly excellent.  The code is now a lot more readable, and I'm 
 almost beginning to understand it.  The code comments are also very 
 good, I wish we had the same quality comments in the rest of kvm.  We 
 can probably merge the next iteration if there aren't significant 
 comments from others.

Thanks!

 The only worrying thing is the issue you raise in patch 8.  Is there a 
 simple fix you can push that addresses correctness?

I'll fix this for the next iteration.
I wanted to avoid changing the existing vcpus_on_cpu machinary, but you're
probably right - it's better to just do this correctly once and for all than
to try to explain the problem away, or to pray that future processors also
continue to work properly if you forget to vmclear a vmcs...

-- 
Nadav Har'El|Monday, May  9 2011, 5 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A diplomat thinks twice before saying
http://nadav.harel.org.il   |nothing.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for May 10th

2011-05-09 Thread Juan Quintela

Please send in any agenda items you are interested in covering.

From last week, we have already:

 - import kvm headers into qemu, drop #ifdef maze (Jan)

Thanks, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] rcu: provide rcu_virt_note_context_switch() function.

2011-05-09 Thread Paul E. McKenney
On Mon, May 09, 2011 at 11:51:34AM +0300, Avi Kivity wrote:
 On 05/04/2011 07:35 PM, Paul E. McKenney wrote:
 On Wed, May 04, 2011 at 04:31:03PM +0300, Gleb Natapov wrote:
   Provide rcu_virt_note_context_switch() for vitalization use to note
   quiescent state during guest entry.
 
 Very good, queued on -rcu.
 
 Unless you tell me otherwise, I will assume that you want to carry the
 patch modifying KVM to use this.
 
 Is -rcu a fast-forward-only tree (like tip)?  If so I'll merge it
 and apply patch 2.

Yep, -rcu is subject to rebase and feeds into -tip.  The patch is
SHA 29ce83181dd757d3116bf774aafffc4b6b20 in

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git

Branch is rcu/next.  My guess is that this commit will show up in
-tip soon.

Thanx, Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trouble adding kvm clock trace to qemu-kvm

2011-05-09 Thread Stefan Hajnoczi
On Sat, Apr 30, 2011 at 6:00 PM, Chris Thompson cth...@cs.umn.edu wrote:
 I'm trying to add a trace to qemu-kvm that will log the value of the vcpu's
 clock when a specific interrupt gets pushed. I'm working with
 qemu-kvm-0.14.0 on the 2.6.32-31 kernel. I've added the following to
 kvm_arch_try_push_interrupts in qemu-kvm-x86.c:

 if (irq == 41) {
    // Get the VCPU's TSC
    struct kvm_clock_data clock;
    kvm_vcpu_ioctl(env, KVM_GET_CLOCK, clock);
    uint64_t ticks = clock.clock;
    trace_kvm_clock_at_injection(ticks);
 }

 And here's the trace event I added:

 kvm_clock_at_injection(uint64_t ticks) interrupt 41 at clock %PRIu64

 I have that trace and the virtio_blk_req_complete trace enabled. An excerpt
 from the resulting trace output from simpletrace.py:

 virtio_blk_req_complete 288390365546367 30461.681 req=46972352 status=0
 kvm_clock_at_injection 288390365546578 0.211 ticks=46972352
 virtio_blk_req_complete 288390394870065 29323.487 req=46972352 status=0
 kvm_clock_at_injection 288390394870276 0.211 ticks=46972352

Did you modify simpletrace.py?  The 288390365546367 field is should
not be there.  The output format should be:
trace-event-name delta-microseconds [arg0=val0...]

It looks like your simpletrace.py may be pretty-printing trace records
incorrectly.

If you have a public git tree you can link to I'd be happy to check
that simpletrace.py is working.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMSTATE_U64 vs. VMSTATE_UINT64

2011-05-09 Thread Juan Quintela
Jan Kiszka jan.kis...@siemens.com wrote:
 Hi guys,

 can anyone comment on commit e4d6d49061 (introduce VMSTATE_U64) in
 qemu-kvm again? I strongly suspect this thing was only introduced to be
 able to grab from a __u64 (for kvmclock) without generating a compiler
 warning that you may got when using uint64_t, right?

Yes, it was that on 64 bit u64 was unsigned long long and uint64_t was
only unsigned long or something like that.  I have forgot if it was
kvmclock or irqchip also.  I think that Marcelo also requested it at
some point, but dont' remember details :-(

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphics pass-through

2011-05-09 Thread Alex Williamson
On Mon, 2011-05-09 at 13:14 +0200, Jan Kiszka wrote:
 On 2011-05-05 17:17, Alex Williamson wrote:
  And what about the host? When does Linux release the legacy range?
  Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?
  
  Well, that's where it'd be nice if the vga arbiter was actually in more
  widespread use.  It currently seems to be nothing more than a shared
  mutex, but it would actually be useful if it included backends to do the
  chipset vga routing changes.  I think when I was testing this, I was
  externally poking PCI bridge chipset to toggle the VGA_EN bit.
 
 Right, we had to drop the approach to pass through the secondary card
 for now, the arbiter was not switching properly. Haven't checked yet if
 VGA_EN was properly set, though the kernel code looks like it should
 take care of this.
 
 Even with handing out the primary adapter, we had only mixed success so
 far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
 not displaying early boot messages at all. Maybe a vgabios issue.
 Windows was booting nevertheless - until we installed the NVIDIA
 drivers. Than it ran into a blue screen.

Interesting, IIRC I could never get VESA modes to work.  I believe I
only had a basic VGA16 mode running in a Windows guest too.

 BTW, what ATI adapter did you use precisely, and what did work, what not?

I have an old X550 (rv380?).  I also have an Nvidia gs8400, but ISTR the
ATI working better for me.

 One thing I was wondering: Most modern adapters should be PCIe these
 days. Our NVIDIA definitely is. But so far we are claiming to have it
 attached to a PCI bus. That caps all the extended capabilities e.g.
 Could this make some relevant difference?

The BIOS and early boot use shouldn't care too much about that, but I
could imagine the high performance drivers potentially failing.  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphics pass-through

2011-05-09 Thread Prasad Joshi
On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-05-05 17:17, Alex Williamson wrote:
 And what about the host? When does Linux release the legacy range?
 Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?

 Well, that's where it'd be nice if the vga arbiter was actually in more
 widespread use.  It currently seems to be nothing more than a shared
 mutex, but it would actually be useful if it included backends to do the
 chipset vga routing changes.  I think when I was testing this, I was
 externally poking PCI bridge chipset to toggle the VGA_EN bit.

 Right, we had to drop the approach to pass through the secondary card
 for now, the arbiter was not switching properly. Haven't checked yet if
 VGA_EN was properly set, though the kernel code looks like it should
 take care of this.

 Even with handing out the primary adapter, we had only mixed success so
 far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
 not displaying early boot messages at all. Maybe a vgabios issue.
 Windows was booting nevertheless - until we installed the NVIDIA
 drivers. Than it ran into a blue screen.

 BTW, what ATI adapter did you use precisely, and what did work, what not?

Not hijacking the mail thread. Just wanted to provide some inputs.

Few days back I had tried passing through the secondary graphics card.
I could pass-through two graphics cards to virtual machine.

02:00.0 VGA compatible controller: ATI Technologies Inc Redwood
[Radeon HD 5670] (prog-if 00 [VGA controller])
Subsystem: PC Partner Limited Device e151
Flags: bus master, fast devsel, latency 0, IRQ 87
Memory at d000 (64-bit, prefetchable) [size=256M]
Memory at fe6e (64-bit, non-prefetchable) [size=128K]
I/O ports at b000 [size=256]
Expansion ROM at fe6c [disabled] [size=128K]
Capabilities: access denied
Kernel driver in use: radeon
Kernel modules: radeon

07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS
290] (rev a1) (prog-if 00 [VGA controller])
   Subsystem: nVidia Corporation Device 0492
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr-Stepping- SERR+ FastB2B- DisINTx-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
TAbort-TAbort- MAbort- SERR- PERR- INTx-
   Latency: 0, Cache Line Size: 64 bytes
   Interrupt: pin A routed to IRQ 24
   Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M]
   Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
   Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
   Region 5: I/O ports at ec00 [size=128]
   Expansion ROM at fe9e [disabled] [size=128K]
   Capabilities: access denied
   Kernel driver in use: nouveau
   Kernel modules: nouveau, nvidiafb

Both of them are PCIe cards. I have one more ATI card and another
NVIDIA card which does not work.

One of the reason the pass-through did not work is because of the
limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places
a hard limit of 256MB or so on the amount of PCI memory space. Thus,
for some of the VGA device that need more memory never worked for me.

SeaBIOS allows this memory region to be extended to some value near
512MB, but even then the range is not enough.

Another problem with SeaBIOS which limits the amount of memory space
is: SeaBIOS allocates the BAR regions as they are encountered. As far
as I know, the BAR regions should be naturally aligned. Thus the
simple strategy of the SeaBIOS results in large fragmentation.
Therefore, even after increasing the PCI memory space to 512MB the BAR
regions were unallocated.

I will confirm you the details of other graphics cards which do not work.

Thanks and Regards,
Prasad


 One thing I was wondering: Most modern adapters should be PCIe these
 days. Our NVIDIA definitely is. But so far we are claiming to have it
 attached to a PCI bus. That caps all the extended capabilities e.g.
 Could this make some relevant difference?

 Jan

 --
 Siemens AG, Corporate Technology, CT T DE IT 1
 Corporate Competence Center Embedded Linux

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphics pass-through

2011-05-09 Thread Jan Kiszka
On 2011-05-09 16:29, Alex Williamson wrote:
 On Mon, 2011-05-09 at 13:14 +0200, Jan Kiszka wrote:
 On 2011-05-05 17:17, Alex Williamson wrote:
 And what about the host? When does Linux release the legacy range?
 Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?

 Well, that's where it'd be nice if the vga arbiter was actually in more
 widespread use.  It currently seems to be nothing more than a shared
 mutex, but it would actually be useful if it included backends to do the
 chipset vga routing changes.  I think when I was testing this, I was
 externally poking PCI bridge chipset to toggle the VGA_EN bit.

 Right, we had to drop the approach to pass through the secondary card
 for now, the arbiter was not switching properly. Haven't checked yet if
 VGA_EN was properly set, though the kernel code looks like it should
 take care of this.

 Even with handing out the primary adapter, we had only mixed success so
 far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
 not displaying early boot messages at all. Maybe a vgabios issue.
 Windows was booting nevertheless - until we installed the NVIDIA
 drivers. Than it ran into a blue screen.
 
 Interesting, IIRC I could never get VESA modes to work.  I believe I
 only had a basic VGA16 mode running in a Windows guest too.
 
 BTW, what ATI adapter did you use precisely, and what did work, what not?
 
 I have an old X550 (rv380?).  I also have an Nvidia gs8400, but ISTR the
 ATI working better for me.

Is that Nvidia a PCIe adapter? Did it show BIOS / early boot messages
properly?

BTW, we are fighting with a Quadro FX 3800.

 
 One thing I was wondering: Most modern adapters should be PCIe these
 days. Our NVIDIA definitely is. But so far we are claiming to have it
 attached to a PCI bus. That caps all the extended capabilities e.g.
 Could this make some relevant difference?
 
 The BIOS and early boot use shouldn't care too much about that, but I
 could imagine the high performance drivers potentially failing.  Thanks,

Yeah, that was my thinking as well. But we will try to confirm this by
tracing the BIOS activities. There is a telling that some adapters do
not allow reading the true cold-boot ROM content during runtime, thus
booting those adapters inside the guest may fail to some degree.

Anyway, I've hacked on the q35 patches until they allowed me to boot a
Linux guest with an assigned PCIe Atheros WLAN adapter - all caps were
suddenly visible. Those bits are now on their way to our test box. Let's
see if they are able to change the BSOD a bit...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphics pass-through

2011-05-09 Thread Jan Kiszka
On 2011-05-09 16:55, Prasad Joshi wrote:
 On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-05-05 17:17, Alex Williamson wrote:
 And what about the host? When does Linux release the legacy range?
 Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?

 Well, that's where it'd be nice if the vga arbiter was actually in more
 widespread use.  It currently seems to be nothing more than a shared
 mutex, but it would actually be useful if it included backends to do the
 chipset vga routing changes.  I think when I was testing this, I was
 externally poking PCI bridge chipset to toggle the VGA_EN bit.

 Right, we had to drop the approach to pass through the secondary card
 for now, the arbiter was not switching properly. Haven't checked yet if
 VGA_EN was properly set, though the kernel code looks like it should
 take care of this.

 Even with handing out the primary adapter, we had only mixed success so
 far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
 not displaying early boot messages at all. Maybe a vgabios issue.
 Windows was booting nevertheless - until we installed the NVIDIA
 drivers. Than it ran into a blue screen.

 BTW, what ATI adapter did you use precisely, and what did work, what not?
 
 Not hijacking the mail thread. Just wanted to provide some inputs.

Much appreciated in fact!

 
 Few days back I had tried passing through the secondary graphics card.
 I could pass-through two graphics cards to virtual machine.
 
 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood
 [Radeon HD 5670] (prog-if 00 [VGA controller])
   Subsystem: PC Partner Limited Device e151
   Flags: bus master, fast devsel, latency 0, IRQ 87
   Memory at d000 (64-bit, prefetchable) [size=256M]
   Memory at fe6e (64-bit, non-prefetchable) [size=128K]
   I/O ports at b000 [size=256]
   Expansion ROM at fe6c [disabled] [size=128K]
   Capabilities: access denied
   Kernel driver in use: radeon
   Kernel modules: radeon
 
 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS
 290] (rev a1) (prog-if 00 [VGA controller])
Subsystem: nVidia Corporation Device 0492
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
 ParErr-Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
 TAbort-TAbort- MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 24
Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
Region 5: I/O ports at ec00 [size=128]
Expansion ROM at fe9e [disabled] [size=128K]
Capabilities: access denied
Kernel driver in use: nouveau
Kernel modules: nouveau, nvidiafb
 
 Both of them are PCIe cards. I have one more ATI card and another
 NVIDIA card which does not work.

Interesting. That may rule out missing PCIe capabilities as source for
the NVIDIA driver indisposition.

Did you passed those cards each as primary to the guest, or was the
guest seeing multiple adapters? I presume you only got output after
early boot was completed, right?

To avoid having to deal with legacy I/O forwarding, we started with a
dual adapter setup in the hope to leave the primary guest adapter at
know-to-work cirrus-vga. But already in a native setup with on-board
primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk
to its hardware in this constellation.

 
 One of the reason the pass-through did not work is because of the
 limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places
 a hard limit of 256MB or so on the amount of PCI memory space. Thus,
 for some of the VGA device that need more memory never worked for me.
 
 SeaBIOS allows this memory region to be extended to some value near
 512MB, but even then the range is not enough.
 
 Another problem with SeaBIOS which limits the amount of memory space
 is: SeaBIOS allocates the BAR regions as they are encountered. As far
 as I know, the BAR regions should be naturally aligned. Thus the
 simple strategy of the SeaBIOS results in large fragmentation.
 Therefore, even after increasing the PCI memory space to 512MB the BAR
 regions were unallocated.

That's an interesting trace! We'll check this here, but I bet it
contributes to the problems. Our FX 3800 has 1G memory...

 
 I will confirm you the details of other graphics cards which do not work.

TiA,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graphics pass-through

2011-05-09 Thread Prasad Joshi
On Mon, May 9, 2011 at 4:27 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-05-09 16:55, Prasad Joshi wrote:
 On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-05-05 17:17, Alex Williamson wrote:
 And what about the host? When does Linux release the legacy range?
 Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?

 Well, that's where it'd be nice if the vga arbiter was actually in more
 widespread use.  It currently seems to be nothing more than a shared
 mutex, but it would actually be useful if it included backends to do the
 chipset vga routing changes.  I think when I was testing this, I was
 externally poking PCI bridge chipset to toggle the VGA_EN bit.

 Right, we had to drop the approach to pass through the secondary card
 for now, the arbiter was not switching properly. Haven't checked yet if
 VGA_EN was properly set, though the kernel code looks like it should
 take care of this.

 Even with handing out the primary adapter, we had only mixed success so
 far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
 not displaying early boot messages at all. Maybe a vgabios issue.
 Windows was booting nevertheless - until we installed the NVIDIA
 drivers. Than it ran into a blue screen.

 BTW, what ATI adapter did you use precisely, and what did work, what not?

 Not hijacking the mail thread. Just wanted to provide some inputs.

 Much appreciated in fact!


 Few days back I had tried passing through the secondary graphics card.
 I could pass-through two graphics cards to virtual machine.

 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood
 [Radeon HD 5670] (prog-if 00 [VGA controller])
       Subsystem: PC Partner Limited Device e151
       Flags: bus master, fast devsel, latency 0, IRQ 87
       Memory at d000 (64-bit, prefetchable) [size=256M]
       Memory at fe6e (64-bit, non-prefetchable) [size=128K]
       I/O ports at b000 [size=256]
       Expansion ROM at fe6c [disabled] [size=128K]
       Capabilities: access denied
       Kernel driver in use: radeon
       Kernel modules: radeon

 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS
 290] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: nVidia Corporation Device 0492
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
 ParErr-Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
 TAbort-TAbort- MAbort- SERR- PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 24
        Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
        Region 5: I/O ports at ec00 [size=128]
        Expansion ROM at fe9e [disabled] [size=128K]
        Capabilities: access denied
        Kernel driver in use: nouveau
        Kernel modules: nouveau, nvidiafb

 Both of them are PCIe cards. I have one more ATI card and another
 NVIDIA card which does not work.

 Interesting. That may rule out missing PCIe capabilities as source for
 the NVIDIA driver indisposition.

 Did you passed those cards each as primary to the guest, or was the
 guest seeing multiple adapters?

I passed the graphics device as a primary device to the guest virtual
machine, with -vga none parameter to disable the default vga device.

 I presume you only got output after
 early boot was completed, right?

Yes you are correct. I got the display, only after the KMS was
started. The initial BIOS messages were not displayed.


 To avoid having to deal with legacy I/O forwarding, we started with a
 dual adapter setup in the hope to leave the primary guest adapter at
 know-to-work cirrus-vga. But already in a native setup with on-board
 primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk
 to its hardware in this constellation.


Windows operating system never worked for me with either of the graphics card.


 One of the reason the pass-through did not work is because of the
 limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places
 a hard limit of 256MB or so on the amount of PCI memory space. Thus,
 for some of the VGA device that need more memory never worked for me.

 SeaBIOS allows this memory region to be extended to some value near
 512MB, but even then the range is not enough.

 Another problem with SeaBIOS which limits the amount of memory space
 is: SeaBIOS allocates the BAR regions as they are encountered. As far
 as I know, the BAR regions should be naturally aligned. Thus the
 simple strategy of the SeaBIOS results in large fragmentation.
 Therefore, even after increasing the PCI memory space to 512MB the BAR
 regions were unallocated.

 That's an interesting trace! We'll check this here, but I bet it
 contributes to the problems. Our FX 3800 has 1G memory...

Yes it 

Re: Graphics pass-through

2011-05-09 Thread Alex Williamson
On Mon, 2011-05-09 at 17:27 +0200, Jan Kiszka wrote:
 On 2011-05-09 16:55, Prasad Joshi wrote:
  On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote:
  On 2011-05-05 17:17, Alex Williamson wrote:
  And what about the host? When does Linux release the legacy range?
  Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?
 
  Well, that's where it'd be nice if the vga arbiter was actually in more
  widespread use.  It currently seems to be nothing more than a shared
  mutex, but it would actually be useful if it included backends to do the
  chipset vga routing changes.  I think when I was testing this, I was
  externally poking PCI bridge chipset to toggle the VGA_EN bit.
 
  Right, we had to drop the approach to pass through the secondary card
  for now, the arbiter was not switching properly. Haven't checked yet if
  VGA_EN was properly set, though the kernel code looks like it should
  take care of this.
 
  Even with handing out the primary adapter, we had only mixed success so
  far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
  not displaying early boot messages at all. Maybe a vgabios issue.
  Windows was booting nevertheless - until we installed the NVIDIA
  drivers. Than it ran into a blue screen.
 
  BTW, what ATI adapter did you use precisely, and what did work, what not?
  
  Not hijacking the mail thread. Just wanted to provide some inputs.
 
 Much appreciated in fact!
 
  
  Few days back I had tried passing through the secondary graphics card.
  I could pass-through two graphics cards to virtual machine.
  
  02:00.0 VGA compatible controller: ATI Technologies Inc Redwood
  [Radeon HD 5670] (prog-if 00 [VGA controller])
  Subsystem: PC Partner Limited Device e151
  Flags: bus master, fast devsel, latency 0, IRQ 87
  Memory at d000 (64-bit, prefetchable) [size=256M]
  Memory at fe6e (64-bit, non-prefetchable) [size=128K]
  I/O ports at b000 [size=256]
  Expansion ROM at fe6c [disabled] [size=128K]
  Capabilities: access denied
  Kernel driver in use: radeon
  Kernel modules: radeon
  
  07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS
  290] (rev a1) (prog-if 00 [VGA controller])
 Subsystem: nVidia Corporation Device 0492
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
  ParErr-Stepping- SERR+ FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
  TAbort-TAbort- MAbort- SERR- PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 24
 Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M]
 Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
 Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
 Region 5: I/O ports at ec00 [size=128]
 Expansion ROM at fe9e [disabled] [size=128K]
 Capabilities: access denied
 Kernel driver in use: nouveau
 Kernel modules: nouveau, nvidiafb
  
  Both of them are PCIe cards. I have one more ATI card and another
  NVIDIA card which does not work.
 
 Interesting. That may rule out missing PCIe capabilities as source for
 the NVIDIA driver indisposition.
 
 Did you passed those cards each as primary to the guest, or was the
 guest seeing multiple adapters? I presume you only got output after
 early boot was completed, right?
 
 To avoid having to deal with legacy I/O forwarding, we started with a
 dual adapter setup in the hope to leave the primary guest adapter at
 know-to-work cirrus-vga. But already in a native setup with on-board
 primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk
 to its hardware in this constellation.
 
  
  One of the reason the pass-through did not work is because of the
  limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places
  a hard limit of 256MB or so on the amount of PCI memory space. Thus,
  for some of the VGA device that need more memory never worked for me.
  
  SeaBIOS allows this memory region to be extended to some value near
  512MB, but even then the range is not enough.
  
  Another problem with SeaBIOS which limits the amount of memory space
  is: SeaBIOS allocates the BAR regions as they are encountered. As far
  as I know, the BAR regions should be naturally aligned. Thus the
  simple strategy of the SeaBIOS results in large fragmentation.
  Therefore, even after increasing the PCI memory space to 512MB the BAR
  regions were unallocated.
 
 That's an interesting trace! We'll check this here, but I bet it
 contributes to the problems. Our FX 3800 has 1G memory...

Yes, qemu leaves far too little MMIO space to think about assigning
graphics cards.  Both of my cards have 512MB and I hacked qemu to leave
a bigger gap via something like:

diff --git a/hw/pc.c b/hw/pc.c
index 0ea6d10..a6376f8 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -879,6 +879,8 @@ void 

Re: Graphics pass-through

2011-05-09 Thread Jan Kiszka
On 2011-05-09 17:48, Alex Williamson wrote:
 On Mon, 2011-05-09 at 17:27 +0200, Jan Kiszka wrote:
 On 2011-05-09 16:55, Prasad Joshi wrote:
 On Mon, May 9, 2011 at 12:14 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-05-05 17:17, Alex Williamson wrote:
 And what about the host? When does Linux release the legacy range?
 Always or only when a specific (!=vga/vesa) framebuffer driver is loaded?

 Well, that's where it'd be nice if the vga arbiter was actually in more
 widespread use.  It currently seems to be nothing more than a shared
 mutex, but it would actually be useful if it included backends to do the
 chipset vga routing changes.  I think when I was testing this, I was
 externally poking PCI bridge chipset to toggle the VGA_EN bit.

 Right, we had to drop the approach to pass through the secondary card
 for now, the arbiter was not switching properly. Haven't checked yet if
 VGA_EN was properly set, though the kernel code looks like it should
 take care of this.

 Even with handing out the primary adapter, we had only mixed success so
 far. The onboard adapter worked well (in VESA mode), but the NVIDIA is
 not displaying early boot messages at all. Maybe a vgabios issue.
 Windows was booting nevertheless - until we installed the NVIDIA
 drivers. Than it ran into a blue screen.

 BTW, what ATI adapter did you use precisely, and what did work, what not?

 Not hijacking the mail thread. Just wanted to provide some inputs.

 Much appreciated in fact!


 Few days back I had tried passing through the secondary graphics card.
 I could pass-through two graphics cards to virtual machine.

 02:00.0 VGA compatible controller: ATI Technologies Inc Redwood
 [Radeon HD 5670] (prog-if 00 [VGA controller])
 Subsystem: PC Partner Limited Device e151
 Flags: bus master, fast devsel, latency 0, IRQ 87
 Memory at d000 (64-bit, prefetchable) [size=256M]
 Memory at fe6e (64-bit, non-prefetchable) [size=128K]
 I/O ports at b000 [size=256]
 Expansion ROM at fe6c [disabled] [size=128K]
 Capabilities: access denied
 Kernel driver in use: radeon
 Kernel modules: radeon

 07:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS
 290] (rev a1) (prog-if 00 [VGA controller])
Subsystem: nVidia Corporation Device 0492
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
 ParErr-Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
 TAbort-TAbort- MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 24
Region 0: Memory at fd00 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
Region 5: I/O ports at ec00 [size=128]
Expansion ROM at fe9e [disabled] [size=128K]
Capabilities: access denied
Kernel driver in use: nouveau
Kernel modules: nouveau, nvidiafb

 Both of them are PCIe cards. I have one more ATI card and another
 NVIDIA card which does not work.

 Interesting. That may rule out missing PCIe capabilities as source for
 the NVIDIA driver indisposition.

 Did you passed those cards each as primary to the guest, or was the
 guest seeing multiple adapters? I presume you only got output after
 early boot was completed, right?

 To avoid having to deal with legacy I/O forwarding, we started with a
 dual adapter setup in the hope to leave the primary guest adapter at
 know-to-work cirrus-vga. But already in a native setup with on-board
 primary + NVIDIA secondary, the NVIDIA Windows drivers refused to talk
 to its hardware in this constellation.


 One of the reason the pass-through did not work is because of the
 limit on amount of pci configuration memory by SeaBIOS. SeaBIOS places
 a hard limit of 256MB or so on the amount of PCI memory space. Thus,
 for some of the VGA device that need more memory never worked for me.

 SeaBIOS allows this memory region to be extended to some value near
 512MB, but even then the range is not enough.

 Another problem with SeaBIOS which limits the amount of memory space
 is: SeaBIOS allocates the BAR regions as they are encountered. As far
 as I know, the BAR regions should be naturally aligned. Thus the
 simple strategy of the SeaBIOS results in large fragmentation.
 Therefore, even after increasing the PCI memory space to 512MB the BAR
 regions were unallocated.

 That's an interesting trace! We'll check this here, but I bet it
 contributes to the problems. Our FX 3800 has 1G memory...
 
 Yes, qemu leaves far too little MMIO space to think about assigning
 graphics cards.  Both of my cards have 512MB and I hacked qemu to leave
 a bigger gap via something like:
 
 diff --git a/hw/pc.c b/hw/pc.c
 index 0ea6d10..a6376f8 100644
 --- a/hw/pc.c
 +++ b/hw/pc.c
 @@ -879,6 +879,8 @@ void pc_cpus_init(const char *cpu_model)
  }
  

Re: [PATCH 27/30] nVMX: Additional TSC-offset handling

2011-05-09 Thread Zachary Amsden

On 05/08/2011 01:29 AM, Nadav Har'El wrote:

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
   


Both changes look correct to me.

Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-09 Thread Zachary Amsden

On 05/08/2011 12:06 PM, Nikola Ciprich wrote:

OK,
I see.. the problem is, that I'm trying to hunt down bug causing hangs
when 2.6.32 guests try to run tcpdump - this seems to be reproducible even on 
latest 2.6.32.x, and seems like it depends on kvm-clock..
So I was thinking about bisecting between 2.6.32 and latest git which doesn't 
seem to suffer this problem but hitting another (different) problem in 2.6.32 
complicates thinks a bit :(
If somebody would have some hint on how to proceed, I'd be more then grateful..
cheers
n.
   


What are you bisecting, the host kernel or the guest kernel, and what 
version is the host kernel?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-09 Thread Nikola Ciprich
The guest, because latest kernels do not suffer this problem, so I'd like to
find fix so it can be pushed to -stable (we're using 2.6.32.x)
host is currently 2.6.37 (and i'm currently testing 2.6.38 as well)
n.

On Mon, May 09, 2011 at 10:32:26AM -0700, Zachary Amsden wrote:
 On 05/08/2011 12:06 PM, Nikola Ciprich wrote:
 OK,
 I see.. the problem is, that I'm trying to hunt down bug causing hangs
 when 2.6.32 guests try to run tcpdump - this seems to be reproducible even 
 on latest 2.6.32.x, and seems like it depends on kvm-clock..
 So I was thinking about bisecting between 2.6.32 and latest git which 
 doesn't seem to suffer this problem but hitting another (different) problem 
 in 2.6.32 complicates thinks a bit :(
 If somebody would have some hint on how to proceed, I'd be more then 
 grateful..
 cheers
 n.


 What are you bisecting, the host kernel or the guest kernel, and what  
 version is the host kernel?


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpbiOBiVY8Rf.pgp
Description: PGP signature


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-09 Thread Zachary Amsden

On 05/09/2011 11:25 AM, Nikola Ciprich wrote:

The guest, because latest kernels do not suffer this problem, so I'd like to
find fix so it can be pushed to -stable (we're using 2.6.32.x)
host is currently 2.6.37 (and i'm currently testing 2.6.38 as well)
n.


That's a pretty wide range to be bisecting, and I think we know for a 
fact there were some kvmclock related bugs in that range.


If you are looking for something causing problems with tcpdump, I'd 
suggest getting rid of kvmclock in your testing and using TSC instead; 
if you're looking to verify that kvmclock related changed have been 
backported to -stable, rather than bisect and run into bugs, it would 
probably be faster to check the commit logs for arch/x86/kvm/x86.c and 
make sure you're not missing anything from me or Glauber that has been 
applied to the most recent branch.


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-09 Thread Nikola Ciprich
 That's a pretty wide range to be bisecting, and I think we know for a  
 fact there were some kvmclock related bugs in that range.
thats true, I might try to pick those that seem related and see if it
helpts..

 If you are looking for something causing problems with tcpdump, I'd  
 suggest getting rid of kvmclock in your testing and using TSC instead;  
that's the problem, I can't reproduce the problems without kvm-clock
enabled, so it must be related to it somehow..

 if you're looking to verify that kvmclock related changed have been  
 backported to -stable, rather than bisect and run into bugs, it would  
 probably be faster to check the commit logs for arch/x86/kvm/x86.c and  
 make sure you're not missing anything from me or Glauber that has been  
 applied to the most recent branch.
yup, I'll try and report...
thanks for the hints!
n.



 Zach


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpYcvv8ncavM.pgp
Description: PGP signature


[PATCH v2] kvm: Add documentation for KVM_CAP_NR_VCPUS

2011-05-09 Thread Pekka Enberg
Document KVM_CAP_NR_VCPUS that can be used by the userspace to determine
maximum number of VCPUs it can create with the KVM_CREATE_VCPU ioctl.

Cc: Avi Kivity a...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Jan Kiszka jan.kis...@web.de
Signed-off-by: Pekka Enberg penb...@kernel.org
---
 Documentation/kvm/api.txt |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index 9bef4e4..533da6b 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -175,7 +175,10 @@ Parameters: vcpu id (apic id on x86)
 Returns: vcpu fd on success, -1 on error
 
 This API adds a vcpu to a virtual machine.  The vcpu id is a small integer
-in the range [0, max_vcpus).
+in the range [0, max_vcpus).  You can use KVM_CAP_NR_VCPUS of the
+KVM_CHECK_EXTENSION ioctl() to determine the value for max_vcpus at run-time.
+If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4
+cpus max.
 
 4.8 KVM_GET_DIRTY_LOG (vm ioctl)
 
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique

2011-05-09 Thread Pekka Enberg

From: Cyrill Gorcunov gorcu...@gmail.com
Subject: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique v2

PCI device numbers must be unique on a bus (as a part
of Bus/Device/Function tuple).Make it so. Note the patch
is rather a fast fix since we need a bit more smart pci device
manager (in particular multiple virtio block devices most
probably should lay on a separate pci bus).

v2: Sasha spotted the nit in virtio_rng__init, ioport
   function was touched insted of irq__register_device.


Hey, I don't like the new patch subject trend you're trying to start at 
all. You can make it


  kvm tools,pci: Make PCI device numbers unique

but in this particular case PCI already appears in the title so

  kmv tools: Make PCI device numbers unique

is the right thing to do.

In addition, the changelog doesn't really tell me much. Does it fix 
something? Why would we need a smart pci device manager and why is that 
relevant for this patch? Hmmh?


Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique

2011-05-09 Thread Cyrill Gorcunov
On 05/09/2011 11:53 PM, Pekka Enberg wrote:
 From: Cyrill Gorcunov gorcu...@gmail.com
 Subject: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique v2

 PCI device numbers must be unique on a bus (as a part
 of Bus/Device/Function tuple).Make it so. Note the patch
 is rather a fast fix since we need a bit more smart pci device
 manager (in particular multiple virtio block devices most
 probably should lay on a separate pci bus).

 v2: Sasha spotted the nit in virtio_rng__init, ioport
function was touched insted of irq__register_device.
 
 Hey, I don't like the new patch subject trend you're trying to start at all. 
 You can make it
 
   kvm tools,pci: Make PCI device numbers unique
 
 but in this particular case PCI already appears in the title so
 
   kmv tools: Make PCI device numbers unique
 
 is the right thing to do.

PCI stands for kvm tools subsystem, but if you like more the last -- I'm fine 
with it.

 
 In addition, the changelog doesn't really tell me much. Does it fix something?
 Why would we need a smart pci device manager and why is that relevant for 
 this
 patch? Hmmh?
 
 Pekka

The thing is that at moment the id's passed to MP table is incorrect, they are 
to be
5 bits long (mp spec). The smart manager we need -- it's because there could be 
multiple
virtio block device and they _are_ to be separated pci devices, ie with own 
numbers and
intx# assignents. As result we probably should have such virtio devices to lay 
on a separate
pci bus, or if the number of pci devices exceed the width of address line then
we should pass them to another pci bus. That is what I had in mind but I'm not 
sure all this
should come to the changelog.

-- 
Cyrill
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] PCI: Track the size of each saved capability data area

2011-05-09 Thread Jesse Barnes
On Wed, 20 Apr 2011 14:31:33 -0600
Alex Williamson alex.william...@redhat.com wrote:

 -struct pci_cap_saved_state {
 - struct hlist_node next;
 +struct pci_cap_saved {
   char cap_nr;
 + unsigned int size;
   u32 data[0];
  };
  
 +struct pci_cap_saved_state {
 + struct hlist_node next;
 + struct pci_cap_saved saved;
 +};
 +
  struct pcie_link_state;
  struct pci_vpd;
  struct pci_sriov;
 @@ -366,7 +371,7 @@ static inline struct pci_cap_saved_state 
 *pci_find_saved_cap(
   struct hlist_node *pos;
  
   hlist_for_each_entry(tmp, pos, pci_dev-saved_cap_space, next) {
 - if (tmp-cap_nr == cap)
 + if (tmp-saved.cap_nr == cap)
   return tmp;
   }
   return NULL;

Looks pretty good in general.  But I think the naming makes it harder
to read than it ought to be.

So we have a pci_cap_saved_state, which implies capability info, and
that's fine.

But pci_cap_saved doesn't communicate much; maybe pci_cap_data or
pci_cap_saved_data would be better?

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


problem with DRBD backed iSCSI storage pool and KVM guests

2011-05-09 Thread Mark Nipper
So I've been struggling with configuring a proper HA
stack using DRBD on two dedicated, back-end storage nodes and
using KVM on two dedicated, front-end nodes (so four machines
total).

I'm stuck at just keeping an exported iSCSI LUN
consistent for one VM while switching over on the back-end DRBD
storage nodes.

In my testing, I think I've narrowed it down to KVM's
cache setting, but it doesn't make sense and it looks like it
will inhibit things later for live migration based on what I've
read on this list.

So the stack looks something like this.  I have the
bottom DRBD layer set to use its normal write-back cache settings
since both the back-end storage machines have battery backed
units for their RAID controllers and I assume that writes only
successfully return from the DRBD layer when both the storage
controller and the network synchronization protocol (using
protocol C of course) return a successful write (to the RAID
controllers cache and the DRBD partner).  I'm using a
primary/secondary setup for the DRBD component.

The next layer is the splicing up of the exported DRBD
device itself.  I'm using nested LVM (not cLVM) for this per the
DRBD documentation.  It's my understanding that cLVM shouldn't be
necessary since the volume group is only active on the primary
DRBD node, so no cluster locking should be needed.  Hopefully
that is correct.

On to the iSCSI layer, I'm using tgtd on the target side
on each back-end node and iscsid on the initiator side from the
front-end nodes.  I have the write cache on both the target and
initiator disabled as much as I seemingly can.  I'm passing the
crazy option for this via tgtadm:
---
mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0

since corosync is doing everything within the cluster stack to
set up and tear down the iSCSI target and LUN's instead of
defining the write-cache off option in /etc/tgt/targets.conf.  I
confirm that sdparm --get=WCE returns:
---
WCE 0  [cha: n, def:  0]

as expected from the initiator.  But I'm still not honestly sure
that the target daemon isn't using the page cache on the current
primary back-end node.  This might be the source of my problem,
but documentation on this is sparse and the mode_page above is
the best I could find along with the check via sdparm.

And finally, there's KVM itself.  In order to test all of
this, I created a random 1GB file from /dev/urandom on the guest
(using RHEL 6 for both the host and the guest).  I then copy the
random file to a new file and force the current back-end primary
node into standby.  This successfully restarts the entire stack
of components after less than say 10-15 seconds.  I have the
initiators set to:
---
node.session.timeo.replacement_timeout = -1

which should hang forever if I understand the configuration file
comments correctly and never report SCSI errors higher up the
stack.

Anyway, the fail over finishes, I diff the two files and
I also do md5sum on each.  Now, this is the part where I'm stuck.
If I define the virtual disk within KVM to use writethrough cache
mode, then while I see a bunch of:
---
Buffer I/O error on device dm-0, logical block ...
end_request: I/O error, dev vda, sector ...

those types of error messages, the cp finishes and the new file
seems to be a bit for bit copy of the original.  Everything
appears to have worked.

If I set the cache to none, which apparently I'll need to
do anyway for live migration to work (which is the ultimate goal
in all of this), then I see the same errors above (although
immediately as soon as I initiate the standby operation on the
cluster while using writethrough mode, the messages don't show up
for a bit) and not only do the files differ typically, it's
usually not too long before the ext4 file system sitting on top
of vda starts to become very unhappy and gets remounted in
read-only mode.

So am I missing something here?  Using the -1 for the
iscsid configuration above, I assumed KVM would never even see
any sort of errors at all, but instead would simply hang
indefinitely until things came back.  Anyone else running a setup
like this?

Thanks for reading.  I can post configuration files as
needed or take this to the open-iscsi lists next if KVM doesn't
appear to be the issue at this point.

-- 
Mark Nipper
ni...@bitgnome.net (XMPP)
+1 979 575 3193
-
There are 10 kinds of people in the world; those who know binary
and those who don't.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patching guest kernel code for better performance from HOST

2011-05-09 Thread Dushyant Bansal

On Sunday 08 May 2011 02:22 AM, Alexander Graf wrote:

On 07.05.2011, at 22:32, Dushyant Bansal wrote:


Hi,

On patching 'mfmsr' instruction with 'lwz', guest exits when it tries to 
execute that 'lwz' instruction. I am looking for possible causes for this exit.

Here are the details:
Initially,
pc: 0xc0019420, instruction: 0x7ca6 [mfmsr r0]
As this is a privileged instruction, this causes an exit.

qemu-system-ppc-4443  [000] 19733.740013: kvm_book3s_exit: exit=0x700 | 
pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | dar=0xe1736a00 | 
srr1=0x1004d032
qemu-system-ppc-4443  [000] 19733.740029: kvm_book3s_patch: return=0 | 
pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | new_inst=0x8000f05c
qemu-system-ppc-4443  [000] 19733.740030: kvm_ppc_instr: inst 2080374950 pc 
0xc0019420 emulate 0
qemu-system-ppc-4443  [000] 19733.740037: kvm_book3s_reenter: reentry r=1 | 
pc=0xc0019420

I patched this instruction with:
0x8000f05c: lwzr0, -4096(offset of msr)
This instruction reads the 'msr' field of the magic page into register r0.

Then, I do not increment the pc value, so the guest starts at the same pc which 
now points to the new patched instruction.

This 'lwz' instruction is causing a exit due to 'BOOK3S_INTERRUPT_PROGRAM' 
(exit_nr: 0x700).
What could be the reason for this exit? As, 'lwz' is not a privileged 
instruction, I am unable to think of any reason.

Did you flush the icache after you patched the instruction? See the function 
flush_icache_range. Without, your CPU still has the old instruction in its 
cache, making it trap again :).

Thanks.

I tried flush_icache_range((ulong)pc, (ulong)pc + 4);
The system becomes unresponsive and I have to use force shut down.

Here, pc will have the address of guest instruction and 
flush_icache_range is called from host. Maybe, I am not using 
flush_icache_range in the correct way.

Also, my host os is ppc64 and the guest is ppc32.

I also tried:  flush_cache_all()
But the instruction is still present in the instruction cache.


Dushyant
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patching guest kernel code for better performance from HOST

2011-05-09 Thread Alexander Graf

On 09.05.2011, at 12:34, Dushyant Bansal wrote:

 On Sunday 08 May 2011 02:22 AM, Alexander Graf wrote:
 On 07.05.2011, at 22:32, Dushyant Bansal wrote:
 
 Hi,
 
 On patching 'mfmsr' instruction with 'lwz', guest exits when it tries to 
 execute that 'lwz' instruction. I am looking for possible causes for this 
 exit.
 
 Here are the details:
 Initially,
 pc: 0xc0019420, instruction: 0x7ca6 [mfmsr r0]
 As this is a privileged instruction, this causes an exit.
 
 qemu-system-ppc-4443  [000] 19733.740013: kvm_book3s_exit: exit=0x700 | 
 pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | dar=0xe1736a00 | 
 srr1=0x1004d032
 qemu-system-ppc-4443  [000] 19733.740029: kvm_book3s_patch: return=0 | 
 pc=0xc0019420 | inst=0x7ca6 | msr=0x1032 | new_inst=0x8000f05c
 qemu-system-ppc-4443  [000] 19733.740030: kvm_ppc_instr: inst 2080374950 pc 
 0xc0019420 emulate 0
 qemu-system-ppc-4443  [000] 19733.740037: kvm_book3s_reenter: reentry r=1 | 
 pc=0xc0019420
 
 I patched this instruction with:
 0x8000f05c: lwzr0, -4096(offset of msr)
 This instruction reads the 'msr' field of the magic page into register r0.
 
 Then, I do not increment the pc value, so the guest starts at the same pc 
 which now points to the new patched instruction.
 
 This 'lwz' instruction is causing a exit due to 'BOOK3S_INTERRUPT_PROGRAM' 
 (exit_nr: 0x700).
 What could be the reason for this exit? As, 'lwz' is not a privileged 
 instruction, I am unable to think of any reason.
 Did you flush the icache after you patched the instruction? See the function 
 flush_icache_range. Without, your CPU still has the old instruction in its 
 cache, making it trap again :).
 Thanks.
 
 I tried flush_icache_range((ulong)pc, (ulong)pc + 4);
 The system becomes unresponsive and I have to use force shut down.
 
 Here, pc will have the address of guest instruction and flush_icache_range is 
 called from host. Maybe, I am not using flush_icache_range in the correct way.
 Also, my host os is ppc64 and the guest is ppc32.
 
 I also tried:  flush_cache_all()
 But the instruction is still present in the instruction cache.

Just patch the _st function to flush the icache on the host virtual address 
every time it gets invoked :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html