[PATCH v2 06/10] KVM: PPC: Book3S HV: Don't access runnable threads list without vcore lock

2012-09-27 Thread Paul Mackerras
There were a few places where we were traversing the list of runnable
threads in a virtual core, i.e. vc-runnable_threads, without holding
the vcore spinlock.  This extends the places where we hold the vcore
spinlock to cover everywhere that we traverse that list.

Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
this moves the call of it from kvmppc_handle_exit out to
kvmppc_vcpu_run, where we don't hold the vcore lock.

In kvmppc_vcore_blocked, we don't actually need to check whether
all vcpus are ceded and don't have any pending exceptions, since the
caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
actually checking for pending exceptions, so we add that.

The change of if to while in kvmppc_run_vcpu is to make sure that we
never call kvmppc_remove_runnable() when the vcore state is RUNNING or
EXITING.

Signed-off-by: Paul Mackerras pau...@samba.org
---
v2: move RESUME_PAGE_FAULT defn to book3s_hv.c

 arch/powerpc/include/asm/kvm_asm.h |1 +
 arch/powerpc/kvm/book3s_hv.c   |   67 ++--
 2 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 76fdcfe..aabcdba 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -118,6 +118,7 @@
 
 #define RESUME_FLAG_NV  (10)  /* Reload guest nonvolatile state? */
 #define RESUME_FLAG_HOST(11)  /* Resume host? */
+#define RESUME_FLAG_ARCH1  (12)
 
 #define RESUME_GUEST0
 #define RESUME_GUEST_NV RESUME_FLAG_NV
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 77dec0f..3a737a4 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -57,6 +57,9 @@
 /* #define EXIT_DEBUG_SIMPLE */
 /* #define EXIT_DEBUG_INT */
 
+/* Used to indicate that a guest page fault needs to be handled */
+#define RESUME_PAGE_FAULT  (RESUME_GUEST | RESUME_FLAG_ARCH1)
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
@@ -431,7 +434,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
  struct task_struct *tsk)
 {
int r = RESUME_HOST;
-   int srcu_idx;
 
vcpu-stat.sum_exits++;
 
@@ -491,16 +493,12 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * have been handled already.
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   vcpu-arch.fault_dar, vcpu-arch.fault_dsisr);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   r = RESUME_PAGE_FAULT;
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   kvmppc_get_pc(vcpu), 0);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   vcpu-arch.fault_dar = kvmppc_get_pc(vcpu);
+   vcpu-arch.fault_dsisr = 0;
+   r = RESUME_PAGE_FAULT;
break;
/*
 * This occurs if the guest executes an illegal instruction.
@@ -984,22 +982,24 @@ static int on_primary_thread(void)
  * Run a set of guest threads on a physical core.
  * Called with vc-lock held.
  */
-static int kvmppc_run_core(struct kvmppc_vcore *vc)
+static void kvmppc_run_core(struct kvmppc_vcore *vc)
 {
struct kvm_vcpu *vcpu, *vcpu0, *vnext;
long ret;
u64 now;
int ptid, i, need_vpa_update;
int srcu_idx;
+   struct kvm_vcpu *vcpus_to_update[threads_per_core];
 
/* don't start if any threads have a signal pending */
need_vpa_update = 0;
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list) {
if (signal_pending(vcpu-arch.run_task))
-   return 0;
-   need_vpa_update |= vcpu-arch.vpa.update_pending |
-   vcpu-arch.slb_shadow.update_pending |
-   vcpu-arch.dtl.update_pending;
+   return;
+   if (vcpu-arch.vpa.update_pending ||
+   vcpu-arch.slb_shadow.update_pending ||
+   vcpu-arch.dtl.update_pending)
+   vcpus_to_update[need_vpa_update++] = vcpu;
}
 
/*
@@ -1019,8 +1019,8 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 */
if (need_vpa_update) {
spin_unlock(vc-lock);
-   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
-   kvmppc_update_vpas(vcpu);
+   for (i = 0; i  need_vpa_update; ++i)
+   kvmppc_update_vpas(vcpus_to_update[i]);

[PATCH v2 09/10] KVM: PPC: Book3S HV: Fix accounting of stolen time

2012-09-27 Thread Paul Mackerras
Currently the code that accounts stolen time tends to overestimate the
stolen time, and will sometimes report more stolen time in a DTL
(dispatch trace log) entry than has elapsed since the last DTL entry.
This can cause guests to underflow the user or system time measured
for some tasks, leading to ridiculous CPU percentages and total runtimes
being reported by top and other utilities.

In addition, the current code was designed for the previous policy where
a vcore would only run when all the vcpus in it were runnable, and so
only counted stolen time on a per-vcore basis.  Now that a vcore can
run while some of the vcpus in it are doing other things in the kernel
(e.g. handling a page fault), we need to count the time when a vcpu task
is preempted while it is not running as part of a vcore as stolen also.

To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
vcpu_load/put functions to count preemption time while the vcpu is
in that state.  Handling the transitions between the RUNNING and
BUSY_IN_HOST states requires checking and updating two variables
(accumulated time stolen and time last preempted), so we add a new
spinlock, vcpu-arch.tbacct_lock.  This protects both the per-vcpu
stolen/preempt-time variables, and the per-vcore variables while this
vcpu is running the vcore.

Finally, we now don't count time spent in userspace as stolen time.
The task could be executing in userspace on behalf of the vcpu, or
it could be preempted, or the vcpu could be genuinely stopped.  Since
we have no way of dividing up the time between these cases, we don't
count any of it as stolen.

Signed-off-by: Paul Mackerras pau...@samba.org
---
v2: just rediffed for context changes, no real change from previous version.

 arch/powerpc/include/asm/kvm_host.h |5 ++
 arch/powerpc/kvm/book3s_hv.c|  127 ++-
 2 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1e8cbd1..3093896 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -559,12 +559,17 @@ struct kvm_vcpu_arch {
unsigned long dtl_index;
u64 stolen_logged;
struct kvmppc_vpa slb_shadow;
+
+   spinlock_t tbacct_lock;
+   u64 busy_stolen;
+   u64 busy_preempt;
 #endif
 };
 
 /* Values for vcpu-arch.state */
 #define KVMPPC_VCPU_NOTREADY   0
 #define KVMPPC_VCPU_RUNNABLE   1
+#define KVMPPC_VCPU_BUSY_IN_HOST   2
 
 /* Values for vcpu-arch.io_gpr */
 #define KVM_MMIO_REG_MASK  0x001f
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61d2934..8b3c470 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -60,23 +60,74 @@
 /* Used to indicate that a guest page fault needs to be handled */
 #define RESUME_PAGE_FAULT  (RESUME_GUEST | RESUME_FLAG_ARCH1)
 
+/* Used as a null value for timebase values */
+#define TB_NIL (~(u64)0)
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
+/*
+ * We use the vcpu_load/put functions to measure stolen time.
+ * Stolen time is counted as time when either the vcpu is able to
+ * run as part of a virtual core, but the task running the vcore
+ * is preempted or sleeping, or when the vcpu needs something done
+ * in the kernel by the task running the vcpu, but that task is
+ * preempted or sleeping.  Those two things have to be counted
+ * separately, since one of the vcpu tasks will take on the job
+ * of running the core, and the other vcpu tasks in the vcore will
+ * sleep waiting for it to do that, but that sleep shouldn't count
+ * as stolen time.
+ *
+ * Hence we accumulate stolen time when the vcpu can run as part of
+ * a vcore using vc-stolen_tb, and the stolen time when the vcpu
+ * needs its task to do other things in the kernel (for example,
+ * service a page fault) in busy_stolen.  We don't accumulate
+ * stolen time for a vcore when it is inactive, or for a vcpu
+ * when it is in state RUNNING or NOTREADY.  NOTREADY is a bit of
+ * a misnomer; it means that the vcpu task is not executing in
+ * the KVM_VCPU_RUN ioctl, i.e. it is in userspace or elsewhere in
+ * the kernel.  We don't have any way of dividing up that time
+ * between time that the vcpu is genuinely stopped, time that
+ * the task is actively working on behalf of the vcpu, and time
+ * that the task is preempted, so we don't count any of it as
+ * stolen.
+ *
+ * Updates to busy_stolen are protected by arch.tbacct_lock;
+ * updates to vc-stolen_tb are protected by the arch.tbacct_lock
+ * of the vcpu that has taken responsibility for running the vcore
+ * (i.e. vc-runner).  The stolen times are measured in units of
+ * timebase ticks.  (Note that the != TB_NIL checks below are
+ * purely defensive; they should never fail.)
+ */
+
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {

Re: [PATCH] virtio-blk: Disable callback in virtblk_done()

2012-09-27 Thread Asias He
On 09/27/2012 08:10 AM, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
 
 On 09/25/2012 10:36 AM, Asias He wrote:
 This reduces unnecessary interrupts that host could send to guest while
 guest is in the progress of irq handling.

 If one vcpu is handling the irq, while another interrupt comes, in
 handle_edge_irq(), the guest will mask the interrupt via mask_msi_irq()
 which is a very heavy operation that goes all the way down to host.

 Signed-off-by: Asias He as...@redhat.com
 ---

 Here are some performance numbers on qemu:
 
 I assume this is with qemu using kvm, not qemu in soft emulation? :)

Of course.

 
 Before:
 -
   seq-read  : io=0 B, bw=269730KB/s, iops=67432 , runt= 62200msec
   seq-write : io=0 B, bw=339716KB/s, iops=84929 , runt= 49386msec
   rand-read : io=0 B, bw=270435KB/s, iops=67608 , runt= 62038msec
   rand-write: io=0 B, bw=354436KB/s, iops=88608 , runt= 47335msec
 clat (usec): min=101 , max=138052 , avg=14822.09, stdev=11771.01
 clat (usec): min=96 , max=81543 , avg=11798.94, stdev=7735.60
 clat (usec): min=128 , max=140043 , avg=14835.85, stdev=11765.33
 clat (usec): min=109 , max=147207 , avg=11337.09, stdev=5990.35
   cpu  : usr=15.93%, sys=60.37%, ctx=7764972, majf=0, minf=54
   cpu  : usr=32.73%, sys=120.49%, ctx=7372945, majf=0, minf=1
   cpu  : usr=18.84%, sys=58.18%, ctx=7775420, majf=0, minf=1
   cpu  : usr=24.20%, sys=59.85%, ctx=8307886, majf=0, minf=0
   vdb: ios=8389107/8368136, merge=0/0, ticks=19457874/14616506,
 in_queue=34206098, util=99.68%
  43: interrupt in total: 887320
 fio --exec_prerun=echo 3  /proc/sys/vm/drop_caches --group_reporting
 --ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
 --ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
 --filename=/dev/vdb --name=seq-read --stonewall --rw=read
 --name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
 --rw=randread --name=rnd-write --stonewall --rw=randwrite

 After:
 -
   seq-read  : io=0 B, bw=309503KB/s, iops=77375 , runt= 54207msec
   seq-write : io=0 B, bw=448205KB/s, iops=112051 , runt= 37432msec
   rand-read : io=0 B, bw=311254KB/s, iops=77813 , runt= 53902msec
   rand-write: io=0 B, bw=377152KB/s, iops=94287 , runt= 44484msec
 clat (usec): min=81 , max=90588 , avg=12946.06, stdev=9085.94
 clat (usec): min=57 , max=72264 , avg=8967.97, stdev=5951.04
 clat (usec): min=29 , max=101046 , avg=12889.95, stdev=9067.91
 clat (usec): min=52 , max=106152 , avg=10660.56, stdev=4778.19
   cpu  : usr=15.05%, sys=57.92%, ctx=7710941, majf=0, minf=54
   cpu  : usr=26.78%, sys=101.40%, ctx=7387891, majf=0, minf=2
   cpu  : usr=19.03%, sys=58.17%, ctx=7681976, majf=0, minf=8
   cpu  : usr=24.65%, sys=58.34%, ctx=8442632, majf=0, minf=4
   vdb: ios=8389086/8361888, merge=0/0, ticks=17243780/12742010,
 in_queue=30078377, util=99.59%
  43: interrupt in total: 1259639
 fio --exec_prerun=echo 3  /proc/sys/vm/drop_caches --group_reporting
 --ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
 --ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
 --filename=/dev/vdb --name=seq-read --stonewall --rw=read
 --name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
 --rw=randread --name=rnd-write --stonewall --rw=randwrite

  drivers/block/virtio_blk.c | 19 +++
  1 file changed, 11 insertions(+), 8 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 53b81d5..0bdde8f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -274,15 +274,18 @@ static void virtblk_done(struct virtqueue *vq)
 unsigned int len;
  
 spin_lock_irqsave(vblk-disk-queue-queue_lock, flags);
 -   while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
 -   if (vbr-bio) {
 -   virtblk_bio_done(vbr);
 -   bio_done = true;
 -   } else {
 -   virtblk_request_done(vbr);
 -   req_done = true;
 +   do {
 +   virtqueue_disable_cb(vq);
 +   while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
 +   if (vbr-bio) {
 +   virtblk_bio_done(vbr);
 +   bio_done = true;
 +   } else {
 +   virtblk_request_done(vbr);
 +   req_done = true;
 +   }
 }
 -   }
 +   } while (!virtqueue_enable_cb(vq));
 /* In case queue is stopped waiting for more buffers. */
 if (req_done)
 blk_start_queue(vblk-disk-queue);
 
 Fascinating.  Please just confirm that VIRTIO_RING_F_EVENT_IDX is
 enabled?

Sure. It is enabled ;-)

 
 I forgot about the cool hack which MST put in to defer event updates
 using disable_cb/enable_cb.

Hmm, are you talking about virtqueue_enable_cb_delayed()?

 
 Applied!
 

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov
On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
 On 09/25/2012 10:09 AM, Raghavendra K T wrote:
  On 09/24/2012 09:36 PM, Avi Kivity wrote:
  On 09/24/2012 05:41 PM, Avi Kivity wrote:
 
 
  case 2)
  rq1 : vcpu1-wait(lockA) (spinning)
  rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
 
  I agree that checking rq1 length is not proper in this case, and as
  you
  rightly pointed out, we are in trouble here.
  nr_running()/num_online_cpus() would give more accurate picture here,
  but it seemed costly. May be load balancer save us a bit here in not
  running to such sort of cases. ( I agree load balancer is far too
  complex).
 
  In theory preempt notifier can tell us whether a vcpu is preempted or
  not (except for exits to userspace), so we can keep track of whether
  it's we're overcommitted in kvm itself.  It also avoids false positives
  from other guests and/or processes being overcommitted while our vm
  is fine.
 
  It also allows us to cheaply skip running vcpus.
 
  Hi Avi,
 
  Could you please elaborate on how preempt notifiers can be used
  here to keep track of overcommit or skip running vcpus?
 
  Are we planning set some flag in sched_out() handler etc?
 
 
 Keep a bitmap kvm-preempted_vcpus.
 
 In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
 flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
 set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
 of preempted vcpus.
 
 We can use the bitmap and the counter to quickly see if spinning is
 worthwhile (if the counter is zero, better to spin).  If not, we can use
 the bitmap to select target vcpus quickly.
 
 The only problem is that in order to keep this accurate we need to keep
 the preempt notifiers active during exits to userspace.  But we can
 prototype this without this change, and add it later if it works.
 
Can user return notifier can be used instead? Set bit in
kvm-preempted_vcpus on return to userspace.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/25/2012 03:40 PM, Raghavendra K T wrote:
 On 09/24/2012 07:46 PM, Raghavendra K T wrote:
 On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
 However Rik had a genuine concern in the cases where runqueue is not
 equally distributed and lockholder might actually be on a different run
 queue but not running.

 Load should eventually get distributed equally -- that's what the
 load-balancer is for -- so this is a temporary situation.

 We already try and favour the non running vcpu in this case, that's what
 yield_to_task_fair() is about. If its still not eligible to run, tough
 luck.

 Yes, I agree.


 Do you think instead of using rq-nr_running, we could get a global
 sense of load using avenrun (something like avenrun/num_onlinecpus)

 To what purpose? Also, global stuff is expensive, so you should try and
 stay away from it as hard as you possibly can.

 Yes, that concern only had made me to fall back to rq-nr_running.

 Will come back with the result soon.
 
 Got the result with the patches:
 So here is the result,
 
 Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
 1x and 2x overcommits
 
 Base = 3.6.0-rc5 + ple handler optimization patches
 A = Base + checking rq_running in vcpu_on_spin() patch
 B = Base + checking rq-nr_running in sched/core
 C = Base - PLE
 
 ---+---+---+---+---+
|Ebizzy result (rec/sec higher is better)   |
 ---+---+---+---+---+
|Base   | A |  B| C |
 ---+---+---+---+---+
 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
 2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
 ---+---+---+---+---+
 
% improvements w.r.t BASE
 ---++++
|  A |B   | C  |
 ---++++
 1x | 206.37603  |  139.70410 |  210.19323 |
 2x | -3.06555   |  -4.33218  |  -98.08773 |
 ---++++
 
 we are getting the benefit of almost PLE disabled case with this
 approach. With patch B, we have dropped a bit in gain.
 (because we still would iterate vcpus until we decide to do a directed
 yield).

This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/25/2012 04:21 PM, Takuya Yoshikawa wrote:
 On Tue, 25 Sep 2012 10:12:49 +0200
 Avi Kivity a...@redhat.com wrote:
 
 It will.  The tradeoff is between false-positive costs (undercommit) and
 true positive costs (overcommit).  I think undercommit should perform
 well no matter what.
 
 If we utilize preempt notifiers to track overcommit dynamically, then we
 can vary the spin time dynamically.  Keep it long initially, as we get
 more preempted vcpus make it shorter.
 
 What will happen if we pin each vcpu thread to some core?
 I don't want to see so many vcpu threads moving around without
 being pinned at all.

If you do that you've removed a lot of flexibility from the scheduler,
so overcommit becomes even less likely to work well (a trivial example
is pinning two vcpus from the same vm to the same core -- it's so
obviously bad no one considers doing it).

 In that case, we don't want to make KVM do any work of searching
 a vcpu thread to yield to.

Why not?  If a vcpu thread on another core has been preempted, and is
the lock holder, and we can boost it, then we've fixed our problem.
Even if the spinning thread keeps spinning because it is the only task
eligible to run on its core.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
 I've actually implemented this preempted_bitmap idea. 

Interesting, please share the code if you can.

 However, I'm doing this to expose this information to the guest, so the
 guest is able to know if the lock holder is preempted or not before
 spining. Right now, I'm doing experiment to show that this idea works.
 
 I'm wondering what do you guys think of the relationship between the
 pv_ticketlock approach and PLE handler approach. Are we going to adopt
 PLE instead of the pv ticketlock, and why?

Right now we're searching for the best solution.  The tradeoffs are more
or less:

PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface

PV:
- has more information, so it can perform better

Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 09:44 AM, Gleb Natapov wrote:
 On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
 On 09/25/2012 10:09 AM, Raghavendra K T wrote:
  On 09/24/2012 09:36 PM, Avi Kivity wrote:
  On 09/24/2012 05:41 PM, Avi Kivity wrote:
 
 
  case 2)
  rq1 : vcpu1-wait(lockA) (spinning)
  rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
 
  I agree that checking rq1 length is not proper in this case, and as
  you
  rightly pointed out, we are in trouble here.
  nr_running()/num_online_cpus() would give more accurate picture here,
  but it seemed costly. May be load balancer save us a bit here in not
  running to such sort of cases. ( I agree load balancer is far too
  complex).
 
  In theory preempt notifier can tell us whether a vcpu is preempted or
  not (except for exits to userspace), so we can keep track of whether
  it's we're overcommitted in kvm itself.  It also avoids false positives
  from other guests and/or processes being overcommitted while our vm
  is fine.
 
  It also allows us to cheaply skip running vcpus.
 
  Hi Avi,
 
  Could you please elaborate on how preempt notifiers can be used
  here to keep track of overcommit or skip running vcpus?
 
  Are we planning set some flag in sched_out() handler etc?
 
 
 Keep a bitmap kvm-preempted_vcpus.
 
 In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
 flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
 set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
 of preempted vcpus.
 
 We can use the bitmap and the counter to quickly see if spinning is
 worthwhile (if the counter is zero, better to spin).  If not, we can use
 the bitmap to select target vcpus quickly.
 
 The only problem is that in order to keep this accurate we need to keep
 the preempt notifiers active during exits to userspace.  But we can
 prototype this without this change, and add it later if it works.
 
 Can user return notifier can be used instead? Set bit in
 kvm-preempted_vcpus on return to userspace.
 

User return notifier is per-cpu, not per-task.  There is a new task_work
(linux/task_work.h) that does what you want.  With these
technicalities out of the way, I think it's the wrong idea.  If a vcpu
thread is in userspace, that doesn't mean it's preempted, there's no
point in boosting it if it's already running.

btw, we can have secondary effects.  A vcpu can be waiting for a lock in
the host kernel, or for a host page fault.  There's no point in boosting
anything for that.  Or a vcpu in userspace can be waiting for a lock
that is held by another thread, which has been preempted.  This is (like
I think Peter already said) a priority inheritance problem.  However
with fine-grained locking in userspace, we can make it go away.  The
guest kernel is unlikely to access one device simultaneously from two
threads (and if it does, we just need to improve the threading in the
device model).

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov
On Thu, Sep 27, 2012 at 10:59:21AM +0200, Avi Kivity wrote:
 On 09/27/2012 09:44 AM, Gleb Natapov wrote:
  On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
  On 09/25/2012 10:09 AM, Raghavendra K T wrote:
   On 09/24/2012 09:36 PM, Avi Kivity wrote:
   On 09/24/2012 05:41 PM, Avi Kivity wrote:
  
  
   case 2)
   rq1 : vcpu1-wait(lockA) (spinning)
   rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
  
   I agree that checking rq1 length is not proper in this case, and as
   you
   rightly pointed out, we are in trouble here.
   nr_running()/num_online_cpus() would give more accurate picture here,
   but it seemed costly. May be load balancer save us a bit here in not
   running to such sort of cases. ( I agree load balancer is far too
   complex).
  
   In theory preempt notifier can tell us whether a vcpu is preempted or
   not (except for exits to userspace), so we can keep track of whether
   it's we're overcommitted in kvm itself.  It also avoids false positives
   from other guests and/or processes being overcommitted while our vm
   is fine.
  
   It also allows us to cheaply skip running vcpus.
  
   Hi Avi,
  
   Could you please elaborate on how preempt notifiers can be used
   here to keep track of overcommit or skip running vcpus?
  
   Are we planning set some flag in sched_out() handler etc?
  
  
  Keep a bitmap kvm-preempted_vcpus.
  
  In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
  flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
  set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
  of preempted vcpus.
  
  We can use the bitmap and the counter to quickly see if spinning is
  worthwhile (if the counter is zero, better to spin).  If not, we can use
  the bitmap to select target vcpus quickly.
  
  The only problem is that in order to keep this accurate we need to keep
  the preempt notifiers active during exits to userspace.  But we can
  prototype this without this change, and add it later if it works.
  
  Can user return notifier can be used instead? Set bit in
  kvm-preempted_vcpus on return to userspace.
  
 
 User return notifier is per-cpu, not per-task.  There is a new task_work
 (linux/task_work.h) that does what you want.  With these
 technicalities out of the way, I think it's the wrong idea.  If a vcpu
 thread is in userspace, that doesn't mean it's preempted, there's no
 point in boosting it if it's already running.
 
Ah, so you want to set bit in kvm-preempted_vcpus if task is _not_
TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
is in userspace it is definitely not preempted.
 
 btw, we can have secondary effects.  A vcpu can be waiting for a lock in
 the host kernel, or for a host page fault.  There's no point in boosting
 anything for that.  Or a vcpu in userspace can be waiting for a lock
 that is held by another thread, which has been preempted. 
Do you mean userspace spinlock? Because otherwise task that's waits on
a kernel lock will sleep in the kernel.

This is (like
 I think Peter already said) a priority inheritance problem.  However
 with fine-grained locking in userspace, we can make it go away.  The
 guest kernel is unlikely to access one device simultaneously from two
 threads (and if it does, we just need to improve the threading in the
 device model).
 
 -- 
 error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively

2012-09-27 Thread Avi Kivity
On 09/26/2012 04:06 AM, Takuya Yoshikawa wrote:
 On Mon, 24 Sep 2012 16:50:13 +0200
 Avi Kivity a...@redhat.com wrote:
 
 Afterwards, most exits are APIC and interrupt related, HLT, and MMIO.
 Of these, some are special (HLT, interrupt injection) and some are not
 (read/write most APIC registers).  I don't think one group dominates the
 other.  So already vcpu-requests processing is not such a slow path, it
 is relatively common.  We still see a lot of page faults during boot and
 during live migration though.
 
 With AVIC/APIC-V (still in the future) the mix will change again, with
 both special and non-special exits eliminated.  We'll be left mostly
 with APIC timer and HLT (and ICR writes for APIC-V).
 
 So maybe the direction of your patch makes sense.  Things like
 KVM_REQ_EVENT (or anything above 2-3% of exits) shouldn't be in
 vcpu-requests or maybe they deserve special treatment.
 
 I see the point.
 
 Since KVM_REQ_EVENT must be checked after handling some other requests,
 it needs special treatment anyway -- if we think defining it as the
 last flag for for_each_set_bit() is kind of special treatment.

Note that's an optimization - the code will work even if we don't do that.

But we can go further, like making it a bool, so it can be set without
an atomic operation.

It would depend on how frequent it is, in a real workload.

 
 As Gleb and you pointed out, KVM_REQ_STEAL_UPDATE needs to be fixed
 first not to be set unnecessarily.
 
 Then by special casing KVM_REQ_EVENT, one line change or moving it out
 from vcpu-requests, we can see if further improvement is needed.
 
 If a few requests exceed the threshold, 2-3% as you wrote?, we can also
 define a mask to indicate which requests should be treated as not unlikely.

Right, it will depend on each individual case.  For example synchronous
requests (can only happen from the vcpu thread) would be better off
outside vcpu-requests.

 
  BTW, schedule() is really rare?  We do either cond_resched() or
  heavy weight exit, no?
 
 If 25% of exits are HLT (like a ping workload), then 25% of your exits
 end up in schedule().
 
 On modern hardware, a relatively larger percentage of exits are
 heavyweight (same analysis as above).  On AVIC hardware most exits will
 be mmio, HLT, and host interrupts.  Of these only host interrupts that
 don't lead to a context switch will be lightweight.
 
  
  I always see vcpu threads actively move around the cores.
  (When I do not pin them.)
 
 Sure, but the frequency is quite low.  If not that's a bug.
 
 That's what I was originally testing for: if vcpu threads were being
 scheduled as expected.
 
 I forgot why I reached here.

Human minds also tend to move around a lot.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 11:11 AM, Gleb Natapov wrote:
 
 User return notifier is per-cpu, not per-task.  There is a new task_work
 (linux/task_work.h) that does what you want.  With these
 technicalities out of the way, I think it's the wrong idea.  If a vcpu
 thread is in userspace, that doesn't mean it's preempted, there's no
 point in boosting it if it's already running.
 
 Ah, so you want to set bit in kvm-preempted_vcpus if task is _not_
 TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
 is in userspace it is definitely not preempted.

No, as I originally wrote.  If it's TASK_RUNNING when it saw sched_out,
then it is preempted (i.e. runnable), not sleeping on some waitqueue,
voluntarily (HLT) or involuntarily (page fault).

  
 btw, we can have secondary effects.  A vcpu can be waiting for a lock in
 the host kernel, or for a host page fault.  There's no point in boosting
 anything for that.  Or a vcpu in userspace can be waiting for a lock
 that is held by another thread, which has been preempted. 
 Do you mean userspace spinlock? Because otherwise task that's waits on
 a kernel lock will sleep in the kernel.

I meant a kernel mutex.

vcpu 0: take guest spinlock
vcpu 0: vmexit
vcpu 0: spin_lock(some_lock)
vcpu 1: take same guest spinlock
vcpu 1: PLE vmexit
vcpu 1: wtf?

Waiting on a host kernel spinlock is not too bad because we expect to be
out shortly.  Waiting on a host kernel mutex can be a lot worse.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1-2...-N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov
On Thu, Sep 27, 2012 at 11:33:56AM +0200, Avi Kivity wrote:
 On 09/27/2012 11:11 AM, Gleb Natapov wrote:
  
  User return notifier is per-cpu, not per-task.  There is a new task_work
  (linux/task_work.h) that does what you want.  With these
  technicalities out of the way, I think it's the wrong idea.  If a vcpu
  thread is in userspace, that doesn't mean it's preempted, there's no
  point in boosting it if it's already running.
  
  Ah, so you want to set bit in kvm-preempted_vcpus if task is _not_
  TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
  is in userspace it is definitely not preempted.
 
 No, as I originally wrote.  If it's TASK_RUNNING when it saw sched_out,
 then it is preempted (i.e. runnable), not sleeping on some waitqueue,
 voluntarily (HLT) or involuntarily (page fault).
 
Of course, I got it all backwards. Need more coffee.

   
  btw, we can have secondary effects.  A vcpu can be waiting for a lock in
  the host kernel, or for a host page fault.  There's no point in boosting
  anything for that.  Or a vcpu in userspace can be waiting for a lock
  that is held by another thread, which has been preempted. 
  Do you mean userspace spinlock? Because otherwise task that's waits on
  a kernel lock will sleep in the kernel.
 
 I meant a kernel mutex.
 
 vcpu 0: take guest spinlock
 vcpu 0: vmexit
 vcpu 0: spin_lock(some_lock)
 vcpu 1: take same guest spinlock
 vcpu 1: PLE vmexit
 vcpu 1: wtf?
 
 Waiting on a host kernel spinlock is not too bad because we expect to be
 out shortly.  Waiting on a host kernel mutex can be a lot worse.
 
We can't do much about it without PV spinlock since there is not
information about what vcpu holds which guest spinlock, no?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-blk: Disable callback in virtblk_done()

2012-09-27 Thread Paolo Bonzini
Il 27/09/2012 02:10, Rusty Russell ha scritto:
  +do {
  +virtqueue_disable_cb(vq);
  +while ((vbr = virtqueue_get_buf(vblk-vq, len)) != 
  NULL) {
  +if (vbr-bio) {
  +virtblk_bio_done(vbr);
  +bio_done = true;
  +} else {
  +virtblk_request_done(vbr);
  +req_done = true;
  +}
   }
  -}
  +} while (!virtqueue_enable_cb(vq));
   /* In case queue is stopped waiting for more buffers. */
   if (req_done)
   blk_start_queue(vblk-disk-queue);
 Fascinating.  Please just confirm that VIRTIO_RING_F_EVENT_IDX is
 enabled?

Yeah, it's a nice and cheap trick.  Stefan, I see that you had this in
virtio-scsi since even before I picked it up.  Do you remember how you
came up with it?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 11:58 AM, Gleb Natapov wrote:
 
   
  btw, we can have secondary effects.  A vcpu can be waiting for a lock in
  the host kernel, or for a host page fault.  There's no point in boosting
  anything for that.  Or a vcpu in userspace can be waiting for a lock
  that is held by another thread, which has been preempted. 
  Do you mean userspace spinlock? Because otherwise task that's waits on
  a kernel lock will sleep in the kernel.
 
 I meant a kernel mutex.
 
 vcpu 0: take guest spinlock
 vcpu 0: vmexit
 vcpu 0: spin_lock(some_lock)
 vcpu 1: take same guest spinlock
 vcpu 1: PLE vmexit
 vcpu 1: wtf?
 
 Waiting on a host kernel spinlock is not too bad because we expect to be
 out shortly.  Waiting on a host kernel mutex can be a lot worse.
 
 We can't do much about it without PV spinlock since there is not
 information about what vcpu holds which guest spinlock, no?

It doesn't help.  If the lock holder is waiting for another lock in the
host kernel, boosting it doesn't help even if we know who it is.  We
need to boost the real lock holder, but we have no idea who it is (and
even if we did, we often can't do anything about it).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov
On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
 On 09/27/2012 11:58 AM, Gleb Natapov wrote:
  

   btw, we can have secondary effects.  A vcpu can be waiting for a lock in
   the host kernel, or for a host page fault.  There's no point in boosting
   anything for that.  Or a vcpu in userspace can be waiting for a lock
   that is held by another thread, which has been preempted. 
   Do you mean userspace spinlock? Because otherwise task that's waits on
   a kernel lock will sleep in the kernel.
  
  I meant a kernel mutex.
  
  vcpu 0: take guest spinlock
  vcpu 0: vmexit
  vcpu 0: spin_lock(some_lock)
  vcpu 1: take same guest spinlock
  vcpu 1: PLE vmexit
  vcpu 1: wtf?
  
  Waiting on a host kernel spinlock is not too bad because we expect to be
  out shortly.  Waiting on a host kernel mutex can be a lot worse.
  
  We can't do much about it without PV spinlock since there is not
  information about what vcpu holds which guest spinlock, no?
 
 It doesn't help.  If the lock holder is waiting for another lock in the
 host kernel, boosting it doesn't help even if we know who it is.  We
 need to boost the real lock holder, but we have no idea who it is (and
 even if we did, we often can't do anything about it).
 
Without PV lock we will boost random preempted vcpu instead of going to
sleep in the situation you described.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/26/2012 05:57 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock
code in order to set any order you may like for the ticket
selection.
(even for a non pvticket lock version)


Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.


Hmm true. I think it is indeed difficult to shoe-in all workloads.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-27 Thread Avi Kivity
On 09/26/2012 07:54 AM, Hao, Xudong wrote:
 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Avi Kivity
 Sent: Tuesday, September 25, 2012 4:16 PM
 To: Hao, Xudong
 Cc: kvm@vger.kernel.org; Zhang, Xiantao
 Subject: Re: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU
 
 On 09/25/2012 04:32 AM, Hao, Xudong wrote:
  
   btw, it is clear that long term the fpu will always be eagerly loaded,
   as hosts and guests (and hardware) are updated.  At that time it will
   make sense to remove the lazy fpu code entirely.  But maybe that time is
   here already, since exits are rare and so the guest has a lot of chance
   to use the fpu, so eager fpu saves the #NM vmexit.
  
   Can you check a kernel compile on a westmere system?  If eager fpu is
   faster there than lazy fpu, we can just make the fpu always eager and
   remove quite a bit of code.
  
  I remember westmere does not support Xsave, do you want performance of
 fxsave/fresotr ?
 
 Yes.   If a westmere is fast enough then we can probably justify it.  If
 you can run tests on Sandy/Ivy Bridge, even better.
 
 Run kernel compile on westmere, eager fpu is about 0.4% faster, seems eager 
 does not benefit it too much, so remain lazy fpu for lazy_allowed fpu state?

Why not make it eager all the time then?  It will simplify the code
quite a bit, no?

All I was looking for was no regressions, a small speedup is just a bonus.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 12:08 PM, Gleb Natapov wrote:
 On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
 On 09/27/2012 11:58 AM, Gleb Natapov wrote:
  

   btw, we can have secondary effects.  A vcpu can be waiting for a lock 
   in
   the host kernel, or for a host page fault.  There's no point in 
   boosting
   anything for that.  Or a vcpu in userspace can be waiting for a lock
   that is held by another thread, which has been preempted. 
   Do you mean userspace spinlock? Because otherwise task that's waits on
   a kernel lock will sleep in the kernel.
  
  I meant a kernel mutex.
  
  vcpu 0: take guest spinlock
  vcpu 0: vmexit
  vcpu 0: spin_lock(some_lock)
  vcpu 1: take same guest spinlock
  vcpu 1: PLE vmexit
  vcpu 1: wtf?
  
  Waiting on a host kernel spinlock is not too bad because we expect to be
  out shortly.  Waiting on a host kernel mutex can be a lot worse.
  
  We can't do much about it without PV spinlock since there is not
  information about what vcpu holds which guest spinlock, no?
 
 It doesn't help.  If the lock holder is waiting for another lock in the
 host kernel, boosting it doesn't help even if we know who it is.  We
 need to boost the real lock holder, but we have no idea who it is (and
 even if we did, we often can't do anything about it).
 
 Without PV lock we will boost random preempted vcpu instead of going to
 sleep in the situation you described.

True.  In theory boosting a random vcpu shouldn't have any negative
effects though.  Right now the problem is that the boosting itself is
expensive.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/26/2012 06:27 PM, Andrew Jones wrote:

On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.


OK, so the vmexit stays and we need to improve yield_to.


Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm-ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).


We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly 
to export nr_running info to ple handler.


Looking at the result and comparing result of A and C,

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq-nr_running in sched/core
C = Base - PLE

   % improvements w.r.t BASE
---++++
   |  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |


I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?

But

vmcs_write32(PLE_WINDOW, (kvm-ple_window += PLE_WINDOW_BUMP))


is worth trying. I will have to see it eventually.





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Jones
On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
 On 09/25/2012 08:30 PM, Dor Laor wrote:
 On 09/24/2012 02:02 PM, Raghavendra K T wrote:
 On 09/24/2012 02:12 PM, Dor Laor wrote:
 In order to help PLE and pvticketlock converge I thought that a small
 test code should be developed to test this in a predictable,
 deterministic way.
 
 The idea is to have a guest kernel module that spawn a new thread each
 time you write to a /sys/ entry.
 
 Each such a thread spins over a spin lock. The specific spin lock is
 also chosen by the /sys/ interface. Let's say we have an array of spin
 locks *10 times the amount of vcpus.
 
 All the threads are running a
 while (1) {
 
 spin_lock(my_lock);
 sum += execute_dummy_cpu_computation(time);
 spin_unlock(my_lock);
 
 if (sys_tells_thread_to_die()) break;
 }
 
 print_result(sum);
 
 Instead of calling the kernel's spin_lock functions, clone them and make
 the ticket lock order deterministic and known (like a linear walk of all
 the threads trying to catch that lock).
 
 By Cloning you mean hierarchy of the locks?
 
 No, I meant to clone the implementation of the current spin lock code in
 order to set any order you may like for the ticket selection.
 (even for a non pvticket lock version)
 
 For instance, let's say you have N threads trying to grab the lock, you
 can always make the ticket go linearly from 1-2...-N.
 Not sure it's a good idea, just a recommendation.
 
 Also I believe time should be passed via sysfs / hardcoded for each
 type of lock we are mimicking
 
 Yap
 
 
 
 This way you can easy calculate:
 1. the score of a single vcpu running a single thread
 2. the score of sum of all thread scores when #thread==#vcpu all
 taking the same spin lock. The overall sum should be close as
 possible to #1.
 3. Like #2 but #threads  #vcpus and other versions of #total vcpus
 (belonging to all VMs)  #pcpus.
 4. Create #thread == #vcpus but let each thread have it's own spin
 lock
 5. Like 4 + 2
 
 Hopefully this way will allows you to judge and evaluate the exact
 overhead of scheduling VMs and threads since you have the ideal result
 in hand and you know what the threads are doing.
 
 My 2 cents, Dor
 
 
 Thank you,
 I think this is an excellent idea. ( Though I am trying to put all the
 pieces together you mentioned). So overall we should be able to measure
 the performance of pvspinlock/PLE improvements with a deterministic
 load in guest.
 
 Only thing I am missing is,
 How to generate different combinations of the lock.
 
 Okay, let me see if I can come with a solid model for this.
 
 
 Do you mean the various options for PLE/pvticket/other? I haven't
 thought of it and assumed its static but it can also be controlled
 through the temporary /sys interface.
 
 
 No, I am not there yet.
 
 So In summary, we are suffering with inconsistent benchmark result,
 while measuring the benefit of our improvement in PLE/pvlock etc..

Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following 
- interleave the memory of all guests across all numa nodes on the
  machine
- echo 0  /proc/sys/kernel/randomize_va_space on both host and test
  guest
- echo 3  /proc/sys/vm/drop_caches on both host and test guest before
  each run
- use a ramdisk for the benchmark output files on all running guests
- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
  with it turned back on

Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...

Drew

 
 So good point from your suggestion is,
 - Giving predictability to workload that runs in guest, so that we have
 pi-pi comparison of improvement.
 
 - we can easily tune the workload via sysfs, and we can have script to
 automate them.
 
 What is complicated is:
 - How can we simulate a workload close to what we measure with
 benchmarks?
 - How can we mimic lock holding time/ lock hierarchy close to the way
 it is seen with real workloads (for e.g. highly contended zone lru lock
 with similar amount of lockholding times).
 - How close it would be to when we forget about other types of spinning
 (for e.g, flush_tlb).
 
 So I feel it is not as trivial as it looks like.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Dor Laor

On 09/27/2012 11:49 AM, Raghavendra K T wrote:

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and
make
the ticket lock order deterministic and known (like a linear walk of
all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1-2...-N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).


You can spin for a similar instruction count that you're interested


- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.



Indeed this is mainly a tool that can serve to optimize few synthetic 
workloads.
I still believe that it worth to go through this exercise since a 100% 
predictable and controlled case can help us purely asses the state of 
PLE and pvticket code. Otherwise we're dealing w/ too many parameters 
and assumptions at once.


Dor

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 12:28 PM, Andrew Jones wrote:

 No, I am not there yet.
 
 So In summary, we are suffering with inconsistent benchmark result,
 while measuring the benefit of our improvement in PLE/pvlock etc..
 
 Are you measuring the combined throughput of all running guests, or
 just looking at the results of the benchmarks in a single test guest?
 
 I've done some benchmarking as well and my stddevs look pretty good for
 kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
 overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
 full sequence of tests (one with the overcommit levels in scrambled
 order). The relative stddevs for each of the sets of 5 runs look pretty
 good, and the data for the 2 runs match nicely as well.
 
 To try and get consistent results I do the following 
 - interleave the memory of all guests across all numa nodes on the
   machine
 - echo 0  /proc/sys/kernel/randomize_va_space on both host and test
   guest
 - echo 3  /proc/sys/vm/drop_caches on both host and test guest before
   each run
 - use a ramdisk for the benchmark output files on all running guests
 - no periodically running services installed on the test guest
 - HT is turned off as you do, although I'd like to try running again
   with it turned back on
 
 Although, I still need to run again measuring the combined throughput
 of all running vms (including the ones launched just to generate busy
 vcpus). Maybe my results won't be as consistent then...
 


Another way to test is to execute

 perf stat -e 'kvm_exit exit_reason==40' sleep 10

to see how many PAUSEs were intercepted in a given time (except I just
invented the filter syntax).  The fewer we get, the more useful work the
system does.  This ignores kvm_vcpu_on_spin overhead though, so it's
just a rough measure.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 02:06 PM, Avi Kivity wrote:

On 09/25/2012 03:40 PM, Raghavendra K T wrote:

On 09/24/2012 07:46 PM, Raghavendra K T wrote:

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq-nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq-nr_running.

Will come back with the result soon.


Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq-nr_running in sched/core
C = Base - PLE

---+---+---+---+---+
|Ebizzy result (rec/sec higher is better)   |
---+---+---+---+---+
|Base   | A |  B| C |
---+---+---+---+---+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
---+---+---+---+---+

% improvements w.r.t BASE
---++++
|  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---++++

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).


This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.

IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.

2) looking at the result (comparing A  C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?

So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:

1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
2) Dynamically changing PLE window (Avi/Andrew/Chegu)
3) preempt_notify handler to identify preempted VCPUs (Avi)
4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
6) Pv spinlock
7) Jiannan's proposed improvements
8) Defer preemption patches

Did we miss anything (or added extra?)

So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..

Please Let me know..






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Martin Wolf

i waited 2 months to get a more stable ubuntu 12.10
now that i dont get crashes every 5 minutes im going to test again.

as you remember my two only problems are
first the reboot issue of the gues os
and second the xonar 2 d2x audio card.

but first i would like to work on the reboot issue since
i guess this is far more important.

when i read the faq i found that one problem could be the VBIOS ROM access
i tried -device pci-assign,...,romfile=... but have not found any output 
or change.

is there some documentation of the romfile command to generate usable
output? for testing i downloaded the bios of my videocard from techpowerup.
they have a huge database and my card and biosrevision is listed.
was that ok or do i have to retrieve the bios from inside the vm?

thank you in advance

Am 19.07.2012 18:16, schrieb Alex Williamson:

On Thu, 2012-07-19 at 16:54 +0200, Martin Wolf wrote:

i tried now the alpha of ubuntu 12.10 that includes qemu-kvm 1.1
and a 3.5 kernel version.

the msr problem is gone now, i was able to select sandybridge as
cpu topology, this removed the MSR error messages.
thats the positive part...

unfortunately i had no success with the other topics (rebootability and
the soundcard)

rebootability:
the guest vm still crashes with a page fault bluescreen

soundcard:
it would be really nice if someone would be able to give me
some advise how to debug this problem.

If kernel/qemu-kvm update didn't help, then the device probably doesn't
support the interrupt disable mechanism that allows shared interrupts.
Therefore the device needs to be on an exclusive interrupt on the host.
Look in /proc/interrupts and see what else is on the same interrupt
line.  Your output below indicates the device is on IRQ 11.  Sometimes
moving the device to a different physical slot will change the IRQ and
give you an exclusive interrupt, otherwise you need to find the devices
sharing that interrupt line and disable them by attaching the device to
pci-stub.  Thanks,

Alex


Am 18.07.2012 16:23, schrieb Alex Williamson:

On Wed, 2012-07-18 at 14:37 +0200, Martin Wolf wrote:

soundcard logs added

On 18.07.2012 14:08, Martin Wolf wrote:

On 18.07.2012 11:26, Jan Kiszka wrote:

On 2012-07-18 07:45, Martin Wolf wrote:

Hello,

i was able to passthrough an AMD 7870 videocard to my win7 guest
machine.

Would you add it to http://www.linux-kvm.org/page/VGA_device_assignment?

sure, i will prepare something

my host is ubuntu 12.04 with stock kernel.
my system contains:
dq67sw q67 mainboard
i5-2400s cpu
sapphire 7870 amd videocard
xonar d2x (problems to passthrough)

for full functionality i just needed two options

- kernel : iommu=on
- kvm module: ignore_msrs=1
(if i would not set it the guest os would crash with a bluescreen)

Can you report (= kernel log) which MSRs are unknown to KVM?

Jul 18 14:03:33 kvm-xen kernel: [  437.309931] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.522724] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.522733] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522736] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522752] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522755] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522821] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522823] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522834] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.522840] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522842] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522865] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522867] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522921] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.523005] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.523081] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.523175] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.523248] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.52] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.523430] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop

i hope thats the info you need, i booted it with ignore_msrs=1 since
if i dont do that i get less output.
(do you need it without the option?)


the unigine benchmark ran flawlessly
also the benchmark 

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 02:20 PM, Avi Kivity wrote:

On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:

I've actually implemented this preempted_bitmap idea.


Interesting, please share the code if you can.


However, I'm doing this to expose this information to the guest, so the
guest is able to know if the lock holder is preempted or not before
spining. Right now, I'm doing experiment to show that this idea works.

I'm wondering what do you guys think of the relationship between the
pv_ticketlock approach and PLE handler approach. Are we going to adopt
PLE instead of the pv ticketlock, and why?


Right now we're searching for the best solution.  The tradeoffs are more
or less:

PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface

PV:
- has more information, so it can perform better


Should we also consider that we always have an edge here for non-PLE
machine?



Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.


Avi,
Thanks for this summary.. It is of great help to proceed in right
direction..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-09-27 Thread Marcelo Tosatti
On Thu, Sep 27, 2012 at 12:50:16AM +, Auld, Will wrote:
 Marcelo,
 
 I think I am missing something. There should be no needed changes to current 
 algorithms that exist today. Does it seem that I have broken Zachary's 
 implementation somehow?

Yes. compute_guest_tsc() function must take ia32_tsc_adjust into
account. guest_read_tsc (and the SVM equivalent) also.


 
 Thanks,
 
 Will
 
 -Original Message-
 From: Marcelo Tosatti [mailto:mtosa...@redhat.com] 
 Sent: Wednesday, September 26, 2012 5:29 PM
 To: Auld, Will
 Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao; Liu, Jinsong
 Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
 
 On Wed, Sep 26, 2012 at 10:58:46PM +, Auld, Will wrote:
  Avi, Still working on your suggestions.
  
  Marcelo,
  
  The purpose is to be able to run guests that implement this change and not 
  require they revert to the older method of adjusting the TSC. I am making 
  no assumption about whether the guest checks to see if the times are good 
  enough or just runs an algorithm every time but in any case this would 
  allow the simpler, cleaner and less expensive algorithm to run if it 
  exists. 
 
 Will, you can choose to not expose the feature. Correct?
 
 Because this conflicts with the model that has been envisioned and developed 
 by Zachary... for that model to continue to be functional you'll have to make 
 sure the TSC emulation is adjusted accordingly to consider IA32_TSC_ADJUST 
 (for example, when trapping TSC).
 
 From that point of view, the patch below is incomplete.
 
 ... or KVM can choose to never expose the feature via CPUID and handle TSC 
 consistency itself (i understand your perspective of getting a task complete, 
 but unfortunately from my POV its not so simple).
 
  Thanks,
  
  Will
  
  The purpose of the IA32_TSC_ADJUST control is to make it easier for the 
  operating system (host) to decrease the delta between cores to an 
  acceptable value, so that applications can make use of direct RDTSC, 
  correct?
  
  Why is it necessary for the guests to make use of such interface, if the 
  hypervisor could provide proper TSC?
  
  (not against exposing it to the guests, just thinking out loud).
  
  That is, if the purpose of the IA32_TSC_ADJUST is to provide proper 
  synchronized TSC across cores, and newer guests which should already make 
  use of paravirt clock interface, what is the point of exposing the 
  feature?
  
  -Original Message-
  From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
  Sent: Wednesday, September 26, 2012 2:35 PM
  To: Auld, Will
  Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao
  Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
  
  On Wed, Sep 19, 2012 at 05:44:46PM +, Auld, Will wrote:
   From 9982bb73460b05c1328068aae047b14b2294e2da Mon Sep 17 00:00:00
   2001
   From: Will Auld will.a...@intel.com
   Date: Wed, 12 Sep 2012 18:10:56 -0700
   Subject: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
   
   CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
   
   Basic design is to emulate the MSR by allowing reads and writes to a 
   guest vcpu specific location to store the value of the emulated MSR while 
   adding the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST 
   value will be included in all reads to the TSC MSR whether through rdmsr 
   or rdtsc. This is of course as long as the use TSC counter offsetting 
   VM-execution control is enabled as well as the IA32_TSC_ADJUST control.
   
   However, because hardware will only return the TSC + IA32_TSC_ADJUST + 
   vmsc tsc_offset for a guest process when it does and rdtsc (with the 
   correct settings) the value of our virtualized IA32_TSC_ADJUST must be 
   stored in one of these three locations.
  
  The purpose of the IA32_TSC_ADJUST control is to make it easier for the 
  operating system (host) to decrease the delta between cores to an 
  acceptable value, so that applications can make use of direct RDTSC, 
  correct?
  
  Why is it necessary for the guests to make use of such interface, if the 
  hypervisor could provide proper TSC?
  
  (not against exposing it to the guests, just thinking out loud).
  
  That is, if the purpose of the IA32_TSC_ADJUST is to provide proper 
  synchronized TSC across cores, and newer guests which should already make 
  use of paravirt clock interface, what is the point of exposing the feature?
  
The argument against storing it in the actual MSR is performance.
This is likely to be seldom used while the save/restore is required 
   on every transition. IA32_TSC_ADJUST was created as a way to solve 
   some issues with writing TSC itself so that is not an option  either.
   The remaining option, defined above as our solution has  the problem 
   of returning incorrect vmcs tsc_offset values (unless  we intercept 
   and fix, not done here) as mentioned above. However,  more 
   problematic is that storing the data in vmcs tsc_offset 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 03:58 PM, Andrew Jones wrote:

On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1-2...-N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..


Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following
- interleave the memory of all guests across all numa nodes on the
   machine
- echo 0  /proc/sys/kernel/randomize_va_space on both host and test
   guest


I was not doing this.


- echo 3  /proc/sys/vm/drop_caches on both host and test guest before
   each run


was doing already as you know


- use a ramdisk for the benchmark output files on all running guests


Yes.. this is also helpful


- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
   with it turned back on
Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...


May be. I take average from all the VMs..



Drew



So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.

--
To unsubscribe from 

Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-09-27 Thread Marcelo Tosatti
On Thu, Sep 27, 2012 at 08:31:22AM -0300, Marcelo Tosatti wrote:
 On Thu, Sep 27, 2012 at 12:50:16AM +, Auld, Will wrote:
  Marcelo,
  
  I think I am missing something. There should be no needed changes to 
  current algorithms that exist today. Does it seem that I have broken 
  Zachary's implementation somehow?
 
 Yes. compute_guest_tsc() function must take ia32_tsc_adjust into
 account. guest_read_tsc (and the SVM equivalent) also.

Also, must take into account VMX-SVM migration. In that case, you
should export IA32_TSC_ADJUST along with IA32_TSC MSR.

Which brings us back to the initial question, if there are other means
to provide stable TSC, why use this MSR? For example, VMWare guests have
no need to use this MSR (because the hypervisor provides TSC
guarantees).

Then we come back to the two questions: 

- Is there anyone from Intel working on the Linux host side, where it
  makes sense to use this?

- Are you sure its worthwhile to expose this to KVM guests?

  
  Thanks,
  
  Will
  
  -Original Message-
  From: Marcelo Tosatti [mailto:mtosa...@redhat.com] 
  Sent: Wednesday, September 26, 2012 5:29 PM
  To: Auld, Will
  Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao; Liu, Jinsong
  Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
  
  On Wed, Sep 26, 2012 at 10:58:46PM +, Auld, Will wrote:
   Avi, Still working on your suggestions.
   
   Marcelo,
   
   The purpose is to be able to run guests that implement this change and 
   not require they revert to the older method of adjusting the TSC. I am 
   making no assumption about whether the guest checks to see if the times 
   are good enough or just runs an algorithm every time but in any case this 
   would allow the simpler, cleaner and less expensive algorithm to run if 
   it exists. 
  
  Will, you can choose to not expose the feature. Correct?
  
  Because this conflicts with the model that has been envisioned and 
  developed by Zachary... for that model to continue to be functional you'll 
  have to make sure the TSC emulation is adjusted accordingly to consider 
  IA32_TSC_ADJUST (for example, when trapping TSC).
  
  From that point of view, the patch below is incomplete.
  
  ... or KVM can choose to never expose the feature via CPUID and handle TSC 
  consistency itself (i understand your perspective of getting a task 
  complete, but unfortunately from my POV its not so simple).
  
   Thanks,
   
   Will
   
   The purpose of the IA32_TSC_ADJUST control is to make it easier for the 
   operating system (host) to decrease the delta between cores to an 
   acceptable value, so that applications can make use of direct RDTSC, 
   correct?
   
   Why is it necessary for the guests to make use of such interface, if the 
   hypervisor could provide proper TSC?
   
   (not against exposing it to the guests, just thinking out loud).
   
   That is, if the purpose of the IA32_TSC_ADJUST is to provide proper 
   synchronized TSC across cores, and newer guests which should already 
   make use of paravirt clock interface, what is the point of exposing the 
   feature?
   
   -Original Message-
   From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
   Sent: Wednesday, September 26, 2012 2:35 PM
   To: Auld, Will
   Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao
   Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
   
   On Wed, Sep 19, 2012 at 05:44:46PM +, Auld, Will wrote:
From 9982bb73460b05c1328068aae047b14b2294e2da Mon Sep 17 00:00:00
2001
From: Will Auld will.a...@intel.com
Date: Wed, 12 Sep 2012 18:10:56 -0700
Subject: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a 
guest vcpu specific location to store the value of the emulated MSR 
while adding the value to the vmcs tsc_offset. In this way the 
IA32_TSC_ADJUST value will be included in all reads to the TSC MSR 
whether through rdmsr or rdtsc. This is of course as long as the use 
TSC counter offsetting VM-execution control is enabled as well as the 
IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST + 
vmsc tsc_offset for a guest process when it does and rdtsc (with the 
correct settings) the value of our virtualized IA32_TSC_ADJUST must be 
stored in one of these three locations.
   
   The purpose of the IA32_TSC_ADJUST control is to make it easier for the 
   operating system (host) to decrease the delta between cores to an 
   acceptable value, so that applications can make use of direct RDTSC, 
   correct?
   
   Why is it necessary for the guests to make use of such interface, if the 
   hypervisor could provide proper TSC?
   
   (not against exposing it to the guests, just thinking out loud).
   
   That is, if the purpose of the IA32_TSC_ADJUST is to provide proper 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 01:23 PM, Raghavendra K T wrote:

 This gives us a good case for tracking preemption on a per-vm basis.  As
 long as we aren't preempted, we can keep the PLE window high, and also
 return immediately from the handler without looking for candidates.
 
 1) So do you think, deferring preemption patch ( Vatsa was mentioning
 long back)  is also another thing worth trying, so we reduce the chance
 of LHP.

Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.

I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.

 
 IIRC, with defer preemption :
 we will have hook in spinlock/unlock path to measure depth of lock held,
 and shared with host scheduler (may be via MSRs now).
 Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
 give say one chance.

A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).

 
 2) looking at the result (comparing A  C) , I do feel we have
 significant in iterating over vcpus (when compared to even vmexit)
 so We still would need undercommit fix sugested by PeterZ (improving by
 140%). ?

Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?

 
 So looking back at threads/ discussions so far, I am trying to
 summarize, the discussions so far. I feel, at least here are the few
 potential candidates to go in:
 
 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
 3) preempt_notify handler to identify preempted VCPUs (Avi)
 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
 6) Pv spinlock
 7) Jiannan's proposed improvements
 8) Defer preemption patches
 
 Did we miss anything (or added extra?)
 
 So here are my action items:
 - I plan to repost this series with what PeterZ, Rik suggested with
 performance analysis.
 - I ll go back and explore on (3) and (6) ..
 
 Please Let me know..

Undoubtedly we'll think of more stuff.  But this looks like a good start.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mm: compaction: cache if a pageblock was scanned and no pages were isolated -fix2

2012-09-27 Thread Mel Gorman
The clearing of PG_migrate_skip potentially takes a long time if the
zone is massive. Be safe and check if it needs to reschedule.

This is a fix for
mm-compaction-cache-if-a-pageblock-was-scanned-and-no-pages-were-isolated.patch

Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index fb07abb..722d10f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -85,6 +85,9 @@ static void reset_isolation_suitable(struct zone *zone)
/* Walk the zone and mark every pageblock as suitable for isolation */
for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
struct page *page;
+
+   cond_resched();
+
if (!pfn_valid(pfn))
continue;
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 01:26 PM, Raghavendra K T wrote:
 On 09/27/2012 02:20 PM, Avi Kivity wrote:
 On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
 I've actually implemented this preempted_bitmap idea.

 Interesting, please share the code if you can.

 However, I'm doing this to expose this information to the guest, so the
 guest is able to know if the lock holder is preempted or not before
 spining. Right now, I'm doing experiment to show that this idea works.

 I'm wondering what do you guys think of the relationship between the
 pv_ticketlock approach and PLE handler approach. Are we going to adopt
 PLE instead of the pv ticketlock, and why?

 Right now we're searching for the best solution.  The tradeoffs are more
 or less:

 PLE:
 - works for unmodified / non-Linux guests
 - works for all types of spins (e.g. smp_call_function*())
 - utilizes an existing hardware interface (PAUSE instruction) so likely
 more robust compared to a software interface

 PV:
 - has more information, so it can perform better
 
 Should we also consider that we always have an edge here for non-PLE
 machine?

True.  The deployment share for these is decreasing rapidly though.  I
hate optimizing for obsolete hardware.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: injection failed, MSI lost (Operation not permitted)

2012-09-27 Thread Michael S. Tsirkin
On Tue, Sep 25, 2012 at 05:32:47PM +0200, Jan Kiszka wrote:
 On 2012-09-25 16:19, Michael S. Tsirkin wrote:
  Each time I kill qemu-kvm I get this error:
  KVM: injection failed, MSI lost (Operation not permitted)
  
  Jan, any idea?
 
 Race in shutting down resources?
 
 What do you do precisely? I never saw this.
 
 Jan

I boot guest like this:

/home/mst/qemu-test/bin/qemu-system-x86_64   -enable-kvm -cpu kvm64 -m
1G -drive file=/home/mst/rhel6-try.qcow2 -net
nic,model=virtio,netdev=foo -netdev id=bar -device e1000,netdev=bar
-redir tcp:8022::22 -netdev
tap,id=foo,ifname=msttap0,script=/home/mst/ifup,downscript=no,vhost=off
-vnc :1 -smp 16 -snapshot

At some later point I kill it from another shell:

kill `pgrep qemu`

 -- 
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-27 Thread Mel Gorman
On Wed, Sep 26, 2012 at 09:49:30AM +0900, Minchan Kim wrote:
 On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
  On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
   On Mon, 24 Sep 2012 10:39:38 +0100
   Mel Gorman mgor...@suse.de wrote:
   
On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:

 Also, what has to be done to avoid the polling altogether?  eg/ie, zap
 a pageblock's PB_migrate_skip synchronously, when something was done 
 to
 that pageblock which justifies repolling it?
 

The something event you are looking for is pages being freed or
allocated in the page allocator. A movable page being allocated in block
or a page being freed should clear the PB_migrate_skip bit if it's set.
Unfortunately this would impact the fast path of the alloc and free 
paths
of the page allocator. I felt that that was too high a price to pay.
   
   We already do a similar thing in the page allocator: clearing of
   -all_unreclaimable and -pages_scanned. 
  
  That is true but that is a simple write (shared cache line but still) to
  a struct zone. Worse, now that you point it out, that's pretty stupid. It
  should be checking if the value is non-zero before writing to it to avoid
  a cache line bounce.
  
  Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
  not as cheap as it needs to
  
  set_pageblock_skip
- set_pageblock_flags_group
  - page_zone
  - page_to_pfn
  - get_pageblock_bitmap
  - pfn_to_bitidx
  - __set_bit
  
   But that isn't on the fast
   path really - it happens once per pcp unload. 
  
  That's still an important enough path that I'm wary of making it fatter
  and that only covers the free path. To avoid the polling, the allocation
  side needs to be handled too. It could be shoved down into rmqueue() to
  put it into a slightly colder path but still, it's a price to pay to keep
  compaction happy.
  
   Can we do something
   like that?  Drop some hint into the zone without having to visit each
   page?
   
  
  Not without incurring a cost, but yes, t is possible to give a hint on when
  PG_migrate_skip should be cleared and move away from that time-based hammer.
  
  First, we'd introduce a variant of get_pageblock_migratetype() that returns
  all the bits for the pageblock flags and then helpers to extract either the
  migratetype or the PG_migrate_skip. We already are incurring the cost of
  get_pageblock_migratetype() so it will not be much more expensive than what
  is already there. If there is an allocation or free within a pageblock that
  as the PG_migrate_skip bit set then we increment a counter. When the counter
  reaches some to-be-decided threshold then compaction may clear all the
  bits. This would match the criteria of the clearing being based on activity.
  
  There are four potential problems with this
  
  1. The logic to retrieve all the bits and split them up will be a little
 convulated but maybe it would not be that bad.
  
  2. The counter is a shared-writable cache line but obviously it could
 be moved to vmstat and incremented with inc_zone_page_state to offset
 the cost a little.
  
  3. The biggested weakness is that there is not way to know if the
 counter is incremented based on activity in a small subset of blocks.
  
  4. What should the threshold be?
  
  The first problem is minor but the other three are potentially a mess.
  Adding another vmstat counter is bad enough in itself but if the counter
  is incremented based on a small subsets of pageblocks, the hint becomes
  is potentially useless.
 
 Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
 in pageblock_flags_group.
 In allocation path, we can set PG_check_migrate in a pageblock
 In free path, we can set PG_check_free in a pageblock.
 And they are cleared by compaction's scan like now.
 So we can discard 3 and 4 at least.
 

Adding a second bit does not fix problem 3 or problem 4 at all. With two
bits, all activity could be concentrated on two blocks - one migrate and
one free. The threshold still has to be selected.

 Another idea is that let's cure it by fixing fundamental problem.
 Make zone's locks more fine-grained.

Far easier said than done and only covers the contention problem. It
does nothing for the scanning problem.

 As time goes by, system uses bigger memory but our lock of zone
 isn't scalable. Recently, lru_lock and zone-lock contention report
 isn't rare so i think it's good time that we move next step.
 

Lock contention on both those locks recently were due to compaction
rather than something more fundamental.

 How about defining struct sub_zone per 2G or 4G?
 so a zone can have several sub_zone as size and subzone can replace
 current zone's role and zone is just container of subzones.
 Of course, it's not easy to implement but I think someday we should
 go that way. Is it a really overkill?
 

One one 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Theurer
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote:
 On 09/27/2012 01:23 PM, Raghavendra K T wrote:
 
  This gives us a good case for tracking preemption on a per-vm basis.  As
  long as we aren't preempted, we can keep the PLE window high, and also
  return immediately from the handler without looking for candidates.
  
  1) So do you think, deferring preemption patch ( Vatsa was mentioning
  long back)  is also another thing worth trying, so we reduce the chance
  of LHP.
 
 Yes, we have to keep it in mind.  It will be useful for fine grained
 locks, not so much so coarse locks or IPIs.
 
 I would still of course prefer a PLE solution, but if we can't get it to
 work we can consider preemption deferral.
 
  
  IIRC, with defer preemption :
  we will have hook in spinlock/unlock path to measure depth of lock held,
  and shared with host scheduler (may be via MSRs now).
  Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
  give say one chance.
 
 A downside is that we have to do that even when undercommitted.
 
 Also there may be a lot of false positives (deferred preemptions even
 when there is no contention).
 
  
  2) looking at the result (comparing A  C) , I do feel we have
  significant in iterating over vcpus (when compared to even vmexit)
  so We still would need undercommit fix sugested by PeterZ (improving by
  140%). ?
 
 Looking only at the current runqueue?  My worry is that it misses a lot
 of cases.  Maybe try the current runqueue first and then others.
 
 Or were you referring to something else?
 
  
  So looking back at threads/ discussions so far, I am trying to
  summarize, the discussions so far. I feel, at least here are the few
  potential candidates to go in:
  
  1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
  2) Dynamically changing PLE window (Avi/Andrew/Chegu)
  3) preempt_notify handler to identify preempted VCPUs (Avi)
  4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
  5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
  6) Pv spinlock
  7) Jiannan's proposed improvements
  8) Defer preemption patches
  
  Did we miss anything (or added extra?)
  
  So here are my action items:
  - I plan to repost this series with what PeterZ, Rik suggested with
  performance analysis.
  - I ll go back and explore on (3) and (6) ..
  
  Please Let me know..
 
 Undoubtedly we'll think of more stuff.  But this looks like a good start.

9) lazy gang-like scheduling with PLE to cover the non-gang-like
exceptions  (/me runs and hides from scheduler folks)

-Andrew Theurer

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] KVM: ARM: Handle guest faults in KVM

2012-09-27 Thread Catalin Marinas
On 25 September 2012 13:38, Christoffer Dall
c.d...@virtualopensystems.com wrote:
 +
 + /*
 +  * If this is a write fault (think COW) we need to make sure the
 +  * existing page, which other CPUs might still read, doesn't go
 away
 +  * from under us, by calling gfn_to_pfn_prot(write_fault=true).
 +  * Therefore, we call gfn_to_pfn_prot(write_fault=false), which
 will
 +  * pin the existing page, then we get a new page for the user space
 +  * pte and map this in the stage-2 table where we also make sure to
 +  * flush the TLB for the VM, if there was an existing entry (the
 entry
 +  * was updated setting the write flag to the potentially new page).
 +  */
 + if (fault_status == FSC_PERM) {
 + pfn_existing = gfn_to_pfn_prot(vcpu-kvm, gfn, false, NULL);
 + if (is_error_pfn(pfn_existing))
 + return -EFAULT;
 + }
 +
 + pfn = gfn_to_pfn_prot(vcpu-kvm, gfn, write_fault, writable);
 + if (is_error_pfn(pfn)) {
 + ret = -EFAULT;
 + goto out_put_existing;
 + }
 +
 + /* We need minimum second+third level pages */
 + ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
 + if (ret)
 + goto out;
 + new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
 + if (writable)
 + pte_val(new_pte) |= L_PTE2_WRITE;
 + coherent_icache_guest_page(vcpu-kvm, gfn);

 why don't you flush icache only when guest has mapped executable page
 as __sync_icache_dcache function does currently?

 because we don't know if the guest will map the page executable. The
 guest may read the page through a normal load, which causes the fault,
 and subsequently execute it (even possible through different guest
 mappings). The only way to see this happening would be to mark all
 pages as non-executable and catch the fault when it occurs -
 unfortunately the HPFAR which gives us the IPA is not populated on
 execute never faults, so we would have to translate the PC's va to ipa
 using cp15 functionality when this happens, which is then also racy
 with other CPUs.

I think you can avoid the race in the stage 2 XN case. In the Hyp
exception handler entered because of a stage 2 XN bit you can get the
IPA via the CP15 ATS1CPR and PAR registers. If the address translation
failed because the same guest running on other CPU changed the stage 1
page table, you can simply return to the guest rather than switching
to host with incomplete information. The guest may handle its own
stage 1 fault and eventually trigger another stage 2 permission and
Hyp will try the address translation again. That's a very rare
situation, so just returning without handling it would not cause any
performance issues.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: injection failed, MSI lost (Operation not permitted)

2012-09-27 Thread Jan Kiszka
On 2012-09-27 14:12, Michael S. Tsirkin wrote:
 On Tue, Sep 25, 2012 at 05:32:47PM +0200, Jan Kiszka wrote:
 On 2012-09-25 16:19, Michael S. Tsirkin wrote:
 Each time I kill qemu-kvm I get this error:
 KVM: injection failed, MSI lost (Operation not permitted)

 Jan, any idea?

 Race in shutting down resources?

 What do you do precisely? I never saw this.

 Jan
 
 I boot guest like this:
 
 /home/mst/qemu-test/bin/qemu-system-x86_64   -enable-kvm -cpu kvm64 -m
 1G -drive file=/home/mst/rhel6-try.qcow2 -net
 nic,model=virtio,netdev=foo -netdev id=bar -device e1000,netdev=bar
 -redir tcp:8022::22 -netdev
 tap,id=foo,ifname=msttap0,script=/home/mst/ifup,downscript=no,vhost=off
 -vnc :1 -smp 16 -snapshot
 
 At some later point I kill it from another shell:
 
 kill `pgrep qemu`

Hmm, cannot reproduce, even under flood-ping load on the virtio device,
ie. the one that is using MSI. I'm afraid you will have to trace the
event flow.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-27 Thread Mel Gorman
On Tue, Sep 25, 2012 at 01:03:52PM -0700, Andrew Morton wrote:
 On Tue, 25 Sep 2012 10:12:07 +0100
 Mel Gorman mgor...@suse.de wrote:
 
  First, we'd introduce a variant of get_pageblock_migratetype() that returns
  all the bits for the pageblock flags and then helpers to extract either the
  migratetype or the PG_migrate_skip. We already are incurring the cost of
  get_pageblock_migratetype() so it will not be much more expensive than what
  is already there. If there is an allocation or free within a pageblock that
  as the PG_migrate_skip bit set then we increment a counter. When the counter
  reaches some to-be-decided threshold then compaction may clear all the
  bits. This would match the criteria of the clearing being based on activity.
  
  There are four potential problems with this
  
  1. The logic to retrieve all the bits and split them up will be a little
 convulated but maybe it would not be that bad.
  
  2. The counter is a shared-writable cache line but obviously it could
 be moved to vmstat and incremented with inc_zone_page_state to offset
 the cost a little.
  
  3. The biggested weakness is that there is not way to know if the
 counter is incremented based on activity in a small subset of blocks.
  
  4. What should the threshold be?
  
  The first problem is minor but the other three are potentially a mess.
  Adding another vmstat counter is bad enough in itself but if the counter
  is incremented based on a small subsets of pageblocks, the hint becomes
  is potentially useless.
  
  However, does this match what you have in mind or am I over-complicating
  things?
 
 Sounds complicated.
 
 Using wall time really does suck. 

I know, we spent a fair amount of effort getting rid of congestion_wait()
from paths it did not belong to for similar reasons.

 Are you sure you can't think of
 something more logical?
 

No, I'm not sure.

As a matter of general policy I should not encourage this but apparently
you can nag better code out of me, patch is below :). I would rather it
was added on top rather than merged with the time-based series so it can
be reverted easily if necessary.

 How would we demonstrate the suckage?  What would be the observeable downside 
 of
 switching that 5 seconds to 5 hours?
 

Reduced allocation success rates.

  Lets take an unlikely case - 128G single-node machine. That loop count
  on x86-64 would be 65536. It'll be fast enough, particularly in this
  path.
 
 That could easily exceed a millisecond.  Can/should we stick a
 cond_resched() in there?

Ok, I think it is very unlikely but not impossible. I posted a patch for
it already.

Here is a candidate patch that replaces the time heuristic with one that
is based on VM activity. My own testing indicate that scan rates are
slightly higher with this patch than the time heuristic but well within
acceptable limits.

Richard, can you also test this patch and make sure your test case has
not regressed again please?

---8---
mm: compaction: Clear PG_migrate_skip based on compaction and reclaim activity

Compaction caches if a pageblock was scanned and no pages were isolated
so that the pageblocks can be skipped in the future to reduce scanning.
This information is not cleared by the page allocator based on activity
due to the impact it would have to the page allocator fast paths. Hence
there is a requirement that something clear the cache or pageblocks will
be skipped forever. Currently the cache is cleared if there were a number
of recent allocation failures and it has not been cleared within the last
5 seconds. Time-based decisions like this are terrible as they have no
relationship to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so this
patch implements a rough heuristic. There are two cases where the cache
is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there has been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start
   a full scan. This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as
THP, they will simply fail. For requests that cannot fail, they will
retry the allocation. Tests indicated that scanning rates were roughly
similar to when the time-based heuristic was used and the allocation
success rates were similar.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 include/linux/compaction.h | 

Re: [PATCH 06/15] KVM: ARM: Initial skeleton to compile KVM support

2012-09-27 Thread Will Deacon
On Wed, Sep 26, 2012 at 02:43:14AM +0100, Christoffer Dall wrote:
 On 09/25/2012 11:20 AM, Will Deacon wrote:
  +/* Multiprocessor Affinity Register */
  +#define MPIDR_CPUID(0x3  0)
 
  I'm fairly sure we already have code under arch/arm/ for dealing with the
  mpidr. Let's re-use that rather than reinventing it here.
 

 I see some defines in topology.c - do you want some of these factored
 out into a header file that we can then also use from kvm? If so, where?

I guess either in topology.h or a new header (topology-bits.h).

  +#define EXCEPTION_NONE  0
  +#define EXCEPTION_RESET 0x80
  +#define EXCEPTION_UNDEFINED 0x40
  +#define EXCEPTION_SOFTWARE  0x20
  +#define EXCEPTION_PREFETCH  0x10
  +#define EXCEPTION_DATA  0x08
  +#define EXCEPTION_IMPRECISE 0x04
  +#define EXCEPTION_IRQ   0x02
  +#define EXCEPTION_FIQ   0x01
 
  Why the noise?
 

 these are simply cruft from a previous life of KVM/ARM.

Ok, then please get rid of them.

  +static inline enum vcpu_mode vcpu_mode(struct kvm_vcpu *vcpu)
  +{
  +   u8 mode = __vcpu_mode(vcpu-arch.regs.cpsr);
  +   BUG_ON(mode == 0xf);
  +   return mode;
  +}
 
  I noticed that you have a fair few BUG_ONs throughout the series. Fair
  enough, but for hyp code is that really the right thing to do? Killing the
  guest could make more sense, perhaps?

 the idea is to have BUG_ONs that are indeed BUG_ONs that we want to
 catch explicitly on the host. We have had a pass over the code to change
 all the BUG_ONs that can be provoked by the guest and inject the proper
 exceptions into the guest in this case. If you find places where this is
 not the case, it should be changed, and do let me know.

Ok, so are you saying that a BUG_ON due to some detected inconsistency with
one guest may not necessarily terminate the other guests? BUG_ONs in the
host seem like a bad idea if the host is able to continue with a subset of
guests.

 
  +static inline u32 *vcpu_pc(struct kvm_vcpu *vcpu)
  +{
  +   return vcpu_reg(vcpu, 15);
  +}
 
  If you stick a struct pt_regs into struct kvm_regs, you could reuse ARM_pc
  here etc.
 

 I prefer not to, because we'd have those registers presumably for usr
 mode and then we only define the others explicit. I think it's much
 clearer to look at kvm_regs today.

I disagree and think that you should reuse as much of the arch/arm/ code as
possible. Not only does it make it easier to read by people who are familiar
with that code (and in turn get you more reviewers) but it also means that
we limit the amount of duplication that we have.

I think Marc (CC'd) had a go at this with some success.

  +#ifndef __ARM_KVM_HOST_H__
  +#define __ARM_KVM_HOST_H__
  +
  +#include asm/kvm.h
  +
  +#define KVM_MAX_VCPUS 4
 
  NR_CPUS?
 

 well this is defined by KVM generic code, and is common for other
 architecture.

I mean #define KVM_MAX_CPUS NR_CPUS. The 4 seems arbitrary.

  +int __attribute_const__ kvm_target_cpu(void)
  +{
  +   unsigned int midr;
  +
  +   midr = read_cpuid_id();
  +   switch ((midr  4)  0xfff) {
  +   case KVM_ARM_TARGET_CORTEX_A15:
  +   return KVM_ARM_TARGET_CORTEX_A15;
 
  I have this code already in perf_event.c. Can we move it somewhere common
  and share it? You should also check that the implementor field is 0x41.
 

 by all means, you can probably suggest a good place better than I can...

cputype.h?

  +#include linux/module.h
  +
  +EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
  Erm...
 
  We already have arch/arm/kernel/armksyms.c for exports -- please use that.
  However, exporting such low-level operations sounds like a bad idea. How
  realistic is kvm-as-a-module on ARM anyway?
 

 at this point it's broken, so I'll just remove this and leave this for a
 fun project for some poor soul at some point if anyone ever needs half
 the code outside the kernel as a module (the other half needs to be
 compiled in anyway)

Ok, that suits me. If it's broken, let's not include it in the initial
submission.

  +int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs 
  *regs)
  +{
  +   return -EINVAL;
  +}
  +
  +int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs 
  *regs)
  +{
  +   return -EINVAL;
  +}
 
  Again, all looks like this should be implemented using regsets from what I
  can tell.
 

 this API has been discussed to death on the KVM lists, and we can of
 course revive that if the regset makes it nicer - I'd prefer getting
 this upstream the way that it is now though, where GET_REG / SET_REG
 seems to be the way forward from a KVM perspective.

I'm sure the API has been discussed, but I've not seen that conversation and
I'm also not aware about whether or not regsets were considered as a
possibility for this stuff. The advantages of using them are:

1. It's less code for the arch to implement (and most of what you
need, you already have).

2. You can move the actual copying code into core 

Re: [PATCH 06/15] KVM: ARM: Initial skeleton to compile KVM support

2012-09-27 Thread Marc Zyngier
On 27/09/12 15:13, Will Deacon wrote:
 On Wed, Sep 26, 2012 at 02:43:14AM +0100, Christoffer Dall wrote:
 On 09/25/2012 11:20 AM, Will Deacon wrote:

 +static inline u32 *vcpu_pc(struct kvm_vcpu *vcpu)
 +{
 +   return vcpu_reg(vcpu, 15);
 +}

 If you stick a struct pt_regs into struct kvm_regs, you could reuse ARM_pc
 here etc.


 I prefer not to, because we'd have those registers presumably for usr
 mode and then we only define the others explicit. I think it's much
 clearer to look at kvm_regs today.
 
 I disagree and think that you should reuse as much of the arch/arm/ code as
 possible. Not only does it make it easier to read by people who are familiar
 with that code (and in turn get you more reviewers) but it also means that
 we limit the amount of duplication that we have.
 
 I think Marc (CC'd) had a go at this with some success.

Yup, I have it converted already. It requires a number of changes, but I
took this opportunity to do some other cleanup (world switch
save/restore code, mostly).

Patches are at the top of mu kvm-cleanup branch.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH 06/15] KVM: ARM: Initial skeleton to compile KVM support

2012-09-27 Thread Peter Maydell
On 27 September 2012 15:13, Will Deacon will.dea...@arm.com wrote:
 On Wed, Sep 26, 2012 at 02:43:14AM +0100, Christoffer Dall wrote:
 this API has been discussed to death on the KVM lists, and we can of
 course revive that if the regset makes it nicer - I'd prefer getting
 this upstream the way that it is now though, where GET_REG / SET_REG
 seems to be the way forward from a KVM perspective.

 I'm sure the API has been discussed, but I've not seen that conversation and
 I'm also not aware about whether or not regsets were considered as a
 possibility for this stuff.

Can you point me at some documentation for regsets? It's a bit
difficult to have a sensible conversation about their suitability
otherwise...

(The potentially tricky area to handle is the cp15 registers, where
it's quite likely that new registers might be added later and so
userspace has to be able to query the kernel for what is supported
and possibly deal with the kernel reporting attempts to set read
only bits within registers, etc. Using the same ABI for simpler
cases like the GPRs and FP registers is then just a matter of
being consistent in the interface we expose to userspace.)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm:queue 41/42] arch/s390/kvm/interrupt.c:428:12: warning: ignoring return value of 'vcpu_load', declared with attribute warn_unused_result

2012-09-27 Thread Christian Borntraeger
On 20/09/12 07:22, Michael S. Tsirkin wrote:
 On Thu, Sep 20, 2012 at 08:16:08AM +0800, Fengguang Wu wrote:
 Hi Michael,

 FYI, there are new compile warnings show up in

 tree:   git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
 head:   879238fecc051d95037ae76332916209a7770709
 commit: 9fc77441e5e1bf80b794cc546d2243ee9f4afb75 [41/42] KVM: make processes 
 waiting on vcpu mutex killable
 config: s390-defconfig

 All error/warnings:

 arch/s390/kvm/interrupt.c: In function 'kvm_s390_handle_wait':
 arch/s390/kvm/interrupt.c:428:12: warning: ignoring return value of 
 'vcpu_load', declared with attribute warn_unused_result [-Wunused-result]

 vim +428 arch/s390/kvm/interrupt.c
418   add_wait_queue(vcpu-arch.local_int.wq, wait);
419   while (list_empty(vcpu-arch.local_int.list) 
420   
 list_empty(vcpu-arch.local_int.float_int-list) 
421   (!vcpu-arch.local_int.timer_due) 
422   !signal_pending(current)) {
423   set_current_state(TASK_INTERRUPTIBLE);
424   spin_unlock_bh(vcpu-arch.local_int.lock);
425   
 spin_unlock(vcpu-arch.local_int.float_int-lock);
426   vcpu_put(vcpu);
427   schedule();
   428   vcpu_load(vcpu);
429   
 spin_lock(vcpu-arch.local_int.float_int-lock);
430   spin_lock_bh(vcpu-arch.local_int.lock);

IIRC schedule will do vcpu put/load anyway via the preempt notifiers. So the 
right fix
is probably to just remove the vcpu_put/load around that schedule.
Will double check and send a patch.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/15] KVM: ARM: Handle I/O aborts

2012-09-27 Thread Will Deacon
On Sat, Sep 15, 2012 at 04:35:59PM +0100, Christoffer Dall wrote:
 When the guest accesses I/O memory this will create data abort
 exceptions and they are handled by decoding the HSR information
 (physical address, read/write, length, register) and forwarding reads
 and writes to QEMU which performs the device emulation.
 
 Certain classes of load/store operations do not support the syndrome
 information provided in the HSR and we therefore must be able to fetch
 the offending instruction from guest memory and decode it manually.
 
 We only support instruction decoding for valid reasonable MMIO operations
 where trapping them do not provide sufficient information in the HSR (no
 16-bit Thumb instructions provide register writeback that we care about).
 
 The following instruciton types are NOT supported for MMIO operations
 despite the HSR not containing decode info:
  - any Load/Store multiple
  - any load/store exclusive
  - any load/store dual
  - anything with the PC as the dest register

[...]

 +
 +/**
 + * Load-Store instruction emulation
 + 
 */
 +
 +/*
 + * Must be ordered with LOADS first and WRITES afterwards
 + * for easy distinction when doing MMIO.
 + */
 +#define NUM_LD_INSTR  9
 +enum INSTR_LS_INDEXES {
 +   INSTR_LS_LDRBT, INSTR_LS_LDRT, INSTR_LS_LDR, INSTR_LS_LDRB,
 +   INSTR_LS_LDRD, INSTR_LS_LDREX, INSTR_LS_LDRH, INSTR_LS_LDRSB,
 +   INSTR_LS_LDRSH,
 +   INSTR_LS_STRBT, INSTR_LS_STRT, INSTR_LS_STR, INSTR_LS_STRB,
 +   INSTR_LS_STRD, INSTR_LS_STREX, INSTR_LS_STRH,
 +   NUM_LS_INSTR
 +};
 +
 +static u32 ls_instr[NUM_LS_INSTR][2] = {
 +   {0x0470, 0x0d70}, /* LDRBT */
 +   {0x0430, 0x0d70}, /* LDRT  */
 +   {0x0410, 0x0c50}, /* LDR   */
 +   {0x0450, 0x0c50}, /* LDRB  */
 +   {0x00d0, 0x0e1000f0}, /* LDRD  */
 +   {0x01900090, 0x0ff000f0}, /* LDREX */
 +   {0x001000b0, 0x0e1000f0}, /* LDRH  */
 +   {0x001000d0, 0x0e1000f0}, /* LDRSB */
 +   {0x001000f0, 0x0e1000f0}, /* LDRSH */
 +   {0x0460, 0x0d70}, /* STRBT */
 +   {0x0420, 0x0d70}, /* STRT  */
 +   {0x0400, 0x0c50}, /* STR   */
 +   {0x0440, 0x0c50}, /* STRB  */
 +   {0x00f0, 0x0e1000f0}, /* STRD  */
 +   {0x01800090, 0x0ff000f0}, /* STREX */
 +   {0x00b0, 0x0e1000f0}  /* STRH  */
 +};
 +
 +static inline int get_arm_ls_instr_index(u32 instr)
 +{
 +   return kvm_instr_index(instr, ls_instr, NUM_LS_INSTR);
 +}
 +
 +/*
 + * Load-Store instruction decoding
 + */
 +#define INSTR_LS_TYPE_BIT  26
 +#define INSTR_LS_RD_MASK   0xf000
 +#define INSTR_LS_RD_SHIFT  12
 +#define INSTR_LS_RN_MASK   0x000f
 +#define INSTR_LS_RN_SHIFT  16
 +#define INSTR_LS_RM_MASK   0x000f
 +#define INSTR_LS_OFFSET12_MASK 0x0fff

I'm afraid you're not going to thank me much for this, but it's high time we
unified the various instruction decoding functions we have under arch/arm/
and this seems like a good opportunity for that. For example, look at the
following snippets (there is much more in the files I list) in addition to
what you have:


asm/ptrace.h
-
#define PSR_T_BIT   0x0020
#define PSR_F_BIT   0x0040
#define PSR_I_BIT   0x0080
#define PSR_A_BIT   0x0100
#define PSR_E_BIT   0x0200
#define PSR_J_BIT   0x0100
#define PSR_Q_BIT   0x0800
#define PSR_V_BIT   0x1000
#define PSR_C_BIT   0x2000
#define PSR_Z_BIT   0x4000
#define PSR_N_BIT   0x8000

mm/alignment.c
--
#define LDST_I_BIT(i)   (i  (1  26)) /* Immediate constant   */
#define LDST_P_BIT(i)   (i  (1  24)) /* Preindex */
#define LDST_U_BIT(i)   (i  (1  23)) /* Add offset   */
#define LDST_W_BIT(i)   (i  (1  21)) /* Writeback*/
#define LDST_L_BIT(i)   (i  (1  20)) /* Load */

kernel/kprobes*.c
-
static void __kprobes
emulate_ldr(struct kprobe *p, struct pt_regs *regs)
{
kprobe_opcode_t insn = p-opcode;
unsigned long pc = (unsigned long)p-addr + 8;
int rt = (insn  12)  0xf;
int rn = (insn  16)  0xf;
int rm = insn  0xf;

kernel/opcodes.c

static const unsigned short cc_map[16] = {
0xF0F0, /* EQ == Z set*/
0x0F0F, /* NE */
0x, /* CS == C set*/
0x, /* CC */
0xFF00, /* MI == N set*/
0x00FF, /* PL */
0x, /* VS == V set*/
0x, /* VC 

Re: [kvm:queue 41/42] arch/s390/kvm/interrupt.c:428:12: warning: ignoring return value of 'vcpu_load', declared with attribute warn_unused_result

2012-09-27 Thread Avi Kivity
On 09/27/2012 05:02 PM, Christian Borntraeger wrote:
 On 20/09/12 07:22, Michael S. Tsirkin wrote:
 On Thu, Sep 20, 2012 at 08:16:08AM +0800, Fengguang Wu wrote:
 Hi Michael,

 FYI, there are new compile warnings show up in

 tree:   git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
 head:   879238fecc051d95037ae76332916209a7770709
 commit: 9fc77441e5e1bf80b794cc546d2243ee9f4afb75 [41/42] KVM: make 
 processes waiting on vcpu mutex killable
 config: s390-defconfig

 All error/warnings:

 arch/s390/kvm/interrupt.c: In function 'kvm_s390_handle_wait':
 arch/s390/kvm/interrupt.c:428:12: warning: ignoring return value of 
 'vcpu_load', declared with attribute warn_unused_result [-Wunused-result]

 vim +428 arch/s390/kvm/interrupt.c
418  add_wait_queue(vcpu-arch.local_int.wq, wait);
419  while (list_empty(vcpu-arch.local_int.list) 
420  
 list_empty(vcpu-arch.local_int.float_int-list) 
421  (!vcpu-arch.local_int.timer_due) 
422  !signal_pending(current)) {
423  set_current_state(TASK_INTERRUPTIBLE);
424  spin_unlock_bh(vcpu-arch.local_int.lock);
425  
 spin_unlock(vcpu-arch.local_int.float_int-lock);
426  vcpu_put(vcpu);
427  schedule();
   428  vcpu_load(vcpu);
429  
 spin_lock(vcpu-arch.local_int.float_int-lock);
430  spin_lock_bh(vcpu-arch.local_int.lock);
 
 IIRC schedule will do vcpu put/load anyway via the preempt notifiers. So the 
 right fix
 is probably to just remove the vcpu_put/load around that schedule.
 Will double check and send a patch.

Yes, this is correct.  On a preemptible kernel this happens all the time.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH 06/15] KVM: ARM: Initial skeleton to compile KVM support

2012-09-27 Thread Will Deacon
On Thu, Sep 27, 2012 at 03:45:50PM +0100, Peter Maydell wrote:
 On 27 September 2012 15:13, Will Deacon will.dea...@arm.com wrote:
  On Wed, Sep 26, 2012 at 02:43:14AM +0100, Christoffer Dall wrote:
  this API has been discussed to death on the KVM lists, and we can of
  course revive that if the regset makes it nicer - I'd prefer getting
  this upstream the way that it is now though, where GET_REG / SET_REG
  seems to be the way forward from a KVM perspective.
 
  I'm sure the API has been discussed, but I've not seen that conversation and
  I'm also not aware about whether or not regsets were considered as a
  possibility for this stuff.
 
 Can you point me at some documentation for regsets? It's a bit
 difficult to have a sensible conversation about their suitability
 otherwise...

The actual regset structure (struct user_regset) is internal to the kernel,
so it's not very well documented. As far as userspace interaction goes, the
usual method is via an iovec (see readv/writev) which is well documented, but
of course the kvm ioctl would still need documenting. For ptrace, that's just
a small paragraph in a user header:

/*
 * Generic ptrace interface that exports the architecture specific regsets
 * using the corresponding NT_* types (which are also used in the core dump).
 * Please note that the NT_PRSTATUS note type in a core dump contains a full
 * 'struct elf_prstatus'. But the user_regset for NT_PRSTATUS contains just the
 * elf_gregset_t that is the pr_reg field of 'struct elf_prstatus'. For all the
 * other user_regset flavors, the user_regset layout and the ELF core dump note
 * payload are exactly the same layout.
 *
 * This interface usage is as follows:
 *  struct iovec iov = { buf, len};
 *
 *  ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov);
 *
 * On the successful completion, iov.len will be updated by the kernel,
 * specifying how much the kernel has written/read to/from the user's iov.buf.
 */

but obviously you'd probably have something under Documentation/.

 (The potentially tricky area to handle is the cp15 registers, where
 it's quite likely that new registers might be added later and so
 userspace has to be able to query the kernel for what is supported
 and possibly deal with the kernel reporting attempts to set read
 only bits within registers, etc. Using the same ABI for simpler
 cases like the GPRs and FP registers is then just a matter of
 being consistent in the interface we expose to userspace.)

You *could* split up cp15 into lots of regsets but realistically that's going
to hurt down the line. GPR and FP would be good though.

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvmarm] [PATCH 13/15] KVM: ARM: Handle guest faults in KVM

2012-09-27 Thread Marc Zyngier
On 27/09/12 04:11, Min-gyu Kim wrote:
 
 
 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Christoffer Dall
 Sent: Tuesday, September 25, 2012 9:39 PM
 To: Min-gyu Kim
 Cc: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
 kvm...@lists.cs.columbia.edu; ??
 Subject: Re: [PATCH 13/15] KVM: ARM: Handle guest faults in KVM

 +
 + /*
 +  * If this is a write fault (think COW) we need to make sure the
 +  * existing page, which other CPUs might still read, doesn't go
 away
 +  * from under us, by calling gfn_to_pfn_prot(write_fault=true).
 +  * Therefore, we call gfn_to_pfn_prot(write_fault=false), which
 will
 +  * pin the existing page, then we get a new page for the user
 space
 +  * pte and map this in the stage-2 table where we also make sure
 to
 +  * flush the TLB for the VM, if there was an existing entry
 + (the
 entry
 +  * was updated setting the write flag to the potentially new
 page).
 +  */
 + if (fault_status == FSC_PERM) {
 + pfn_existing = gfn_to_pfn_prot(vcpu-kvm, gfn, false,
 NULL);
 + if (is_error_pfn(pfn_existing))
 + return -EFAULT;
 + }
 +
 + pfn = gfn_to_pfn_prot(vcpu-kvm, gfn, write_fault, writable);
 + if (is_error_pfn(pfn)) {
 + ret = -EFAULT;
 + goto out_put_existing;
 + }
 +
 + /* We need minimum second+third level pages */
 + ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
 + if (ret)
 + goto out;
 + new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
 + if (writable)
 + pte_val(new_pte) |= L_PTE2_WRITE;
 + coherent_icache_guest_page(vcpu-kvm, gfn);

 why don't you flush icache only when guest has mapped executable page
 as __sync_icache_dcache function does currently?



 because we don't know if the guest will map the page executable. The guest
 may read the page through a normal load, which causes the fault, and
 subsequently execute it (even possible through different guest mappings).
 The only way to see this happening would be to mark all pages as non-
 executable and catch the fault when it occurs - unfortunately the HPFAR
 which gives us the IPA is not populated on execute never faults, so we
 would have to translate the PC's va to ipa using cp15 functionality when
 this happens, which is then also racy with other CPUs. So the question is
 really if this will even be an optimization, but it's definitely something
 that requires further investigation.
 
 OK. I understand your point.
 
 But if guest maps a page for execution, guest will flush Icache
 from __sync_icache_dcache. Then coherent_icache_guest_page doesn't seem to
 be
 necessary again. One thing I'm not sure in this case is when guest maps
 for kernel executable page(module loading) and it reuses the kernel
 executable page
 from host(module unloading). But in that case, I think it is possible to
 reduce 
 the number of flush by limiting the address range for flush.

I think you're missing the major point:
When the guest maps a page for execution, it knows it has to synchronize
Icache and Dcache. But the guest never knows when we swap out a page
because of memory pressure.

When the guest eventually faults that page back in, chances are it will
be a different physical page, and the cache content may be inconsistent.
We must then sync Icache/Dcache for this page.

Now, as Christoffer mentioned, there's a number of schemes we could
potentially use to mitigate this effect (using the XN bit in the Stage2
page tables), but it remains to be seen how effective it will be.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] s390/kvm: Fix vcpu_load handling in interrupt code

2012-09-27 Thread Christian Borntraeger
Recent changes (KVM: make processes waiting on vcpu mutex killable)
now requires to check the return value of vcpu_load. This triggered
a warning in s390 specific kvm code. Turns out that we can actually
remove the put/load, since schedule will do the right thing via
the preempt notifiers.

Reported-by: Fengguang Wu fengguang...@intel.com
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 arch/s390/kvm/interrupt.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 7556231..ff1e2f8 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -423,9 +423,7 @@ no_timer:
set_current_state(TASK_INTERRUPTIBLE);
spin_unlock_bh(vcpu-arch.local_int.lock);
spin_unlock(vcpu-arch.local_int.float_int-lock);
-   vcpu_put(vcpu);
schedule();
-   vcpu_load(vcpu);
spin_lock(vcpu-arch.local_int.float_int-lock);
spin_lock_bh(vcpu-arch.local_int.lock);
}
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: e500: fix allocation size error on g2h_tlb1_map

2012-09-27 Thread Marcelo Tosatti
On Tue, Sep 25, 2012 at 09:46:01AM +0200, Alexander Graf wrote:
 
 On 23.08.2012, at 03:04, Scott Wood wrote:
 
  We were only allocating half the bytes we need, which was made more
  obvious by a recent fix to the memset in  clear_tlb1_bitmap().
  
  Signed-off-by: Scott Wood scottw...@freescale.com
 
 Thanks, applied to kvm-ppc-next.
 
 Avi, Marcelo, this one should get applied to anything currently -stable as it 
 essentially means we could overrun an array that has been allocated too 
 small. How do we do this?
 
 
 Alex

Apparently Avi prefers that patches are sent directly to the -stable
tree.

Avi?

 
  ---
  arch/powerpc/kvm/e500_tlb.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
  index 43489a8..a27d134 100644
  --- a/arch/powerpc/kvm/e500_tlb.c
  +++ b/arch/powerpc/kvm/e500_tlb.c
  @@ -1385,7 +1385,7 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 
  *vcpu_e500)
  if (!vcpu_e500-gtlb_priv[1])
  goto err;
  
  -   vcpu_e500-g2h_tlb1_map = kzalloc(sizeof(unsigned int) *
  +   vcpu_e500-g2h_tlb1_map = kzalloc(sizeof(u64) *
vcpu_e500-gtlb_params[1].entries,
GFP_KERNEL);
  if (!vcpu_e500-g2h_tlb1_map)
  -- 
  1.7.9.5
  
  
  --
  To unsubscribe from this list: send the line unsubscribe kvm-ppc in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] s390/kvm: Fix vcpu_load handling in interrupt code

2012-09-27 Thread Avi Kivity
On 09/27/2012 05:29 PM, Christian Borntraeger wrote:
 Recent changes (KVM: make processes waiting on vcpu mutex killable)
 now requires to check the return value of vcpu_load. This triggered
 a warning in s390 specific kvm code. Turns out that we can actually
 remove the put/load, since schedule will do the right thing via
 the preempt notifiers.

Thanks, applied.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn

2012-09-27 Thread Avi Kivity
On 09/24/2012 02:32 PM, Xiao Guangrong wrote:

 3 places after the whole patchset (There are some cleanups after this patch).
 
 and one by even stronger is_error_pfn(). 
 
 This one is:
 
 | if (!is_error_pfn(pfn)) {
 |kvm_release_pfn_clean(pfn);
 |return true;
 | }
 |
 | return false;
 
 We can change it to:
 
 | if (is_error_pfn(pfn))
 | return false;
 |
 | kvm_release_pfn_clean(pfn);
 | return true;
 
 I guess when/if other architectures will add MMIO MMU
 caching they will need to guard kvm_release_pfn_clean() by is_noslot_pfn()
 too in most cases. I am not insisting, but as this patch shows it is
 easy to miss the check before calling the function.
 
 Sounds reasonable. I will consider it if Avi/Marcelo have no object on
 it.

I think it's a good idea.

Looks like we traded the unscalable error pages for these branches, I
think it's a reasonable tradeoff.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: e500: fix allocation size error on g2h_tlb1_map

2012-09-27 Thread Avi Kivity
On 09/27/2012 06:03 PM, Marcelo Tosatti wrote:
 On Tue, Sep 25, 2012 at 09:46:01AM +0200, Alexander Graf wrote:
 
 On 23.08.2012, at 03:04, Scott Wood wrote:
 
  We were only allocating half the bytes we need, which was made more
  obvious by a recent fix to the memset in  clear_tlb1_bitmap().
  
  Signed-off-by: Scott Wood scottw...@freescale.com
 
 Thanks, applied to kvm-ppc-next.
 
 Avi, Marcelo, this one should get applied to anything currently -stable as 
 it essentially means we could overrun an array that has been allocated too 
 small. How do we do this?
 
 
 Alex
 
 Apparently Avi prefers that patches are sent directly to the -stable
 tree.

We were discussing letting Greg honour Cc: sta...@vger.kernel.org
(currently he ignores them), not sending patches directly.

We still haven't told him to do so, but Alex, you can go ahead and add
the Cc: tag to the patch.

Do you have the auto-autotest setup ready?  I guess we can do it
manually until it is.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] KVM: ARM: Handle guest faults in KVM

2012-09-27 Thread Christoffer Dall
On Thu, Sep 27, 2012 at 8:39 AM, Catalin Marinas
catalin.mari...@arm.com wrote:
 On 25 September 2012 13:38, Christoffer Dall
 c.d...@virtualopensystems.com wrote:
 +
 + /*
 +  * If this is a write fault (think COW) we need to make sure the
 +  * existing page, which other CPUs might still read, doesn't go
 away
 +  * from under us, by calling gfn_to_pfn_prot(write_fault=true).
 +  * Therefore, we call gfn_to_pfn_prot(write_fault=false), which
 will
 +  * pin the existing page, then we get a new page for the user space
 +  * pte and map this in the stage-2 table where we also make sure to
 +  * flush the TLB for the VM, if there was an existing entry (the
 entry
 +  * was updated setting the write flag to the potentially new page).
 +  */
 + if (fault_status == FSC_PERM) {
 + pfn_existing = gfn_to_pfn_prot(vcpu-kvm, gfn, false, NULL);
 + if (is_error_pfn(pfn_existing))
 + return -EFAULT;
 + }
 +
 + pfn = gfn_to_pfn_prot(vcpu-kvm, gfn, write_fault, writable);
 + if (is_error_pfn(pfn)) {
 + ret = -EFAULT;
 + goto out_put_existing;
 + }
 +
 + /* We need minimum second+third level pages */
 + ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
 + if (ret)
 + goto out;
 + new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
 + if (writable)
 + pte_val(new_pte) |= L_PTE2_WRITE;
 + coherent_icache_guest_page(vcpu-kvm, gfn);

 why don't you flush icache only when guest has mapped executable page
 as __sync_icache_dcache function does currently?

 because we don't know if the guest will map the page executable. The
 guest may read the page through a normal load, which causes the fault,
 and subsequently execute it (even possible through different guest
 mappings). The only way to see this happening would be to mark all
 pages as non-executable and catch the fault when it occurs -
 unfortunately the HPFAR which gives us the IPA is not populated on
 execute never faults, so we would have to translate the PC's va to ipa
 using cp15 functionality when this happens, which is then also racy
 with other CPUs.

 I think you can avoid the race in the stage 2 XN case. In the Hyp
 exception handler entered because of a stage 2 XN bit you can get the
 IPA via the CP15 ATS1CPR and PAR registers. If the address translation
 failed because the same guest running on other CPU changed the stage 1
 page table, you can simply return to the guest rather than switching
 to host with incomplete information. The guest may handle its own
 stage 1 fault and eventually trigger another stage 2 permission and
 Hyp will try the address translation again. That's a very rare
 situation, so just returning without handling it would not cause any
 performance issues.

you're right that the race is not a big issue, but it's not clear to
me that the trapping + ATS1CPR will be faster than just flushing
icache - we'll have to measure this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] KVM: ARM: Handle guest faults in KVM

2012-09-27 Thread Catalin Marinas
On Thu, Sep 27, 2012 at 06:15:05PM +0100, Christoffer Dall wrote:
 On Thu, Sep 27, 2012 at 8:39 AM, Catalin Marinas
 catalin.mari...@arm.com wrote:
  On 25 September 2012 13:38, Christoffer Dall
  c.d...@virtualopensystems.com wrote:
  +
  + /*
  +  * If this is a write fault (think COW) we need to make sure the
  +  * existing page, which other CPUs might still read, doesn't go
  away
  +  * from under us, by calling gfn_to_pfn_prot(write_fault=true).
  +  * Therefore, we call gfn_to_pfn_prot(write_fault=false), which
  will
  +  * pin the existing page, then we get a new page for the user space
  +  * pte and map this in the stage-2 table where we also make sure to
  +  * flush the TLB for the VM, if there was an existing entry (the
  entry
  +  * was updated setting the write flag to the potentially new page).
  +  */
  + if (fault_status == FSC_PERM) {
  + pfn_existing = gfn_to_pfn_prot(vcpu-kvm, gfn, false, 
  NULL);
  + if (is_error_pfn(pfn_existing))
  + return -EFAULT;
  + }
  +
  + pfn = gfn_to_pfn_prot(vcpu-kvm, gfn, write_fault, writable);
  + if (is_error_pfn(pfn)) {
  + ret = -EFAULT;
  + goto out_put_existing;
  + }
  +
  + /* We need minimum second+third level pages */
  + ret = mmu_topup_memory_cache(memcache, 2, KVM_NR_MEM_OBJS);
  + if (ret)
  + goto out;
  + new_pte = pfn_pte(pfn, PAGE_KVM_GUEST);
  + if (writable)
  + pte_val(new_pte) |= L_PTE2_WRITE;
  + coherent_icache_guest_page(vcpu-kvm, gfn);
 
  why don't you flush icache only when guest has mapped executable page
  as __sync_icache_dcache function does currently?
 
  because we don't know if the guest will map the page executable. The
  guest may read the page through a normal load, which causes the fault,
  and subsequently execute it (even possible through different guest
  mappings). The only way to see this happening would be to mark all
  pages as non-executable and catch the fault when it occurs -
  unfortunately the HPFAR which gives us the IPA is not populated on
  execute never faults, so we would have to translate the PC's va to ipa
  using cp15 functionality when this happens, which is then also racy
  with other CPUs.
 
  I think you can avoid the race in the stage 2 XN case. In the Hyp
  exception handler entered because of a stage 2 XN bit you can get the
  IPA via the CP15 ATS1CPR and PAR registers. If the address translation
  failed because the same guest running on other CPU changed the stage 1
  page table, you can simply return to the guest rather than switching
  to host with incomplete information. The guest may handle its own
  stage 1 fault and eventually trigger another stage 2 permission and
  Hyp will try the address translation again. That's a very rare
  situation, so just returning without handling it would not cause any
  performance issues.
 
 you're right that the race is not a big issue, but it's not clear to
 me that the trapping + ATS1CPR will be faster than just flushing
 icache - we'll have to measure this.

I agree, it needs measuring first as it may not be worth the hassle.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Alex Williamson
On Thu, 2012-09-27 at 13:19 +0200, Martin Wolf wrote:
 i waited 2 months to get a more stable ubuntu 12.10
 now that i dont get crashes every 5 minutes im going to test again.
 
 as you remember my two only problems are
 first the reboot issue of the gues os
 and second the xonar 2 d2x audio card.
 
 but first i would like to work on the reboot issue since
 i guess this is far more important.
 
 when i read the faq i found that one problem could be the VBIOS ROM access
 i tried -device pci-assign,...,romfile=... but have not found any output 
 or change.
 is there some documentation of the romfile command to generate usable
 output?

If you can read the rom from the host, this probably isn't adding
anything.  You can test this by finding the rom in /sys (find /sys -name
rom), match it up to the PCI device you're assigning.  Then do:

# echo 1  rom
# xxd rom | head
#echo 0  rom

If you get something out that starts with 55aa then qemu will be able to
read the rom directly.  If you're replacing the rom with the romfile=
option, you can boot a linux guest, do the same as above to read the rom
from the guest, save it to a file and compare md5sum of what the guest
sees to the file provided.  If qemu is reading the rom on the device
correctly, it's no surprise that providing one doesn't make any
difference.

 for testing i downloaded the bios of my videocard from techpowerup.
 they have a huge database and my card and biosrevision is listed.
 was that ok or do i have to retrieve the bios from inside the vm?

Thanks for mentioning that.  From other reports I've seen, it looks like
nvidia cards are the troublesome ones for being able to read the BIOS
from the host.  Thanks,

Alex


 thank you in advance
 
 Am 19.07.2012 18:16, schrieb Alex Williamson:
  On Thu, 2012-07-19 at 16:54 +0200, Martin Wolf wrote:
  i tried now the alpha of ubuntu 12.10 that includes qemu-kvm 1.1
  and a 3.5 kernel version.
 
  the msr problem is gone now, i was able to select sandybridge as
  cpu topology, this removed the MSR error messages.
  thats the positive part...
 
  unfortunately i had no success with the other topics (rebootability and
  the soundcard)
 
  rebootability:
  the guest vm still crashes with a page fault bluescreen
 
  soundcard:
  it would be really nice if someone would be able to give me
  some advise how to debug this problem.
  If kernel/qemu-kvm update didn't help, then the device probably doesn't
  support the interrupt disable mechanism that allows shared interrupts.
  Therefore the device needs to be on an exclusive interrupt on the host.
  Look in /proc/interrupts and see what else is on the same interrupt
  line.  Your output below indicates the device is on IRQ 11.  Sometimes
  moving the device to a different physical slot will change the IRQ and
  give you an exclusive interrupt, otherwise you need to find the devices
  sharing that interrupt line and disable them by attaching the device to
  pci-stub.  Thanks,
 
  Alex
 
  Am 18.07.2012 16:23, schrieb Alex Williamson:
  On Wed, 2012-07-18 at 14:37 +0200, Martin Wolf wrote:
  soundcard logs added
 
  On 18.07.2012 14:08, Martin Wolf wrote:
  On 18.07.2012 11:26, Jan Kiszka wrote:
  On 2012-07-18 07:45, Martin Wolf wrote:
  Hello,
 
  i was able to passthrough an AMD 7870 videocard to my win7 guest
  machine.
  Would you add it to 
  http://www.linux-kvm.org/page/VGA_device_assignment?
  sure, i will prepare something
  my host is ubuntu 12.04 with stock kernel.
  my system contains:
  dq67sw q67 mainboard
  i5-2400s cpu
  sapphire 7870 amd videocard
  xonar d2x (problems to passthrough)
 
  for full functionality i just needed two options
 
  - kernel : iommu=on
  - kvm module: ignore_msrs=1
  (if i would not set it the guest os would crash with a bluescreen)
  Can you report (= kernel log) which MSRs are unknown to KVM?
  Jul 18 14:03:33 kvm-xen kernel: [  437.309931] kvm: 3347: cpu1
  kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
  Jul 18 14:03:33 kvm-xen kernel: [  437.522724] kvm: 3347: cpu1
  kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
  Jul 18 14:03:33 kvm-xen kernel: [  437.522733] kvm: 3347: cpu1 ignored
  rdmsr: 0x1c9
  Jul 18 14:03:33 kvm-xen kernel: [  437.522736] kvm: 3347: cpu1 ignored
  rdmsr: 0x60
  Jul 18 14:03:33 kvm-xen kernel: [  437.522752] kvm: 3347: cpu1 ignored
  rdmsr: 0x1c9
  Jul 18 14:03:33 kvm-xen kernel: [  437.522755] kvm: 3347: cpu1 ignored
  rdmsr: 0x60
  Jul 18 14:03:33 kvm-xen kernel: [  437.522821] kvm: 3347: cpu1 ignored
  rdmsr: 0x1c9
  Jul 18 14:03:33 kvm-xen kernel: [  437.522823] kvm: 3347: cpu1 ignored
  rdmsr: 0x60
  Jul 18 14:03:33 kvm-xen kernel: [  437.522834] kvm: 3347: cpu1
  kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
  Jul 18 14:03:33 kvm-xen kernel: [  437.522840] kvm: 3347: cpu1 ignored
  rdmsr: 0x1c9
  Jul 18 14:03:33 kvm-xen kernel: [  437.522842] kvm: 3347: cpu1 ignored
  rdmsr: 0x60
  Jul 18 14:03:33 kvm-xen kernel: [  437.522865] kvm: 3347: cpu1 ignored
  rdmsr: 0x1c9
  Jul 18 

Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Martin Wolf

thank you for the information.

i will try what you mentioned...
do you have some additional information about rebooting a VM with a 
passed through videocard?

(amd / ati 7870)

thanks in advance

Am 27.09.2012 19:56, schrieb Alex Williamson:

On Thu, 2012-09-27 at 13:19 +0200, Martin Wolf wrote:

i waited 2 months to get a more stable ubuntu 12.10
now that i dont get crashes every 5 minutes im going to test again.

as you remember my two only problems are
first the reboot issue of the gues os
and second the xonar 2 d2x audio card.

but first i would like to work on the reboot issue since
i guess this is far more important.

when i read the faq i found that one problem could be the VBIOS ROM access
i tried -device pci-assign,...,romfile=... but have not found any output
or change.
is there some documentation of the romfile command to generate usable
output?

If you can read the rom from the host, this probably isn't adding
anything.  You can test this by finding the rom in /sys (find /sys -name
rom), match it up to the PCI device you're assigning.  Then do:

# echo 1  rom
# xxd rom | head
#echo 0  rom

If you get something out that starts with 55aa then qemu will be able to
read the rom directly.  If you're replacing the rom with the romfile=
option, you can boot a linux guest, do the same as above to read the rom
from the guest, save it to a file and compare md5sum of what the guest
sees to the file provided.  If qemu is reading the rom on the device
correctly, it's no surprise that providing one doesn't make any
difference.


for testing i downloaded the bios of my videocard from techpowerup.
they have a huge database and my card and biosrevision is listed.
was that ok or do i have to retrieve the bios from inside the vm?

Thanks for mentioning that.  From other reports I've seen, it looks like
nvidia cards are the troublesome ones for being able to read the BIOS
from the host.  Thanks,

Alex



thank you in advance

Am 19.07.2012 18:16, schrieb Alex Williamson:

On Thu, 2012-07-19 at 16:54 +0200, Martin Wolf wrote:

i tried now the alpha of ubuntu 12.10 that includes qemu-kvm 1.1
and a 3.5 kernel version.

the msr problem is gone now, i was able to select sandybridge as
cpu topology, this removed the MSR error messages.
thats the positive part...

unfortunately i had no success with the other topics (rebootability and
the soundcard)

rebootability:
the guest vm still crashes with a page fault bluescreen

soundcard:
it would be really nice if someone would be able to give me
some advise how to debug this problem.

If kernel/qemu-kvm update didn't help, then the device probably doesn't
support the interrupt disable mechanism that allows shared interrupts.
Therefore the device needs to be on an exclusive interrupt on the host.
Look in /proc/interrupts and see what else is on the same interrupt
line.  Your output below indicates the device is on IRQ 11.  Sometimes
moving the device to a different physical slot will change the IRQ and
give you an exclusive interrupt, otherwise you need to find the devices
sharing that interrupt line and disable them by attaching the device to
pci-stub.  Thanks,

Alex


Am 18.07.2012 16:23, schrieb Alex Williamson:

On Wed, 2012-07-18 at 14:37 +0200, Martin Wolf wrote:

soundcard logs added

On 18.07.2012 14:08, Martin Wolf wrote:

On 18.07.2012 11:26, Jan Kiszka wrote:

On 2012-07-18 07:45, Martin Wolf wrote:

Hello,

i was able to passthrough an AMD 7870 videocard to my win7 guest
machine.

Would you add it to http://www.linux-kvm.org/page/VGA_device_assignment?

sure, i will prepare something

my host is ubuntu 12.04 with stock kernel.
my system contains:
dq67sw q67 mainboard
i5-2400s cpu
sapphire 7870 amd videocard
xonar d2x (problems to passthrough)

for full functionality i just needed two options

- kernel : iommu=on
- kvm module: ignore_msrs=1
(if i would not set it the guest os would crash with a bluescreen)

Can you report (= kernel log) which MSRs are unknown to KVM?

Jul 18 14:03:33 kvm-xen kernel: [  437.309931] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.522724] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.522733] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522736] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522752] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522755] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522821] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522823] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 14:03:33 kvm-xen kernel: [  437.522834] kvm: 3347: cpu1
kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop
Jul 18 14:03:33 kvm-xen kernel: [  437.522840] kvm: 3347: cpu1 ignored
rdmsr: 0x1c9
Jul 18 14:03:33 kvm-xen kernel: [  437.522842] kvm: 3347: cpu1 ignored
rdmsr: 0x60
Jul 18 

Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Alex Williamson
On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote:
 thank you for the information.
 
 i will try what you mentioned...
 do you have some additional information about rebooting a VM with a 
 passed through videocard?
 (amd / ati 7870)

I don't.  Is the bsod on reboot only or does it also happen on shutdown?
There's a slim chance it could be traced by enabling debug in the
pci-assign driver and analyzing what the guest driver is trying to do.
I'm hoping that q35 chipset support might resolve some issues with vga
assignment as it exposes a topology that looks a bit more like one that
a driver would expect on physical hardware.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Martin Wolf

when will that support be implemented?
how do i enable debug support for the pci-assign driver?


Am 27.09.2012 21:18, schrieb Alex Williamson:

On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote:

thank you for the information.

i will try what you mentioned...
do you have some additional information about rebooting a VM with a
passed through videocard?
(amd / ati 7870)

I don't.  Is the bsod on reboot only or does it also happen on shutdown?
There's a slim chance it could be traced by enabling debug in the
pci-assign driver and analyzing what the guest driver is trying to do.
I'm hoping that q35 chipset support might resolve some issues with vga
assignment as it exposes a topology that looks a bit more like one that
a driver would expect on physical hardware.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Adiumentum GmbH
Gf. Martin Wolf
Banderbacherstraße 76
90513 Zirndorf

0911 / 9601470
mw...@adiumentum.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: e500: fix allocation size error on g2h_tlb1_map

2012-09-27 Thread Alexander Graf

On 09/27/2012 06:35 PM, Avi Kivity wrote:

On 09/27/2012 06:03 PM, Marcelo Tosatti wrote:

On Tue, Sep 25, 2012 at 09:46:01AM +0200, Alexander Graf wrote:

On 23.08.2012, at 03:04, Scott Wood wrote:


We were only allocating half the bytes we need, which was made more
obvious by a recent fix to the memset in  clear_tlb1_bitmap().

Signed-off-by: Scott Woodscottw...@freescale.com

Thanks, applied to kvm-ppc-next.

Avi, Marcelo, this one should get applied to anything currently -stable as it 
essentially means we could overrun an array that has been allocated too small. 
How do we do this?


Alex

Apparently Avi prefers that patches are sent directly to the -stable
tree.

We were discussing letting Greg honour Cc: sta...@vger.kernel.org
(currently he ignores them), not sending patches directly.

We still haven't told him to do so, but Alex, you can go ahead and add
the Cc: tag to the patch.


Sure, I can certainly do that :).


Do you have the auto-autotest setup ready?  I guess we can do it
manually until it is.


I do have a local autotest setup. Or what exactly are you referring to?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Alex Williamson
On Thu, 2012-09-27 at 21:37 +0200, Martin Wolf wrote:
 when will that support be implemented?
 how do i enable debug support for the pci-assign driver?

I believe the target is qemu 1.3 for q35.  Debug is enabled by
uncommenting the DEVICE_ASSIGNMENT_DEBUG define in hw/kvm/pci-assign.c
(or hw/device-assignment.c on qemu-kvm in an older tree) and rebuilding.
Thanks,

Alex

 Am 27.09.2012 21:18, schrieb Alex Williamson:
  On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote:
  thank you for the information.
 
  i will try what you mentioned...
  do you have some additional information about rebooting a VM with a
  passed through videocard?
  (amd / ati 7870)
  I don't.  Is the bsod on reboot only or does it also happen on shutdown?
  There's a slim chance it could be traced by enabling debug in the
  pci-assign driver and analyzing what the guest driver is trying to do.
  I'm hoping that q35 chipset support might resolve some issues with vga
  assignment as it exposes a topology that looks a bit more like one that
  a driver would expect on physical hardware.  Thanks,
 
  Alex
 
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET] kthread_worker: reimplement flush_kthread_work() to allow freeing during execution

2012-09-27 Thread Greg KH
On Mon, Sep 17, 2012 at 01:28:50PM -0700, Greg KH wrote:
 On Mon, Sep 17, 2012 at 12:40:16PM -0700, Tejun Heo wrote:
  On Fri, Sep 14, 2012 at 03:50:40PM -0700, Colin Cross wrote:
   This patch set fixes a reproducible crash I'm seeing on a 3.4.10
   kernel.  flush_kthread_worker (which is different from
   flush_kthread_work) is initializing a kthread_work and a completion on
   the stack, then queuing it and calling wait_for_completion.  Once the
   completion is signaled, flush_kthread_worker exits and the stack
   region used by the kthread_work may be immediately reused by another
   object on the stack, but kthread_worker_fn continues accessing its
   work pointer:
   work-func(work); - calls complete,
   effectively frees work
   smp_wmb();  /* wmb worker-b0 paired with flush-b1 */
   work-done_seq = work-queue_seq;   - overwrites a
   new stack object
   smp_mb();   /* mb worker-b1 paired with flush-b0 */
   if (atomic_read(work-flushing))
   wake_up_all(work-done);  - or crashes here
   
   These patches fix the problem by not accessing work after work-func
   is called, and should be backported to stable.  They apply cleanly to
   3.4.10.  Upstream commits are 9a2e03d8ed518a61154f18d83d6466628e519f94
   and 46f3d976213452350f9d10b0c2780c2681f7075b.
  
  Yeah, you're right.  I wonder why this didn't come up before.  Greg,
  can you please pick up these two commits?
 
 Ok, will do, thanks for letting me know.

Now applied, thanks.

greg k-h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 29/38] KVM: PPC: Book3S: PR: Rework irq disabling

2012-09-27 Thread Alexander Graf

On 17.08.2012, at 23:47, Benjamin Herrenschmidt wrote:

 
 +/* Please call after prepare_to_enter. This function puts the lazy ee state
 +   back to normal mode, without actually enabling interrupts. */
 +static inline void kvmppc_lazy_ee_enable(void)
 +{
 +#ifdef CONFIG_PPC64
 +/* Only need to enable IRQs by hard enabling them after this */
 +local_paca-irq_happened = 0;
 +local_paca-soft_enabled = 1;
 +#endif
 +}
 
 Smells like the above is the right spot for trace_hardirqs_on() an:

Hrm. Ok :).

 
 -__hard_irq_disable();
 +local_irq_disable();
  if (kvmppc_prepare_to_enter(vcpu)) {
 -/* local_irq_enable(); */
 +local_irq_enable();
  run-exit_reason = KVM_EXIT_INTR;
  r = -EINTR;
  } else {
  /* Going back to guest */
  kvm_guest_enter();
 +kvmppc_lazy_ee_enable();
  }
  }
 
 You should probably do kvmppc_lazy_ee_enable() before guest enter
 so the CPU doesn't appear to the rest of the world that it has
 interrupt disabled while it's in the guest.

I don't think I understand. The patch adds kvmppc_lazy_ee_enable() right before 
guest entry here, no?

 
 @@ -1066,8 +1062,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
 #endif
  ulong ext_msr;
 
 -preempt_disable();
 -
  /* Check if we can run the vcpu at all */
  if (!vcpu-arch.sane) {
  kvm_run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
 @@ -1081,9 +1075,9 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
   * really did time things so badly, then we just exit again due to
   * a host external interrupt.
   */
 -__hard_irq_disable();
 +local_irq_disable();
  if (kvmppc_prepare_to_enter(vcpu)) {
 -__hard_irq_enable();
 +local_irq_enable();
  kvm_run-exit_reason = KVM_EXIT_INTR;
  ret = -EINTR;
  goto out;
 @@ -1122,7 +1116,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  if (vcpu-arch.shared-msr  MSR_FP)
  kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP);
 
 -kvm_guest_enter();
 +kvmppc_lazy_ee_enable();
 
 Same. BTW, why do you have two enter path ? Smells like a recipe for
 disaster :-)

Because this way we can keep r14-r31 in registers ;). But here too we call 
kvmppc_lazy_ee_enable() right before going into guest context, no?

I can't shake off the feeling I don't fully understand your comments :)


Alex

 
   ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
 @@ -1157,7 +1151,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
 
 out:
  vcpu-mode = OUTSIDE_GUEST_MODE;
 -preempt_enable();
  return ret;
 }
 
 diff --git a/arch/powerpc/kvm/book3s_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_rmhandlers.S
 index 9ecf6e3..b2f8258 100644
 --- a/arch/powerpc/kvm/book3s_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_rmhandlers.S
 @@ -170,20 +170,21 @@ kvmppc_handler_skip_ins:
  * Call kvmppc_handler_trampoline_enter in real mode
  *
  * On entry, r4 contains the guest shadow MSR
 + * MSR.EE has to be 0 when calling this function
  */
 _GLOBAL(kvmppc_entry_trampoline)
  mfmsr   r5
  LOAD_REG_ADDR(r7, kvmppc_handler_trampoline_enter)
  toreal(r7)
 
 -li  r9, MSR_RI
 -ori r9, r9, MSR_EE
 -andcr9, r5, r9  /* Clear EE and RI in MSR value */
  li  r6, MSR_IR | MSR_DR
 -ori r6, r6, MSR_EE
 -andcr6, r5, r6  /* Clear EE, DR and IR in MSR value */
 -MTMSR_EERI(r9)  /* Clear EE and RI in MSR */
 -mtsrr0  r7  /* before we set srr0/1 */
 +andcr6, r5, r6  /* Clear DR and IR in MSR value */
 +/*
 + * Set EE in HOST_MSR so that it's enabled when we get into our
 + * C exit handler function
 + */
 +ori r5, r5, MSR_EE
 +mtsrr0  r7
  mtsrr1  r6
  RFI
 
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index aae535f..2bd190c 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -486,6 +486,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  ret = -EINTR;
  goto out;
  }
 +kvmppc_lazy_ee_enable();
 
  kvm_guest_enter();
 
 Same.
 
 @@ -955,6 +956,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
 kvm_vcpu *vcpu,
  } else {
  /* Going back to guest */
  kvm_guest_enter();
 +kvmppc_lazy_ee_enable();
  }
  }
 
 Same.
 
 
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 053bfef..545c183 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -30,6 +30,7 @@
 #include asm/kvm_ppc.h
 #include asm/tlbflush.h
 #include asm/cputhreads.h
 +#include 

RE: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

2012-09-27 Thread Auld, Will
Marcelo,

I tagged my comments below with [auld] to make it easier to read. 

Thanks,

Will

-Original Message-
From: Marcelo Tosatti [mailto:mtosa...@redhat.com] 
Sent: Thursday, September 27, 2012 4:49 AM
To: Auld, Will
Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao; Liu, Jinsong
Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

On Thu, Sep 27, 2012 at 08:31:22AM -0300, Marcelo Tosatti wrote:
 On Thu, Sep 27, 2012 at 12:50:16AM +, Auld, Will wrote:
  Marcelo,
  
  I think I am missing something. There should be no needed changes to 
  current algorithms that exist today. Does it seem that I have broken 
  Zachary's implementation somehow?
 
 Yes. compute_guest_tsc() function must take ia32_tsc_adjust into 
 account. guest_read_tsc (and the SVM equivalent) also.

[auld] I don't see how that function is broken. 

Also, must take into account VMX-SVM migration. In that case, you should 
export IA32_TSC_ADJUST along with IA32_TSC MSR.

[auld] I'll give this more thought. Two different ways to go, allow this to 
only work on host processors with this feature or enable this for all VM 
independent of the underlying host processor capability. In the former case 
migrating cross architecture might be disallowed. In the later case sending 
only IA32_TSC on migration should be enough as the delta would be accounted for 
in tsc_offset of the control structure.

Which brings us back to the initial question, if there are other means to 
provide stable TSC, why use this MSR? For example, VMWare guests have no need 
to use this MSR (because the hypervisor provides TSC guarantees).

[auld] Using this MSR simplifies the process of synchronizing the tsc for each 
logical processor because its value does not change with the clock. How do you 
write the same value to all the IA32_TIME_STAMP_COUNTER MSR? Well, figure out 
what you want to write there, get all the processors to rendezvous at the same 
time, have all logical processors complete their writes in a very small amount 
of time. This is in contrast to deciding the offset to write and then having 
all the logical processors write the offset. No worries about rendezvous, 
synchronization of the writes in time and such.  

Then we come back to the two questions: 

- Is there anyone from Intel working on the Linux host side, where it
  makes sense to use this?

[auld] I am not aware of anyone working on this for Linux.

- Are you sure its worthwhile to expose this to KVM guests?

[auld] At least one OS is moving to implement this that is commonly used as a 
guest. 

  
  Thanks,
  
  Will
  
  -Original Message-
  From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
  Sent: Wednesday, September 26, 2012 5:29 PM
  To: Auld, Will
  Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao; Liu, Jinsong
  Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
  
  On Wed, Sep 26, 2012 at 10:58:46PM +, Auld, Will wrote:
   Avi, Still working on your suggestions.
   
   Marcelo,
   
   The purpose is to be able to run guests that implement this change and 
   not require they revert to the older method of adjusting the TSC. I am 
   making no assumption about whether the guest checks to see if the times 
   are good enough or just runs an algorithm every time but in any case this 
   would allow the simpler, cleaner and less expensive algorithm to run if 
   it exists. 
  
  Will, you can choose to not expose the feature. Correct?
  
  Because this conflicts with the model that has been envisioned and 
  developed by Zachary... for that model to continue to be functional you'll 
  have to make sure the TSC emulation is adjusted accordingly to consider 
  IA32_TSC_ADJUST (for example, when trapping TSC).
  
  From that point of view, the patch below is incomplete.
  
  ... or KVM can choose to never expose the feature via CPUID and handle TSC 
  consistency itself (i understand your perspective of getting a task 
  complete, but unfortunately from my POV its not so simple).
  
   Thanks,
   
   Will
   
   The purpose of the IA32_TSC_ADJUST control is to make it easier for the 
   operating system (host) to decrease the delta between cores to an 
   acceptable value, so that applications can make use of direct RDTSC, 
   correct?
   
   Why is it necessary for the guests to make use of such interface, if the 
   hypervisor could provide proper TSC?
   
   (not against exposing it to the guests, just thinking out loud).
   
   That is, if the purpose of the IA32_TSC_ADJUST is to provide proper 
   synchronized TSC across cores, and newer guests which should already 
   make use of paravirt clock interface, what is the point of exposing the 
   feature?
   
   -Original Message-
   From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
   Sent: Wednesday, September 26, 2012 2:35 PM
   To: Auld, Will
   Cc: kvm@vger.kernel.org; Avi Kivity; Zhang, Xiantao
   Subject: Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
   
   On Wed, Sep 19, 

Re: vga passthrough // questions about pci passthrough

2012-09-27 Thread Martin Wolf

about your question if the bsod also happens on shutdown, i have
to deny it. it just happens after an msi message in kernel log on 
rebooting the vm.
(but the msi message is also there on initial boot where the system 
works perfectly)

in my oppinion at the moment when the video driver is loaded.

Am 27.09.2012 22:46, schrieb Alex Williamson:

On Thu, 2012-09-27 at 21:37 +0200, Martin Wolf wrote:

when will that support be implemented?
how do i enable debug support for the pci-assign driver?

I believe the target is qemu 1.3 for q35.  Debug is enabled by
uncommenting the DEVICE_ASSIGNMENT_DEBUG define in hw/kvm/pci-assign.c
(or hw/device-assignment.c on qemu-kvm in an older tree) and rebuilding.
Thanks,

Alex


Am 27.09.2012 21:18, schrieb Alex Williamson:

On Thu, 2012-09-27 at 20:43 +0200, Martin Wolf wrote:

thank you for the information.

i will try what you mentioned...
do you have some additional information about rebooting a VM with a
passed through videocard?
(amd / ati 7870)

I don't.  Is the bsod on reboot only or does it also happen on shutdown?
There's a slim chance it could be traced by enabling debug in the
pci-assign driver and analyzing what the guest driver is trying to do.
I'm hoping that q35 chipset support might resolve some issues with vga
assignment as it exposes a topology that looks a bit more like one that
a driver would expect on physical hardware.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html








--
Adiumentum GmbH
Gf. Martin Wolf
Banderbacherstraße 76
90513 Zirndorf

0911 / 9601470
mw...@adiumentum.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Book3S HV: Allow DTL to be unregistered through KVM_SET_ONE_REG interface

2012-09-27 Thread Paul Mackerras
From 0912ff6f7d95aa559af75ba69aa7f661357f5dce Mon Sep 17 00:00:00 2001
From: Paul Mackerras pau...@samba.org
Date: Fri, 28 Sep 2012 15:26:07 +1000
Subject: [PATCH] KVM: PPC: Book3S HV: Allow DTL to be set to address 0,
 length 0

Recently-added code to allow the dispatch trace log buffer address
to be set with the KVM_SET_ONE_REG interface includes a check on
the length to make sure the buffer is at least one entry long.
This is appropriate when registering a buffer, but the interface
also allows for any existing buffer to be unregistered by specifying
a zero address.  In this case the length check is not appropriate,
so this makes the check conditional on the address being non-zero.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 350b038..1eda13d 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -854,9 +854,8 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union 
kvmppc_one_reg *val)
addr = val-vpaval.addr;
len = val-vpaval.length;
r = -EINVAL;
-   if (len  sizeof(struct dtl_entry))
-   break;
-   if (addr  !vcpu-arch.vpa.next_gpa)
+   if (addr  (len  sizeof(struct dtl_entry) ||
+!vcpu-arch.vpa.next_gpa))
break;
len -= len % sizeof(struct dtl_entry);
r = set_vpa(vcpu, vcpu-arch.dtl, addr, len);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 05:33 PM, Avi Kivity wrote:

On 09/27/2012 01:23 PM, Raghavendra K T wrote:


This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.


Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.



Agree.


I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.



Okay.



IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.


A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).


Yes. That is a worry.





2) looking at the result (comparing A  C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?


No. I was referring to the same thing.

However. I had tried following also (which works well to check 
undercommited scenario). But thinking to use only for yielding in case
of overcommit (yield in overcommit suggested by Rik) and keep 
undercommit patch as suggested by PeterZ


[ patch is not in proper diff I suppose ].

Will test them.

Peter, Can I post your patch with your from/sob.. in V2?
Please let me know..

---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 28f00bc..9ed3759 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
kvm_vcpu *vcpu)

return eligible;
 }
 #endif
+
+bool kvm_overcommitted()
+{
+   unsigned long load;
+
+   load = avenrun[0] + FIXED_1/200;
+   load = load  FSHIFT;
+   load = (load  7) / num_online_cpus();
+
+   if (load  128)
+   return true;
+
+   return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
struct kvm *kvm = me-kvm;
@@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int pass;
int i;

+   if (!kvm_overcommitted())
+   return;
+
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread H. Peter Anvin

On 09/27/2012 10:38 PM, Raghavendra K T wrote:

+
+bool kvm_overcommitted()
+{


This better not be C...

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/10] KVM: PPC: Book3S HV: Don't access runnable threads list without vcore lock

2012-09-27 Thread Paul Mackerras
There were a few places where we were traversing the list of runnable
threads in a virtual core, i.e. vc-runnable_threads, without holding
the vcore spinlock.  This extends the places where we hold the vcore
spinlock to cover everywhere that we traverse that list.

Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
this moves the call of it from kvmppc_handle_exit out to
kvmppc_vcpu_run, where we don't hold the vcore lock.

In kvmppc_vcore_blocked, we don't actually need to check whether
all vcpus are ceded and don't have any pending exceptions, since the
caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
actually checking for pending exceptions, so we add that.

The change of if to while in kvmppc_run_vcpu is to make sure that we
never call kvmppc_remove_runnable() when the vcore state is RUNNING or
EXITING.

Signed-off-by: Paul Mackerras pau...@samba.org
---
v2: move RESUME_PAGE_FAULT defn to book3s_hv.c

 arch/powerpc/include/asm/kvm_asm.h |1 +
 arch/powerpc/kvm/book3s_hv.c   |   67 ++--
 2 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 76fdcfe..aabcdba 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -118,6 +118,7 @@
 
 #define RESUME_FLAG_NV  (10)  /* Reload guest nonvolatile state? */
 #define RESUME_FLAG_HOST(11)  /* Resume host? */
+#define RESUME_FLAG_ARCH1  (12)
 
 #define RESUME_GUEST0
 #define RESUME_GUEST_NV RESUME_FLAG_NV
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 77dec0f..3a737a4 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -57,6 +57,9 @@
 /* #define EXIT_DEBUG_SIMPLE */
 /* #define EXIT_DEBUG_INT */
 
+/* Used to indicate that a guest page fault needs to be handled */
+#define RESUME_PAGE_FAULT  (RESUME_GUEST | RESUME_FLAG_ARCH1)
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
@@ -431,7 +434,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
  struct task_struct *tsk)
 {
int r = RESUME_HOST;
-   int srcu_idx;
 
vcpu-stat.sum_exits++;
 
@@ -491,16 +493,12 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * have been handled already.
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   vcpu-arch.fault_dar, vcpu-arch.fault_dsisr);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   r = RESUME_PAGE_FAULT;
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   kvmppc_get_pc(vcpu), 0);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   vcpu-arch.fault_dar = kvmppc_get_pc(vcpu);
+   vcpu-arch.fault_dsisr = 0;
+   r = RESUME_PAGE_FAULT;
break;
/*
 * This occurs if the guest executes an illegal instruction.
@@ -984,22 +982,24 @@ static int on_primary_thread(void)
  * Run a set of guest threads on a physical core.
  * Called with vc-lock held.
  */
-static int kvmppc_run_core(struct kvmppc_vcore *vc)
+static void kvmppc_run_core(struct kvmppc_vcore *vc)
 {
struct kvm_vcpu *vcpu, *vcpu0, *vnext;
long ret;
u64 now;
int ptid, i, need_vpa_update;
int srcu_idx;
+   struct kvm_vcpu *vcpus_to_update[threads_per_core];
 
/* don't start if any threads have a signal pending */
need_vpa_update = 0;
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list) {
if (signal_pending(vcpu-arch.run_task))
-   return 0;
-   need_vpa_update |= vcpu-arch.vpa.update_pending |
-   vcpu-arch.slb_shadow.update_pending |
-   vcpu-arch.dtl.update_pending;
+   return;
+   if (vcpu-arch.vpa.update_pending ||
+   vcpu-arch.slb_shadow.update_pending ||
+   vcpu-arch.dtl.update_pending)
+   vcpus_to_update[need_vpa_update++] = vcpu;
}
 
/*
@@ -1019,8 +1019,8 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 */
if (need_vpa_update) {
spin_unlock(vc-lock);
-   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
-   kvmppc_update_vpas(vcpu);
+   for (i = 0; i  need_vpa_update; ++i)
+   kvmppc_update_vpas(vcpus_to_update[i]);

[PATCH v2 09/10] KVM: PPC: Book3S HV: Fix accounting of stolen time

2012-09-27 Thread Paul Mackerras
Currently the code that accounts stolen time tends to overestimate the
stolen time, and will sometimes report more stolen time in a DTL
(dispatch trace log) entry than has elapsed since the last DTL entry.
This can cause guests to underflow the user or system time measured
for some tasks, leading to ridiculous CPU percentages and total runtimes
being reported by top and other utilities.

In addition, the current code was designed for the previous policy where
a vcore would only run when all the vcpus in it were runnable, and so
only counted stolen time on a per-vcore basis.  Now that a vcore can
run while some of the vcpus in it are doing other things in the kernel
(e.g. handling a page fault), we need to count the time when a vcpu task
is preempted while it is not running as part of a vcore as stolen also.

To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
vcpu_load/put functions to count preemption time while the vcpu is
in that state.  Handling the transitions between the RUNNING and
BUSY_IN_HOST states requires checking and updating two variables
(accumulated time stolen and time last preempted), so we add a new
spinlock, vcpu-arch.tbacct_lock.  This protects both the per-vcpu
stolen/preempt-time variables, and the per-vcore variables while this
vcpu is running the vcore.

Finally, we now don't count time spent in userspace as stolen time.
The task could be executing in userspace on behalf of the vcpu, or
it could be preempted, or the vcpu could be genuinely stopped.  Since
we have no way of dividing up the time between these cases, we don't
count any of it as stolen.

Signed-off-by: Paul Mackerras pau...@samba.org
---
v2: just rediffed for context changes, no real change from previous version.

 arch/powerpc/include/asm/kvm_host.h |5 ++
 arch/powerpc/kvm/book3s_hv.c|  127 ++-
 2 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1e8cbd1..3093896 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -559,12 +559,17 @@ struct kvm_vcpu_arch {
unsigned long dtl_index;
u64 stolen_logged;
struct kvmppc_vpa slb_shadow;
+
+   spinlock_t tbacct_lock;
+   u64 busy_stolen;
+   u64 busy_preempt;
 #endif
 };
 
 /* Values for vcpu-arch.state */
 #define KVMPPC_VCPU_NOTREADY   0
 #define KVMPPC_VCPU_RUNNABLE   1
+#define KVMPPC_VCPU_BUSY_IN_HOST   2
 
 /* Values for vcpu-arch.io_gpr */
 #define KVM_MMIO_REG_MASK  0x001f
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61d2934..8b3c470 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -60,23 +60,74 @@
 /* Used to indicate that a guest page fault needs to be handled */
 #define RESUME_PAGE_FAULT  (RESUME_GUEST | RESUME_FLAG_ARCH1)
 
+/* Used as a null value for timebase values */
+#define TB_NIL (~(u64)0)
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
+/*
+ * We use the vcpu_load/put functions to measure stolen time.
+ * Stolen time is counted as time when either the vcpu is able to
+ * run as part of a virtual core, but the task running the vcore
+ * is preempted or sleeping, or when the vcpu needs something done
+ * in the kernel by the task running the vcpu, but that task is
+ * preempted or sleeping.  Those two things have to be counted
+ * separately, since one of the vcpu tasks will take on the job
+ * of running the core, and the other vcpu tasks in the vcore will
+ * sleep waiting for it to do that, but that sleep shouldn't count
+ * as stolen time.
+ *
+ * Hence we accumulate stolen time when the vcpu can run as part of
+ * a vcore using vc-stolen_tb, and the stolen time when the vcpu
+ * needs its task to do other things in the kernel (for example,
+ * service a page fault) in busy_stolen.  We don't accumulate
+ * stolen time for a vcore when it is inactive, or for a vcpu
+ * when it is in state RUNNING or NOTREADY.  NOTREADY is a bit of
+ * a misnomer; it means that the vcpu task is not executing in
+ * the KVM_VCPU_RUN ioctl, i.e. it is in userspace or elsewhere in
+ * the kernel.  We don't have any way of dividing up that time
+ * between time that the vcpu is genuinely stopped, time that
+ * the task is actively working on behalf of the vcpu, and time
+ * that the task is preempted, so we don't count any of it as
+ * stolen.
+ *
+ * Updates to busy_stolen are protected by arch.tbacct_lock;
+ * updates to vc-stolen_tb are protected by the arch.tbacct_lock
+ * of the vcpu that has taken responsibility for running the vcore
+ * (i.e. vc-runner).  The stolen times are measured in units of
+ * timebase ticks.  (Note that the != TB_NIL checks below are
+ * purely defensive; they should never fail.)
+ */
+
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {

Re: [PATCH 1/2] KVM: PPC: e500: fix allocation size error on g2h_tlb1_map

2012-09-27 Thread Avi Kivity
On 09/27/2012 06:03 PM, Marcelo Tosatti wrote:
 On Tue, Sep 25, 2012 at 09:46:01AM +0200, Alexander Graf wrote:
 
 On 23.08.2012, at 03:04, Scott Wood wrote:
 
  We were only allocating half the bytes we need, which was made more
  obvious by a recent fix to the memset in  clear_tlb1_bitmap().
  
  Signed-off-by: Scott Wood scottw...@freescale.com
 
 Thanks, applied to kvm-ppc-next.
 
 Avi, Marcelo, this one should get applied to anything currently -stable as 
 it essentially means we could overrun an array that has been allocated too 
 small. How do we do this?
 
 
 Alex
 
 Apparently Avi prefers that patches are sent directly to the -stable
 tree.

We were discussing letting Greg honour Cc: sta...@vger.kernel.org
(currently he ignores them), not sending patches directly.

We still haven't told him to do so, but Alex, you can go ahead and add
the Cc: tag to the patch.

Do you have the auto-autotest setup ready?  I guess we can do it
manually until it is.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: e500: fix allocation size error on g2h_tlb1_map

2012-09-27 Thread Alexander Graf

On 09/27/2012 06:35 PM, Avi Kivity wrote:

On 09/27/2012 06:03 PM, Marcelo Tosatti wrote:

On Tue, Sep 25, 2012 at 09:46:01AM +0200, Alexander Graf wrote:

On 23.08.2012, at 03:04, Scott Wood wrote:


We were only allocating half the bytes we need, which was made more
obvious by a recent fix to the memset in  clear_tlb1_bitmap().

Signed-off-by: Scott Woodscottw...@freescale.com

Thanks, applied to kvm-ppc-next.

Avi, Marcelo, this one should get applied to anything currently -stable as it 
essentially means we could overrun an array that has been allocated too small. 
How do we do this?


Alex

Apparently Avi prefers that patches are sent directly to the -stable
tree.

We were discussing letting Greg honour Cc: sta...@vger.kernel.org
(currently he ignores them), not sending patches directly.

We still haven't told him to do so, but Alex, you can go ahead and add
the Cc: tag to the patch.


Sure, I can certainly do that :).


Do you have the auto-autotest setup ready?  I guess we can do it
manually until it is.


I do have a local autotest setup. Or what exactly are you referring to?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 29/38] KVM: PPC: Book3S: PR: Rework irq disabling

2012-09-27 Thread Alexander Graf

On 17.08.2012, at 23:47, Benjamin Herrenschmidt wrote:

 
 +/* Please call after prepare_to_enter. This function puts the lazy ee state
 +   back to normal mode, without actually enabling interrupts. */
 +static inline void kvmppc_lazy_ee_enable(void)
 +{
 +#ifdef CONFIG_PPC64
 +/* Only need to enable IRQs by hard enabling them after this */
 +local_paca-irq_happened = 0;
 +local_paca-soft_enabled = 1;
 +#endif
 +}
 
 Smells like the above is the right spot for trace_hardirqs_on() an:

Hrm. Ok :).

 
 -__hard_irq_disable();
 +local_irq_disable();
  if (kvmppc_prepare_to_enter(vcpu)) {
 -/* local_irq_enable(); */
 +local_irq_enable();
  run-exit_reason = KVM_EXIT_INTR;
  r = -EINTR;
  } else {
  /* Going back to guest */
  kvm_guest_enter();
 +kvmppc_lazy_ee_enable();
  }
  }
 
 You should probably do kvmppc_lazy_ee_enable() before guest enter
 so the CPU doesn't appear to the rest of the world that it has
 interrupt disabled while it's in the guest.

I don't think I understand. The patch adds kvmppc_lazy_ee_enable() right before 
guest entry here, no?

 
 @@ -1066,8 +1062,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
 #endif
  ulong ext_msr;
 
 -preempt_disable();
 -
  /* Check if we can run the vcpu at all */
  if (!vcpu-arch.sane) {
  kvm_run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
 @@ -1081,9 +1075,9 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
   * really did time things so badly, then we just exit again due to
   * a host external interrupt.
   */
 -__hard_irq_disable();
 +local_irq_disable();
  if (kvmppc_prepare_to_enter(vcpu)) {
 -__hard_irq_enable();
 +local_irq_enable();
  kvm_run-exit_reason = KVM_EXIT_INTR;
  ret = -EINTR;
  goto out;
 @@ -1122,7 +1116,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  if (vcpu-arch.shared-msr  MSR_FP)
  kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP);
 
 -kvm_guest_enter();
 +kvmppc_lazy_ee_enable();
 
 Same. BTW, why do you have two enter path ? Smells like a recipe for
 disaster :-)

Because this way we can keep r14-r31 in registers ;). But here too we call 
kvmppc_lazy_ee_enable() right before going into guest context, no?

I can't shake off the feeling I don't fully understand your comments :)


Alex

 
   ret = __kvmppc_vcpu_run(kvm_run, vcpu);
 
 @@ -1157,7 +1151,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
 
 out:
  vcpu-mode = OUTSIDE_GUEST_MODE;
 -preempt_enable();
  return ret;
 }
 
 diff --git a/arch/powerpc/kvm/book3s_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_rmhandlers.S
 index 9ecf6e3..b2f8258 100644
 --- a/arch/powerpc/kvm/book3s_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_rmhandlers.S
 @@ -170,20 +170,21 @@ kvmppc_handler_skip_ins:
  * Call kvmppc_handler_trampoline_enter in real mode
  *
  * On entry, r4 contains the guest shadow MSR
 + * MSR.EE has to be 0 when calling this function
  */
 _GLOBAL(kvmppc_entry_trampoline)
  mfmsr   r5
  LOAD_REG_ADDR(r7, kvmppc_handler_trampoline_enter)
  toreal(r7)
 
 -li  r9, MSR_RI
 -ori r9, r9, MSR_EE
 -andcr9, r5, r9  /* Clear EE and RI in MSR value */
  li  r6, MSR_IR | MSR_DR
 -ori r6, r6, MSR_EE
 -andcr6, r5, r6  /* Clear EE, DR and IR in MSR value */
 -MTMSR_EERI(r9)  /* Clear EE and RI in MSR */
 -mtsrr0  r7  /* before we set srr0/1 */
 +andcr6, r5, r6  /* Clear DR and IR in MSR value */
 +/*
 + * Set EE in HOST_MSR so that it's enabled when we get into our
 + * C exit handler function
 + */
 +ori r5, r5, MSR_EE
 +mtsrr0  r7
  mtsrr1  r6
  RFI
 
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index aae535f..2bd190c 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -486,6 +486,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  ret = -EINTR;
  goto out;
  }
 +kvmppc_lazy_ee_enable();
 
  kvm_guest_enter();
 
 Same.
 
 @@ -955,6 +956,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct 
 kvm_vcpu *vcpu,
  } else {
  /* Going back to guest */
  kvm_guest_enter();
 +kvmppc_lazy_ee_enable();
  }
  }
 
 Same.
 
 
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 053bfef..545c183 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -30,6 +30,7 @@
 #include asm/kvm_ppc.h
 #include asm/tlbflush.h
 #include asm/cputhreads.h
 +#include 

[PATCH] KVM: PPC: Book3S HV: Allow DTL to be unregistered through KVM_SET_ONE_REG interface

2012-09-27 Thread Paul Mackerras
From 0912ff6f7d95aa559af75ba69aa7f661357f5dce Mon Sep 17 00:00:00 2001
From: Paul Mackerras pau...@samba.org
Date: Fri, 28 Sep 2012 15:26:07 +1000
Subject: [PATCH] KVM: PPC: Book3S HV: Allow DTL to be set to address 0,
 length 0

Recently-added code to allow the dispatch trace log buffer address
to be set with the KVM_SET_ONE_REG interface includes a check on
the length to make sure the buffer is at least one entry long.
This is appropriate when registering a buffer, but the interface
also allows for any existing buffer to be unregistered by specifying
a zero address.  In this case the length check is not appropriate,
so this makes the check conditional on the address being non-zero.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 350b038..1eda13d 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -854,9 +854,8 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union 
kvmppc_one_reg *val)
addr = val-vpaval.addr;
len = val-vpaval.length;
r = -EINVAL;
-   if (len  sizeof(struct dtl_entry))
-   break;
-   if (addr  !vcpu-arch.vpa.next_gpa)
+   if (addr  (len  sizeof(struct dtl_entry) ||
+!vcpu-arch.vpa.next_gpa))
break;
len -= len % sizeof(struct dtl_entry);
r = set_vpa(vcpu, vcpu-arch.dtl, addr, len);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html