Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 3:21 AM, Chris Wright wrote:

* Nai Xia (nai@gmail.com) wrote:

Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, which 
brings
significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not updated by
hardware.

Did you test with each of EPT, NPT and shadow?


Signed-off-by: Nai Xianai@gmail.com
Acked-by: Izik Eidusizik.ei...@ravellosystems.com
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/mmu.c  |   36 +
  arch/x86/kvm/mmu.h  |3 +-
  arch/x86/kvm/vmx.c  |1 +
  include/linux/kvm_host.h|2 +-
  include/linux/mmu_notifier.h|   48 +++
  mm/mmu_notifier.c   |   33 ++
  virt/kvm/kvm_main.c |   27 ++
  8 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..f0d7aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -848,6 +848,7 @@ extern bool kvm_rebooting;
  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
  int kvm_age_hva(struct kvm *kvm, unsigned long hva);
  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aee3862..a5a0c51 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -979,6 +979,37 @@ out:
return young;
  }

+/*
+ * Caller is supposed to SetPageDirty(), it's not done inside this.
+ */
+static
+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
+{
+   u64 *spte;
+   int dirty = 0;
+
+   if (!shadow_dirty_mask) {
+   WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
+   goto out;
+   }

This should never fire with the dirty_update() notifier test, right?
And that means that this whole optimization is for the shadow mmu case,
arguably the legacy case.



Hi Chris,
AMD npt does track the dirty bit in the nested page tables,
so the shadow_dirty_mask should not be 0 in that case...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Wednesday 22 June 2011 14:15:51 Izik Eidus wrote:
 On 6/22/2011 3:21 AM, Chris Wright wrote:
  * Nai Xia (nai@gmail.com) wrote:
  Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
  kvm_mmu_notifier_dirty_update()
  and their mmu_notifier interfaces to support KSM dirty bit tracking, which 
  brings
  significant performance gain in volatile pages scanning in KSM.
  Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel 
  EPT is
  enabled to indicate that the dirty bits of underlying sptes are not 
  updated by
  hardware.
  Did you test with each of EPT, NPT and shadow?
 
  Signed-off-by: Nai Xianai@gmail.com
  Acked-by: Izik Eidusizik.ei...@ravellosystems.com
  ---
arch/x86/include/asm/kvm_host.h |1 +
arch/x86/kvm/mmu.c  |   36 +
arch/x86/kvm/mmu.h  |3 +-
arch/x86/kvm/vmx.c  |1 +
include/linux/kvm_host.h|2 +-
include/linux/mmu_notifier.h|   48 
  +++
mm/mmu_notifier.c   |   33 ++
virt/kvm/kvm_main.c |   27 ++
8 files changed, 149 insertions(+), 2 deletions(-)
 
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index d2ac8e2..f0d7aa0 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -848,6 +848,7 @@ extern bool kvm_rebooting;
int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
int kvm_age_hva(struct kvm *kvm, unsigned long hva);
int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
  +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
  diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
  index aee3862..a5a0c51 100644
  --- a/arch/x86/kvm/mmu.c
  +++ b/arch/x86/kvm/mmu.c
  @@ -979,6 +979,37 @@ out:
 return young;
}
 
  +/*
  + * Caller is supposed to SetPageDirty(), it's not done inside this.
  + */
  +static
  +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
  + unsigned long data)
  +{
  +  u64 *spte;
  +  int dirty = 0;
  +
  +  if (!shadow_dirty_mask) {
  +  WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
  +  goto out;
  +  }
  This should never fire with the dirty_update() notifier test, right?
  And that means that this whole optimization is for the shadow mmu case,
  arguably the legacy case.
 
 
 Hi Chris,
 AMD npt does track the dirty bit in the nested page tables,
 so the shadow_dirty_mask should not be 0 in that case...
 
Hi Izik, 
I think he meant that if the caller is doing right  (!shadow_dirty_mask), 
the kvm_test_and_clear_dirty_rmapp() will never be called at all. So 
this test inside kvm_test_and_clear_dirty_rmapp() is useless...as I said
I added this test in any case of this interface abused by others, just like
a softer BUG_ON() --- dirty bit is not that critical to bump into BUG().



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: restricting users to only power control of VMs

2011-06-22 Thread Avi Kivity

On 06/22/2011 12:45 AM, Iordan Iordanov wrote:




It's a job for the management layer; I think it should be easy to script
libvirt to do this.



I read the documentation of libvirt, and out of the box, I don't see 
how this can be configured. So, I understand your reply as meaning 
that we need to write a program that uses the libvirt API to control this?


Yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time sync in KVM Guest

2011-06-22 Thread Dor Laor

On 06/22/2011 06:17 AM, Chaitra Gorantla wrote:

Hi all,

We are working on Fedora 15 Host. And the KVM is used to create Fedora 14 guest.
The clock-source details are as below.

Our Host supports constant_tsc.

on HOST OS
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

on GUEST OS
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock

We are not having Wall-clock time Sync in Both Host and guest.
The Time in Host OS lags behind guest by atleast 2 seconds.
However, this time difference remains constant.


That's good


Does this mean that the Host and Guest clocks are synchronized ?
By using kvm-clock as clock source on the guest, the time sync should happen 
itself ?
Please correct me if I am wrong...
How exactly we can find the time difference between Host and guest?


Usually Linux OS reads the rtc cmos to get the time when it boots.
Both the guest and the host do that using hwclock.

From there own the time is tracked by the clock source as delta.

It is recommended that both the guest and the host will run ntpd to keep 
their time updated. That's the only thing that you're recommended to do.


Thanks,
dor



Thanks in advance.



The contents of this e-mail and any attachment(s) may contain confidential or privileged 
information for the intended recipient(s). Unintended recipients are prohibited from 
taking action on the basis of information in this e-mail and  using or disseminating the 
information,  and must notify the sender and delete it from their system. LT 
Infotech will not accept responsibility or liability for the accuracy or completeness 
of, or the presence of any virus or disabling code in this e-mail

__
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta

2011-06-22 Thread Ian Campbell
On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote:
 On 06/16/2011 06:50 AM, Zachary Amsden wrote:
  The 128-bit multiply in pvclock.h was missing an output constraint for 
  EDX which caused a register corruption to appear.

Was there any particular symptom associated with corruption at that
particular point or just general badness?

   Thanks to Ulrich 
  for diagnosing the EDX corruption and Avi for providing this fix 
  (which now I am sending back to you Avi..)
 
 Applied and queued, thanks.

Seems like a stable candidate also?

Ian.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta

2011-06-22 Thread Avi Kivity

On 06/22/2011 12:35 PM, Ian Campbell wrote:

On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote:
  On 06/16/2011 06:50 AM, Zachary Amsden wrote:
The 128-bit multiply in pvclock.h was missing an output constraint for
EDX which caused a register corruption to appear.

Was there any particular symptom associated with corruption at that
particular point or just general badness?

Thanks to Ulrich
for diagnosing the EDX corruption and Avi for providing this fix
(which now I am sending back to you Avi..)

  Applied and queued, thanks.

Seems like a stable candidate also?


Yes.  I see it's merged; Marcelo, can you backport it?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta

2011-06-22 Thread Avi Kivity

On 06/22/2011 12:35 PM, Ian Campbell wrote:

On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote:
  On 06/16/2011 06:50 AM, Zachary Amsden wrote:
The 128-bit multiply in pvclock.h was missing an output constraint for
EDX which caused a register corruption to appear.

Was there any particular symptom associated with corruption at that
particular point or just general badness?



Time went backwards and forwards as vcpus were migrated across cpus.

The problem was in the host, not the guest, so Xen should not be affected.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH x86 kvm] Fix register corruption in pvclock_scale_delta

2011-06-22 Thread Ian Campbell
On Wed, 2011-06-22 at 10:40 +0100, Avi Kivity wrote:
 On 06/22/2011 12:35 PM, Ian Campbell wrote:
  On Sun, 2011-06-19 at 15:44 +0300, Avi Kivity wrote:
On 06/16/2011 06:50 AM, Zachary Amsden wrote:
  The 128-bit multiply in pvclock.h was missing an output constraint for
  EDX which caused a register corruption to appear.
 
  Was there any particular symptom associated with corruption at that
  particular point or just general badness?
 
 
 Time went backwards and forwards as vcpus were migrated across cpus.

Oops!

 The problem was in the host, not the guest, so Xen should not be affected.

Good to know, thanks!

Ian.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Avi Kivity

On 06/21/2011 04:32 PM, Nai Xia wrote:

Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, which 
brings
significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not updated by
hardware.




Can you quantify the performance gains?


+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
+{
+   u64 *spte;
+   int dirty = 0;
+
+   if (!shadow_dirty_mask) {
+   WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
+   goto out;
+   }
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   int _dirty;
+   u64 _spte = *spte;
+   BUG_ON(!(_spte  PT_PRESENT_MASK));
+   _dirty = _spte  PT_DIRTY_MASK;
+   if (_dirty) {
+   dirty = 1;
+   clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+   }


Racy.  Also, needs a tlb flush eventually.


+   spte = rmap_next(kvm, rmapp, spte);
+   }
+out:
+   return dirty;
+}
+
  #define RMAP_RECYCLE_THRESHOLD 1000


  struct mmu_notifier_ops {
+   int (*dirty_update)(struct mmu_notifier *mn,
+struct mm_struct *mm);
+


I prefer to have test_and_clear_dirty() always return 1 in this case (if 
the spte is writeable), and drop this callback.

+int __mmu_notifier_dirty_update(struct mm_struct *mm)
+{
+   struct mmu_notifier *mn;
+   struct hlist_node *n;
+   int dirty_update = 0;
+
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(mn, n,mm-mmu_notifier_mm-list, hlist) {
+   if (mn-ops-dirty_update)
+   dirty_update |= mn-ops-dirty_update(mn, mm);
+   }
+   rcu_read_unlock();
+


Should it not be = instead?


+   return dirty_update;
+}
+
  /*
   * This function can't run concurrently against mmu_notifier_register
   * because mm-mm_users  0 during mmu_notifier_register and exit_mmap


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 1:43 PM, Avi Kivity wrote:

On 06/21/2011 04:32 PM, Nai Xia wrote:
Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, 
which brings

significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if 
intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not 
updated by

hardware.




Can you quantify the performance gains?

+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long 
*rmapp,

+   unsigned long data)
+{
+u64 *spte;
+int dirty = 0;
+
+if (!shadow_dirty_mask) {
+WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
+goto out;
+}
+
+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.


Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such behavior, 
so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Avi Kivity

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such 
behavior, so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?



If you don't flush the tlb, a subsequent write will not see that spte.d 
is clear and the write will happen.  So you'll see the page as clean 
even though it's dirty.  That's not acceptable.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 2:10 PM, Avi Kivity wrote:

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such 
behavior, so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?



If you don't flush the tlb, a subsequent write will not see that 
spte.d is clear and the write will happen.  So you'll see the page as 
clean even though it's dirty.  That's not acceptable.




Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the idea 
behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush its 
tlb over and over again, specially in periodically scan like ksm does.


So what we say here is: it is better to have little junk in the unstable 
tree that get flushed eventualy anyway, instead of make the guest slower
this race is something that does not reflect accurate of ksm anyway due 
to the full memcmp that we will eventualy perform...


Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
random virtual address in real systems few minutes, there will be 
already tlb flush performed.


What you think about having 2 calls: one that does the expected behivor 
and does flush the tlb, and one that clearly say it doesnt flush the tlb

and expline its use case for ksm?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Avi Kivity

On 06/22/2011 02:19 PM, Izik Eidus wrote:

On 6/22/2011 2:10 PM, Avi Kivity wrote:

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not 
do tlb flush eventually,
But I see your point, because other users will not expect such 
behavior, so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?



If you don't flush the tlb, a subsequent write will not see that 
spte.d is clear and the write will happen.  So you'll see the page as 
clean even though it's dirty.  That's not acceptable.




Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the 
idea behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush 
its tlb over and over again, specially in periodically scan like ksm 
does.


I see.



So what we say here is: it is better to have little junk in the 
unstable tree that get flushed eventualy anyway, instead of make the 
guest slower
this race is something that does not reflect accurate of ksm anyway 
due to the full memcmp that we will eventualy perform...


Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
random virtual address in real systems few minutes, there will be 
already tlb flush performed.


What you think about having 2 calls: one that does the expected 
behivor and does flush the tlb, and one that clearly say it doesnt 
flush the tlb

and expline its use case for ksm?


Yes.  And if the unstable/fast callback is not provided, have the common 
code fall back to the stable/slow callback instead.


Or have a parameter that allows inaccurate results to be returned more 
quickly.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Avi Kivity

On 06/22/2011 02:24 PM, Avi Kivity wrote:

On 06/22/2011 02:19 PM, Izik Eidus wrote:

On 6/22/2011 2:10 PM, Avi Kivity wrote:

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not 
do tlb flush eventually,
But I see your point, because other users will not expect such 
behavior, so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?



If you don't flush the tlb, a subsequent write will not see that 
spte.d is clear and the write will happen.  So you'll see the page 
as clean even though it's dirty.  That's not acceptable.




Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the 
idea behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush 
its tlb over and over again, specially in periodically scan like ksm 
does.


I see.


Actually, this is dangerous.  If we use the dirty bit for other things, 
we will get data corruption.


For example we might want to map clean host pages as writeable-clean in 
the spte on a read fault so that we don't get a page fault when they get 
eventually written.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Avi Kivity

On 06/22/2011 02:28 PM, Avi Kivity wrote:


Actually, this is dangerous.  If we use the dirty bit for other 
things, we will get data corruption.


For example we might want to map clean host pages as writeable-clean 
in the spte on a read fault so that we don't get a page fault when 
they get eventually written.




Another example - we can use the dirty bit for dirty page loggging.

So I think we can get away with a conditional tlb flush - only flush if 
the page was dirty.  That should be rare after the first pass, at least 
with small pages.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:
 On 06/22/2011 02:24 PM, Avi Kivity wrote:
  On 06/22/2011 02:19 PM, Izik Eidus wrote:
  On 6/22/2011 2:10 PM, Avi Kivity wrote:
  On 06/22/2011 02:05 PM, Izik Eidus wrote:
  +spte = rmap_next(kvm, rmapp, NULL);
  +while (spte) {
  +int _dirty;
  +u64 _spte = *spte;
  +BUG_ON(!(_spte  PT_PRESENT_MASK));
  +_dirty = _spte  PT_DIRTY_MASK;
  +if (_dirty) {
  +dirty = 1;
  +clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
  +}
 
  Racy.  Also, needs a tlb flush eventually.
  +
 
  Hi, one of the issues is that the whole point of this patch is not 
  do tlb flush eventually,
  But I see your point, because other users will not expect such 
  behavior, so maybe there is need into a parameter
  flush_tlb=?, or add another mmu notifier call?
 
 
  If you don't flush the tlb, a subsequent write will not see that 
  spte.d is clear and the write will happen.  So you'll see the page 
  as clean even though it's dirty.  That's not acceptable.
 
 
  Yes, but this is exactly what we want from this use case:
  Right now ksm calculate the page hash to see if it was changed, the 
  idea behind this patch is to use the dirty bit instead,
  however the guest might not really like the fact that we will flush 
  its tlb over and over again, specially in periodically scan like ksm 
  does.
 
  I see.
 
 Actually, this is dangerous.  If we use the dirty bit for other things, 
 we will get data corruption.

Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
concern that we are currently the _only_ user. :)
We can add the flushing when someone else should rely on this bit.

 
 For example we might want to map clean host pages as writeable-clean in 
 the spte on a read fault so that we don't get a page fault when they get 
 eventually written.
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 2:33 PM, Nai Xia wrote:

On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:

On 06/22/2011 02:24 PM, Avi Kivity wrote:

On 06/22/2011 02:19 PM, Izik Eidus wrote:

On 6/22/2011 2:10 PM, Avi Kivity wrote:

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte   PT_PRESENT_MASK));
+_dirty = _spte   PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}

Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not
do tlb flush eventually,
But I see your point, because other users will not expect such
behavior, so maybe there is need into a parameter
flush_tlb=?, or add another mmu notifier call?


If you don't flush the tlb, a subsequent write will not see that
spte.d is clear and the write will happen.  So you'll see the page
as clean even though it's dirty.  That's not acceptable.


Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the
idea behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush
its tlb over and over again, specially in periodically scan like ksm
does.

I see.

Actually, this is dangerous.  If we use the dirty bit for other things,
we will get data corruption.

Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
concern that we are currently the _only_ user. :)
We can add the flushing when someone else should rely on this bit.



I suggest to add the flushing when someone else will use it as well

Btw I don`t think this whole optimization is worth for kvm guests in 
case that tlb flush must be perform,
in machine with alot of cpus, it much better ksm will burn one cpu 
usage, instead of slowering all the others...
So while this patch will really make ksm look faster, the whole system 
will be slower...


So in case you don`t want to add the flushing when someone else will 
rely on it,
it will be better to use the dirty tick just for userspace applications 
and not for kvm guests..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


collect some information when qemu-kvm exit

2011-06-22 Thread lidong chen
I find qemu-kvm only output a little information when abnormally exit.
For example, if qemu-kvm exit by segmentation fault, there are no
information in /var/log/libvirt/qemu/xx.log.

so i want to solve this by collect some information when qemu-kvm exit.

my idea is register some signal handler, and print some debug
information in the signal handler function. and use atexit register a
function for process termination.

the main thread already used SIGBUS,SIGUSR2,SIGALRM,SIGIO. and  vcpu
thread used SIGBUS,SIGIPI.

the signal that should register for vcpu thread.
   SIGHUP,
   SIGINT,
   SIGQUIT,
   SIGILL,
   SIGTRAP,
   SIGABRT,
   SIGFPE,
   SIGUSR1,
   SIGPIPE,
   SIGSEGV,
   SIGUSR2,
   SIGTERM,
   SIGSTKFLT,
   SIGTSTP,
   SIGXCPU,
   SIGXFSZ,
   SIGVTALRM,
   SIGIO,
   SIGSYS,

the signal that should register for main thread.
   SIGHUP,
   SIGINT,
   SIGQUIT,
   SIGILL,
   SIGTRAP,
   SIGABRT,
   SIGFPE,
   SIGUSR1,
   SIGPIPE,
   SIGSEGV,
   SIGTERM,
   SIGSTKFLT,
   SIGTSTP,
   SIGXCPU,
   SIGXFSZ,
   SIGVTALRM,
   SIGSYS,

the simple example for this function:
--- ../../BUILD/qemu-kvm-0.12.5/vl.c 2011-05-25 04:08:00.0 -0400
+++ vl.c 2011-06-22 06:57:51.0 -0400
+static void sig_handler(int n)
+{
+         int j, nptrs;
+       #define SIZE 100
+           void *buffer[100];
+           char **strings;
+
+           nptrs = backtrace(buffer, SIZE);
+           printf(backtrace() returned %d addresses\n, nptrs);
+
+           /* The call backtrace_symbols_fd(buffer, nptrs, STDOUT_FILENO)
+              would produce similar output to the following: */
+
+           strings = backtrace_symbols(buffer, nptrs);
+           if (strings == NULL) {
+               perror(backtrace_symbols);
+               exit(EXIT_FAILURE);
+           }
+
+           for (j = 0; j  nptrs; j++)
+               printf(%s\n, strings[j]);
+
+           free(strings);
+
+       abort();
+}
+
+void bye(void) {
+        printf(That was all, folks\n);
+         int j, nptrs;
+       #define SIZE 100
+           void *buffer[100];
+           char **strings;
+
+           nptrs = backtrace(buffer, SIZE);
+           printf(backtrace() returned %d addresses\n, nptrs);
+
+           /* The call backtrace_symbols_fd(buffer, nptrs, STDOUT_FILENO)
+              would produce similar output to the following: */
+
+           strings = backtrace_symbols(buffer, nptrs);
+           if (strings == NULL) {
+               perror(backtrace_symbols);
+               exit(EXIT_FAILURE);
+           }
+
+           for (j = 0; j  nptrs; j++)
+               printf(%s\n, strings[j]);
+
+           free(strings);
+
+}
+
+
int main(int argc, char **argv, char **envp)
{
    const char *gdbstub_dev = NULL;
@@ -4954,6 +5010,17 @@

    init_clocks();

+
+    signal(SIGSEGV, sig_handler);
+
+    i = atexit(bye);
+    if (i != 0) {
+           fprintf(stderr, cannot set exit function\n);
+           return EXIT_FAILURE;
+    }
+
+
+

--- ../../BUILD/qemu-kvm-0.12.5/qemu-kvm.c    2011-05-25
03:31:25.0 -0400
+++ qemu-kvm.c    2011-06-22 07:04:05.0 -0400
@@ -1883,6 +1883,10 @@

+
static void *ap_main_loop(void *_env)
{
    CPUState *env = _env;
@@ -1908,10 +1927,10 @@
    struct ioperm_data *data = NULL;
#endif

    current_env = env;
    env-thread_id = kvm_get_thread_id();
    sigfillset(signals);
+    sigdelset(signals,SIGSEGV);
    sigprocmask(SIG_BLOCK, signals, NULL);
    kvm_create_vcpu(env, env-cpu_index);

the log in /var/log/libvirt/qemu/xx.log.

LC_ALL=C PATH=/bin:/sbin:/usr/bin:/usr/sbin HOME=/ QEMU_AUDIO_DRV=none
/usr/bin/qemu-kvm -S -M pc-0.12 -enable-kvm -m 9767 -smp
16,sockets=16,cores=1,threads=1 -name sles11-4 -uuid
52841129-6b46-fef5-6a62-11c422597206 -nodefaults -chardev
socket,id=monitor,path=/var/lib/libvirt/qemu/sles11-4.monitor,server,nowait
-mon chardev=monitor,mode=readline -no-reboot -boot dc -drive
file=/mnt/disk2/c00104598/sles11sp1.raw,if=none,id=drive-virtio-disk0,boot=on
-device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
file=/opt/c00104598/iso/SLES-11-SP1.iso,if=none,media=cdrom,id=drive-ide0-1-0
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-device virtio-net-pci,vlan=0,id=net0,mac=52:54:00:24:a5:1d,bus=pci.0,addr=0x5
-net tap,fd=15,vlan=0,name=hostnet0 -usb -vnc *:0 -k en-us -vga cirrus
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3
19:22:40.436: debug : qemuSecurityDACSetProcessLabel:411 : Dropping
privileges of VM to 0:0
backtrace() returned 16 addresses
/usr/bin/qemu-kvm(bye+0x2b) [0x43346e]
/lib64/libc.so.6(+0x355e5) [0x7f45c50605e5]
/lib64/libc.so.6(+0x35635) [0x7f45c5060635]
/usr/bin/qemu-kvm() [0x4489e2]
/usr/bin/qemu-kvm(virtio_queue_notify+0x89) [0x5bb841]
/usr/bin/qemu-kvm() [0x449bb3]
/usr/bin/qemu-kvm() [0x44a077]
/usr/bin/qemu-kvm() [0x4f2cb4]
/usr/bin/qemu-kvm(cpu_outw+0x20) [0x4f305d]
/usr/bin/qemu-kvm() [0x4513f8]
/usr/bin/qemu-kvm(kvm_run+0x2e0) [0x4539bb]
/usr/bin/qemu-kvm(kvm_cpu_exec+0x15) 

Re: [Qemu-devel] [PATCH 00/15] [PULL] qemu-kvm.git uq/master queue

2011-06-22 Thread Anthony Liguori

On 06/21/2011 12:07 PM, Marcelo Tosatti wrote:

The following changes since commit eb47d7c5d96060040931c42773ee07e61e547af9:

   hw/9118.c: Implement active-low interrupt support (2011-06-15 13:23:37 +0200)

are available in the git repository at:
   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master


Pulled.  Thanks.

Regards,

Anthony Liguori



Andre Przywara (1):
   KVM: Fix XSAVE feature bit enumeration

Jan Kiszka (13):
   kvm: x86: Save/restore FPU OP, IP and DP
   Add kernel header update script
   Import kernel headers
   Switch build system to accompanied kernel headers
   kvm: Drop CONFIG_KVM_PARA
   kvm: ppc: Drop CONFIG_KVM_PPC_PVR
   kvm: Drop useless zero-initializations
   kvm: Drop KVM_CAP build dependencies
   kvm: x86: Drop KVM_CAP build dependencies
   kvm: ppc: Drop KVM_CAP build dependencies
   kvm: Clean up stubs
   kvm: x86: Pass KVMState to kvm_arch_get_supported_cpuid
   Remove unneeded kvm.h from cpu-exec.c

Marcelo Tosatti (1):
   kvm: fix FPU state subsection

  Makefile.target  |4 +-
  configure|  149 +--
  cpu-exec.c   |1 -
  hw/kvmclock.c|9 -
  kvm-all.c|   13 -
  kvm-stub.c   |   18 +-
  kvm.h|2 +-
  linux-headers/COPYING|  356 +++
  linux-headers/README |2 +
  linux-headers/asm-powerpc/kvm.h  |  275 
  linux-headers/asm-powerpc/kvm_para.h |   53 +++
  linux-headers/asm-s390/kvm.h |   44 ++
  linux-headers/asm-s390/kvm_para.h|   17 +
  linux-headers/asm-x86/hyperv.h   |  193 
  linux-headers/asm-x86/kvm.h  |  324 ++
  linux-headers/asm-x86/kvm_para.h |   79 
  linux-headers/linux/kvm.h|  804 ++
  linux-headers/linux/kvm_para.h   |   28 ++
  linux-headers/linux/vhost.h  |  130 ++
  linux-headers/linux/virtio_config.h  |   54 +++
  linux-headers/linux/virtio_ring.h|  163 +++
  scripts/update-linux-headers.sh  |   55 +++
  target-i386/cpu.h|4 +
  target-i386/cpuid.c  |   20 +-
  target-i386/kvm.c|  148 ++-
  target-i386/machine.c|   23 +
  target-ppc/kvm.c |   23 -
  target-s390x/cpu.h   |   10 -
  target-s390x/op_helper.c |1 +
  29 files changed, 2675 insertions(+), 327 deletions(-)
  create mode 100644 linux-headers/COPYING
  create mode 100644 linux-headers/README
  create mode 100644 linux-headers/asm-powerpc/kvm.h
  create mode 100644 linux-headers/asm-powerpc/kvm_para.h
  create mode 100644 linux-headers/asm-s390/kvm.h
  create mode 100644 linux-headers/asm-s390/kvm_para.h
  create mode 100644 linux-headers/asm-x86/hyperv.h
  create mode 100644 linux-headers/asm-x86/kvm.h
  create mode 100644 linux-headers/asm-x86/kvm_para.h
  create mode 100644 linux-headers/linux/kvm.h
  create mode 100644 linux-headers/linux/kvm_para.h
  create mode 100644 linux-headers/linux/vhost.h
  create mode 100644 linux-headers/linux/virtio_config.h
  create mode 100644 linux-headers/linux/virtio_ring.h
  create mode 100755 scripts/update-linux-headers.sh




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/22] KVM: MMU: fix walking shadow page table

2011-06-22 Thread Xiao Guangrong
Properly check the last mapping, and do not walk to the next level if last spte
is met

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9c629b5..f474e93 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct 
kvm_shadow_walk_iterator *iterator)
if (iterator-level  PT_PAGE_TABLE_LEVEL)
return false;
 
-   if (iterator-level == PT_PAGE_TABLE_LEVEL)
-   if (is_large_pte(*iterator-sptep))
-   return false;
-
iterator-index = SHADOW_PT_INDEX(iterator-addr, iterator-level);
iterator-sptep = ((u64 *)__va(iterator-shadow_addr)) + 
iterator-index;
return true;
@@ -1528,6 +1524,11 @@ static bool shadow_walk_okay(struct 
kvm_shadow_walk_iterator *iterator)
 
 static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 {
+   if (is_last_spte(*iterator-sptep, iterator-level)) {
+   iterator-level = 0;
+   return;
+   }
+
iterator-shadow_addr = *iterator-sptep  PT64_BASE_ADDR_MASK;
--iterator-level;
 }
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/22] KVM: MMU: do not update slot bitmap if spte is nonpresent

2011-06-22 Thread Xiao Guangrong
Set slot bitmap only if the spte is present

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   15 +++
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f474e93..8316c2d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -743,9 +743,6 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t 
gfn)
struct kvm_mmu_page *sp;
unsigned long *rmapp;
 
-   if (!is_rmap_spte(*spte))
-   return 0;
-
sp = page_header(__pa(spte));
kvm_mmu_page_set_gfn(sp, spte - sp-spt, gfn);
rmapp = gfn_to_rmap(vcpu-kvm, gfn, sp-role.level);
@@ -2087,11 +2084,13 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
if (!was_rmapped  is_large_pte(*sptep))
++vcpu-kvm-stat.lpages;
 
-   page_header_update_slot(vcpu-kvm, sptep, gfn);
-   if (!was_rmapped) {
-   rmap_count = rmap_add(vcpu, sptep, gfn);
-   if (rmap_count  RMAP_RECYCLE_THRESHOLD)
-   rmap_recycle(vcpu, sptep, gfn);
+   if (is_shadow_present_pte(*sptep)) {
+   page_header_update_slot(vcpu-kvm, sptep, gfn);
+   if (!was_rmapped) {
+   rmap_count = rmap_add(vcpu, sptep, gfn);
+   if (rmap_count  RMAP_RECYCLE_THRESHOLD)
+   rmap_recycle(vcpu, sptep, gfn);
+   }
}
kvm_release_pfn_clean(pfn);
if (speculative) {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary

2011-06-22 Thread Xiao Guangrong
If the range spans a boundary, the mmio access can be broke, fix it as
write emulation.

And we already get the guest physical address, so use it to read guest data
directly to avoid walking guest page table again

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |   41 -
 1 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b803f0..eb27be4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3944,14 +3944,13 @@ out:
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
-static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
- unsigned long addr,
- void *val,
- unsigned int bytes,
- struct x86_exception *exception)
+static int emulator_read_emulated_onepage(unsigned long addr,
+ void *val,
+ unsigned int bytes,
+ struct x86_exception *exception,
+ struct kvm_vcpu *vcpu)
 {
-   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-   gpa_t gpa;
+   gpa_t gpa;
int handled;
 
if (vcpu-mmio_read_completed) {
@@ -3971,8 +3970,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt 
*ctxt,
if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
goto mmio;
 
-   if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
-   == X86EMUL_CONTINUE)
+   if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes))
return X86EMUL_CONTINUE;
 
 mmio:
@@ -4001,6 +3999,31 @@ mmio:
return X86EMUL_IO_NEEDED;
 }
 
+static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
+ unsigned long addr,
+ void *val,
+ unsigned int bytes,
+ struct x86_exception *exception)
+{
+   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+   /* Crossing a page boundary? */
+   if (((addr + bytes - 1) ^ addr)  PAGE_MASK) {
+   int rc, now;
+
+   now = -addr  ~PAGE_MASK;
+   rc = emulator_read_emulated_onepage(addr, val, now, exception,
+   vcpu);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+   addr += now;
+   val += now;
+   bytes -= now;
+   }
+   return emulator_read_emulated_onepage(addr, val, bytes, exception,
+   vcpu);
+}
+
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
const void *val, int bytes)
 {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code

2011-06-22 Thread Xiao Guangrong
Introduce vcpu_gva_to_gpa to translate the gva to gpa, we can use it
to cleanup the code between read emulation and write emulation

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |   38 +-
 1 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb27be4..c29ef96 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3944,6 +3944,27 @@ out:
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
+static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
+  gpa_t *gpa, struct x86_exception *exception,
+  bool write)
+{
+   u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+
+   if (write)
+   access |= PFERR_WRITE_MASK;
+
+   *gpa = vcpu-arch.walk_mmu-gva_to_gpa(vcpu, gva, access, exception);
+
+   if (*gpa == UNMAPPED_GVA)
+   return -1;
+
+   /* For APIC access vmexit */
+   if ((*gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+   return 1;
+
+   return 0;
+}
+
 static int emulator_read_emulated_onepage(unsigned long addr,
  void *val,
  unsigned int bytes,
@@ -3951,7 +3972,7 @@ static int emulator_read_emulated_onepage(unsigned long 
addr,
  struct kvm_vcpu *vcpu)
 {
gpa_t gpa;
-   int handled;
+   int handled, ret;
 
if (vcpu-mmio_read_completed) {
memcpy(val, vcpu-mmio_data, bytes);
@@ -3961,13 +3982,12 @@ static int emulator_read_emulated_onepage(unsigned long 
addr,
return X86EMUL_CONTINUE;
}
 
-   gpa = kvm_mmu_gva_to_gpa_read(vcpu, addr, exception);
+   ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, false);
 
-   if (gpa == UNMAPPED_GVA)
+   if (ret  0)
return X86EMUL_PROPAGATE_FAULT;
 
-   /* For APIC access vmexit */
-   if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+   if (ret)
goto mmio;
 
if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes))
@@ -4043,15 +4063,15 @@ static int emulator_write_emulated_onepage(unsigned 
long addr,
   struct kvm_vcpu *vcpu)
 {
gpa_t gpa;
-   int handled;
+   int handled, ret;
 
-   gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
+   ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, true);
 
-   if (gpa == UNMAPPED_GVA)
+   if (ret  0)
return X86EMUL_PROPAGATE_FAULT;
 
/* For APIC access vmexit */
-   if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+   if (ret)
goto mmio;
 
if (emulator_write_phys(vcpu, gpa, val, bytes))
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation

2011-06-22 Thread Xiao Guangrong
The operations of read emulation and write emulation are very similar, so we
can abstract the operation of them, in larter patch, it is used to cleanup the
same code

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |   72 
 1 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c29ef96..887714f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4056,6 +4056,78 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
return 1;
 }
 
+struct read_write_emulator_ops {
+   int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
+ int bytes);
+   int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
+ void *val, int bytes);
+   int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
+  int bytes, void *val);
+   int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
+   void *val, int bytes);
+   bool write;
+};
+
+static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
+{
+   if (vcpu-mmio_read_completed) {
+   memcpy(val, vcpu-mmio_data, bytes);
+   trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
+  vcpu-mmio_phys_addr, *(u64 *)val);
+   vcpu-mmio_read_completed = 0;
+   return 1;
+   }
+
+   return 0;
+}
+
+static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+   void *val, int bytes)
+{
+   return !kvm_read_guest(vcpu-kvm, gpa, val, bytes);
+}
+
+static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+void *val, int bytes)
+{
+   return emulator_write_phys(vcpu, gpa, val, bytes);
+}
+
+static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
+{
+   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
+   return vcpu_mmio_write(vcpu, gpa, bytes, val);
+}
+
+static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
+ void *val, int bytes)
+{
+   trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0);
+   return X86EMUL_IO_NEEDED;
+}
+
+static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
+  void *val, int bytes)
+{
+   memcpy(vcpu-mmio_data, val, bytes);
+   memcpy(vcpu-run-mmio.data, vcpu-mmio_data, 8);
+   return X86EMUL_CONTINUE;
+}
+
+static struct read_write_emulator_ops read_emultor = {
+   .read_write_prepare = read_prepare,
+   .read_write_emulate = read_emulate,
+   .read_write_mmio = vcpu_mmio_read,
+   .read_write_exit_mmio = read_exit_mmio,
+};
+
+static struct read_write_emulator_ops write_emultor = {
+   .read_write_emulate = write_emulate,
+   .read_write_mmio = write_mmio,
+   .read_write_exit_mmio = write_exit_mmio,
+   .write = true,
+};
+
 static int emulator_write_emulated_onepage(unsigned long addr,
   const void *val,
   unsigned int bytes,
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/22] KVM: x86: cleanup the code of read/write emulation

2011-06-22 Thread Xiao Guangrong
Using the read/write operation to remove the same code

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |  149 ---
 1 files changed, 47 insertions(+), 102 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 887714f..baa5a11 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3965,85 +3965,6 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, 
unsigned long gva,
return 0;
 }
 
-static int emulator_read_emulated_onepage(unsigned long addr,
- void *val,
- unsigned int bytes,
- struct x86_exception *exception,
- struct kvm_vcpu *vcpu)
-{
-   gpa_t gpa;
-   int handled, ret;
-
-   if (vcpu-mmio_read_completed) {
-   memcpy(val, vcpu-mmio_data, bytes);
-   trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
-  vcpu-mmio_phys_addr, *(u64 *)val);
-   vcpu-mmio_read_completed = 0;
-   return X86EMUL_CONTINUE;
-   }
-
-   ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, false);
-
-   if (ret  0)
-   return X86EMUL_PROPAGATE_FAULT;
-
-   if (ret)
-   goto mmio;
-
-   if (!kvm_read_guest(vcpu-kvm, gpa, val, bytes))
-   return X86EMUL_CONTINUE;
-
-mmio:
-   /*
-* Is this MMIO handled locally?
-*/
-   handled = vcpu_mmio_read(vcpu, gpa, bytes, val);
-
-   if (handled == bytes)
-   return X86EMUL_CONTINUE;
-
-   gpa += handled;
-   bytes -= handled;
-   val += handled;
-
-   trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0);
-
-   vcpu-mmio_needed = 1;
-   vcpu-run-exit_reason = KVM_EXIT_MMIO;
-   vcpu-run-mmio.phys_addr = vcpu-mmio_phys_addr = gpa;
-   vcpu-mmio_size = bytes;
-   vcpu-run-mmio.len = min(vcpu-mmio_size, 8);
-   vcpu-run-mmio.is_write = vcpu-mmio_is_write = 0;
-   vcpu-mmio_index = 0;
-
-   return X86EMUL_IO_NEEDED;
-}
-
-static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
- unsigned long addr,
- void *val,
- unsigned int bytes,
- struct x86_exception *exception)
-{
-   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-
-   /* Crossing a page boundary? */
-   if (((addr + bytes - 1) ^ addr)  PAGE_MASK) {
-   int rc, now;
-
-   now = -addr  ~PAGE_MASK;
-   rc = emulator_read_emulated_onepage(addr, val, now, exception,
-   vcpu);
-   if (rc != X86EMUL_CONTINUE)
-   return rc;
-   addr += now;
-   val += now;
-   bytes -= now;
-   }
-   return emulator_read_emulated_onepage(addr, val, bytes, exception,
-   vcpu);
-}
-
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
const void *val, int bytes)
 {
@@ -4128,16 +4049,22 @@ static struct read_write_emulator_ops write_emultor = {
.write = true,
 };
 
-static int emulator_write_emulated_onepage(unsigned long addr,
-  const void *val,
-  unsigned int bytes,
-  struct x86_exception *exception,
-  struct kvm_vcpu *vcpu)
+static int emulator_read_write_onepage(unsigned long addr, void *val,
+  unsigned int bytes,
+  struct x86_exception *exception,
+  struct kvm_vcpu *vcpu,
+  struct read_write_emulator_ops *ops)
+
 {
-   gpa_t gpa;
+   gpa_t gpa;
int handled, ret;
+   bool write = ops-write;
+
+   if (ops-read_write_prepare 
+ ops-read_write_prepare(vcpu, val, bytes))
+   return X86EMUL_CONTINUE;
 
-   ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, true);
+   ret = vcpu_gva_to_gpa(vcpu, addr, gpa, exception, write);
 
if (ret  0)
return X86EMUL_PROPAGATE_FAULT;
@@ -4146,15 +4073,14 @@ static int emulator_write_emulated_onepage(unsigned 
long addr,
if (ret)
goto mmio;
 
-   if (emulator_write_phys(vcpu, gpa, val, bytes))
+   if (ops-read_write_emulate(vcpu, gpa, val, bytes))
return X86EMUL_CONTINUE;
 
 mmio:
-   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
/*
 * Is this MMIO handled locally?
 */
-   handled = vcpu_mmio_write(vcpu, gpa, bytes, val);
+   handled = 

[PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path

2011-06-22 Thread Xiao Guangrong
If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |5 +
 arch/x86/kvm/mmu.c  |   21 +++--
 arch/x86/kvm/mmu.h  |   23 +++
 arch/x86/kvm/paging_tmpl.h  |   21 ++---
 arch/x86/kvm/x86.c  |   11 +++
 arch/x86/kvm/x86.h  |   36 
 6 files changed, 96 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..7b0834a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -415,6 +415,11 @@ struct kvm_vcpu_arch {
u64 mcg_ctl;
u64 *mce_banks;
 
+   /* Cache MMIO info */
+   u64 mmio_gva;
+   unsigned access;
+   gfn_t mmio_gfn;
+
/* used for guest single stepping over the given code position */
unsigned long singlestep_rip;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8316c2d..7f53210 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -217,11 +217,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
-static bool is_write_protection(struct kvm_vcpu *vcpu)
-{
-   return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
-}
-
 static int is_cpuid_PSE36(void)
 {
return 1;
@@ -243,11 +238,6 @@ static int is_large_pte(u64 pte)
return pte  PT_PAGE_SIZE_MASK;
 }
 
-static int is_writable_pte(unsigned long pte)
-{
-   return pte  PT_WRITABLE_MASK;
-}
-
 static int is_dirty_gpte(unsigned long pte)
 {
return pte  PT_DIRTY_MASK;
@@ -2247,15 +2237,17 @@ static void kvm_send_hwpoison_signal(unsigned long 
address, struct task_struct *
send_sig_info(SIGBUS, info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gva_t gva,
+  unsigned access, gfn_t gfn, pfn_t pfn)
 {
kvm_release_pfn_clean(pfn);
if (is_hwpoison_pfn(pfn)) {
-   kvm_send_hwpoison_signal(gfn_to_hva(kvm, gfn), current);
+   kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current);
return 0;
} else if (is_fault_pfn(pfn))
return -EFAULT;
 
+   vcpu_cache_mmio_info(vcpu, gva, gfn, access);
return 1;
 }
 
@@ -2337,7 +2329,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn,
 
/* mmio */
if (is_error_pfn(pfn))
-   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
+   return kvm_handle_bad_page(vcpu, v, ACC_ALL, gfn, pfn);
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2564,6 +2556,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
if (!VALID_PAGE(vcpu-arch.mmu.root_hpa))
return;
 
+   vcpu_clear_mmio_info(vcpu, ~0ul);
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
if (vcpu-arch.mmu.root_level == PT64_ROOT_LEVEL) {
hpa_t root = vcpu-arch.mmu.root_hpa;
@@ -2710,7 +2703,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
 
/* mmio */
if (is_error_pfn(pfn))
-   return kvm_handle_bad_page(vcpu-kvm, gfn, pfn);
+   return kvm_handle_bad_page(vcpu, 0, 0, gfn, pfn);
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7086ca8..05310b1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -76,4 +76,27 @@ static inline int is_present_gpte(unsigned long pte)
return pte  PT_PRESENT_MASK;
 }
 
+static inline int is_writable_pte(unsigned long pte)
+{
+   return pte  PT_WRITABLE_MASK;
+}
+
+static inline bool is_write_protection(struct kvm_vcpu *vcpu)
+{
+   return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
+}
+
+static inline bool check_write_user_access(struct kvm_vcpu *vcpu,
+  bool write_fault, bool user_fault,
+  unsigned long pte)
+{
+   if (unlikely(write_fault  !is_writable_pte(pte)
+  (user_fault || is_write_protection(vcpu
+   return false;
+
+   if (unlikely(user_fault  !(pte  PT_USER_MASK)))
+   return false;
+
+   return true;
+}
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1caeb4d..13978dc 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -201,11 +201,8 @@ walk:
break;
}
 
-   if (unlikely(write_fault  

[PATCH v2 08/22] KVM: MMU: optimize to handle dirty bit

2011-06-22 Thread Xiao Guangrong
If dirty bit is not set, we can make the pte access read-only to avoid handing
dirty bit everywhere

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   13 +--
 arch/x86/kvm/paging_tmpl.h |   46 ++-
 2 files changed, 25 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7f53210..9e35577 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1923,7 +1923,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, 
gfn_t gfn,
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
-   int write_fault, int dirty, int level,
+   int write_fault, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
bool can_unsync, bool host_writable)
 {
@@ -1938,8 +1938,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte = PT_PRESENT_MASK;
if (!speculative)
spte |= shadow_accessed_mask;
-   if (!dirty)
-   pte_access = ~ACC_WRITE_MASK;
+
if (pte_access  ACC_EXEC_MASK)
spte |= shadow_x_mask;
else
@@ -2023,7 +2022,7 @@ done:
 
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 unsigned pt_access, unsigned pte_access,
-int user_fault, int write_fault, int dirty,
+int user_fault, int write_fault,
 int *ptwrite, int level, gfn_t gfn,
 pfn_t pfn, bool speculative,
 bool host_writable)
@@ -2059,7 +2058,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
 
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
- dirty, level, gfn, pfn, speculative, true,
+ level, gfn, pfn, speculative, true,
  host_writable)) {
if (write_fault)
*ptwrite = 1;
@@ -2129,7 +2128,7 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 
for (i = 0; i  ret; i++, gfn++, start++)
mmu_set_spte(vcpu, start, ACC_ALL,
-access, 0, 0, 1, NULL,
+access, 0, 0, NULL,
 sp-role.level, gfn,
 page_to_pfn(pages[i]), true, true);
 
@@ -2193,7 +2192,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
unsigned pte_access = ACC_ALL;
 
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
-0, write, 1, pt_write,
+0, write, pt_write,
 level, gfn, pfn, prefault, map_writable);
direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu-stat.pf_fixed;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 13978dc..e06c9e4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -101,11 +101,15 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
return (ret != orig_pte);
 }
 
-static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte)
+static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
+  bool last)
 {
unsigned access;
 
access = (gpte  (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+   if (last  !is_dirty_gpte(gpte))
+   access = ~ACC_WRITE_MASK;
+
 #if PTTYPE == 64
if (vcpu-arch.mmu.nx)
access = ~(gpte  PT64_NX_SHIFT);
@@ -227,8 +231,6 @@ walk:
pte |= PT_ACCESSED_MASK;
}
 
-   pte_access = pt_access  FNAME(gpte_access)(vcpu, pte);
-
walker-ptes[walker-level - 1] = pte;
 
if ((walker-level == PT_PAGE_TABLE_LEVEL) ||
@@ -269,7 +271,7 @@ walk:
break;
}
 
-   pt_access = pte_access;
+   pt_access = FNAME(gpte_access)(vcpu, pte, false);
--walker-level;
}
 
@@ -293,6 +295,7 @@ walk:
walker-ptes[walker-level - 1] = pte;
}
 
+   pte_access = pt_access  FNAME(gpte_access)(vcpu, pte, true);
walker-pt_access = pt_access;
walker-pte_access = pte_access;
pgprintk(%s: pte %llx pte_access %x pt_access %x\n,
@@ -373,7 +376,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
return;
 
pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte);
-   pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte);
+   pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte, true);
pfn = 

[PATCH v2 09/22] KVM: MMU: cleanup for FNAME(fetch)

2011-06-22 Thread Xiao Guangrong
gw-pte_access is the final access permission, since it is unified with
gw-pt_access when we walked guest page table:

FNAME(walk_addr_generic):
pte_access = pt_access  FNAME(gpte_access)(vcpu, pte, true);

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/paging_tmpl.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e06c9e4..e9ee473 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -485,7 +485,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
if (!is_present_gpte(gw-ptes[gw-level - 1]))
return NULL;
 
-   direct_access = gw-pt_access  gw-pte_access;
+   direct_access = gw-pte_access;
 
top_level = vcpu-arch.mmu.root_level;
if (top_level == PT32E_ROOT_LEVEL)
@@ -543,7 +543,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
link_shadow_page(it.sptep, sp);
}
 
-   mmu_set_spte(vcpu, it.sptep, access, gw-pte_access  access,
+   mmu_set_spte(vcpu, it.sptep, access, gw-pte_access,
 user_fault, write_fault, ptwrite, it.level,
 gw-gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/22] KVM: MMU: rename 'pt_write' to 'emulate'

2011-06-22 Thread Xiao Guangrong
If 'pt_write' is true, we need to emulate the fault. And in later patch, we
need to emulate the fault even though it is not a pt_write event, so rename
it to better fit the meaning

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   10 +-
 arch/x86/kvm/paging_tmpl.h |   16 
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9e35577..580ecf0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2023,7 +2023,7 @@ done:
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault,
-int *ptwrite, int level, gfn_t gfn,
+int *emulate, int level, gfn_t gfn,
 pfn_t pfn, bool speculative,
 bool host_writable)
 {
@@ -2061,7 +2061,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
  level, gfn, pfn, speculative, true,
  host_writable)) {
if (write_fault)
-   *ptwrite = 1;
+   *emulate = 1;
kvm_mmu_flush_tlb(vcpu);
}
 
@@ -2184,7 +2184,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
 {
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
-   int pt_write = 0;
+   int emulate = 0;
gfn_t pseudo_gfn;
 
for_each_shadow_entry(vcpu, (u64)gfn  PAGE_SHIFT, iterator) {
@@ -2192,7 +2192,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
unsigned pte_access = ACC_ALL;
 
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
-0, write, pt_write,
+0, write, emulate,
 level, gfn, pfn, prefault, map_writable);
direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu-stat.pf_fixed;
@@ -2220,7 +2220,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
   | shadow_accessed_mask);
}
}
-   return pt_write;
+   return emulate;
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct 
*tsk)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e9ee473..1e534e0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -473,7 +473,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 struct guest_walker *gw,
 int user_fault, int write_fault, int hlevel,
-int *ptwrite, pfn_t pfn, bool map_writable,
+int *emulate, pfn_t pfn, bool map_writable,
 bool prefault)
 {
unsigned access = gw-pt_access;
@@ -544,7 +544,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
}
 
mmu_set_spte(vcpu, it.sptep, access, gw-pte_access,
-user_fault, write_fault, ptwrite, it.level,
+user_fault, write_fault, emulate, it.level,
 gw-gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);
 
@@ -578,7 +578,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
int user_fault = error_code  PFERR_USER_MASK;
struct guest_walker walker;
u64 *sptep;
-   int write_pt = 0;
+   int emulate = 0;
int r;
pfn_t pfn;
int level = PT_PAGE_TABLE_LEVEL;
@@ -639,19 +639,19 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
if (!force_pt_level)
transparent_hugepage_adjust(vcpu, walker.gfn, pfn, level);
sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault,
-level, write_pt, pfn, map_writable, prefault);
+level, emulate, pfn, map_writable, prefault);
(void)sptep;
-   pgprintk(%s: shadow pte %p %llx ptwrite %d\n, __func__,
-sptep, *sptep, write_pt);
+   pgprintk(%s: shadow pte %p %llx emulate %d\n, __func__,
+sptep, *sptep, emulate);
 
-   if (!write_pt)
+   if (!emulate)
vcpu-arch.last_pt_write_count = 0; /* reset fork detector */
 
++vcpu-stat.pf_fixed;
trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
spin_unlock(vcpu-kvm-mmu_lock);
 
-   return write_pt;
+   return emulate;
 
 out_unlock:
spin_unlock(vcpu-kvm-mmu_lock);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body 

[PATCH v2 11/22] KVM: MMU: count used shadow pages on prepareing path

2011-06-22 Thread Xiao Guangrong
Move counting used shadow pages from commiting path to preparing path to
reduce tlb flush on some paths

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 580ecf0..8233aba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1039,7 +1039,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp-spt));
hlist_del(sp-hash_link);
@@ -1048,7 +1048,6 @@ static void kvm_mmu_free_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
if (!sp-role.direct)
free_page((unsigned long)sp-gfns);
kmem_cache_free(mmu_page_header_cache, sp);
-   kvm_mod_used_mmu_pages(kvm, -1);
 }
 
 static unsigned kvm_page_table_hashfn(gfn_t gfn)
@@ -1655,6 +1654,7 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
/* Count self */
ret++;
list_move(sp-link, invalid_list);
+   kvm_mod_used_mmu_pages(kvm, -1);
} else {
list_move(sp-link, kvm-arch.active_mmu_pages);
kvm_reload_remote_mmus(kvm);
@@ -1678,7 +1678,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
do {
sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
WARN_ON(!sp-role.invalid || sp-root_count);
-   kvm_mmu_free_page(kvm, sp);
+   kvm_mmu_free_page(sp);
} while (!list_empty(invalid_list));
 
 }
@@ -1704,8 +1704,8 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int goal_nr_mmu_pages)
page = container_of(kvm-arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
-   kvm_mmu_commit_zap_page(kvm, invalid_list);
}
+   kvm_mmu_commit_zap_page(kvm, invalid_list);
goal_nr_mmu_pages = kvm-arch.n_used_mmu_pages;
}
 
@@ -3312,9 +3312,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
sp = container_of(vcpu-kvm-arch.active_mmu_pages.prev,
  struct kvm_mmu_page, link);
kvm_mmu_prepare_zap_page(vcpu-kvm, sp, invalid_list);
-   kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
++vcpu-kvm-stat.mmu_recycled;
}
+   kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
 }
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/22] KVM: MMU: remove bypass_guest_pf

2011-06-22 Thread Xiao Guangrong
The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1.  It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 Documentation/kernel-parameters.txt |4 --
 arch/x86/include/asm/kvm_host.h |3 -
 arch/x86/kvm/mmu.c  |   83 ++-
 arch/x86/kvm/mmu_audit.c|   12 -
 arch/x86/kvm/paging_tmpl.h  |   51 +++--
 arch/x86/kvm/vmx.c  |   11 +
 arch/x86/kvm/x86.c  |1 -
 7 files changed, 33 insertions(+), 132 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fd248a31..a06e4f1 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1159,10 +1159,6 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
for all guests.
Default is 1 (enabled) if in 64bit or 32bit-PAE mode
 
-   kvm-intel.bypass_guest_pf=
-   [KVM,Intel] Disables bypassing of guest page faults
-   on Intel chips. Default is 1 (enabled)
-
kvm-intel.ept=  [KVM,Intel] Disable extended page tables
(virtualized MMU) support on capable Intel chips.
Default is 1 (enabled)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7b0834a..42e577d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -266,8 +266,6 @@ struct kvm_mmu {
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
struct x86_exception *exception);
gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access);
-   void (*prefetch_page)(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *page);
int (*sync_page)(struct kvm_vcpu *vcpu,
 struct kvm_mmu_page *sp);
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
@@ -638,7 +636,6 @@ void kvm_mmu_module_exit(void);
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_setup(struct kvm_vcpu *vcpu);
-void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c6fabf0..2c7b7a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -186,8 +186,6 @@ static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
-static u64 __read_mostly shadow_trap_nonpresent_pte;
-static u64 __read_mostly shadow_notrap_nonpresent_pte;
 static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;/* mutual exclusive with 
nx_mask */
 static u64 __read_mostly shadow_user_mask;
@@ -199,13 +197,6 @@ static inline u64 rsvd_bits(int s, int e)
return ((1ULL  (e - s + 1)) - 1)  s;
 }
 
-void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
-{
-   shadow_trap_nonpresent_pte = trap_pte;
-   shadow_notrap_nonpresent_pte = notrap_pte;
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_set_nonpresent_ptes);
-
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask)
 {
@@ -229,8 +220,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-   return pte != shadow_trap_nonpresent_pte
-pte != shadow_notrap_nonpresent_pte;
+   return pte != 0ull;
 }
 
 static int is_large_pte(u64 pte)
@@ -777,9 +767,9 @@ static int set_spte_track_bits(u64 *sptep, u64 new_spte)
return 1;
 }
 
-static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
+static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
-   if (set_spte_track_bits(sptep, new_spte))
+   if (set_spte_track_bits(sptep, 0ull))
rmap_remove(kvm, sptep);
 }
 
@@ -814,8 +804,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
BUG_ON((*spte  (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != 
(PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
pgprintk(rmap_write_protect(large): spte %p %llx 
%lld\n, spte, *spte, gfn);
if (is_writable_pte(*spte)) {
-   drop_spte(kvm, spte,
- shadow_trap_nonpresent_pte);
+   drop_spte(kvm, spte);
--kvm-stat.lpages;
spte = NULL;
write_protected = 1;
@@ -836,7 +825,7 @@ static 

[PATCH v2 14/22] KVM: MMU: filter out the mmio pfn from the fault pfn

2011-06-22 Thread Xiao Guangrong
If the page fault is caused by mmio, the gfn can not be found in memslots, and
'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
the mmio page fault.
And, to clarify the meaning of mmio pfn, we return fault page instead of bad
page when the gfn is not allowd to prefetch

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c   |4 ++--
 include/linux/kvm_host.h |5 +
 virt/kvm/kvm_main.c  |   16 ++--
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2c7b7a0..4a51042 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2085,8 +2085,8 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, no_dirty_log);
if (!slot) {
-   get_page(bad_page);
-   return page_to_pfn(bad_page);
+   get_page(fault_page);
+   return page_to_pfn(fault_page);
}
 
hva = gfn_to_hva_memslot(slot, gfn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..3e548e8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -326,12 +326,17 @@ static inline struct kvm_memslots *kvm_memslots(struct 
kvm *kvm)
 static inline int is_error_hpa(hpa_t hpa) { return hpa  HPA_MSB; }
 
 extern struct page *bad_page;
+extern struct page *fault_page;
+
 extern pfn_t bad_pfn;
+extern pfn_t fault_pfn;
 
 int is_error_page(struct page *page);
 int is_error_pfn(pfn_t pfn);
 int is_hwpoison_pfn(pfn_t pfn);
 int is_fault_pfn(pfn_t pfn);
+int is_mmio_pfn(pfn_t pfn);
+int is_invalid_pfn(pfn_t pfn);
 int kvm_is_error_hva(unsigned long addr);
 int kvm_set_memory_region(struct kvm *kvm,
  struct kvm_userspace_memory_region *mem,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..c7d41d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -101,8 +101,8 @@ static bool largepages_enabled = true;
 static struct page *hwpoison_page;
 static pfn_t hwpoison_pfn;
 
-static struct page *fault_page;
-static pfn_t fault_pfn;
+struct page *fault_page;
+pfn_t fault_pfn;
 
 inline int kvm_is_mmio_pfn(pfn_t pfn)
 {
@@ -931,6 +931,18 @@ int is_fault_pfn(pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(is_fault_pfn);
 
+int is_mmio_pfn(pfn_t pfn)
+{
+   return pfn == bad_pfn;
+}
+EXPORT_SYMBOL_GPL(is_mmio_pfn);
+
+int is_invalid_pfn(pfn_t pfn)
+{
+   return pfn == hwpoison_pfn || pfn == fault_pfn;
+}
+EXPORT_SYMBOL_GPL(is_invalid_pfn);
+
 static inline unsigned long bad_hva(void)
 {
return PAGE_OFFSET;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 15/22] KVM: MMU: abstract some functions to handle fault pfn

2011-06-22 Thread Xiao Guangrong
Introduce handle_abnormal_pfn to handle fault pfn on page fault path,
introduce mmu_invalid_pfn to handle fault pfn on prefetch path

It is the preparing work for mmio page fault support

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   47 ---
 arch/x86/kvm/paging_tmpl.h |   12 +-
 2 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4a51042..4fe1cb3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2221,18 +2221,15 @@ static void kvm_send_hwpoison_signal(unsigned long 
address, struct task_struct *
send_sig_info(SIGBUS, info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gva_t gva,
-  unsigned access, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
 {
kvm_release_pfn_clean(pfn);
if (is_hwpoison_pfn(pfn)) {
kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current);
return 0;
-   } else if (is_fault_pfn(pfn))
-   return -EFAULT;
+   }
 
-   vcpu_cache_mmio_info(vcpu, gva, gfn, access);
-   return 1;
+   return -EFAULT;
 }
 
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
@@ -2277,6 +2274,33 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
*vcpu,
}
 }
 
+static bool mmu_invalid_pfn(pfn_t pfn)
+{
+   return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn));
+}
+
+static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+   pfn_t pfn, unsigned access, int *ret_val)
+{
+   bool ret = true;
+
+   /* The pfn is invalid, report the error! */
+   if (unlikely(is_invalid_pfn(pfn))) {
+   *ret_val = kvm_handle_bad_page(vcpu, gfn, pfn);
+   goto exit;
+   }
+
+   if (unlikely(is_mmio_pfn(pfn))) {
+   vcpu_cache_mmio_info(vcpu, gva, gfn, access);
+   *ret_val = 1;
+   goto exit;
+   }
+
+   ret = false;
+exit:
+   return ret;
+}
+
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 gva_t gva, pfn_t *pfn, bool write, bool *writable);
 
@@ -2311,9 +2335,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn,
if (try_async_pf(vcpu, prefault, gfn, v, pfn, write, map_writable))
return 0;
 
-   /* mmio */
-   if (is_error_pfn(pfn))
-   return kvm_handle_bad_page(vcpu, v, ACC_ALL, gfn, pfn);
+   if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, r))
+   return r;
 
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2685,9 +2708,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, write, map_writable))
return 0;
 
-   /* mmio */
-   if (is_error_pfn(pfn))
-   return kvm_handle_bad_page(vcpu, 0, 0, gfn, pfn);
+   if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, r))
+   return r;
+
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 58a29d0..870be69 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -373,7 +373,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte);
pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte, true);
pfn = gfn_to_pfn_atomic(vcpu-kvm, gpte_to_gfn(gpte));
-   if (is_error_pfn(pfn)) {
+   if (mmu_invalid_pfn(pfn)) {
kvm_release_pfn_clean(pfn);
return;
}
@@ -451,7 +451,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
gfn = gpte_to_gfn(gpte);
pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
  pte_access  ACC_WRITE_MASK);
-   if (is_error_pfn(pfn)) {
+   if (mmu_invalid_pfn(pfn)) {
kvm_release_pfn_clean(pfn);
break;
}
@@ -621,10 +621,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
 map_writable))
return 0;
 
-   /* mmio */
-   if (is_error_pfn(pfn))
-   return kvm_handle_bad_page(vcpu, mmu_is_nested(vcpu) ? 0 :
- addr, walker.pte_access, walker.gfn, pfn);
+   if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
+   walker.gfn, pfn, walker.pte_access, r))
+   return r;
+

[PATCH v2 16/22] KVM: MMU: introduce the rules to modify shadow page table

2011-06-22 Thread Xiao Guangrong
Introduce some interfaces to modify spte as linux kernel does:
- mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and
  track the stat bits(accessed/dirty) of spte
- mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except
  tracking the stat bits
- mmu_spte_set, set spte from nonpresent to present
- mmu_spte_update, only update the stat bits

Now, it does not allowed to set spte from present to present, later, we can
drop the atomicly opration for X86_32 host, and it is the preparing work to
get spte on X86_32 host out of the mmu lock

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |  103 +++-
 1 files changed, 69 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4fe1cb3..5ceb64a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -299,12 +299,30 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 
new_spte, u64 bit_mask)
return (old_spte  bit_mask)  !(new_spte  bit_mask);
 }
 
-static void update_spte(u64 *sptep, u64 new_spte)
+/* Rules for using mmu_spte_set:
+ * Set the sptep from nonpresent to present.
+ * Note: the sptep being assigned *must* be either not present
+ * or in a state where the hardware will not attempt to update
+ * the spte.
+ */
+static void mmu_spte_set(u64 *sptep, u64 new_spte)
+{
+   WARN_ON(is_shadow_present_pte(*sptep));
+   __set_spte(sptep, new_spte);
+}
+
+/* Rules for using mmu_spte_update:
+ * Update the state bits, it means the mapped pfn is not changged.
+ */
+static void mmu_spte_update(u64 *sptep, u64 new_spte)
 {
u64 mask, old_spte = *sptep;
 
WARN_ON(!is_rmap_spte(new_spte));
 
+   if (!is_shadow_present_pte(old_spte))
+   return mmu_spte_set(sptep, new_spte);
+
new_spte |= old_spte  shadow_dirty_mask;
 
mask = shadow_accessed_mask;
@@ -325,6 +343,42 @@ static void update_spte(u64 *sptep, u64 new_spte)
kvm_set_pfn_dirty(spte_to_pfn(old_spte));
 }
 
+/*
+ * Rules for using mmu_spte_clear_track_bits:
+ * It sets the sptep from present to nonpresent, and track the
+ * state bits, it is used to clear the last level sptep.
+ */
+static int mmu_spte_clear_track_bits(u64 *sptep)
+{
+   pfn_t pfn;
+   u64 old_spte = *sptep;
+
+   if (!spte_has_volatile_bits(old_spte))
+   __set_spte(sptep, 0ull);
+   else
+   old_spte = __xchg_spte(sptep, 0ull);
+
+   if (!is_rmap_spte(old_spte))
+   return 0;
+
+   pfn = spte_to_pfn(old_spte);
+   if (!shadow_accessed_mask || old_spte  shadow_accessed_mask)
+   kvm_set_pfn_accessed(pfn);
+   if (!shadow_dirty_mask || (old_spte  shadow_dirty_mask))
+   kvm_set_pfn_dirty(pfn);
+   return 1;
+}
+
+/*
+ * Rules for using mmu_spte_clear_no_track:
+ * Directly clear spte without caring the state bits of sptep,
+ * it is used to set the upper level spte.
+ */
+static void mmu_spte_clear_no_track(u64 *sptep)
+{
+   __set_spte(sptep, 0ull);
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
  struct kmem_cache *base_cache, int min)
 {
@@ -746,30 +800,9 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
pte_list_remove(spte, rmapp);
 }
 
-static int set_spte_track_bits(u64 *sptep, u64 new_spte)
-{
-   pfn_t pfn;
-   u64 old_spte = *sptep;
-
-   if (!spte_has_volatile_bits(old_spte))
-   __set_spte(sptep, new_spte);
-   else
-   old_spte = __xchg_spte(sptep, new_spte);
-
-   if (!is_rmap_spte(old_spte))
-   return 0;
-
-   pfn = spte_to_pfn(old_spte);
-   if (!shadow_accessed_mask || old_spte  shadow_accessed_mask)
-   kvm_set_pfn_accessed(pfn);
-   if (!shadow_dirty_mask || (old_spte  shadow_dirty_mask))
-   kvm_set_pfn_dirty(pfn);
-   return 1;
-}
-
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
-   if (set_spte_track_bits(sptep, 0ull))
+   if (mmu_spte_clear_track_bits(sptep))
rmap_remove(kvm, sptep);
 }
 
@@ -787,7 +820,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
BUG_ON(!(*spte  PT_PRESENT_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
if (is_writable_pte(*spte)) {
-   update_spte(spte, *spte  ~PT_WRITABLE_MASK);
+   mmu_spte_update(spte, *spte  ~PT_WRITABLE_MASK);
write_protected = 1;
}
spte = rmap_next(kvm, rmapp, spte);
@@ -856,7 +889,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
new_spte = ~PT_WRITABLE_MASK;
new_spte = ~SPTE_HOST_WRITEABLE;
new_spte = ~shadow_accessed_mask;
-   set_spte_track_bits(spte, 

[PATCH v2 17/22] KVM: MMU: clean up spte updating and clearing

2011-06-22 Thread Xiao Guangrong
Clean up the code between mmu_spte_clear_* and mmu_spte_update

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   75 +++-
 1 files changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5ceb64a..339e1a3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -279,24 +279,51 @@ static u64 __xchg_spte(u64 *sptep, u64 new_spte)
 #endif
 }
 
-static bool spte_has_volatile_bits(u64 spte)
+static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
+{
+   return (old_spte  bit_mask)  !(new_spte  bit_mask);
+}
+
+static bool spte_has_volatile_bits(u64 spte, u64 keep_bits)
 {
+   bool access_nonvolatile = false, dirty_nonvolatile = false;
+
if (!shadow_accessed_mask)
return false;
 
-   if (!is_shadow_present_pte(spte))
-   return false;
+   if ((spte | keep_bits)  shadow_accessed_mask)
+   access_nonvolatile = true;
+
+   if (!is_writable_pte(spte) || ((spte | keep_bits)  shadow_dirty_mask))
+   dirty_nonvolatile = true;
 
-   if ((spte  shadow_accessed_mask) 
- (!is_writable_pte(spte) || (spte  shadow_dirty_mask)))
+   if (access_nonvolatile  dirty_nonvolatile)
return false;
 
return true;
 }
 
-static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
+static void track_spte_bits(u64 old_spte, u64 keep_bits, bool always_track)
 {
-   return (old_spte  bit_mask)  !(new_spte  bit_mask);
+   if (always_track ||
+ (spte_is_bit_cleared(old_spte, keep_bits, shadow_accessed_mask)))
+   kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+
+   if (always_track ||
+ (spte_is_bit_cleared(old_spte, keep_bits, shadow_dirty_mask)))
+   kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+}
+
+static u64 spte_get_and_update_clear(u64 *sptep, u64 new_spte)
+{
+   u64 old_spte = *sptep;
+
+   if (!spte_has_volatile_bits(old_spte, new_spte))
+   __set_spte(sptep, new_spte);
+   else
+   old_spte = __xchg_spte(sptep, new_spte);
+
+   return old_spte;
 }
 
 /* Rules for using mmu_spte_set:
@@ -316,7 +343,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
  */
 static void mmu_spte_update(u64 *sptep, u64 new_spte)
 {
-   u64 mask, old_spte = *sptep;
+   u64 old_spte = *sptep;
 
WARN_ON(!is_rmap_spte(new_spte));
 
@@ -324,23 +351,8 @@ static void mmu_spte_update(u64 *sptep, u64 new_spte)
return mmu_spte_set(sptep, new_spte);
 
new_spte |= old_spte  shadow_dirty_mask;
-
-   mask = shadow_accessed_mask;
-   if (is_writable_pte(old_spte))
-   mask |= shadow_dirty_mask;
-
-   if (!spte_has_volatile_bits(old_spte) || (new_spte  mask) == mask)
-   __set_spte(sptep, new_spte);
-   else
-   old_spte = __xchg_spte(sptep, new_spte);
-
-   if (!shadow_accessed_mask)
-   return;
-
-   if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
-   kvm_set_pfn_accessed(spte_to_pfn(old_spte));
-   if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
-   kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+   old_spte = spte_get_and_update_clear(sptep, new_spte);
+   track_spte_bits(old_spte, new_spte, false);
 }
 
 /*
@@ -350,22 +362,13 @@ static void mmu_spte_update(u64 *sptep, u64 new_spte)
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
-   pfn_t pfn;
u64 old_spte = *sptep;
 
-   if (!spte_has_volatile_bits(old_spte))
-   __set_spte(sptep, 0ull);
-   else
-   old_spte = __xchg_spte(sptep, 0ull);
-
if (!is_rmap_spte(old_spte))
return 0;
 
-   pfn = spte_to_pfn(old_spte);
-   if (!shadow_accessed_mask || old_spte  shadow_accessed_mask)
-   kvm_set_pfn_accessed(pfn);
-   if (!shadow_dirty_mask || (old_spte  shadow_dirty_mask))
-   kvm_set_pfn_dirty(pfn);
+   old_spte = spte_get_and_update_clear(sptep, 0ull);
+   track_spte_bits(old_spte, 0ull, !shadow_accessed_mask);
return 1;
 }
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/22] KVM: MMU: do not need atomicly to set/clear spte

2011-06-22 Thread Xiao Guangrong
Now, the spte is just from nonprsent to present or present to nonprsent, so
we can use some trick to set/clear spte non-atomicly as linux kernel does

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   82 +++
 1 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 339e1a3..0ba94e9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -259,26 +259,82 @@ static gfn_t pse36_gfn_delta(u32 gpte)
return (gpte  PT32_DIR_PSE36_MASK)  shift;
 }
 
+#ifdef CONFIG_X86_64
 static void __set_spte(u64 *sptep, u64 spte)
 {
-   set_64bit(sptep, spte);
+   *sptep = spte;
 }
 
-static u64 __xchg_spte(u64 *sptep, u64 new_spte)
+static void __update_clear_spte_fast(u64 *sptep, u64 spte)
 {
-#ifdef CONFIG_X86_64
-   return xchg(sptep, new_spte);
+   *sptep = spte;
+}
+
+static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
+{
+   return xchg(sptep, spte);
+}
 #else
-   u64 old_spte;
+union split_spte {
+   struct {
+   u32 spte_low;
+   u32 spte_high;
+   };
+   u64 spte;
+};
 
-   do {
-   old_spte = *sptep;
-   } while (cmpxchg64(sptep, old_spte, new_spte) != old_spte);
+static void __set_spte(u64 *sptep, u64 spte)
+{
+   union split_spte *ssptep, sspte;
 
-   return old_spte;
-#endif
+   ssptep = (union split_spte *)sptep;
+   sspte = (union split_spte)spte;
+
+   ssptep-spte_high = sspte.spte_high;
+
+   /*
+* If we map the spte from nonpresent to present, We should store
+* the high bits firstly, then set present bit, so cpu can not
+* fetch this spte while we are setting the spte.
+*/
+   smp_wmb();
+
+   ssptep-spte_low = sspte.spte_low;
 }
 
+static void __update_clear_spte_fast(u64 *sptep, u64 spte)
+{
+   union split_spte *ssptep, sspte;
+
+   ssptep = (union split_spte *)sptep;
+   sspte = (union split_spte)spte;
+
+   ssptep-spte_low = sspte.spte_low;
+
+   /*
+* If we map the spte from present to nonpresent, we should clear
+* present bit firstly to avoid vcpu fetch the old high bits.
+*/
+   smp_wmb();
+
+   ssptep-spte_high = sspte.spte_high;
+}
+
+static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
+{
+   union split_spte *ssptep, sspte, orig;
+
+   ssptep = (union split_spte *)sptep;
+   sspte = (union split_spte)spte;
+
+   /* xchg acts as a barrier before the setting of the high bits */
+   orig.spte_low = xchg(ssptep-spte_low, sspte.spte_low);
+   orig.spte_high = ssptep-spte_high = sspte.spte_high;
+
+   return orig.spte;
+}
+#endif
+
 static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
 {
return (old_spte  bit_mask)  !(new_spte  bit_mask);
@@ -319,9 +375,9 @@ static u64 spte_get_and_update_clear(u64 *sptep, u64 
new_spte)
u64 old_spte = *sptep;
 
if (!spte_has_volatile_bits(old_spte, new_spte))
-   __set_spte(sptep, new_spte);
+   __update_clear_spte_fast(sptep, new_spte);
else
-   old_spte = __xchg_spte(sptep, new_spte);
+   old_spte = __update_clear_spte_slow(sptep, new_spte);
 
return old_spte;
 }
@@ -379,7 +435,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
-   __set_spte(sptep, 0ull);
+   __update_clear_spte_fast(sptep, 0ull);
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 19/22] KVM: MMU: lockless walking shadow page table

2011-06-22 Thread Xiao Guangrong
Use rcu to protect shadow pages table to be freed, so we can safely walk it,
it should run fastly and is needed by mmio page fault

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |8 +++
 arch/x86/kvm/mmu.c  |  132 ---
 2 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42e577d..87a868e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -233,6 +233,12 @@ struct kvm_mmu_page {
unsigned int unsync_children;
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
DECLARE_BITMAP(unsync_child_bitmap, 512);
+
+#ifdef CONFIG_X86_32
+   int clear_spte_count;
+#endif
+
+   struct rcu_head rcu;
 };
 
 struct kvm_pv_mmu_op_buffer {
@@ -477,6 +483,8 @@ struct kvm_arch {
u64 hv_guest_os_id;
u64 hv_hypercall;
 
+   atomic_t reader_counter;
+
#ifdef CONFIG_KVM_MMU_AUDIT
int audit_point;
#endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0ba94e9..dad7ad9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -182,6 +182,12 @@ struct kvm_shadow_walk_iterator {
 shadow_walk_okay((_walker));  \
 shadow_walk_next((_walker)))
 
+#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte)\
+   for (shadow_walk_init((_walker), _vcpu, _addr);\
+shadow_walk_okay((_walker)) \
+   ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; });  \
+__shadow_walk_next((_walker), spte))
+
 static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
@@ -274,6 +280,11 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
 {
return xchg(sptep, spte);
 }
+
+static u64 __get_spte_lockless(u64 *sptep)
+{
+   return ACCESS_ONCE(*sptep);
+}
 #else
 union split_spte {
struct {
@@ -283,6 +294,18 @@ union split_spte {
u64 spte;
 };
 
+static void count_spte_clear(u64 *sptep, u64 spte)
+{
+   struct kvm_mmu_page *sp =  page_header(__pa(sptep));
+
+   if (is_shadow_present_pte(spte))
+   return;
+
+   /* Ensure the spte is completely set before we increase the count */
+   smp_wmb();
+   sp-clear_spte_count++;
+}
+
 static void __set_spte(u64 *sptep, u64 spte)
 {
union split_spte *ssptep, sspte;
@@ -318,6 +341,7 @@ static void __update_clear_spte_fast(u64 *sptep, u64 spte)
smp_wmb();
 
ssptep-spte_high = sspte.spte_high;
+   count_spte_clear(sptep, spte);
 }
 
 static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
@@ -330,9 +354,40 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
/* xchg acts as a barrier before the setting of the high bits */
orig.spte_low = xchg(ssptep-spte_low, sspte.spte_low);
orig.spte_high = ssptep-spte_high = sspte.spte_high;
+   count_spte_clear(sptep, spte);
 
return orig.spte;
 }
+
+/*
+ * The idea using the light way get the spte on x86_32 guest is from
+ * gup_get_pte(arch/x86/mm/gup.c).
+ * The difference is we can not catch the spte tlb flush if we leave
+ * guest mode, so we emulate it by increase clear_spte_count when spte
+ * is cleared.
+ */
+static u64 __get_spte_lockless(u64 *sptep)
+{
+   struct kvm_mmu_page *sp =  page_header(__pa(sptep));
+   union split_spte spte, *orig = (union split_spte *)sptep;
+   int count;
+
+retry:
+   count = sp-clear_spte_count;
+   smp_rmb();
+
+   spte.spte_low = orig-spte_low;
+   smp_rmb();
+
+   spte.spte_high = orig-spte_high;
+   smp_rmb();
+
+   if (unlikely(spte.spte_low != orig-spte_low ||
+ count != sp-clear_spte_count))
+   goto retry;
+
+   return spte.spte;
+}
 #endif
 
 static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
@@ -438,6 +493,28 @@ static void mmu_spte_clear_no_track(u64 *sptep)
__update_clear_spte_fast(sptep, 0ull);
 }
 
+static u64 mmu_spte_get_lockless(u64 *sptep)
+{
+   return __get_spte_lockless(sptep);
+}
+
+static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
+{
+   rcu_read_lock();
+   atomic_inc(vcpu-kvm-arch.reader_counter);
+
+   /* Increase the counter before walking shadow page table */
+   smp_mb__after_atomic_inc();
+}
+
+static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
+{
+   /* Decrease the counter after walking shadow page table finished */
+   smp_mb__before_atomic_dec();
+   atomic_dec(vcpu-kvm-arch.reader_counter);
+   rcu_read_unlock();
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
  struct kmem_cache *base_cache, int min)
 {

[PATCH v2 20/22] KVM: MMU: reorganize struct kvm_shadow_walk_iterator

2011-06-22 Thread Xiao Guangrong
Reorganize it for good using the cache

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dad7ad9..1319050 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -172,8 +172,8 @@ struct pte_list_desc {
 struct kvm_shadow_walk_iterator {
u64 addr;
hpa_t shadow_addr;
-   int level;
u64 *sptep;
+   int level;
unsigned index;
 };
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 21/22] KVM: MMU: mmio page fault support

2011-06-22 Thread Xiao Guangrong
The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |  156 ++--
 arch/x86/kvm/mmu.h |1 +
 arch/x86/kvm/paging_tmpl.h |   18 --
 arch/x86/kvm/vmx.c |   10 +++-
 4 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1319050..e69a47a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;/* mutual 
exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
+static u64 __read_mostly shadow_mmio_mask = (0xffull  49 | 1ULL);
+
+static void mmu_spte_set(u64 *sptep, u64 spte);
+
+static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access)
+{
+   access = ACC_WRITE_MASK | ACC_USER_MASK;
+
+   mmu_spte_set(sptep, shadow_mmio_mask | access | gfn  PAGE_SHIFT);
+}
+
+static bool is_mmio_spte(u64 spte)
+{
+   return (spte  shadow_mmio_mask) == shadow_mmio_mask;
+}
+
+static gfn_t get_mmio_spte_gfn(u64 spte)
+{
+   return (spte  ~shadow_mmio_mask)  PAGE_SHIFT;
+}
+
+static unsigned get_mmio_spte_access(u64 spte)
+{
+   return (spte  ~shadow_mmio_mask)  ~PAGE_MASK;
+}
+
+static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
+{
+   if (unlikely(is_mmio_pfn(pfn))) {
+   mark_mmio_spte(sptep, gfn, access);
+   return true;
+   }
+
+   return false;
+}
 
 static inline u64 rsvd_bits(int s, int e)
 {
@@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-   return pte != 0ull;
+   return pte != 0ull  !is_mmio_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
@@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
u64 spte, entry = *sptep;
int ret = 0;
 
+   if (set_mmio_spte(sptep, gfn, pfn, pte_access))
+   return 0;
+
/*
 * We don't set the accessed bit, since we sometimes want to see
 * whether the guest actually used the pte (in order to detect
@@ -2258,6 +2296,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
kvm_mmu_flush_tlb(vcpu);
}
 
+   if (unlikely(is_mmio_spte(*sptep)  emulate))
+   *emulate = 1;
+
pgprintk(%s: setting spte %llx\n, __func__, *sptep);
pgprintk(instantiating %s PTE (%s) at %llx (%llx) addr %p\n,
 is_large_pte(*sptep)? 2MB : 4kB,
@@ -2484,7 +2525,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
*vcpu,
 
 static bool mmu_invalid_pfn(pfn_t pfn)
 {
-   return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn));
+   return unlikely(is_invalid_pfn(pfn));
 }
 
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
@@ -2498,11 +2539,8 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, 
gva_t gva, gfn_t gfn,
goto exit;
}
 
-   if (unlikely(is_mmio_pfn(pfn))) {
+   if (unlikely(is_mmio_pfn(pfn)))
vcpu_cache_mmio_info(vcpu, gva, gfn, access);
-   *ret_val = 1;
-   goto exit;
-   }
 
ret = false;
 exit:
@@ -2816,6 +2854,86 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu 
*vcpu, gva_t vaddr,
return vcpu-arch.nested_mmu.translate_gpa(vcpu, vaddr, access);
 }
 
+static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+{
+   if (direct)
+   return vcpu_match_mmio_gpa(vcpu, addr);
+
+   return vcpu_match_mmio_gva(vcpu, addr);
+}
+
+static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
+{
+   struct kvm_shadow_walk_iterator iterator;
+   u64 spte, ret = 0ull;
+
+   walk_shadow_page_lockless_begin(vcpu);
+   for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
+   if (iterator.level == 1) {
+   ret = spte;
+   break;
+   }
+
+   if (!is_shadow_present_pte(spte))
+   break;
+   }
+   walk_shadow_page_lockless_end(vcpu);
+
+   return ret;
+}

[PATCH v2 22/22] KVM: MMU: trace mmio page fault

2011-06-22 Thread Xiao Guangrong
Add tracepoints to trace mmio page fault

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c  |5 
 arch/x86/kvm/mmutrace.h |   48 +++
 arch/x86/kvm/trace.h|   23 ++
 arch/x86/kvm/x86.c  |5 +++-
 4 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e69a47a..5a4e36c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -205,6 +205,7 @@ static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned 
access)
 {
access = ACC_WRITE_MASK | ACC_USER_MASK;
 
+   trace_mark_mmio_spte(sptep, gfn, access);
mmu_spte_set(sptep, shadow_mmio_mask | access | gfn  PAGE_SHIFT);
 }
 
@@ -1913,6 +1914,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
kvm_mmu_isolate_pages(invalid_list);
sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
list_del_init(invalid_list);
+
+   trace_kvm_mmu_delay_free_pages(sp);
call_rcu(sp-rcu, free_pages_rcu);
return;
}
@@ -2902,6 +2905,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, 
u64 addr, bool direct)
 
if (direct)
addr = 0;
+
+   trace_handle_mmio_page_fault(addr, gfn, access);
vcpu_cache_mmio_info(vcpu, addr, gfn, access);
return 1;
}
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index b60b4fd..eed67f3 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -196,6 +196,54 @@ DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_prepare_zap_page,
TP_ARGS(sp)
 );
 
+DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_delay_free_pages,
+   TP_PROTO(struct kvm_mmu_page *sp),
+
+   TP_ARGS(sp)
+);
+
+TRACE_EVENT(
+   mark_mmio_spte,
+   TP_PROTO(u64 *sptep, gfn_t gfn, unsigned access),
+   TP_ARGS(sptep, gfn, access),
+
+   TP_STRUCT__entry(
+   __field(void *, sptep)
+   __field(gfn_t, gfn)
+   __field(unsigned, access)
+   ),
+
+   TP_fast_assign(
+   __entry-sptep = sptep;
+   __entry-gfn = gfn;
+   __entry-access = access;
+   ),
+
+   TP_printk(sptep:%p gfn %llx access %x, __entry-sptep, __entry-gfn,
+ __entry-access)
+);
+
+TRACE_EVENT(
+   handle_mmio_page_fault,
+   TP_PROTO(u64 addr, gfn_t gfn, unsigned access),
+   TP_ARGS(addr, gfn, access),
+
+   TP_STRUCT__entry(
+   __field(u64, addr)
+   __field(gfn_t, gfn)
+   __field(unsigned, access)
+   ),
+
+   TP_fast_assign(
+   __entry-addr = addr;
+   __entry-gfn = gfn;
+   __entry-access = access;
+   ),
+
+   TP_printk(addr:%llx gfn %llx access %x, __entry-addr, __entry-gfn,
+ __entry-access)
+);
+
 TRACE_EVENT(
kvm_mmu_audit,
TP_PROTO(struct kvm_vcpu *vcpu, int audit_point),
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 624f8cb..3ff898c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -698,6 +698,29 @@ TRACE_EVENT(kvm_emulate_insn,
 #define trace_kvm_emulate_insn_start(vcpu) trace_kvm_emulate_insn(vcpu, 0)
 #define trace_kvm_emulate_insn_failed(vcpu) trace_kvm_emulate_insn(vcpu, 1)
 
+TRACE_EVENT(
+   vcpu_match_mmio,
+   TP_PROTO(gva_t gva, gpa_t gpa, bool write, bool gpa_match),
+   TP_ARGS(gva, gpa, write, gpa_match),
+
+   TP_STRUCT__entry(
+   __field(gva_t, gva)
+   __field(gpa_t, gpa)
+   __field(bool, write)
+   __field(bool, gpa_match)
+   ),
+
+   TP_fast_assign(
+   __entry-gva = gva;
+   __entry-gpa = gpa;
+   __entry-write = write;
+   __entry-gpa_match = gpa_match
+   ),
+
+   TP_printk(gva %#lx gpa %#llx %s %s, __entry-gva, __entry-gpa,
+ __entry-write ? Write : Read,
+ __entry-gpa_match ? GPA : GVA)
+);
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 78d5519..24a0f2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3955,6 +3955,7 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, 
unsigned long gva,
  vcpu-arch.access)) {
*gpa = vcpu-arch.mmio_gfn  PAGE_SHIFT |
(gva  (PAGE_SIZE - 1));
+   trace_vcpu_match_mmio(gva, *gpa, write, false);
return 1;
}
 
@@ -3970,8 +3971,10 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, 
unsigned long gva,
if ((*gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
return 1;
 
-   if (vcpu_match_mmio_gpa(vcpu, *gpa))
+   if (vcpu_match_mmio_gpa(vcpu, *gpa)) {
+   

Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Andrea Arcangeli
On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index d48ec60..b407a69 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
   kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
   VMX_EPT_EXECUTABLE_MASK);
   kvm_enable_tdp();
 + kvm_dirty_update = 0;
   } else
   kvm_disable_tdp();
  

Why not return !shadow_dirty_mask instead of adding a new var?

  struct mmu_notifier_ops {
 + int (*dirty_update)(struct mmu_notifier *mn,
 +  struct mm_struct *mm);
 +

Needs some docu.

I think dirty_update isn't self explanatory name. I think
has_test_and_clear_dirty would be better.

If we don't flush the smp tlb don't we risk that we'll insert pages in
the unstable tree that are volatile just because the dirty bit didn't
get set again on the spte?

The first patch I guess it's a sign of hugetlbfs going a little over
the edge in trying to mix with the core VM... Passing that parameter
need_pte_unmap all over the place not so nice, maybe it'd be possible
to fix within hugetlbfs to use a different method to walk the hugetlb
vmas. I'd prefer that if possible.

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus



If we don't flush the smp tlb don't we risk that we'll insert pages in
the unstable tree that are volatile just because the dirty bit didn't
get set again on the spte?


Yes, this is the trade off we take, the unstable tree will be flushed 
anyway -

so this is nothing that won`t be recovered very soon after it happen...

and most of the chances the tlb will be flushed before ksm get there anyway
(specially for heavily modified page, that we don`t want in the unstable 
tree)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Rik van Riel

On 06/22/2011 07:19 AM, Izik Eidus wrote:


So what we say here is: it is better to have little junk in the unstable
tree that get flushed eventualy anyway, instead of make the guest
slower
this race is something that does not reflect accurate of ksm anyway due
to the full memcmp that we will eventualy perform...


With 2MB pages, I am not convinced they will get flushed eventually,
because there is a good chance at least one of the 4kB pages inside
a 2MB page is in active use at all times.

I worry that the proposed changes may end up effectively preventing
KSM from scanning inside 2MB pages, when even one 4kB page inside
is in active use.  This could mean increased swapping on systems
that run low on memory, which can be a much larger performance penalty
than ksmd CPU use.

We need to scan inside 2MB pages when memory runs low, regardless
of the accessed or dirty bits.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Chris Wright
* Izik Eidus (izik.ei...@ravellosystems.com) wrote:
 On 6/22/2011 3:21 AM, Chris Wright wrote:
 * Nai Xia (nai@gmail.com) wrote:
 +   if (!shadow_dirty_mask) {
 +   WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
 +   goto out;
 +   }
 This should never fire with the dirty_update() notifier test, right?
 And that means that this whole optimization is for the shadow mmu case,
 arguably the legacy case.
 
 Hi Chris,
 AMD npt does track the dirty bit in the nested page tables,
 so the shadow_dirty_mask should not be 0 in that case...

Yeah, momentary lapse... ;)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] KVM: MMU: Clean up the error handling of walk_addr_generic()

2011-06-22 Thread Marcelo Tosatti
On Mon, Jun 20, 2011 at 11:29:47PM +0900, Takuya Yoshikawa wrote:
 From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 
 Avoid two step jump to the error handling part.  This eliminates the use
 of the variables present and rsvd_fault.
 
 We also use the const type qualifier to show that write/user/fetch_fault
 do not change in the function.
 
 Both of these were suggested by Ingo Molnar.
 
 Cc: Ingo Molnar mi...@elte.hu
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 ---
  v2-v3: only changelog update
 
  arch/x86/kvm/paging_tmpl.h |   64 +++
  1 files changed, 28 insertions(+), 36 deletions(-)
 
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 1caeb4d..137aa45 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -125,18 +125,17 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
 *walker,
   gfn_t table_gfn;
   unsigned index, pt_access, uninitialized_var(pte_access);
   gpa_t pte_gpa;
 - bool eperm, present, rsvd_fault;
 - int offset, write_fault, user_fault, fetch_fault;
 -
 - write_fault = access  PFERR_WRITE_MASK;
 - user_fault = access  PFERR_USER_MASK;
 - fetch_fault = access  PFERR_FETCH_MASK;
 + bool eperm;
 + int offset;
 + const int write_fault = access  PFERR_WRITE_MASK;
 + const int user_fault  = access  PFERR_USER_MASK;
 + const int fetch_fault = access  PFERR_FETCH_MASK;
 + u16 errcode = 0;
  
   trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
fetch_fault);
  walk:
 - present = true;
 - eperm = rsvd_fault = false;
 + eperm = false;
   walker-level = mmu-root_level;
   pte   = mmu-get_cr3(vcpu);
  
 @@ -145,7 +144,7 @@ walk:
   pte = kvm_pdptr_read_mmu(vcpu, mmu, (addr  30)  3);
   trace_kvm_mmu_paging_element(pte, walker-level);
   if (!is_present_gpte(pte)) {
 - present = false;
 + errcode |= PFERR_PRESENT_MASK;
   goto error;
   }
   --walker-level;
 @@ -171,34 +170,34 @@ walk:
   real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn),
 PFERR_USER_MASK|PFERR_WRITE_MASK);
   if (unlikely(real_gfn == UNMAPPED_GVA)) {
 - present = false;
 - break;
 + errcode |= PFERR_PRESENT_MASK;
 + goto error;
   }
   real_gfn = gpa_to_gfn(real_gfn);
  
   host_addr = gfn_to_hva(vcpu-kvm, real_gfn);
   if (unlikely(kvm_is_error_hva(host_addr))) {
 - present = false;
 - break;
 + errcode |= PFERR_PRESENT_MASK;
 + goto error;
   }
  
   ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
   if (unlikely(__copy_from_user(pte, ptep_user, sizeof(pte {
 - present = false;
 - break;
 + errcode |= PFERR_PRESENT_MASK;
 + goto error;
   }
  
   trace_kvm_mmu_paging_element(pte, walker-level);
  
   if (unlikely(!is_present_gpte(pte))) {
 - present = false;
 - break;
 + errcode |= PFERR_PRESENT_MASK;
 + goto error;
   }

Assignment of PFERR_PRESENT_MASK is inverted.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Andrea Arcangeli
On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote:
 On 06/22/2011 07:19 AM, Izik Eidus wrote:
 
  So what we say here is: it is better to have little junk in the unstable
  tree that get flushed eventualy anyway, instead of make the guest
  slower
  this race is something that does not reflect accurate of ksm anyway due
  to the full memcmp that we will eventualy perform...
 
 With 2MB pages, I am not convinced they will get flushed eventually,
 because there is a good chance at least one of the 4kB pages inside
 a 2MB page is in active use at all times.
 
 I worry that the proposed changes may end up effectively preventing
 KSM from scanning inside 2MB pages, when even one 4kB page inside
 is in active use.  This could mean increased swapping on systems
 that run low on memory, which can be a much larger performance penalty
 than ksmd CPU use.
 
 We need to scan inside 2MB pages when memory runs low, regardless
 of the accessed or dirty bits.

I guess we could fallback to the cksum when a THP is encountered
(repeating the test_and_clear_dirty also wouldn't give the expected
result if it's repeated on the same hugepmd for the next 4k virtual
address candidate for unstable tree insertion, so it'd need special
handling during the virtual walk anyway).

So it's getting a little hairy, skip on THP, skip on EPT, then I
wonder what is the common case that would be left using it...

Or we could evaluate with statistic how many less pages are inserted
into the unstable tree using the 2m dirty bit but clearly it'd be less
reliable, the algorithm really is meant to track the volatility of
what is later merged, not of a bigger chunk with unrelated data in it.

On a side note, khugepaged should also be changed to preserve the
dirty bit if at least one dirty bit of the ptes is dirty (currently
the hugepmd is always created dirty, it can never happen for an
hugepmd to be clean today so it wasn't preserved in khugepaged so far).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] KVM: MMU: Clean up the error handling of walk_addr_generic()

2011-06-22 Thread Avi Kivity

On 06/22/2011 07:46 PM, Marcelo Tosatti wrote:

if (unlikely(!is_present_gpte(pte))) {
  - present = false;
  - break;
  + errcode |= PFERR_PRESENT_MASK;
  + goto error;
}

Assignment of PFERR_PRESENT_MASK is inverted.



Note: kvm-unit-tests.git/x86/access.flat would have caught this.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table

2011-06-22 Thread Marcelo Tosatti
On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote:
 Properly check the last mapping, and do not walk to the next level if last 
 spte
 is met
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |9 +
  1 files changed, 5 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9c629b5..f474e93 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct 
 kvm_shadow_walk_iterator *iterator)
   if (iterator-level  PT_PAGE_TABLE_LEVEL)
   return false;
  
 - if (iterator-level == PT_PAGE_TABLE_LEVEL)

Change to = PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay.

 - if (is_large_pte(*iterator-sptep))
 - return false;
 -
   iterator-index = SHADOW_PT_INDEX(iterator-addr, iterator-level);
   iterator-sptep = ((u64 *)__va(iterator-shadow_addr)) + 
 iterator-index;
   return true;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 03/12] Switch build system to accompanied kernel headers

2011-06-22 Thread Stefan Weil

Am 08.06.2011 16:10, schrieb Jan Kiszka:

This helps reducing our build-time checks for feature support in the
available Linux kernel headers. And it helps users that do not have
sufficiently recent headers installed on their build machine.

Consequently, the patch removes and build-time checks for kvm and vhost
in configure, the --kerneldir switch, and KVM_CFLAGS. Kernel headers are
supposed to be provided by QEMU only.

s390 needs some extra love as it carries redefinitions from kernel
headers.

CC: Alexander Graf ag...@suse.de
Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
Makefile.target | 4 +-
configure | 151 ++
target-s390x/cpu.h | 10 ---
target-s390x/op_helper.c | 1 +
4 files changed, 21 insertions(+), 145 deletions(-)

diff --git a/Makefile.target b/Makefile.target
index 5c22df8..be9c0e8 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -14,7 +14,7 @@ endif

TARGET_PATH=$(SRC_PATH)/target-$(TARGET_BASE_ARCH)
$(call set-vpath, $(SRC_PATH):$(TARGET_PATH):$(SRC_PATH)/hw)
-QEMU_CFLAGS+= -I.. -I$(TARGET_PATH) -DNEED_CPU_H
+QEMU_CFLAGS+= -I.. -I../linux-headers -I$(TARGET_PATH) -DNEED_CPU_H

include $(SRC_PATH)/Makefile.objs

@@ -37,8 +37,6 @@ ifndef CONFIG_HAIKU
LIBS+=-lm
endif

-kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: 
QEMU_CFLAGS+=$(KVM_CFLAGS)

-
config-target.h: config-target.h-timestamp
config-target.h-timestamp: config-target.mak

diff --git a/configure b/configure
index d38b952..0e1dc46 100755
--- a/configure
+++ b/configure
@@ -113,8 +113,7 @@ curl=
curses=
docs=
fdt=
-kvm=
-kvm_para=
+kvm=yes
nptl=


Are you planning to add kvm support for all platforms which don't 
support it today?

If not, kvm=yes should be restricted to platforms with kvm support.

Otherwise, QEMU builds will fail very early:

 ERROR: Host kernel lacks signalfd() support,
 but KVM depends on it when the IO thread is disabled.

Of course, users of those non-kvm platforms can set --disable-kvm,
but I don't think that is the correct solution.

Even with kvm disabled, builds still fail for non-kvm systems:

 In file included from /qemu/hw/kvmclock.c:21:
 /qemu/linux-headers/linux/kvm_para.h:26:26: warning: asm/kvm_para.h: 
No such file or directory


Cheers,

Stefan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] Switch build system to accompanied kernel headers

2011-06-22 Thread Jan Kiszka
On 2011-06-22 22:51, Stefan Weil wrote:
 Am 08.06.2011 16:10, schrieb Jan Kiszka:
 This helps reducing our build-time checks for feature support in the
 available Linux kernel headers. And it helps users that do not have
 sufficiently recent headers installed on their build machine.

 Consequently, the patch removes and build-time checks for kvm and vhost
 in configure, the --kerneldir switch, and KVM_CFLAGS. Kernel headers are
 supposed to be provided by QEMU only.

 s390 needs some extra love as it carries redefinitions from kernel
 headers.

 CC: Alexander Graf ag...@suse.de
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 Makefile.target | 4 +-
 configure | 151 ++
 target-s390x/cpu.h | 10 ---
 target-s390x/op_helper.c | 1 +
 4 files changed, 21 insertions(+), 145 deletions(-)

 diff --git a/Makefile.target b/Makefile.target
 index 5c22df8..be9c0e8 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -14,7 +14,7 @@ endif

 TARGET_PATH=$(SRC_PATH)/target-$(TARGET_BASE_ARCH)
 $(call set-vpath, $(SRC_PATH):$(TARGET_PATH):$(SRC_PATH)/hw)
 -QEMU_CFLAGS+= -I.. -I$(TARGET_PATH) -DNEED_CPU_H
 +QEMU_CFLAGS+= -I.. -I../linux-headers -I$(TARGET_PATH) -DNEED_CPU_H

 include $(SRC_PATH)/Makefile.objs

 @@ -37,8 +37,6 @@ ifndef CONFIG_HAIKU
 LIBS+=-lm
 endif

 -kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o:
 QEMU_CFLAGS+=$(KVM_CFLAGS)
 -
 config-target.h: config-target.h-timestamp
 config-target.h-timestamp: config-target.mak

 diff --git a/configure b/configure
 index d38b952..0e1dc46 100755
 --- a/configure
 +++ b/configure
 @@ -113,8 +113,7 @@ curl=
 curses=
 docs=
 fdt=
 -kvm=
 -kvm_para=
 +kvm=yes
 nptl=
 
 Are you planning to add kvm support for all platforms which don't
 support it today?

That would mean replacing all their kernels with Linux. Will take a bit
longer.

I simply overshot with my cleanups:

diff --git a/configure b/configure
index 3286e33..e6847c4 100755
--- a/configure
+++ b/configure
@@ -113,7 +113,7 @@ curl=
 curses=
 docs=
 fdt=
-kvm=yes
+kvm=
 nptl=
 sdl=
 vnc=yes
@@ -129,7 +129,7 @@ xen=
 xen_ctrl_version=
 linux_aio=
 attr=
-vhost_net=yes
+vhost_net=
 xfs=

 gprof=no
@@ -457,6 +457,8 @@ Haiku)
   linux=yes
   linux_user=yes
   usb=linux
+  kvm=yes
+  vhost_net=yes
   if [ $cpu = i386 -o $cpu = x86_64 ] ; then
 audio_possible_drivers=$audio_possible_drivers fmod
   fi

Will post this as a patch tomorrow.

 If not, kvm=yes should be restricted to platforms with kvm support.
 
 Otherwise, QEMU builds will fail very early:
 
  ERROR: Host kernel lacks signalfd() support,
  but KVM depends on it when the IO thread is disabled.
 
 Of course, users of those non-kvm platforms can set --disable-kvm,
 but I don't think that is the correct solution.
 
 Even with kvm disabled, builds still fail for non-kvm systems:
 
  In file included from /qemu/hw/kvmclock.c:21:
  /qemu/linux-headers/linux/kvm_para.h:26:26: warning: asm/kvm_para.h: No
 such file or directory

That indicates symlink emulation under Windows does not support
directories. Can you confirm this (check what
builddir/linux-headers/asm became)? Then we would have to link all
files in the arch header dir individually.

Jan




signature.asc
Description: OpenPGP digital signature


[PATCH] KVM test: get_started.py: Remove confusing question

2011-06-22 Thread Lucas Meneghel Rodrigues
After receiving some feedback on get_started.py, resolved
to remove a question regarding NFS shares. Since the script
prints the directories clearly, if one wants to setup NFS or
symlinks, he/she will do it.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/kvm/get_started.py |7 ---
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/client/tests/kvm/get_started.py b/client/tests/kvm/get_started.py
index c986f5e..5f76817 100755
--- a/client/tests/kvm/get_started.py
+++ b/client/tests/kvm/get_started.py
@@ -80,13 +80,6 @@ if __name__ == __main__:
 else:
 logging.debug(Dir %s exists, not creating %
   sub_dir_path)
-answer = utils.ask(Do you want to setup NFS mounts for some of those 
-   dirs?)
-if answer == 'y':
-logging.info(Exiting the script so you can setup the NFS mounts. 
- When you are done, re-run this script.)
-sys.exit(0)
-
 logging.info()
 logging.info(2 - Creating config files from samples (copy the default 
  config samples to actual config files))
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support

2011-06-22 Thread Marcelo Tosatti

Xiao,

On Wed, Jun 22, 2011 at 10:36:16PM +0800, Xiao Guangrong wrote:
 The idea is from Avi:
 
 | We could cache the result of a miss in an spte by using a reserved bit, and
 | checking the page fault error code (or seeing if we get an ept violation or
 | ept misconfiguration), so if we get repeated mmio on a page, we don't need 
 to
 | search the slot list/tree.
 | (https://lkml.org/lkml/2011/2/22/221)
 
 When the page fault is caused by mmio, we cache the info in the shadow page
 table, and also set the reserved bits in the shadow page table, so if the mmio
 is caused again, we can quickly identify it and emulate it directly
 
 Searching mmio gfn in memslots is heavy since we need to walk all memeslots, 
 it
 can be reduced by this feature, and also avoid walking guest page table for
 soft mmu.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |  156 
 ++--
  arch/x86/kvm/mmu.h |1 +
  arch/x86/kvm/paging_tmpl.h |   18 --
  arch/x86/kvm/vmx.c |   10 +++-
  4 files changed, 173 insertions(+), 12 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 1319050..e69a47a 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;  /* mutual 
 exclusive with nx_mask */
  static u64 __read_mostly shadow_user_mask;
  static u64 __read_mostly shadow_accessed_mask;
  static u64 __read_mostly shadow_dirty_mask;
 +static u64 __read_mostly shadow_mmio_mask = (0xffull  49 | 1ULL);
 +
 +static void mmu_spte_set(u64 *sptep, u64 spte);
 +
 +static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access)
 +{
 + access = ACC_WRITE_MASK | ACC_USER_MASK;
 +
 + mmu_spte_set(sptep, shadow_mmio_mask | access | gfn  PAGE_SHIFT);
 +}
 +
 +static bool is_mmio_spte(u64 spte)
 +{
 + return (spte  shadow_mmio_mask) == shadow_mmio_mask;
 +}
 +
 +static gfn_t get_mmio_spte_gfn(u64 spte)
 +{
 + return (spte  ~shadow_mmio_mask)  PAGE_SHIFT;
 +}
 +
 +static unsigned get_mmio_spte_access(u64 spte)
 +{
 + return (spte  ~shadow_mmio_mask)  ~PAGE_MASK;
 +}
 +
 +static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
 +{
 + if (unlikely(is_mmio_pfn(pfn))) {
 + mark_mmio_spte(sptep, gfn, access);
 + return true;
 + }
 +
 + return false;
 +}
  
  static inline u64 rsvd_bits(int s, int e)
  {
 @@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
  
  static int is_shadow_present_pte(u64 pte)
  {
 - return pte != 0ull;
 + return pte != 0ull  !is_mmio_spte(pte);
  }
  
  static int is_large_pte(u64 pte)
 @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
   u64 spte, entry = *sptep;
   int ret = 0;
  
 + if (set_mmio_spte(sptep, gfn, pfn, pte_access))
 + return 0;
 +


Should zap all mmio sptes when creating new memory slots (current qemu
never exhibits that pattern, but its not an unsupported scenario).
Easier to zap all mmu in that case.

For shadow you need to remove mmio sptes on invlpg and all mmio sptes
under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables.

Otherwise you can move the mmio info from an mmio spte back to
mmio_gva/mmio_gfn after a TLB flush, without rereading the guest
pagetable.

 +{
 + struct kvm_shadow_walk_iterator iterator;
 + u64 spte, ret = 0ull;
 +
 + walk_shadow_page_lockless_begin(vcpu);
 + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
 + if (iterator.level == 1) {
 + ret = spte;
 + break;
 + }

Have you verified it is safe to walk sptes with parallel modifications
to them? Other than shadow page deletion which is in theory covered by
RCU. It would be good to have all cases written down.

 +
 + if (!is_shadow_present_pte(spte))
 + break;
 + }
 + walk_shadow_page_lockless_end(vcpu);
 +
 + return ret;
 +}
 +
 +/*
 + * If it is a real mmio page fault, return 1 and emulat the instruction
 + * directly, return 0 if it needs page fault path to fix it, -1 is
 + * returned if bug is detected.
 + */
 +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
 direct)
 +{
 + u64 spte;
 +
 + if (quickly_check_mmio_pf(vcpu, addr, direct))
 + return 1;
 +
 + spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
 +
 + if (is_mmio_spte(spte)) {
 + gfn_t gfn = get_mmio_spte_gfn(spte);
 + unsigned access = get_mmio_spte_access(spte);
 +
 + if (direct)
 + addr = 0;
 + vcpu_cache_mmio_info(vcpu, addr, gfn, access);
 + return 1;
 + }
 +
 + /*
 +  * It's ok if the gva is remapped by other cpus on shadow guest,
 +  * it's a BUG if the gfn is not a mmio page.
 +  */

If the gva is remapped there should be no mmio page 

Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Wed, Jun 22, 2011 at 11:39 PM, Rik van Riel r...@redhat.com wrote:
 On 06/22/2011 07:19 AM, Izik Eidus wrote:

 So what we say here is: it is better to have little junk in the unstable
 tree that get flushed eventualy anyway, instead of make the guest
 slower
 this race is something that does not reflect accurate of ksm anyway due
 to the full memcmp that we will eventualy perform...

 With 2MB pages, I am not convinced they will get flushed eventually,
 because there is a good chance at least one of the 4kB pages inside
 a 2MB page is in active use at all times.

 I worry that the proposed changes may end up effectively preventing
 KSM from scanning inside 2MB pages, when even one 4kB page inside
 is in active use.  This could mean increased swapping on systems
 that run low on memory, which can be a much larger performance penalty
 than ksmd CPU use.

 We need to scan inside 2MB pages when memory runs low, regardless
 of the accessed or dirty bits.

I agree on this point. Dirty bit , young bit, is by no means accurate. Even
on 4kB pages, there is always a chance that the pte are dirty but the contents
are actually the same. Yeah, the whole optimization contains trade-offs and
trades-offs always have the possibilities to annoy  someone.  Just like
page-bit-relying LRU approximations none of them is perfect too. But I think
it can benefit some people. So maybe we could just provide a generic balanced
solution but provide fine tuning interfaces to make sure tha when it really gets
in the way of someone, he has a way to walk around.
Do you agree on my argument? :-)


 --
 All rights reversed

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index d48ec60..b407a69 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
               kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
                               VMX_EPT_EXECUTABLE_MASK);
               kvm_enable_tdp();
 +             kvm_dirty_update = 0;
       } else
               kvm_disable_tdp();


 Why not return !shadow_dirty_mask instead of adding a new var?

  struct mmu_notifier_ops {
 +     int (*dirty_update)(struct mmu_notifier *mn,
 +                          struct mm_struct *mm);
 +

 Needs some docu.

 I think dirty_update isn't self explanatory name. I think
 has_test_and_clear_dirty would be better.

 If we don't flush the smp tlb don't we risk that we'll insert pages in
 the unstable tree that are volatile just because the dirty bit didn't
 get set again on the spte?

 The first patch I guess it's a sign of hugetlbfs going a little over
 the edge in trying to mix with the core VM... Passing that parameter
 need_pte_unmap all over the place not so nice, maybe it'd be possible
 to fix within hugetlbfs to use a different method to walk the hugetlb
 vmas. I'd prefer that if possible.

OK, I'll have a try over other workarounds.
I am not feeling good about need_pte_unmap myself. :-)

Thanks for viewing!

-Nai


 Thanks,
 Andrea

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Andrea Arcangeli
On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote:
 I agree on this point. Dirty bit , young bit, is by no means accurate. Even
 on 4kB pages, there is always a chance that the pte are dirty but the contents
 are actually the same. Yeah, the whole optimization contains trade-offs and

Just a side note: the fact the dirty bit would be set even when the
data is the same is actually a pros, not a cons. If the content is the
same but the page was written to, it'd trigger a copy on write short
after merging the page rendering the whole exercise wasteful. The
cksum plays a double role, it both stabilizes the unstable tree, so
there's less chance of bad lookups, but it also avoids us to merge
stuff that is written to frequently triggering copy on writes, and the
dirty bit would also catch overwrites with the same data, something
the cksum can't do.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Rik van Riel

On 06/22/2011 07:13 PM, Nai Xia wrote:

On Wed, Jun 22, 2011 at 11:39 PM, Rik van Rielr...@redhat.com  wrote:

On 06/22/2011 07:19 AM, Izik Eidus wrote:


So what we say here is: it is better to have little junk in the unstable
tree that get flushed eventualy anyway, instead of make the guest
slower
this race is something that does not reflect accurate of ksm anyway due
to the full memcmp that we will eventualy perform...


With 2MB pages, I am not convinced they will get flushed eventually,
because there is a good chance at least one of the 4kB pages inside
a 2MB page is in active use at all times.

I worry that the proposed changes may end up effectively preventing
KSM from scanning inside 2MB pages, when even one 4kB page inside
is in active use.  This could mean increased swapping on systems
that run low on memory, which can be a much larger performance penalty
than ksmd CPU use.

We need to scan inside 2MB pages when memory runs low, regardless
of the accessed or dirty bits.


I agree on this point. Dirty bit , young bit, is by no means accurate. Even
on 4kB pages, there is always a chance that the pte are dirty but the contents
are actually the same. Yeah, the whole optimization contains trade-offs and
trades-offs always have the possibilities to annoy  someone.  Just like
page-bit-relying LRU approximations none of them is perfect too. But I think
it can benefit some people. So maybe we could just provide a generic balanced
solution but provide fine tuning interfaces to make sure tha when it really gets
in the way of someone, he has a way to walk around.
Do you agree on my argument? :-)


That's not an argument.

That is a if I wave my hands vigorously enough, maybe people
will let my patch in without thinking about what I wrote
style argument.

I believe your optimization makes sense for 4kB pages, but
is going to be counter-productive for 2MB pages.

Your approach of make ksmd skip over more pages, so it uses
less CPU is likely to reduce the effectiveness of ksm by not
sharing some pages.

For 4kB pages that is fine, because you'll get around to them
eventually.

However, the internal use of a 2MB page is likely to be quite
different.  Chances are most 2MB pages will have actively used,
barely used and free pages inside.

You absolutely want ksm to get at the barely used and free
sub-pages.  Having just one actively used 4kB sub-page prevent
ksm from merging any of the other 511 sub-pages is a problem.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 12:55 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Wed, Jun 22, 2011 at 11:39:40AM -0400, Rik van Riel wrote:
 On 06/22/2011 07:19 AM, Izik Eidus wrote:

  So what we say here is: it is better to have little junk in the unstable
  tree that get flushed eventualy anyway, instead of make the guest
  slower
  this race is something that does not reflect accurate of ksm anyway due
  to the full memcmp that we will eventualy perform...

 With 2MB pages, I am not convinced they will get flushed eventually,
 because there is a good chance at least one of the 4kB pages inside
 a 2MB page is in active use at all times.

 I worry that the proposed changes may end up effectively preventing
 KSM from scanning inside 2MB pages, when even one 4kB page inside
 is in active use.  This could mean increased swapping on systems
 that run low on memory, which can be a much larger performance penalty
 than ksmd CPU use.

 We need to scan inside 2MB pages when memory runs low, regardless
 of the accessed or dirty bits.

 I guess we could fallback to the cksum when a THP is encountered
 (repeating the test_and_clear_dirty also wouldn't give the expected
 result if it's repeated on the same hugepmd for the next 4k virtual
 address candidate for unstable tree insertion, so it'd need special
 handling during the virtual walk anyway).

 So it's getting a little hairy, skip on THP, skip on EPT, then I
 wonder what is the common case that would be left using it...

 Or we could evaluate with statistic how many less pages are inserted
 into the unstable tree using the 2m dirty bit but clearly it'd be less
 reliable, the algorithm really is meant to track the volatility of
 what is later merged, not of a bigger chunk with unrelated data in it.

On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
huge pages before their sub pages gets really merged to stable tree.
So when there are many 2MB pages each having a 4kB subpage
changed for all time, this is already a concern for ksmd to judge
if it's worthwhile to split 2MB page and get its sub-pages merged.
I think the policy for ksmd in a system should be If you cannot do sth good,
at least do nothing evil. So I really don't think we can satisfy _all_ people.
Get a general method and give users one or two knobs to tune it when they
are the corner cases. How do  you think of my proposal ?


 On a side note, khugepaged should also be changed to preserve the
 dirty bit if at least one dirty bit of the ptes is dirty (currently
 the hugepmd is always created dirty, it can never happen for an
 hugepmd to be clean today so it wasn't preserved in khugepaged so far).


Thanks for the point that out. This is what I have overlooked!

thanks,
Nai
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Wed, Jun 22, 2011 at 11:03 PM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Tue, Jun 21, 2011 at 09:32:39PM +0800, Nai Xia wrote:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index d48ec60..b407a69 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4674,6 +4674,7 @@ static int __init vmx_init(void)
               kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
                               VMX_EPT_EXECUTABLE_MASK);
               kvm_enable_tdp();
 +             kvm_dirty_update = 0;
       } else
               kvm_disable_tdp();


 Why not return !shadow_dirty_mask instead of adding a new var?

  struct mmu_notifier_ops {
 +     int (*dirty_update)(struct mmu_notifier *mn,
 +                          struct mm_struct *mm);
 +

 Needs some docu.

OK. I'll add it.


 I think dirty_update isn't self explanatory name. I think
 has_test_and_clear_dirty would be better.

Agreed.  So it will be the name in the next version. :)

Thanks,
Nai


 If we don't flush the smp tlb don't we risk that we'll insert pages in
 the unstable tree that are volatile just because the dirty bit didn't
 get set again on the spte?

 The first patch I guess it's a sign of hugetlbfs going a little over
 the edge in trying to mix with the core VM... Passing that parameter
 need_pte_unmap all over the place not so nice, maybe it'd be possible
 to fix within hugetlbfs to use a different method to walk the hugetlb
 vmas. I'd prefer that if possible.

 Thanks,
 Andrea

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Andrea Arcangeli
On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote:
 OK, I'll have a try over other workarounds.
 I am not feeling good about need_pte_unmap myself. :-)

The usual way is to check VM_HUGETLB in the caller and to call another
function that doesn't kmap. Casting pmd_t to pte_t isn't really nice
(but hey we're also doing that exceptionally in smaps_pte_range for
THP, but it safe there because we're casting the value of the pmd, not
the pointer to the pmd, so the kmap is done by the pte version of the
caller and not done by the pmd version of the caller).

Is it done for migrate? Surely it's not for swapout ;).

 Thanks for viewing!

You're welcome!

JFYI I'll be offline on vacation for a week, starting tomorrow, so if
I don't answer in the next few days that's the reason but I'll follow
the progress in a week.

Thanks!
Andrea
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Andrea Arcangeli
On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
 On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
 huge pages before their sub pages gets really merged to stable tree.
 So when there are many 2MB pages each having a 4kB subpage
 changed for all time, this is already a concern for ksmd to judge
 if it's worthwhile to split 2MB page and get its sub-pages merged.

Hmm not sure to follow. KSM memory density with THP on and off should
be identical. The cksum is computed on subpages so the fact the 4k
subpage is actually mapped by a hugepmd is invisible to KSM up to the
point we get a unstable_tree_search_insert/stable_tree_search lookup
succeeding.

 I think the policy for ksmd in a system should be If you cannot do sth good,
 at least do nothing evil. So I really don't think we can satisfy _all_ 
 people.
 Get a general method and give users one or two knobs to tune it when they
 are the corner cases. How do  you think of my proposal ?

I'm neutral, but if we get two methods for deciding the unstable tree
candidates, the default probably should prioritize on maximum merging
even if it takes more CPU (if one cares about performance of the core
dedicated to ksmd, KSM is likely going to be off or scanning at low
rate in the first place).

  On a side note, khugepaged should also be changed to preserve the
  dirty bit if at least one dirty bit of the ptes is dirty (currently
  the hugepmd is always created dirty, it can never happen for an
  hugepmd to be clean today so it wasn't preserved in khugepaged so far).
 
 
 Thanks for the point that out. This is what I have overlooked!

No prob. And its default scan rate is very slow compared to ksmd so
it was unlikely to generate too many false positive dirty bits even if
you were splitting hugepages through swap.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Rik van Riel

On 06/22/2011 07:37 PM, Nai Xia wrote:


On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
huge pages before their sub pages gets really merged to stable tree.


Your proposal appears to add a condition that causes ksmd to skip
doing that, which can cause the system to start using swap instead
of sharing memory.


So when there are many 2MB pages each having a 4kB subpage
changed for all time, this is already a concern for ksmd to judge
if it's worthwhile to split 2MB page and get its sub-pages merged.
I think the policy for ksmd in a system should be If you cannot do sth good,
at least do nothing evil. So I really don't think we can satisfy _all_ people.
Get a general method and give users one or two knobs to tune it when they
are the corner cases. How do  you think of my proposal ?


I think your proposal makes sense for 4kB pages, but the ksmd
policy for 2MB pages probably needs to be much more aggressive.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 7:44 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Thu, Jun 23, 2011 at 07:19:06AM +0800, Nai Xia wrote:
 OK, I'll have a try over other workarounds.
 I am not feeling good about need_pte_unmap myself. :-)

 The usual way is to check VM_HUGETLB in the caller and to call another
 function that doesn't kmap. Casting pmd_t to pte_t isn't really nice
 (but hey we're also doing that exceptionally in smaps_pte_range for
 THP, but it safe there because we're casting the value of the pmd, not
 the pointer to the pmd, so the kmap is done by the pte version of the
 caller and not done by the pmd version of the caller).

 Is it done for migrate? Surely it's not for swapout ;).

Thanks for the hint. :-)

You know, another thing I am worried about is that I think I
did make page_check_address()  return a pmd version for skipping the
tail subpages ...
I did detecte a schedule in atomic if I kunmap() the returned value. :-(


 Thanks for viewing!

 You're welcome!

 JFYI I'll be offline on vacation for a week, starting tomorrow, so if
 I don't answer in the next few days that's the reason but I'll follow
 the progress in a week.

Have a nice vacation man! Enjoy the sunlight, we all have enough
of code in rooms. ;-)


Thanks,
Nai


 Thanks!
 Andrea

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
 On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
 huge pages before their sub pages gets really merged to stable tree.
 So when there are many 2MB pages each having a 4kB subpage
 changed for all time, this is already a concern for ksmd to judge
 if it's worthwhile to split 2MB page and get its sub-pages merged.

 Hmm not sure to follow. KSM memory density with THP on and off should
 be identical. The cksum is computed on subpages so the fact the 4k
 subpage is actually mapped by a hugepmd is invisible to KSM up to the
 point we get a unstable_tree_search_insert/stable_tree_search lookup
 succeeding.

I agree on your points.

But, I mean splitting the huge page into normal pages when some subpages
need to be merged may increase the TLB lookside timing of CPU and
_might_ hurt the workload ksmd is scanning. If only a small portion of false
negative 2MB pages are really get merged eventually, maybe it's not worthwhile,
right?

But, well, just like Rik said below, yes, ksmd should be more aggressive to
avoid much more time consuming cost for swapping.


 I think the policy for ksmd in a system should be If you cannot do sth good,
 at least do nothing evil. So I really don't think we can satisfy _all_ 
 people.
 Get a general method and give users one or two knobs to tune it when they
 are the corner cases. How do  you think of my proposal ?

 I'm neutral, but if we get two methods for deciding the unstable tree
 candidates, the default probably should prioritize on maximum merging
 even if it takes more CPU (if one cares about performance of the core
 dedicated to ksmd, KSM is likely going to be off or scanning at low
 rate in the first place).

I agree with you here.


thanks,

Nai

  On a side note, khugepaged should also be changed to preserve the
  dirty bit if at least one dirty bit of the ptes is dirty (currently
  the hugepmd is always created dirty, it can never happen for an
  hugepmd to be clean today so it wasn't preserved in khugepaged so far).
 

 Thanks for the point that out. This is what I have overlooked!

 No prob. And its default scan rate is very slow compared to ksmd so
 it was unlikely to generate too many false positive dirty bits even if
 you were splitting hugepages through swap.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 8:00 AM, Rik van Riel r...@redhat.com wrote:
 On 06/22/2011 07:37 PM, Nai Xia wrote:

 On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
 huge pages before their sub pages gets really merged to stable tree.

 Your proposal appears to add a condition that causes ksmd to skip
 doing that, which can cause the system to start using swap instead
 of sharing memory.

Hmm, yes, no swapping. So we should make the checksum default
for huge pages, right?


 So when there are many 2MB pages each having a 4kB subpage
 changed for all time, this is already a concern for ksmd to judge
 if it's worthwhile to split 2MB page and get its sub-pages merged.
 I think the policy for ksmd in a system should be If you cannot do sth
 good,
 at least do nothing evil. So I really don't think we can satisfy _all_
 people.
 Get a general method and give users one or two knobs to tune it when they
 are the corner cases. How do  you think of my proposal ?

 I think your proposal makes sense for 4kB pages, but the ksmd
 policy for 2MB pages probably needs to be much more aggressive.

I now agree with you on the whole point. Let's fall back to checksum
Thanks for make my mind clear! :)

And shall we provide a interface to users if he _really_ what to judge the dirty
bit from the pmd level? I think we should agree to one point before I
misunderstand
you and spam you with my next submission :P


And thanks for your time viewing!

-Nai



 --
 All rights reversed

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Andrea Arcangeli
On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote:
 On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli aarca...@redhat.com wrote:
  On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
  On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
  huge pages before their sub pages gets really merged to stable tree.
  So when there are many 2MB pages each having a 4kB subpage
  changed for all time, this is already a concern for ksmd to judge
  if it's worthwhile to split 2MB page and get its sub-pages merged.
 
  Hmm not sure to follow. KSM memory density with THP on and off should
  be identical. The cksum is computed on subpages so the fact the 4k
  subpage is actually mapped by a hugepmd is invisible to KSM up to the
  point we get a unstable_tree_search_insert/stable_tree_search lookup
  succeeding.
 
 I agree on your points.
 
 But, I mean splitting the huge page into normal pages when some subpages
 need to be merged may increase the TLB lookside timing of CPU and
 _might_ hurt the workload ksmd is scanning. If only a small portion of false
 negative 2MB pages are really get merged eventually, maybe it's not 
 worthwhile,
 right?

Yes, there's not threshold to say only split if we could merge more
than N subpages, 1 subpage match in two different hugepages is enough
to split both and save just 4k but then memory accesses will be slower
for both 2m ranges that have been splitted. But the point is that it
won't be slower than if THP was off in the first place. So in the end
all we gain is 4k saved but we still run faster than THP off, in the
other hugepages that haven't been splitted yet.

 But, well, just like Rik said below, yes, ksmd should be more aggressive to
 avoid much more time consuming cost for swapping.

Correct the above logic also follows the idea to always maximize
memory merging in KSM, which is why we've no threshold to wait N
subpages to be mergeable before we split the hugepage.

I'm unsure if admins in real life would then start to use those
thresholds even if we'd implement them. The current way of enabling
KSM a per-VM (not per-host) basis is pretty simple: the performance
critical VM has KSM off, non-performance critical VM has KSM on and it
prioritizes on memory merging.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 7:28 AM, Rik van Riel r...@redhat.com wrote:
 On 06/22/2011 07:13 PM, Nai Xia wrote:

 On Wed, Jun 22, 2011 at 11:39 PM, Rik van Rielr...@redhat.com  wrote:

 On 06/22/2011 07:19 AM, Izik Eidus wrote:

 So what we say here is: it is better to have little junk in the unstable
 tree that get flushed eventualy anyway, instead of make the guest
 slower
 this race is something that does not reflect accurate of ksm anyway due
 to the full memcmp that we will eventualy perform...

 With 2MB pages, I am not convinced they will get flushed eventually,
 because there is a good chance at least one of the 4kB pages inside
 a 2MB page is in active use at all times.

 I worry that the proposed changes may end up effectively preventing
 KSM from scanning inside 2MB pages, when even one 4kB page inside
 is in active use.  This could mean increased swapping on systems
 that run low on memory, which can be a much larger performance penalty
 than ksmd CPU use.

 We need to scan inside 2MB pages when memory runs low, regardless
 of the accessed or dirty bits.

 I agree on this point. Dirty bit , young bit, is by no means accurate.
 Even
 on 4kB pages, there is always a chance that the pte are dirty but the
 contents
 are actually the same. Yeah, the whole optimization contains trade-offs
 and
 trades-offs always have the possibilities to annoy  someone.  Just like
 page-bit-relying LRU approximations none of them is perfect too. But I
 think
 it can benefit some people. So maybe we could just provide a generic
 balanced
 solution but provide fine tuning interfaces to make sure tha when it
 really gets
 in the way of someone, he has a way to walk around.
 Do you agree on my argument? :-)

 That's not an argument.

 That is a if I wave my hands vigorously enough, maybe people
 will let my patch in without thinking about what I wrote
 style argument.

Oh, NO, this is not what I meant.
Really sorry if I made myself look so evil...
I actually mean: Skip or not, we agree on a point that will not
harm most people, and provide another interface to let someon
who _really_ want to take another way.

I am by no means pushing the idea of skipping huge pages.
I am just not sure about it and want to get a precise idea from
you. And now I get it.



 I believe your optimization makes sense for 4kB pages, but
 is going to be counter-productive for 2MB pages.

 Your approach of make ksmd skip over more pages, so it uses
 less CPU is likely to reduce the effectiveness of ksm by not
 sharing some pages.

 For 4kB pages that is fine, because you'll get around to them
 eventually.

 However, the internal use of a 2MB page is likely to be quite
 different.  Chances are most 2MB pages will have actively used,
 barely used and free pages inside.

 You absolutely want ksm to get at the barely used and free
 sub-pages.  Having just one actively used 4kB sub-page prevent
 ksm from merging any of the other 511 sub-pages is a problem.

No, no,  I was just not sure about it. I meant we cannot satisfy
all people but I was not sure which one is good for most of them.

Sorry, again, if I didn't make it clear.


Nai


 --
 All rights reversed

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 7:25 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Thu, Jun 23, 2011 at 07:13:54AM +0800, Nai Xia wrote:
 I agree on this point. Dirty bit , young bit, is by no means accurate. Even
 on 4kB pages, there is always a chance that the pte are dirty but the 
 contents
 are actually the same. Yeah, the whole optimization contains trade-offs and

 Just a side note: the fact the dirty bit would be set even when the
 data is the same is actually a pros, not a cons. If the content is the
 same but the page was written to, it'd trigger a copy on write short
 after merging the page rendering the whole exercise wasteful. The
 cksum plays a double role, it both stabilizes the unstable tree, so
 there's less chance of bad lookups, but it also avoids us to merge
 stuff that is written to frequently triggering copy on writes, and the
 dirty bit would also catch overwrites with the same data, something
 the cksum can't do.

Good point. I actually have myself another version of ksm(off topic, but
if you want to take a glance: http://code.google.com/p/uksm/ :-) )
that did do statistics of the ratio of the pages in a VMA that really got COWed.
due to KSM merging on each scan round basis.

It's  complicated to deduce a precise  information only
from the dirty and cksum.


Thanks,
Nai

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Nai Xia
On Thu, Jun 23, 2011 at 8:44 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Thu, Jun 23, 2011 at 08:31:56AM +0800, Nai Xia wrote:
 On Thu, Jun 23, 2011 at 7:59 AM, Andrea Arcangeli aarca...@redhat.com 
 wrote:
  On Thu, Jun 23, 2011 at 07:37:47AM +0800, Nai Xia wrote:
  On 2MB pages, I'd like to remind you and Rik that ksmd currently splits
  huge pages before their sub pages gets really merged to stable tree.
  So when there are many 2MB pages each having a 4kB subpage
  changed for all time, this is already a concern for ksmd to judge
  if it's worthwhile to split 2MB page and get its sub-pages merged.
 
  Hmm not sure to follow. KSM memory density with THP on and off should
  be identical. The cksum is computed on subpages so the fact the 4k
  subpage is actually mapped by a hugepmd is invisible to KSM up to the
  point we get a unstable_tree_search_insert/stable_tree_search lookup
  succeeding.

 I agree on your points.

 But, I mean splitting the huge page into normal pages when some subpages
 need to be merged may increase the TLB lookside timing of CPU and
 _might_ hurt the workload ksmd is scanning. If only a small portion of false
 negative 2MB pages are really get merged eventually, maybe it's not 
 worthwhile,
 right?

 Yes, there's not threshold to say only split if we could merge more
 than N subpages, 1 subpage match in two different hugepages is enough
 to split both and save just 4k but then memory accesses will be slower
 for both 2m ranges that have been splitted. But the point is that it
 won't be slower than if THP was off in the first place. So in the end
 all we gain is 4k saved but we still run faster than THP off, in the
 other hugepages that haven't been splitted yet.

Yes, so ksmd is still doing good compared to THP off.
Thanks for making my mind clearer :)


 But, well, just like Rik said below, yes, ksmd should be more aggressive to
 avoid much more time consuming cost for swapping.

 Correct the above logic also follows the idea to always maximize
 memory merging in KSM, which is why we've no threshold to wait N
 subpages to be mergeable before we split the hugepage.

 I'm unsure if admins in real life would then start to use those
 thresholds even if we'd implement them. The current way of enabling
 KSM a per-VM (not per-host) basis is pretty simple: the performance
 critical VM has KSM off, non-performance critical VM has KSM on and it
 prioritizes on memory merging.

Hmm, yes, you are right.

Thanks,
Nai
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table

2011-06-22 Thread Xiao Guangrong
On 06/23/2011 01:13 AM, Marcelo Tosatti wrote:
 On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote:
 Properly check the last mapping, and do not walk to the next level if last 
 spte
 is met

 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |9 +
  1 files changed, 5 insertions(+), 4 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9c629b5..f474e93 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct 
 kvm_shadow_walk_iterator *iterator)
  if (iterator-level  PT_PAGE_TABLE_LEVEL)
  return false;
  
 -if (iterator-level == PT_PAGE_TABLE_LEVEL)
 
 Change to = PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay.
 

OK, will fix, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support

2011-06-22 Thread Xiao Guangrong
Marcelo,

Thanks for your review!

On 06/23/2011 05:59 AM, Marcelo Tosatti wrote:

  static int is_large_pte(u64 pte)
 @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
  u64 spte, entry = *sptep;
  int ret = 0;
  
 +if (set_mmio_spte(sptep, gfn, pfn, pte_access))
 +return 0;
 +
 
 
 Should zap all mmio sptes when creating new memory slots (current qemu
 never exhibits that pattern, but its not an unsupported scenario).
 Easier to zap all mmu in that case.
 

OK, will do it in the next version.

 For shadow you need to remove mmio sptes on invlpg and all mmio sptes
 under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables.
 

Oh, kvm_mmu_pte_write and invlpg do not zap mmio spte, i will
fix these, thanks for your point it out.

For mmu_sync_roots, we properly do it in FNAME(sync_page) in this patch:

static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
   int *nr_present)
{
if (unlikely(is_mmio_spte(*sptep))) {
if (gfn != get_mmio_spte_gfn(*sptep)) {
mmu_spte_clear_no_track(sptep);
return true;
}

(*nr_present)++;
mark_mmio_spte(sptep, gfn, access);
return true;
}

return false;
}

 Otherwise you can move the mmio info from an mmio spte back to
 mmio_gva/mmio_gfn after a TLB flush, without rereading the guest
 pagetable.
 

We do not read the guest page table when mmio page fault occurred,
we just do it as you say:
- Firstly, walk the shadow page table to get the mmio spte
- Then, cache the mmio spte info to mmio_gva/mmio_gfn

I missed your meaning?

 +{
 +struct kvm_shadow_walk_iterator iterator;
 +u64 spte, ret = 0ull;
 +
 +walk_shadow_page_lockless_begin(vcpu);
 +for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
 +if (iterator.level == 1) {
 +ret = spte;
 +break;
 +}
 
 Have you verified it is safe to walk sptes with parallel modifications
 to them? Other than shadow page deletion which is in theory covered by
 RCU. It would be good to have all cases written down.
 

Um, i think it is safe, when spte is being write, we can get the old spte
or the new spte, both cases are ok for us: it is just like the page structure
cache on the real machine, the cpu can fetch the old spte even if the page
structure is changed by other cpus.

 +
 +if (!is_shadow_present_pte(spte))
 +break;
 +}
 +walk_shadow_page_lockless_end(vcpu);
 +
 +return ret;
 +}
 +
 +/*
 + * If it is a real mmio page fault, return 1 and emulat the instruction
 + * directly, return 0 if it needs page fault path to fix it, -1 is
 + * returned if bug is detected.
 + */
 +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
 direct)
 +{
 +u64 spte;
 +
 +if (quickly_check_mmio_pf(vcpu, addr, direct))
 +return 1;
 +
 +spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
 +
 +if (is_mmio_spte(spte)) {
 +gfn_t gfn = get_mmio_spte_gfn(spte);
 +unsigned access = get_mmio_spte_access(spte);
 +
 +if (direct)
 +addr = 0;
 +vcpu_cache_mmio_info(vcpu, addr, gfn, access);
 +return 1;
 +}
 +
 +/*
 + * It's ok if the gva is remapped by other cpus on shadow guest,
 + * it's a BUG if the gfn is not a mmio page.
 + */
 
 If the gva is remapped there should be no mmio page fault in the first
 place.
 

For example, as follow case:

VCPU 0 VCPU 1
gva is mapped to a mmio region
access gva
remap gva to other page
   emulate mmio access by kvm
   (*the point we are discussing*)
flush all tlb

In theory, it can happen.

 +if (direct  is_shadow_present_pte(spte))
 +return -1;
 
 An spte does not have to contain the present bit to generate a valid EPT
 misconfiguration (and an spte dump is still required in that case).
 Use !is_mmio_spte() instead.
 

We can not use !is_mmio_spte() here, since the shadow page can be zapped 
anytime,
for example:

sp.spt[i] = mmio-spte

  VCPU 0VCPU 1
Access sp.spte[i], ept misconfig is occurred
   delete sp
   (if the number of shadow page is out of the 
limit
or page shrink is required, and other 
events...)

Walk shadow page out of the lock and get the
non-present spte
(*the point we are discussing*)

So, the bug we can detect is: it is the mmio access but the spte is point to 
the normal
page.

 
 +
 +/*
 + * If the page table is zapped by other cpus, let the page
 + * fault path 

Re: [PATCH 03/12] Switch build system to accompanied kernel headers

2011-06-22 Thread Stefan Weil

Am 22.06.2011 23:37, schrieb Jan Kiszka:

On 2011-06-22 22:51, Stefan Weil wrote:

If not, kvm=yes should be restricted to platforms with kvm support.

Otherwise, QEMU builds will fail very early:

ERROR: Host kernel lacks signalfd() support,
but KVM depends on it when the IO thread is disabled.

Of course, users of those non-kvm platforms can set --disable-kvm,
but I don't think that is the correct solution.

Even with kvm disabled, builds still fail for non-kvm systems:

In file included from /qemu/hw/kvmclock.c:21:
/qemu/linux-headers/linux/kvm_para.h:26:26: warning: asm/kvm_para.h: No
such file or directory


That indicates symlink emulation under Windows does not support
directories. Can you confirm this (check what
builddir/linux-headers/asm became)? Then we would have to link all
files in the arch header dir individually.

Jan


Even when cross compiling for w32 (on a linux host), kvmclock.c
does not compile:

$ LANG=C make CFLAGS=-g
  CCi386-softmmu/kvmclock.o
In file included from /home/stefan/src/qemu/savannah/qemu/hw/kvmclock.c:20:
/home/stefan/src/qemu/savannah/qemu/linux-headers/linux/kvm.h:10:25: 
warning: linux/types.h: No such file or directory
/home/stefan/src/qemu/savannah/qemu/linux-headers/linux/kvm.h:12:25: 
warning: linux/ioctl.h: No such file or directory
In file included from 
/home/stefan/src/qemu/savannah/qemu/linux-headers/linux/kvm.h:13,

 from /home/stefan/src/qemu/savannah/qemu/hw/kvmclock.c:20:
../linux-headers/asm/kvm.h:32: error: expected specifier-qualifier-list 
before '__u32'
../linux-headers/asm/kvm.h:41: error: expected specifier-qualifier-list 
before '__u8'
../linux-headers/asm/kvm.h:61: error: expected specifier-qualifier-list 
before '__u64'


Is kvmclock.c really needed for non-kvm platforms? Or does it simply need
a obj-i386-$(CONFIG_KVM) in Makefile.target?

Stefan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio_net sometimes didn't work

2011-06-22 Thread lidong chen
I find this problem have already solved by this patch:)

Subject: [PATCH] [virtio] Replace virtio-net with native gPXE driver
http://git.etherboot.org/gpxe.git/commitdiff/180dcf4363bf1d8889a2398a4944e94ca150b311

2011/2/16 Michael S. Tsirkin m...@redhat.com:
 On Wed, Feb 16, 2011 at 04:00:15PM +0800, lidong chen wrote:
 how to info pci in qemu?

 just type it at the monitor prompt.

 the usb controller also used irq 11, i think the interrupt maybe cause by 
 this.
 i will modify qemu, ignore -usb,and try again.

 i add some debug code, printk the ioaddr of vdev.

         static int i;
         /* reading the ISR has the effect of also clearing it so it's very
          * important to save off the value. */
         isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR);

         /* It's definitely not us if the ISR was not high */
         if (!isr) {
                 i++;
                 if( i==1 ) {
                         printk(KERN_EMERG 2);
                         printk(KERN_EMERG ioaddr %p, vp_dev-ioaddr );
                         i=0;
                 }
                 return IRQ_NONE;
         }

 OK, so we seem to get it right.
 Try to stick the printout in qemu
    case VIRTIO_PCI_ISR:
        /* reading from the ISR also clears it. */
        ret = vdev-isr;
        vdev-isr = 0;
        qemu_set_irq(proxy-pci_dev.irq[0], 0);

 see whether it's the baloon device or the virtio net
 device that gets to handle the reads.

 ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
 ACPI: PCI Interrupt :00:03.0[A] - Link [LNKC] - GSI 11 (level,
 high) - IRQ 11
 20ioaddr 0001c040020ioaddr 0001c040020ioaddr
 0001c040020ioaddr 0001c040020ioaddr
 0001c040020ioaddr 0001c040020ioaddr
 0001c040020ioaddr 0001c040020ioaddr
 0001c040020ioaddr 0001c0403irq 11: nobody cared (try booting
 with the irqpoll option)
  [c01457b0] __report_bad_irq+0x2b/0x69
  [c0145979] note_interrupt+0x18b/0x1b2
  [c01452a9] handle_IRQ_event+0x26/0x51
  [c014537f] __do_IRQ+0xab/0xdc
  [c0106445] do_IRQ+0x46/0x53
  [c0104e8a] common_interrupt+0x1a/0x20
  [c01276f2] __do_softirq+0x4f/0xc2
  [c0127793] do_softirq+0x2e/0x32
  [c0104f3c] apic_timer_interrupt+0x1c/0x30
  [c02ae9fe] _spin_unlock_irqrestore+0x6/0x7
  [c01455dc] setup_irq+0xab/0x108
  [f8a9a09b] vp_interrupt+0x0/0x114 [virtio_pci]
  [c01456ad] request_irq+0x74/0x90
  [f8a9a2fb] virtio_pci_probe+0x14c/0x1c2 [virtio_pci]
  [c01d4096] pci_device_probe+0x36/0x57
  [c022f935] driver_probe_device+0x42/0x8b
  [c022fa23] __driver_attach+0x4a/0x71
  [c022f9d9] __driver_attach+0x0/0x71
  [c022f45a] bus_for_each_dev+0x39/0x5b
  [c022f89f] driver_attach+0x11/0x13
  [c022f9d9] __driver_attach+0x0/0x71
  [c022f17d] bus_add_driver+0x64/0xfd
  [c01d41f9] __pci_register_driver+0x6c/0x8e
  [f8830020] virtio_pci_init+0x20/0x34 [virtio_pci]
  [c013b216] sys_init_module+0x1749/0x18c2
  [c014ab9b] generic_file_read+0x9a/0xaf
  [c0167011] vfs_read+0xa8/0x150
  [c0103dcb] sysenter_past_esp+0x54/0x79
 handlers:
 [f8a9a09b] (vp_interrupt+0x0/0x114 [virtio_pci])
 Disabling IRQ #11


 the output of lspci -vv
 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
       Subsystem: Unknown device 1af4:1100
       Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
       Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-

 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
       Subsystem: Unknown device 1af4:1100
       Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
       Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort-
 TAbort- MAbort- SERR- PERR-
       Latency: 0

 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
 [Natoma/Triton II] (prog-if 80 [Master])
       Subsystem: Unknown device 1af4:1100
       Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
       Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort-
 TAbort- MAbort- SERR- PERR-
       Latency: 0
       Region 4: I/O ports at c000 [size=16]

 00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB
 [Natoma/Triton II] (rev 01) (prog-if 00 [UHCI])
       Subsystem: Unknown device 1af4:1100
       Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
       Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR-
       Latency: 0
       Interrupt: pin D routed to IRQ 11
       Region 4: I/O ports at c020 [size=32]

 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
       Subsystem: Unknown device 1af4:1100
       Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B-
       Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort-
 TAbort- MAbort- SERR- PERR-
       Interrupt: pin A routed to IRQ 9