set_pte_at_notify regression

2014-01-10 Thread Izik Eidus

Hi,

It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
semantic of set_pte_at_notify.
The change of calling first to mmu_notifier_invalidate_range_start, then 
to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
not only increase the amount of locks kvm have to take and release by 
factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
have any spte to set and it acctuly get called for nothing, the result is
increasing of vmexits for kvm from both do_wp_page and replace_page, and 
broken semantic of set_pte_at_notify.


Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 3:21 AM, Chris Wright wrote:

* Nai Xia (nai@gmail.com) wrote:

Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, which 
brings
significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not updated by
hardware.

Did you test with each of EPT, NPT and shadow?


Signed-off-by: Nai Xianai@gmail.com
Acked-by: Izik Eidusizik.ei...@ravellosystems.com
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/mmu.c  |   36 +
  arch/x86/kvm/mmu.h  |3 +-
  arch/x86/kvm/vmx.c  |1 +
  include/linux/kvm_host.h|2 +-
  include/linux/mmu_notifier.h|   48 +++
  mm/mmu_notifier.c   |   33 ++
  virt/kvm/kvm_main.c |   27 ++
  8 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..f0d7aa0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -848,6 +848,7 @@ extern bool kvm_rebooting;
  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
  int kvm_age_hva(struct kvm *kvm, unsigned long hva);
  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva);
  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aee3862..a5a0c51 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -979,6 +979,37 @@ out:
return young;
  }

+/*
+ * Caller is supposed to SetPageDirty(), it's not done inside this.
+ */
+static
+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
+{
+   u64 *spte;
+   int dirty = 0;
+
+   if (!shadow_dirty_mask) {
+   WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
+   goto out;
+   }

This should never fire with the dirty_update() notifier test, right?
And that means that this whole optimization is for the shadow mmu case,
arguably the legacy case.



Hi Chris,
AMD npt does track the dirty bit in the nested page tables,
so the shadow_dirty_mask should not be 0 in that case...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 1:43 PM, Avi Kivity wrote:

On 06/21/2011 04:32 PM, Nai Xia wrote:
Introduced kvm_mmu_notifier_test_and_clear_dirty(), 
kvm_mmu_notifier_dirty_update()
and their mmu_notifier interfaces to support KSM dirty bit tracking, 
which brings

significant performance gain in volatile pages scanning in KSM.
Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if 
intel EPT is
enabled to indicate that the dirty bits of underlying sptes are not 
updated by

hardware.




Can you quantify the performance gains?

+int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long 
*rmapp,

+   unsigned long data)
+{
+u64 *spte;
+int dirty = 0;
+
+if (!shadow_dirty_mask) {
+WARN(1, KVM: do NOT try to test dirty bit in EPT\n);
+goto out;
+}
+
+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.


Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such behavior, 
so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 2:10 PM, Avi Kivity wrote:

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte  PT_PRESENT_MASK));
+_dirty = _spte  PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}


Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not do 
tlb flush eventually,
But I see your point, because other users will not expect such 
behavior, so maybe there is need into a parameter

flush_tlb=?, or add another mmu notifier call?



If you don't flush the tlb, a subsequent write will not see that 
spte.d is clear and the write will happen.  So you'll see the page as 
clean even though it's dirty.  That's not acceptable.




Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the idea 
behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush its 
tlb over and over again, specially in periodically scan like ksm does.


So what we say here is: it is better to have little junk in the unstable 
tree that get flushed eventualy anyway, instead of make the guest slower
this race is something that does not reflect accurate of ksm anyway due 
to the full memcmp that we will eventualy perform...


Ofcurse we trust that in most cases, beacuse it take ksm to get into a 
random virtual address in real systems few minutes, there will be 
already tlb flush performed.


What you think about having 2 calls: one that does the expected behivor 
and does flush the tlb, and one that clearly say it doesnt flush the tlb

and expline its use case for ksm?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus

On 6/22/2011 2:33 PM, Nai Xia wrote:

On Wednesday 22 June 2011 19:28:08 Avi Kivity wrote:

On 06/22/2011 02:24 PM, Avi Kivity wrote:

On 06/22/2011 02:19 PM, Izik Eidus wrote:

On 6/22/2011 2:10 PM, Avi Kivity wrote:

On 06/22/2011 02:05 PM, Izik Eidus wrote:

+spte = rmap_next(kvm, rmapp, NULL);
+while (spte) {
+int _dirty;
+u64 _spte = *spte;
+BUG_ON(!(_spte   PT_PRESENT_MASK));
+_dirty = _spte   PT_DIRTY_MASK;
+if (_dirty) {
+dirty = 1;
+clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte);
+}

Racy.  Also, needs a tlb flush eventually.

+

Hi, one of the issues is that the whole point of this patch is not
do tlb flush eventually,
But I see your point, because other users will not expect such
behavior, so maybe there is need into a parameter
flush_tlb=?, or add another mmu notifier call?


If you don't flush the tlb, a subsequent write will not see that
spte.d is clear and the write will happen.  So you'll see the page
as clean even though it's dirty.  That's not acceptable.


Yes, but this is exactly what we want from this use case:
Right now ksm calculate the page hash to see if it was changed, the
idea behind this patch is to use the dirty bit instead,
however the guest might not really like the fact that we will flush
its tlb over and over again, specially in periodically scan like ksm
does.

I see.

Actually, this is dangerous.  If we use the dirty bit for other things,
we will get data corruption.

Yeah,yeah, I actually clarified in a reply letter to Chris about his similar
concern that we are currently the _only_ user. :)
We can add the flushing when someone else should rely on this bit.



I suggest to add the flushing when someone else will use it as well

Btw I don`t think this whole optimization is worth for kvm guests in 
case that tlb flush must be perform,
in machine with alot of cpus, it much better ksm will burn one cpu 
usage, instead of slowering all the others...
So while this patch will really make ksm look faster, the whole system 
will be slower...


So in case you don`t want to add the flushing when someone else will 
rely on it,
it will be better to use the dirty tick just for userspace applications 
and not for kvm guests..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking

2011-06-22 Thread Izik Eidus



If we don't flush the smp tlb don't we risk that we'll insert pages in
the unstable tree that are volatile just because the dirty bit didn't
get set again on the spte?


Yes, this is the trade off we take, the unstable tree will be flushed 
anyway -

so this is nothing that won`t be recovered very soon after it happen...

and most of the chances the tlb will be flushed before ksm get there anyway
(specially for heavily modified page, that we don`t want in the unstable 
tree)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix migration with big mem guests

2010-04-04 Thread Izik Eidus
Hi,

(Below is explenation about the bug to who does`nt familier)

In the beggining I tried to make this code run with
qemu_bh() but the result was performence catastrophic

The reason is that the migration code just doesn`t built
to run at such high granularity, for example sutff like:

static ram_addr_t ram_save_remaining(void)
{
ram_addr_t addr;
ram_addr_t count = 0;

for (addr = 0; addr  last_ram_offset; addr += TARGET_PAGE_SIZE) {
if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
count++;
}

return count;
}

That get called from ram_save_live(), were taking way too much time...
(Just think that I tried to read each time small data, and run it at
 each time that main_loop_wait() finish (from qemu_bh_poll())

Then I thought ok - let`s add a timer that the bh code will run to me
only once in a time - however the migration code already have timer
that is set, so it like it make the most sense to use it...

If anyone have any better idea how to solve this issue, I will be very
happy to hear.

Thanks.

From 2d9c25f1fee61f50cb130769c3779707a6ef90d9 Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Mon, 5 Apr 2010 02:05:09 +0300
Subject: [PATCH] qemu-kvm: fix migration with large mem

In cases of guests with large mem that have pages
that all their bytes content are the same, we will
spend alot of time reading the memory from the guest
(is_dup_page())

It is happening beacuse ram_save_live() function have
limit of how much we can send to the dest but not how
much we read from it, and in cases we have many is_dup_page()
hits, we might read huge amount of data without updating important
stuff like the timers...

The guest lose all its repsonsibility and have many softlock ups
inside itself.

this patch add limit on the size we can read from the guest each
iteration.

Thanks.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 vl.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/vl.c b/vl.c
index d959fdb..777988d 100644
--- a/vl.c
+++ b/vl.c
@@ -174,6 +174,8 @@ int main(int argc, char **argv)
 
 #define DEFAULT_RAM_SIZE 128
 
+#define MAX_SAVE_BLOCK_READ 10 * 1024 * 1024
+
 #define MAX_VIRTIO_CONSOLES 1
 
 static const char *data_dir;
@@ -2854,6 +2856,7 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int 
stage, void *opaque)
 uint64_t bytes_transferred_last;
 double bwidth = 0;
 uint64_t expected_time = 0;
+int data_read = 0;
 
 if (stage  0) {
 cpu_physical_memory_set_dirty_tracking(0);
@@ -2883,10 +2886,11 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int 
stage, void *opaque)
 bytes_transferred_last = bytes_transferred;
 bwidth = qemu_get_clock_ns(rt_clock);
 
-while (!qemu_file_rate_limit(f)) {
+while (!qemu_file_rate_limit(f)  data_read  MAX_SAVE_BLOCK_READ) {
 int ret;
 
 ret = ram_save_block(f);
+data_read += ret * TARGET_PAGE_SIZE;
 bytes_transferred += ret * TARGET_PAGE_SIZE;
 if (ret == 0) /* no more blocks */
 break;
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems getting KSM running on newer 2.6.32.x kernels

2010-02-09 Thread Izik Eidus
On Tue, 09 Feb 2010 12:51:47 +0200
Avi Kivity a...@redhat.com wrote:

 On 02/01/2010 12:37 PM, Nikola Ciprich wrote:
  Hello,
  it seems to me that after upgrading to some 2.6.32.x release, KSM stopped 
  working
  for me. I'm not exactly sure which update did this, but enabling KSM doesn't
  seem to do anything, ksmd process just sleeps and doesn't merge any memory.
  Early 2.6.32 versions worked correctly for me, now I'm using 2.6.32.7 and 
  qemu-kvm-0.12.2.
 
  I'm enabling it using:
 
  echo 262144  /sys/kernel/mm/ksm/pages_to_scan
  echo 100  /sys/kernel/mm/ksm/sleep_millisecs
  echo 1  /sys/kernel/mm/ksm/run

Is it happen to you just in 2.6.32.x? what happen with 2.6.33.x?

I have tested it on 2.6.33.x and it seems to work...

 
 
 
 Izik?
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RFC: alias rework

2010-01-26 Thread Izik Eidus
On Tue, 26 Jan 2010 16:14:47 +0200
Avi Kivity a...@redhat.com wrote:

 On 01/25/2010 10:40 PM, Izik Eidus wrote:
 
  Or is this a feature you need?
   
 
  I dont need it (I asked Avi to do something), So he said he want to nuke 
  the aliasing
  from kvm and keep supporting the old userspace`s
 
  Do you have any other way to achive this?
 
  Btw I do realize it might be better not to push this patch and just keep 
  the old
  way of treating aliasing as we have now, I really don`t mind.
 
 
 How about implementing an alias pointing at a deleted slot as an invalid 
 slot?
 
 If the slot comes back later, we can revalidate it.
 

Ok, didn`t notice this invalid memslot flag,
I will add this, I will still leave the update_aliased_memslot()
in order to update the userspace virtual address...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RFC: alias rework

2010-01-25 Thread Izik Eidus
From f94dcd1ccabbcdb51ed7c37c5f58f00a5c1b7eec Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Mon, 25 Jan 2010 15:49:41 +0200
Subject: [PATCH] RFC: alias rework

This patch remove the old way of aliasing inside kvm
and move into using aliasing with the same virtual addresses

This patch is really just early RFC just to know if you guys
like this direction, and I need to clean some parts of it
and test it more before I feel it ready to be merged...

Comments are more than welcome.

Thanks.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/ia64/include/asm/kvm_host.h |1 +
 arch/ia64/kvm/kvm-ia64.c |5 --
 arch/powerpc/kvm/powerpc.c   |5 --
 arch/s390/include/asm/kvm_host.h |1 +
 arch/s390/kvm/kvm-s390.c |5 --
 arch/x86/include/asm/kvm_host.h  |   19 --
 arch/x86/include/asm/vmx.h   |6 +-
 arch/x86/kvm/mmu.c   |   19 ++-
 arch/x86/kvm/x86.c   |  114 +++--
 include/linux/kvm_host.h |   11 +--
 virt/kvm/kvm_main.c  |   80 +++---
 11 files changed, 107 insertions(+), 159 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index a362e67..d5377c2 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -24,6 +24,7 @@
 #define __ASM_KVM_HOST_H
 
 #define KVM_MEMORY_SLOTS 32
+#define KVM_ALIAS_SLOTS 0
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
 
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 0618898..3d2559e 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1947,11 +1947,6 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
return vcpu-arch.timer_fired;
 }
 
-gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
-{
-   return gfn;
-}
-
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
return (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE) ||
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 51aedd7..50b7d5f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -35,11 +35,6 @@
 #define CREATE_TRACE_POINTS
 #include trace.h
 
-gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
-{
-   return gfn;
-}
-
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 {
return !(v-arch.msr  MSR_WE) || !!(v-arch.pending_exceptions);
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 27605b6..6a2112e 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -21,6 +21,7 @@
 
 #define KVM_MAX_VCPUS 64
 #define KVM_MEMORY_SLOTS 32
+#define KVM_ALIAS_SLOTS 0
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 8f09959..5d63f6b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -741,11 +741,6 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 {
 }
 
-gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
-{
-   return gfn;
-}
-
 static int __init kvm_s390_init(void)
 {
int ret;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a1f0b5d..2d2509f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -367,24 +367,7 @@ struct kvm_vcpu_arch {
u64 hv_vapic;
 };
 
-struct kvm_mem_alias {
-   gfn_t base_gfn;
-   unsigned long npages;
-   gfn_t target_gfn;
-#define KVM_ALIAS_INVALID 1UL
-   unsigned long flags;
-};
-
-#define KVM_ARCH_HAS_UNALIAS_INSTANTIATION
-
-struct kvm_mem_aliases {
-   struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
-   int naliases;
-};
-
 struct kvm_arch {
-   struct kvm_mem_aliases *aliases;
-
unsigned int n_free_mmu_pages;
unsigned int n_requested_mmu_pages;
unsigned int n_alloc_mmu_pages;
@@ -674,8 +657,6 @@ void kvm_disable_tdp(void);
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 int complete_pio(struct kvm_vcpu *vcpu);
 
-struct kvm_memory_slot *gfn_to_memslot_unaliased(struct kvm *kvm, gfn_t gfn);
-
 static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 {
struct page *page = pfn_to_page(shadow_page  PAGE_SHIFT);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 43f1e9b..bf52a32 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -347,9 +347,9 @@ enum vmcs_field {
 
 #define AR_RESERVD_MASK 0xfffe0f00
 
-#define TSS_PRIVATE_MEMSLOT(KVM_MEMORY_SLOTS + 0)
-#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 1)
-#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2)
+#define TSS_PRIVATE_MEMSLOT(KVM_MEMORY_SLOTS + 
KVM_ALIAS_SLOTS + 0)
+#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 
KVM_ALIAS_SLOTS + 1)
+#define

Re: [PATCH] RFC: alias rework

2010-01-25 Thread Izik Eidus
On Mon, 25 Jan 2010 17:45:53 -0200
Marcelo Tosatti mtosa...@redhat.com wrote:

 Izik,
 
 On Mon, Jan 25, 2010 at 03:53:44PM +0200, Izik Eidus wrote:
  From f94dcd1ccabbcdb51ed7c37c5f58f00a5c1b7eec Mon Sep 17 00:00:00 2001
  From: Izik Eidus iei...@redhat.com
  Date: Mon, 25 Jan 2010 15:49:41 +0200
  Subject: [PATCH] RFC: alias rework
  
  This patch remove the old way of aliasing inside kvm
  and move into using aliasing with the same virtual addresses
  
  This patch is really just early RFC just to know if you guys
  like this direction, and I need to clean some parts of it
  and test it more before I feel it ready to be merged...
  
  Comments are more than welcome.
  
  Thanks.
  
  Signed-off-by: Izik Eidus iei...@redhat.com
  ---
   arch/ia64/include/asm/kvm_host.h |1 +
   arch/ia64/kvm/kvm-ia64.c |5 --
   arch/powerpc/kvm/powerpc.c   |5 --
   arch/s390/include/asm/kvm_host.h |1 +
   arch/s390/kvm/kvm-s390.c |5 --
   arch/x86/include/asm/kvm_host.h  |   19 --
   arch/x86/include/asm/vmx.h   |6 +-
   arch/x86/kvm/mmu.c   |   19 ++-
   arch/x86/kvm/x86.c   |  114 
  +++--
   include/linux/kvm_host.h |   11 +--
   virt/kvm/kvm_main.c  |   80 +++---
   11 files changed, 107 insertions(+), 159 deletions(-)
  
 
  @@ -2661,7 +2611,18 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
  struct kvm_memslots *slots, *old_slots;
   
  spin_lock(kvm-mmu_lock);
  +   for (i = KVM_MEMORY_SLOTS; i  KVM_MEMORY_SLOTS +
  + KVM_ALIAS_SLOTS; ++i) {
 
 The plan is to kill KVM_ALIAS_SLOTS (aliases will share the 32 mem
 slots), right?

Hrmm I think we got to have this addition 4 KVM_MEMORY_SLOTS to keep
the same beahivor with old userspaces
beacuse maybe some userspace apps use 32 slots already?

I dont mind remove it if you guys don`t think this is the case.

 
  +#ifdef CONFIG_X86
  +
  +static void update_alias_slots(struct kvm *kvm, struct kvm_memory_slot 
  *slot)
  +{
  +   int i;
  +
  +   for (i = KVM_MEMORY_SLOTS; i  KVM_MEMORY_SLOTS + KVM_ALIAS_SLOTS;
  +++i) {
  +   struct kvm_memory_slot *alias_memslot =
  +   kvm-memslots-memslots[i];
  +   unsigned long size = slot-npages  PAGE_SHIFT;
  +
  +   if (alias_memslot-real_base_gfn = slot-base_gfn 
  +   alias_memslot-real_base_gfn  slot-base_gfn + size) {
  +   if (slot-dirty_bitmap) {
  +   unsigned long bitmap_addr;
  +   unsigned long dirty_offset;
  +   unsigned long offset_addr =
  +   (alias_memslot-real_base_gfn -
  +   slot-base_gfn)  PAGE_SHIFT;
  +   alias_memslot-userspace_addr = 
  +   slot-userspace_addr + offset_addr;
  +
  +   dirty_offset =
  +   ALIGN(offset_addr, BITS_PER_LONG) / 8;
  +   bitmap_addr = (unsigned long) 
  slot-dirty_bitmap;
  +   bitmap_addr += dirty_offset;
  +   alias_memslot-dirty_bitmap = (unsigned long 
  *)bitmap_addr;
  +   alias_memslot-base_gfn = 
  alias_memslot-real_base_gfn;
  +   alias_memslot-npages = 
  alias_memslot-real_npages;
  +   } else if (!slot-rmap) {
  +   alias_memslot-base_gfn = 0;
  +   alias_memslot-npages = 0;
  +   }
  +   }
  +   }
  +}
  +
  +#endif
 
 Can't see why is this needed. What is the problem with nuking child
 aliases when deleting a real memslot?

The problem is that this memslot still point in the virtual address of the host,
This mean that gfn_to_memslot/page will still work on gfns and will result in
pages that are mapped into the virtual address that the userspace requested to
remove from KVM.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RFC: alias rework

2010-01-25 Thread Izik Eidus
On Mon, 25 Jan 2010 18:49:25 -0200
Marcelo Tosatti mtosa...@redhat.com wrote:

 On Mon, Jan 25, 2010 at 10:40:32PM +0200, Izik Eidus wrote:
  On Mon, 25 Jan 2010 18:20:39 -0200
  Marcelo Tosatti mtosa...@redhat.com wrote:
  
   With current code, if a memslot is deleted, access through any aliases
   that use it will fail (BTW it looks this is not properly handled, but
   thats a separate problem).
  
  
  Yea I had some still open concerns about this code (this why I sent it on 
  RFC)
  
   
   So AFAICS there is no requirement for an alias to continue operable 
   if its parent memslot is deleted.
  
  
  With this patch alias will stop to opearte when the parent is deleted
  just like the behivor with the current code...
  
  base_gfn will be set to 0 and npages will be set to 0 as well
  (the true values wil be hide in real_base_gfn...), so gfn_to_memslot
  and gfn_to_page will fail
 
 But you adjust the alias (and keep it valid) if dirty logging is
 enabled?

I am sorry, but probaby you got confused beacuse the code is wrong
the adjust of aliasing should happen in every case of:
 if(slot-rmap - valid (!NULL)):
 this mean we got NEW parent slot that mapped into the gfn
 that the alias is mapped to, and we want the userspace address
 of the alias slot to intersect with the new parent slot.

and the latter adjustmant of the dirty_bitmap should happen only in case
of - if(slot-dirty_bitmap - valid (!NULL)):
 the alias slot need to mark_page_dirty the bitmap of the new parent slot

I hope this will make things more clear
(I think there is another small issue there, but I will send it when it wont be 
RFC)

 
   
   Or is this a feature you need?
  
  
  I dont need it (I asked Avi to do something), So he said he want to nuke 
  the aliasing
  from kvm and keep supporting the old userspace`s
 
 With feature i meant keeping the alias around when parent slot is
 deleted.


The code doesnt try to do this, infact:
} else if (!slot-rmap) {
alias_memslot-base_gfn = 0;
alias_memslot-npages = 0;
}
came to invalidate the alias slot.

Sorry if I made to much mess :).

 
  Do you have any other way to achive this?
 
 No.
 
  Btw I do realize it might be better not to push this patch and just keep 
  the old
  way of treating aliasing as we have now, I really don`t mind.
  
   
   Motivation is that nukeing aliases is simpler than adjusting them.
   
  
  Agree.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PATCH: kvm-userspace: ksm support

2009-10-04 Thread Izik Eidus
From a8ca226de8efb4f0447e4ef87bf034cf18996745 Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Sun, 4 Oct 2009 14:01:31 +0200
Subject: [PATCH] kvm-userspace: add ksm support

Calling to madvise(MADV_MERGEABLE) on the memory allocations.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 exec.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index 5c9edf7..406d2cb 100644
--- a/exec.c
+++ b/exec.c
@@ -2538,6 +2538,9 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
 new_block-host = file_ram_alloc(size, mem_path);
 if (!new_block-host) {
 new_block-host = qemu_vmalloc(size);
+#ifdef MADV_MERGEABLE
+madvise(new_block-host, size, MADV_MERGEABLE);
+#endif
 }
 new_block-offset = last_ram_offset;
 new_block-length = size;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] ksm support for kvm v2

2009-09-23 Thread Izik Eidus
Hope i fixed everything i was asked...
please tell me if I forgot anything.

Izik Eidus (3):
  kvm: dont hold pagecount reference for mapped sptes pages
  add SPTE_HOST_WRITEABLE flag to the shadow ptes
  add support for change_pte mmu notifiers

 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   86 ++
 arch/x86/kvm/paging_tmpl.h  |   18 +++-
 virt/kvm/kvm_main.c |   14 ++
 4 files changed, 98 insertions(+), 21 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvm: dont hold pagecount reference for mapped sptes pages

2009-09-23 Thread Izik Eidus
When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |7 ++-
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eca41ae..6c67b23 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -634,9 +634,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
if (*spte  shadow_accessed_mask)
kvm_set_pfn_accessed(pfn);
if (is_writeble_pte(*spte))
-   kvm_release_pfn_dirty(pfn);
-   else
-   kvm_release_pfn_clean(pfn);
+   kvm_set_pfn_dirty(pfn);
rmapp = gfn_to_rmap(kvm, sp-gfns[spte - sp-spt], sp-role.level);
if (!*rmapp) {
printk(KERN_ERR rmap_remove: %p %llx 0-BUG\n, spte, *spte);
@@ -1877,8 +1875,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
page_header_update_slot(vcpu-kvm, sptep, gfn);
if (!was_rmapped) {
rmap_count = rmap_add(vcpu, sptep, gfn);
-   if (!is_rmap_spte(*sptep))
-   kvm_release_pfn_clean(pfn);
+   kvm_release_pfn_clean(pfn);
if (rmap_count  RMAP_RECYCLE_THRESHOLD)
rmap_recycle(vcpu, sptep, gfn);
} else {
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] add SPTE_HOST_WRITEABLE flag to the shadow ptes

2009-09-23 Thread Izik Eidus
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/paging_tmpl.h |   18 +++---
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c67b23..5cd8b4e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -156,6 +156,8 @@ module_param(oos_shadow, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include mmutrace.h
 
+#define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {
@@ -1754,7 +1756,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync)
+   bool can_unsync, bool reset_host_protection)
 {
u64 spte;
int ret = 0;
@@ -1781,6 +1783,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= kvm_x86_ops-get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
 
+   if (reset_host_protection)
+   spte |= SPTE_HOST_WRITEABLE;
+
spte |= (u64)pfn  PAGE_SHIFT;
 
if ((pte_access  ACC_WRITE_MASK)
@@ -1826,7 +1831,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int level, gfn_t gfn,
-pfn_t pfn, bool speculative)
+pfn_t pfn, bool speculative,
+bool reset_host_protection)
 {
int was_rmapped = 0;
int was_writeble = is_writeble_pte(*sptep);
@@ -1858,7 +1864,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
 
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
- dirty, level, gfn, pfn, speculative, true)) {
+ dirty, level, gfn, pfn, speculative, true,
+ reset_host_protection)) {
if (write_fault)
*ptwrite = 1;
kvm_x86_ops-tlb_flush(vcpu);
@@ -1906,7 +1913,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
if (iterator.level == level) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
 0, write, 1, pt_write,
-level, gfn, pfn, false);
+level, gfn, pfn, false, true);
++vcpu-stat.pf_fixed;
break;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index d2fec9c..cfd2424 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -273,9 +273,13 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *page,
if (mmu_notifier_retry(vcpu, vcpu-arch.update_pte.mmu_seq))
return;
kvm_get_pfn(pfn);
+   /*
+* we call mmu_set_spte() with reset_host_protection = true beacuse that
+* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1).
+*/
mmu_set_spte(vcpu, spte, page-role.access, pte_access, 0, 0,
 gpte  PT_DIRTY_MASK, NULL, PT_PAGE_TABLE_LEVEL,
-gpte_to_gfn(gpte), pfn, true);
+gpte_to_gfn(gpte), pfn, true, true);
 }
 
 /*
@@ -308,7 +312,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 user_fault, write_fault,
 gw-ptes[gw-level-1]  PT_DIRTY_MASK,
 ptwrite, level,
-gw-gfn, pfn, false);
+gw-gfn, pfn, false, true);
break;
}
 
@@ -558,6 +562,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
int i, offset, nr_present;
+   bool reset_host_protection;
 
offset = nr_present = 0;
 
@@ -595,9 +600,16 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
nr_present++;
pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte);
+   if (!(sp-spt[i]  SPTE_HOST_WRITEABLE)) {
+   pte_access = ~ACC_WRITE_MASK;
+   reset_host_protection = 0;
+   } else

[PATCH 3/3] add support for change_pte mmu notifiers

2009-09-23 Thread Izik Eidus
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   64 +-
 virt/kvm/kvm_main.c |   14 
 3 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3be0004..d838922 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -796,6 +796,7 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cd8b4e..0905ca2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -748,7 +748,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -763,8 +763,47 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
+{
+   int need_flush = 0;
+   u64 *spte, new_spte;
+   pte_t *ptep = (pte_t *)data;
+   pfn_t new_pfn;
+
+   WARN_ON(pte_huge(*ptep));
+   new_pfn = pte_pfn(*ptep);
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!is_shadow_present_pte(*spte));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte);
+   need_flush = 1;
+   if (pte_write(*ptep)) {
+   rmap_remove(kvm, spte);
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   spte = rmap_next(kvm, rmapp, NULL);
+   } else {
+   new_spte = *spte ~ (PT64_BASE_ADDR_MASK);
+   new_spte |= new_pfn  PAGE_SHIFT;
+
+   if (!pte_write(*ptep)) {
+   new_spte = ~PT_WRITABLE_MASK;
+   new_spte = ~SPTE_HOST_WRITEABLE;
+   if (is_writeble_pte(*spte))
+   kvm_set_pfn_dirty(spte_to_pfn(*spte));
+   }
+   __set_spte(spte, new_spte);
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   }
+   if (need_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   return 0;
+}
+
+static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, u64 data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+u64 data))
 {
int i, j;
int retval = 0;
@@ -786,13 +825,15 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
if (hva = start  hva  end) {
gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
 
-   retval |= handler(kvm, memslot-rmap[gfn_offset]);
+   retval |= handler(kvm, memslot-rmap[gfn_offset],
+ data);
 
for (j = 0; j  KVM_NR_PAGE_SIZES - 1; ++j) {
int idx = gfn_offset;
idx /= KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL + 
j);
retval |= handler(kvm,
-   memslot-lpage_info[j][idx].rmap_pde);
+   memslot-lpage_info[j][idx].rmap_pde,
+   data);
}
}
}
@@ -802,10 +843,15 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-   return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
+   return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+}
+
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+{
+   kvm_handle_hva(kvm, hva, (u64)pte, kvm_set_pte_rmapp);
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
 {
u64 *spte;
int young = 0;
@@ -841,13 +887,13 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 
*spte, gfn_t gfn

Re: [PATCH 3/3] add support for change_pte mmu notifiers

2009-09-23 Thread Izik Eidus

Izik Eidus wrote:

this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   64 +-
 virt/kvm/kvm_main.c |   14 
 3 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3be0004..d838922 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -796,6 +796,7 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cd8b4e..0905ca2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -748,7 +748,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)

+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -763,8 +763,47 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,

- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
+{
+   int need_flush = 0;
+   u64 *spte, new_spte;
+   pte_t *ptep = (pte_t *)data;
+   pfn_t new_pfn;
+
+   WARN_ON(pte_huge(*ptep));
+   new_pfn = pte_pfn(*ptep);
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!is_shadow_present_pte(*spte));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte);
+   need_flush = 1;
+   if (pte_write(*ptep)) {
+   rmap_remove(kvm, spte);
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   spte = rmap_next(kvm, rmapp, NULL);
+   } else {
+   new_spte = *spte ~ (PT64_BASE_ADDR_MASK);
+   new_spte |= new_pfn  PAGE_SHIFT;
+
+   if (!pte_write(*ptep)) {
  


Just noticed that this if is not needed (we got to get here if we had 
if (pte_write(*ptep)) { few lines before...)..

I will resend

+   new_spte = ~PT_WRITABLE_MASK;
+   new_spte = ~SPTE_HOST_WRITEABLE;
+   if (is_writeble_pte(*spte))
+   kvm_set_pfn_dirty(spte_to_pfn(*spte));
+   }
+   __set_spte(spte, new_spte);
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   }
+   if (need_flush)
+   kvm_flush_remote_tlbs(kvm);
+


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] kvm ksm support v3

2009-09-23 Thread Izik Eidus
Change from v2 : remove unused if.

Thanks.

Izik Eidus (3):
  kvm: dont hold pagecount reference for mapped sptes pages
  add SPTE_HOST_WRITEABLE flag to the shadow ptes
  add support for change_pte mmu notifiers

 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   84 ++
 arch/x86/kvm/paging_tmpl.h  |   18 +++-
 virt/kvm/kvm_main.c |   14 ++
 4 files changed, 96 insertions(+), 21 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvm: dont hold pagecount reference for mapped sptes pages

2009-09-23 Thread Izik Eidus
When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |7 ++-
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eca41ae..6c67b23 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -634,9 +634,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
if (*spte  shadow_accessed_mask)
kvm_set_pfn_accessed(pfn);
if (is_writeble_pte(*spte))
-   kvm_release_pfn_dirty(pfn);
-   else
-   kvm_release_pfn_clean(pfn);
+   kvm_set_pfn_dirty(pfn);
rmapp = gfn_to_rmap(kvm, sp-gfns[spte - sp-spt], sp-role.level);
if (!*rmapp) {
printk(KERN_ERR rmap_remove: %p %llx 0-BUG\n, spte, *spte);
@@ -1877,8 +1875,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
page_header_update_slot(vcpu-kvm, sptep, gfn);
if (!was_rmapped) {
rmap_count = rmap_add(vcpu, sptep, gfn);
-   if (!is_rmap_spte(*sptep))
-   kvm_release_pfn_clean(pfn);
+   kvm_release_pfn_clean(pfn);
if (rmap_count  RMAP_RECYCLE_THRESHOLD)
rmap_recycle(vcpu, sptep, gfn);
} else {
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] add SPTE_HOST_WRITEABLE flag to the shadow ptes

2009-09-23 Thread Izik Eidus
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/paging_tmpl.h |   18 +++---
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c67b23..5cd8b4e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -156,6 +156,8 @@ module_param(oos_shadow, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include mmutrace.h
 
+#define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {
@@ -1754,7 +1756,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync)
+   bool can_unsync, bool reset_host_protection)
 {
u64 spte;
int ret = 0;
@@ -1781,6 +1783,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= kvm_x86_ops-get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
 
+   if (reset_host_protection)
+   spte |= SPTE_HOST_WRITEABLE;
+
spte |= (u64)pfn  PAGE_SHIFT;
 
if ((pte_access  ACC_WRITE_MASK)
@@ -1826,7 +1831,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int level, gfn_t gfn,
-pfn_t pfn, bool speculative)
+pfn_t pfn, bool speculative,
+bool reset_host_protection)
 {
int was_rmapped = 0;
int was_writeble = is_writeble_pte(*sptep);
@@ -1858,7 +1864,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
 
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
- dirty, level, gfn, pfn, speculative, true)) {
+ dirty, level, gfn, pfn, speculative, true,
+ reset_host_protection)) {
if (write_fault)
*ptwrite = 1;
kvm_x86_ops-tlb_flush(vcpu);
@@ -1906,7 +1913,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
if (iterator.level == level) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
 0, write, 1, pt_write,
-level, gfn, pfn, false);
+level, gfn, pfn, false, true);
++vcpu-stat.pf_fixed;
break;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index d2fec9c..cfd2424 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -273,9 +273,13 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *page,
if (mmu_notifier_retry(vcpu, vcpu-arch.update_pte.mmu_seq))
return;
kvm_get_pfn(pfn);
+   /*
+* we call mmu_set_spte() with reset_host_protection = true beacuse that
+* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1).
+*/
mmu_set_spte(vcpu, spte, page-role.access, pte_access, 0, 0,
 gpte  PT_DIRTY_MASK, NULL, PT_PAGE_TABLE_LEVEL,
-gpte_to_gfn(gpte), pfn, true);
+gpte_to_gfn(gpte), pfn, true, true);
 }
 
 /*
@@ -308,7 +312,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 user_fault, write_fault,
 gw-ptes[gw-level-1]  PT_DIRTY_MASK,
 ptwrite, level,
-gw-gfn, pfn, false);
+gw-gfn, pfn, false, true);
break;
}
 
@@ -558,6 +562,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
int i, offset, nr_present;
+   bool reset_host_protection;
 
offset = nr_present = 0;
 
@@ -595,9 +600,16 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
nr_present++;
pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte);
+   if (!(sp-spt[i]  SPTE_HOST_WRITEABLE)) {
+   pte_access = ~ACC_WRITE_MASK;
+   reset_host_protection = 0;
+   } else

[PATCH 3/3] add support for change_pte mmu notifiers

2009-09-23 Thread Izik Eidus
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   62 +-
 virt/kvm/kvm_main.c |   14 +
 3 files changed, 68 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3be0004..d838922 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -796,6 +796,7 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cd8b4e..ceec065 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -748,7 +748,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -763,8 +763,45 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
+{
+   int need_flush = 0;
+   u64 *spte, new_spte;
+   pte_t *ptep = (pte_t *)data;
+   pfn_t new_pfn;
+
+   WARN_ON(pte_huge(*ptep));
+   new_pfn = pte_pfn(*ptep);
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!is_shadow_present_pte(*spte));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte);
+   need_flush = 1;
+   if (pte_write(*ptep)) {
+   rmap_remove(kvm, spte);
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   spte = rmap_next(kvm, rmapp, NULL);
+   } else {
+   new_spte = *spte ~ (PT64_BASE_ADDR_MASK);
+   new_spte |= new_pfn  PAGE_SHIFT;
+
+   new_spte = ~PT_WRITABLE_MASK;
+   new_spte = ~SPTE_HOST_WRITEABLE;
+   if (is_writeble_pte(*spte))
+   kvm_set_pfn_dirty(spte_to_pfn(*spte));
+   __set_spte(spte, new_spte);
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   }
+   if (need_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   return 0;
+}
+
+static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, u64 data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+u64 data))
 {
int i, j;
int retval = 0;
@@ -786,13 +823,15 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
if (hva = start  hva  end) {
gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
 
-   retval |= handler(kvm, memslot-rmap[gfn_offset]);
+   retval |= handler(kvm, memslot-rmap[gfn_offset],
+ data);
 
for (j = 0; j  KVM_NR_PAGE_SIZES - 1; ++j) {
int idx = gfn_offset;
idx /= KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL + 
j);
retval |= handler(kvm,
-   memslot-lpage_info[j][idx].rmap_pde);
+   memslot-lpage_info[j][idx].rmap_pde,
+   data);
}
}
}
@@ -802,10 +841,15 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-   return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
+   return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+}
+
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+{
+   kvm_handle_hva(kvm, hva, (u64)pte, kvm_set_pte_rmapp);
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp, u64 data)
 {
u64 *spte;
int young = 0;
@@ -841,13 +885,13 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 
*spte, gfn_t gfn)
gfn = unalias_gfn(vcpu-kvm, gfn);
rmapp = gfn_to_rmap(vcpu-kvm, gfn, sp-role.level

Re: [PATCH 2/3] add SPTE_HOST_WRITEABLE flag to the shadow ptes

2009-09-14 Thread Izik Eidus

Marcelo Tosatti wrote:
Why can't you use the writable bit in the spte? So that you can only 
sync a writeable spte if it was writeable before, in sync_page?
  


I could, but there we will add overhead for read only gptes that become 
writable in the guest...
If you prefer to fault on the syncing of the guest gpte readonly to gpte 
writeable I can change it...


What you prefer?


Is there any other need for the extra bit?
  


No


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] add SPTE_HOST_WRITEABLE flag to the shadow ptes

2009-09-12 Thread Izik Eidus

Marcelo Tosatti wrote:

On Thu, Sep 10, 2009 at 07:38:57PM +0300, Izik Eidus wrote:
  

this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/paging_tmpl.h |   18 +++---
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62d2f86..a7151b8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -156,6 +156,8 @@ module_param(oos_shadow, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include mmutrace.h
 
+#define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)

+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {

@@ -1754,7 +1756,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync)
+   bool can_unsync, bool reset_host_protection)


 bool host_pte_writeable ?

  


Sure.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] add support for change_pte mmu notifiers

2009-09-12 Thread Izik Eidus

Marcelo Tosatti wrote:

On Sat, Sep 12, 2009 at 09:41:10AM +0300, Izik Eidus wrote:
  

Marcelo Tosatti wrote:


On Thu, Sep 10, 2009 at 07:38:58PM +0300, Izik Eidus wrote:
  
  

this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   70 ++
 virt/kvm/kvm_main.c |   14 
 3 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6046e6f..594d131 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -797,6 +797,7 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a7151b8..3fd19f2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -282,6 +282,11 @@ static pfn_t spte_to_pfn(u64 pte)
return (pte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
 }
 +static pte_t ptep_val(pte_t *ptep)
+{
+   return *ptep;
+}
+
 static gfn_t pse36_gfn_delta(u32 gpte)
 {
int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
@@ -748,7 +753,8 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 -static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -763,8 +769,48 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 +static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data)
+{
+   int need_flush = 0;
+   u64 *spte, new_spte;
+   pte_t *ptep = (pte_t *)data;
+   pfn_t new_pfn;
+
+   new_pfn = pte_pfn(ptep_val(ptep));
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!is_shadow_present_pte(*spte));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte);
+   need_flush = 1;
+   if (pte_write(ptep_val(ptep))) {
+   rmap_remove(kvm, spte);
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   spte = rmap_next(kvm, rmapp, NULL);
+   } else {
+   new_spte = *spte ~ (PT64_BASE_ADDR_MASK);
+   new_spte |= new_pfn  PAGE_SHIFT;
+
+   if (!pte_write(ptep_val(ptep))) {
+   new_spte = ~PT_WRITABLE_MASK;
+   new_spte = ~SPTE_HOST_WRITEABLE;
+   if (is_writeble_pte(*spte))
+   kvm_set_pfn_dirty(spte_to_pfn(*spte));
+   }
+   __set_spte(spte, new_spte);
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   }
+   if (need_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   return 0;
+}
+
 static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+ unsigned long data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data))
 {
int i, j;
int retval = 0;
@@ -786,13 +832,15 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
if (hva = start  hva  end) {
gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
 -  retval |= handler(kvm, memslot-rmap[gfn_offset]);
+   retval |= handler(kvm, memslot-rmap[gfn_offset],
+ data);
for (j = 0; j  KVM_NR_PAGE_SIZES - 1; ++j) {
int idx = gfn_offset;
idx /= KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL + 
j);
retval |= handler(kvm,
-   memslot-lpage_info[j][idx].rmap_pde);
+   memslot-lpage_info[j][idx].rmap_pde,
+   data);



If change_pte is called to modify a largepage pte, and the shadow has
that largepage mapped with 4k sptes, you'll set the wrong pfn. That is,
the patch does not attempt to handle different page sizes properly

[PATCH 0/3] ksm support for kvm

2009-09-10 Thread Izik Eidus
Hi,

The following seires add ksm support to the kvm mmu.

Thanks.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] add SPTE_HOST_WRITEABLE flag to the shadow ptes

2009-09-10 Thread Izik Eidus
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/paging_tmpl.h |   18 +++---
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62d2f86..a7151b8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -156,6 +156,8 @@ module_param(oos_shadow, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include mmutrace.h
 
+#define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {
@@ -1754,7 +1756,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync)
+   bool can_unsync, bool reset_host_protection)
 {
u64 spte;
int ret = 0;
@@ -1781,6 +1783,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= kvm_x86_ops-get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
 
+   if (reset_host_protection)
+   spte |= SPTE_HOST_WRITEABLE;
+
spte |= (u64)pfn  PAGE_SHIFT;
 
if ((pte_access  ACC_WRITE_MASK)
@@ -1826,7 +1831,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int level, gfn_t gfn,
-pfn_t pfn, bool speculative)
+pfn_t pfn, bool speculative,
+bool reset_host_protection)
 {
int was_rmapped = 0;
int was_writeble = is_writeble_pte(*sptep);
@@ -1858,7 +1864,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
 
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
- dirty, level, gfn, pfn, speculative, true)) {
+ dirty, level, gfn, pfn, speculative, true,
+ reset_host_protection)) {
if (write_fault)
*ptwrite = 1;
kvm_x86_ops-tlb_flush(vcpu);
@@ -1906,7 +1913,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
if (iterator.level == level) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
 0, write, 1, pt_write,
-level, gfn, pfn, false);
+level, gfn, pfn, false, true);
++vcpu-stat.pf_fixed;
break;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index d2fec9c..c9256ee 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -273,9 +273,13 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *page,
if (mmu_notifier_retry(vcpu, vcpu-arch.update_pte.mmu_seq))
return;
kvm_get_pfn(pfn);
+   /*
+* we call mmu_set_spte() with reset_host_protection = true beacuse that
+* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1).
+*/
mmu_set_spte(vcpu, spte, page-role.access, pte_access, 0, 0,
 gpte  PT_DIRTY_MASK, NULL, PT_PAGE_TABLE_LEVEL,
-gpte_to_gfn(gpte), pfn, true);
+gpte_to_gfn(gpte), pfn, true, true);
 }
 
 /*
@@ -308,7 +312,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 user_fault, write_fault,
 gw-ptes[gw-level-1]  PT_DIRTY_MASK,
 ptwrite, level,
-gw-gfn, pfn, false);
+gw-gfn, pfn, false, true);
break;
}
 
@@ -558,6 +562,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
int i, offset, nr_present;
+bool reset_host_protection;
 
offset = nr_present = 0;
 
@@ -595,9 +600,16 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
nr_present++;
pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte);
+   if (!(sp-spt[i]  SPTE_HOST_WRITEABLE)) {
+   pte_access = ~PT_WRITABLE_MASK;
+   reset_host_protection = 0;
+   } else

[PATCH 3/3] add support for change_pte mmu notifiers

2009-09-10 Thread Izik Eidus
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   70 ++
 virt/kvm/kvm_main.c |   14 
 3 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6046e6f..594d131 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -797,6 +797,7 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a7151b8..3fd19f2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -282,6 +282,11 @@ static pfn_t spte_to_pfn(u64 pte)
return (pte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
 }
 
+static pte_t ptep_val(pte_t *ptep)
+{
+   return *ptep;
+}
+
 static gfn_t pse36_gfn_delta(u32 gpte)
 {
int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
@@ -748,7 +753,8 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -763,8 +769,48 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 
+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data)
+{
+   int need_flush = 0;
+   u64 *spte, new_spte;
+   pte_t *ptep = (pte_t *)data;
+   pfn_t new_pfn;
+
+   new_pfn = pte_pfn(ptep_val(ptep));
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!is_shadow_present_pte(*spte));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte);
+   need_flush = 1;
+   if (pte_write(ptep_val(ptep))) {
+   rmap_remove(kvm, spte);
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   spte = rmap_next(kvm, rmapp, NULL);
+   } else {
+   new_spte = *spte ~ (PT64_BASE_ADDR_MASK);
+   new_spte |= new_pfn  PAGE_SHIFT;
+
+   if (!pte_write(ptep_val(ptep))) {
+   new_spte = ~PT_WRITABLE_MASK;
+   new_spte = ~SPTE_HOST_WRITEABLE;
+   if (is_writeble_pte(*spte))
+   kvm_set_pfn_dirty(spte_to_pfn(*spte));
+   }
+   __set_spte(spte, new_spte);
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   }
+   if (need_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   return 0;
+}
+
 static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+ unsigned long data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data))
 {
int i, j;
int retval = 0;
@@ -786,13 +832,15 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
if (hva = start  hva  end) {
gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
 
-   retval |= handler(kvm, memslot-rmap[gfn_offset]);
+   retval |= handler(kvm, memslot-rmap[gfn_offset],
+ data);
 
for (j = 0; j  KVM_NR_PAGE_SIZES - 1; ++j) {
int idx = gfn_offset;
idx /= KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL + 
j);
retval |= handler(kvm,
-   memslot-lpage_info[j][idx].rmap_pde);
+   memslot-lpage_info[j][idx].rmap_pde,
+   data);
}
}
}
@@ -802,10 +850,16 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-   return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
+   return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+}
+
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte

Re: kvm userspace: ksm support

2009-08-03 Thread Izik Eidus

Brian Jackson wrote:
If someone wanted to play around with ksm in qemu-kvm-0.x.x would it be as 
simple as adding the below additions to kvm_setup_guest_memory in kvm-all.c


qemu-kvm-0.x.x doesnt tell me much, but if it is the function that 
register the memory than yes...


(I just remember that qemu used to have something called phys_ram_base, 
in that case it would be just making madvise on phys_ram_base with the 
same of phys_ram_size)


 
(and adding the necessary kernel changes of course)?



On Tuesday 28 July 2009 11:39:59 am Izik Eidus wrote:
  

This patch is not for inclusion just rfc.

Thanks.


From 1297b86aa257100b3d819df9f9f0932bf4f7f49d Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Tue, 28 Jul 2009 19:14:26 +0300
Subject: [PATCH] kvm userspace: ksm support

rfc for ksm support to kvm userpsace.

thanks

Signed-off-by: Izik Eidus iei...@redhat.com
---
 exec.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index f6d9ec9..375cc18 100644
--- a/exec.c
+++ b/exec.c
@@ -2595,6 +2595,9 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
 new_block-host = file_ram_alloc(size, mem_path);
 if (!new_block-host) {
 new_block-host = qemu_vmalloc(size);
+#ifdef MADV_MERGEABLE
+madvise(new_block-host, size, MADV_MERGEABLE);
+#endif
 }
 new_block-offset = last_ram_offset;
 new_block-length = size;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm userspace: ksm support

2009-08-03 Thread Izik Eidus

Brian Jackson wrote:

On Monday 03 August 2009 01:09:38 pm Izik Eidus wrote:
  

Brian Jackson wrote:


If someone wanted to play around with ksm in qemu-kvm-0.x.x would it be
as simple as adding the below additions to kvm_setup_guest_memory in
kvm-all.c
  

qemu-kvm-0.x.x doesnt tell me much, but if it is the function that
register the memory than yes...

(I just remember that qemu used to have something called phys_ram_base,
in that case it would be just making madvise on phys_ram_base with the
same of phys_ram_size)



Sorry, I'm using qemu-kvm-0.10.6


This is what qemu_ram_alloc looks like:



/* XXX: better than nothing */
ram_addr_t qemu_ram_alloc(ram_addr_t size)
{
ram_addr_t addr;
if ((phys_ram_alloc_offset + size)  phys_ram_size) {
fprintf(stderr, Not enough memory (requested_size = % PRIu64 , max memory = % 
PRIu64 )\n,
(uint64_t)size, (uint64_t)phys_ram_size);
abort();
}
addr = phys_ram_alloc_offset;
phys_ram_alloc_offset = TARGET_PAGE_ALIGN(phys_ram_alloc_offset + size);

if (kvm_enabled())
kvm_setup_guest_memory(phys_ram_base + addr, size);

return addr;
}


And this is what my new kvm_setup_guest_memory looks like:


void kvm_setup_guest_memory(void *start, size_t size)
{
if (!kvm_has_sync_mmu()) {
#ifdef MADV_DONTFORK
int ret = madvise(start, size, MADV_DONTFORK);

if (ret) {
perror(madvice);
exit(1);
}
#else
fprintf(stderr,
Need MADV_DONTFORK in absence of synchronous KVM MMU\n);
exit(1);
#endif
}
#ifdef MADV_MERGEABLE
madvise(start, size, MADV_MERGEABLE);
#endif
}



Look okay?


  


Yes.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm userspace: ksm support

2009-07-28 Thread Izik Eidus
This patch is not for inclusion just rfc.

Thanks.


From 1297b86aa257100b3d819df9f9f0932bf4f7f49d Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Tue, 28 Jul 2009 19:14:26 +0300
Subject: [PATCH] kvm userspace: ksm support

rfc for ksm support to kvm userpsace.

thanks

Signed-off-by: Izik Eidus iei...@redhat.com
---
 exec.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index f6d9ec9..375cc18 100644
--- a/exec.c
+++ b/exec.c
@@ -2595,6 +2595,9 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
 new_block-host = file_ram_alloc(size, mem_path);
 if (!new_block-host) {
 new_block-host = qemu_vmalloc(size);
+#ifdef MADV_MERGEABLE
+madvise(new_block-host, size, MADV_MERGEABLE);
+#endif
 }
 new_block-offset = last_ram_offset;
 new_block-length = size;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm userspace: ksm support

2009-07-28 Thread Izik Eidus

Anthony Liguori wrote:

Izik Eidus wrote:

This patch is not for inclusion just rfc.
  


The madvise() interface looks really nice :-)

Thanks.


From 1297b86aa257100b3d819df9f9f0932bf4f7f49d Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Tue, 28 Jul 2009 19:14:26 +0300
Subject: [PATCH] kvm userspace: ksm support

rfc for ksm support to kvm userpsace.

thanks

Signed-off-by: Izik Eidus iei...@redhat.com
---
 exec.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index f6d9ec9..375cc18 100644
--- a/exec.c
+++ b/exec.c
@@ -2595,6 +2595,9 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
 new_block-host = file_ram_alloc(size, mem_path);
 if (!new_block-host) {
 new_block-host = qemu_vmalloc(size);
+#ifdef MADV_MERGEABLE
+madvise(new_block-host, size, MADV_MERGEABLE);
+#endif
  


Are madvise calls additive?

Do we need to change the madvise balloon calls to include 
MADV_MERGEABLE or will this carry the property forever?


You mean: when we later call for other madvise calls, if it will remove 
the MADV_MERGEABLE from that memory?

if yes, the answer is no, it should be still l left in the vma-vm_flags...



I'd suggest doing the following in osdep.h too:

#if !defined(MADV_MERGABLE)
#define MADV_MERGABLE MADV_NORMAL
#endif

To avoid #ifdefs in .c files.


I tried to follow the way DONTFORK madvise is working...

So you say, just to throw this thing into osdep.h instead of that c file?



Regards,

Anthony Liguori


 }
 new_block-offset = last_ram_offset;
 new_block-length = size;
  




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] KVM: MMU: make __kvm_mmu_free_some_pages handle empty list

2009-07-28 Thread Izik Eidus

Marcelo Tosatti wrote:

From: Izik Eidus iei...@redhat.com

First check if the list is empty before attempting to look at list
entries.

Signed-off-by: Izik Eidus iei...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kvm/mmu.c
===
--- kvm.orig/arch/x86/kvm/mmu.c
+++ kvm/arch/x86/kvm/mmu.c
@@ -2625,7 +2625,8 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page
 
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)

 {
-   while (vcpu-kvm-arch.n_free_mmu_pages  KVM_REFILL_PAGES) {
+   while (vcpu-kvm-arch.n_free_mmu_pages  KVM_REFILL_PAGES 
+  !list_empty(vcpu-kvm-arch.active_mmu_pages)) {
struct kvm_mmu_page *sp;
 
 		sp = container_of(vcpu-kvm-arch.active_mmu_pages.prev,



  

ack
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] KVM: MMU: fix bogus alloc_mmu_pages assignment

2009-07-28 Thread Izik Eidus

Marcelo Tosatti wrote:

Remove the bogus n_free_mmu_pages assignment from alloc_mmu_pages.

It breaks accounting of mmu pages, since n_free_mmu_pages is modified
but the real number of pages remains the same.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kvm/mmu.c
===
--- kvm.orig/arch/x86/kvm/mmu.c
+++ kvm/arch/x86/kvm/mmu.c
@@ -2706,14 +2706,6 @@ static int alloc_mmu_pages(struct kvm_vc
 
 	ASSERT(vcpu);
 
-	spin_lock(vcpu-kvm-mmu_lock);

-   if (vcpu-kvm-arch.n_requested_mmu_pages)
-   vcpu-kvm-arch.n_free_mmu_pages =
-   vcpu-kvm-arch.n_requested_mmu_pages;
-   else
-   vcpu-kvm-arch.n_free_mmu_pages =
-   vcpu-kvm-arch.n_alloc_mmu_pages;
-   spin_unlock(vcpu-kvm-mmu_lock);
/*
 * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
 * Therefore we need to allocate shadow page tables in the first


  

ack
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: change the dirty page tracking to work with dirty bity

2009-06-11 Thread Izik Eidus

Avi Kivity wrote:

Izik Eidus wrote:
change the dirty page tracking to work with dirty bity instead of 
page fault.
right now the dirty page tracking work with the help of page faults, 
when we
want to track a page for being dirty, we write protect it and we mark 
it dirty
when we have write page fault, this code move into looking at the 
dirty bit

of the spte.

  


I'm concerned about performance during the later stages of live 
migration.  Even if only 1000 pages are dirty, you still have to look 
at 2,000,000 or more ptes (for an 8GB guest).  That's a lot of overhead.


I think we need to use the page table hierarchy, write protect the 
upper page table so we know which page tables we need to look at.





Great idea, so i add another bitmap for the page directory?
 
+static int vmx_dirty_bit_support(void)

+{
+return false;
+}
  


It's false only when ept is enabled.



Yea, that i found out already

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] *** SUBJECT HERE ***

2009-06-10 Thread Izik Eidus
RFC: move the dirty page tracking to use dirty bit

Well, i was bored this morning and had this idea for a while, didnt test it to
much..., first i want to hear what ppl think?

Thanks.

Izik Eidus (2):
  kvm: fix dirty bit tracking for slots with large pages
  kvm: change the dirty page tracking to work with dirty bit instead of
page fault

 arch/ia64/kvm/kvm-ia64.c|4 
 arch/powerpc/kvm/powerpc.c  |4 
 arch/s390/kvm/kvm-s390.c|4 
 arch/x86/include/asm/kvm_host.h |3 +++
 arch/x86/kvm/mmu.c  |   32 +---
 arch/x86/kvm/svm.c  |7 +++
 arch/x86/kvm/vmx.c  |7 +++
 arch/x86/kvm/x86.c  |   21 ++---
 include/linux/kvm_host.h|1 +
 virt/kvm/kvm_main.c |   17 -
 10 files changed, 89 insertions(+), 11 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] kvm: fix dirty bit tracking for slots with large pages

2009-06-10 Thread Izik Eidus
When slot is already allocted and being asked to be tracked we need to break the
large pages.

This code flush the mmu when someone ask a slot to start dirty bit tracking.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 virt/kvm/kvm_main.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 669eb4a..4a60c72 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1160,6 +1160,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.userspace_addr = mem-userspace_addr;
else
new.userspace_addr = 0;
+
+   kvm_arch_flush_shadow(kvm);
}
if (npages  !new.lpage_info) {
largepages = 1 + (base_gfn + npages - 1) / KVM_PAGES_PER_HPAGE;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm: change the dirty page tracking to work with dirty bit instead of page fault

2009-06-10 Thread Izik Eidus
right now the dirty page tracking work with the help of page faults, when we
want to track a page for being dirty, we write protect it and we mark it dirty
when we have write page fault, this code move into looking at the dirty bit
of the spte.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/ia64/kvm/kvm-ia64.c|4 
 arch/powerpc/kvm/powerpc.c  |4 
 arch/s390/kvm/kvm-s390.c|4 
 arch/x86/include/asm/kvm_host.h |3 +++
 arch/x86/kvm/mmu.c  |   32 +---
 arch/x86/kvm/svm.c  |7 +++
 arch/x86/kvm/vmx.c  |7 +++
 arch/x86/kvm/x86.c  |   21 ++---
 include/linux/kvm_host.h|1 +
 virt/kvm/kvm_main.c |   15 ++-
 10 files changed, 87 insertions(+), 11 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 3199221..5914128 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1809,6 +1809,10 @@ void kvm_arch_exit(void)
kvm_vmm_info = NULL;
 }
 
+void kvm_arch_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+}
+
 static int kvm_ia64_sync_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log)
 {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2cf915e..6beb368 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -418,6 +418,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
return -ENOTSUPP;
 }
 
+void kvm_arch_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
 {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 981ab04..ab6f115 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -130,6 +130,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
return 0;
 }
 
+void kvm_arch_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c7b0cc2..8a24149 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -527,6 +527,7 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+   int (*dirty_bit_support)(void);
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
@@ -796,4 +797,6 @@ int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 
+int is_dirty_and_clean_rmapp(struct kvm *kvm, unsigned long *rmapp);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 809cce0..3ec6a7d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -140,6 +140,8 @@ module_param(oos_shadow, bool, 0644);
 #define ACC_USER_MASKPT_USER_MASK
 #define ACC_ALL  (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+#define SPTE_DONT_DIRTY (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {
@@ -629,6 +631,25 @@ static u64 *rmap_next(struct kvm *kvm, unsigned long 
*rmapp, u64 *spte)
return NULL;
 }
 
+int is_dirty_and_clean_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte;
+   int dirty = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   if (*spte  PT_DIRTY_MASK) {
+   set_shadow_pte(spte, (*spte = ~PT_DIRTY_MASK) |
+  SPTE_DONT_DIRTY);
+   dirty = 1;
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+
+   return dirty;
+}
+
+
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
unsigned long *rmapp;
@@ -1676,7 +1697,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
 * whether the guest actually used the pte (in order to detect
 * demand paging).
 */
-   spte = shadow_base_present_pte | shadow_dirty_mask;
+   spte = shadow_base_present_pte;
+   if (!(spte  SPTE_DONT_DIRTY))
+   spte |= shadow_dirty_mask;
+
if (!speculative)
spte |= shadow_accessed_mask;
if (!dirty)
@@ -1725,8 +1749,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
}
}
 
-   if (pte_access  ACC_WRITE_MASK)
-   mark_page_dirty(vcpu-kvm, gfn);
+   if (!shadow_dirty_mask) {
+   if (pte_access  ACC_WRITE_MASK)
+   mark_page_dirty(vcpu-kvm, gfn);
+   }
 
 set_pte

Re: [PATCH 2/2] kvm: change the dirty page tracking to work with dirty bit instead of page fault

2009-06-10 Thread Izik Eidus

Few quick thoughts:


 
+void kvm_arch_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)

+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c7b0cc2..8a24149 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -527,6 +527,7 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+   int (*dirty_bit_support)(void);
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;

@@ -796,4 +797,6 @@ int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 
+int is_dirty_and_clean_rmapp(struct kvm *kvm, unsigned long *rmapp);

+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 809cce0..3ec6a7d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -140,6 +140,8 @@ module_param(oos_shadow, bool, 0644);
 #define ACC_USER_MASKPT_USER_MASK
 #define ACC_ALL  (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+#define SPTE_DONT_DIRTY (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)

+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {

@@ -629,6 +631,25 @@ static u64 *rmap_next(struct kvm *kvm, unsigned long 
*rmapp, u64 *spte)
return NULL;
 }
 
+int is_dirty_and_clean_rmapp(struct kvm *kvm, unsigned long *rmapp)

+{
+   u64 *spte;
+   int dirty = 0;
+
  


Here we should add:

if (!shadow_dirty_mask)
   return 0;



+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   if (*spte  PT_DIRTY_MASK) {
+   set_shadow_pte(spte, (*spte = ~PT_DIRTY_MASK) |
+  SPTE_DONT_DIRTY);
+   dirty = 1;
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+
+   return dirty;
+}
+




  */
@@ -1982,9 +1995,11 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
 
 	/* If nothing is dirty, don't bother messing with page tables. */

if (is_dirty) {
-   spin_lock(kvm-mmu_lock);
-   kvm_mmu_slot_remove_write_access(kvm, log-slot);
-   spin_unlock(kvm-mmu_lock);
+   if (kvm_x86_ops-dirty_bit_support()) {
  


This should be  if (kvm_x86_ops-dirty_bit_support() - if 
(!kvm_x86_ops-dirty_bit_support())



+   spin_lock(kvm-mmu_lock);
+   kvm_mmu_slot_remove_write_access(kvm, log-slot);
+   spin_unlock(kvm-mmu_lock);
+   }
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm: fix dirty bit tracking for slots with large pages

2009-06-10 Thread Izik Eidus

Avi Kivity wrote:

Izik Eidus wrote:
When slot is already allocted and being asked to be tracked we need 
to break the

large pages.

This code flush the mmu when someone ask a slot to start dirty bit 
tracking.


Signed-off-by: Izik Eidus iei...@redhat.com
---
 virt/kvm/kvm_main.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 669eb4a..4a60c72 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1160,6 +1160,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 new.userspace_addr = mem-userspace_addr;
 else
 new.userspace_addr = 0;
+
+kvm_arch_flush_shadow(kvm);
 }
 if (npages  !new.lpage_info) {
 largepages = 1 + (base_gfn + npages - 1) / KVM_PAGES_PER_HPAGE;
  


Ryan, can you try this out with your large page migration failures?


Wait, i think it is in the wrong place., i am sending a second seires :(
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] RFC use dirty bit for page dirty tracking (v2)

2009-06-10 Thread Izik Eidus
RFC move to dirty bit tracking using the page table dirty bit (v2)

(BTW, it seems like the vnc in the mainline have some bugs, i have wasted 2
 hours debugging rendering bug that i thought was related to that seires, but
 it came out not to be related)

Thanks.

Izik Eidus (2):
  kvm: fix dirty bit tracking for slots with large pages
  kvm: change the dirty page tracking to work with dirty bity

 arch/ia64/kvm/kvm-ia64.c|4 +++
 arch/powerpc/kvm/powerpc.c  |4 +++
 arch/s390/kvm/kvm-s390.c|4 +++
 arch/x86/include/asm/kvm_host.h |3 ++
 arch/x86/kvm/mmu.c  |   42 --
 arch/x86/kvm/svm.c  |7 ++
 arch/x86/kvm/vmx.c  |7 ++
 arch/x86/kvm/x86.c  |   26 ---
 include/linux/kvm_host.h|1 +
 virt/kvm/kvm_main.c |8 ++-
 10 files changed, 98 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] kvm: fix dirty bit tracking for slots with large pages

2009-06-10 Thread Izik Eidus
When slot is already allocted and being asked to be tracked we need to break the
large pages.

This code flush the mmu when someone ask a slot to start dirty bit tracking.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 virt/kvm/kvm_main.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 669eb4a..3046e9c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1194,6 +1194,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (!new.dirty_bitmap)
goto out_free;
memset(new.dirty_bitmap, 0, dirty_bytes);
+   if (old.npages)
+   kvm_arch_flush_shadow(kvm);
}
 #endif /* not defined CONFIG_S390 */
 
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: change the dirty page tracking to work with dirty bity

2009-06-10 Thread Izik Eidus

Izik Eidus wrote:

+static int vmx_dirty_bit_support(void)
+{
+   return false;
+}
+
  



Again, idiotic bug: this should be:
return tdp_enable == false;


...


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: change the dirty page tracking to work with dirty bity

2009-06-10 Thread Izik Eidus

Marcelo Tosatti wrote:

On Wed, Jun 10, 2009 at 07:23:25PM +0300, Izik Eidus wrote:
  

change the dirty page tracking to work with dirty bity instead of page fault.
right now the dirty page tracking work with the help of page faults, when we
want to track a page for being dirty, we write protect it and we mark it dirty
when we have write page fault, this code move into looking at the dirty bit
of the spte.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/ia64/kvm/kvm-ia64.c|4 +++
 arch/powerpc/kvm/powerpc.c  |4 +++
 arch/s390/kvm/kvm-s390.c|4 +++
 arch/x86/include/asm/kvm_host.h |3 ++
 arch/x86/kvm/mmu.c  |   42 --
 arch/x86/kvm/svm.c  |7 ++
 arch/x86/kvm/vmx.c  |7 ++
 arch/x86/kvm/x86.c  |   26 ---
 include/linux/kvm_host.h|1 +
 virt/kvm/kvm_main.c |6 -
 10 files changed, 96 insertions(+), 8 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 3199221..5914128 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1809,6 +1809,10 @@ void kvm_arch_exit(void)
kvm_vmm_info = NULL;
 }
 
+void kvm_arch_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)

+{
+}
+
 static int kvm_ia64_sync_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log)
 {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2cf915e..6beb368 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -418,6 +418,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
return -ENOTSUPP;
 }



  
 



#ifndef KVM_ARCH_HAVE_DIRTY_LOG
  

+void kvm_arch_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+}
+


#endif

in virt/kvm/main.c


  

index c7b0cc2..8a24149 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -527,6 +527,7 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+   int (*dirty_bit_support)(void);
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;

@@ -796,4 +797,6 @@ int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 
+int is_dirty_and_clean_rmapp(struct kvm *kvm, unsigned long *rmapp);

+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 809cce0..500e0e2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -140,6 +140,8 @@ module_param(oos_shadow, bool, 0644);
 #define ACC_USER_MASKPT_USER_MASK
 #define ACC_ALL  (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+#define SPTE_DONT_DIRTY (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)

+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {

@@ -629,6 +631,29 @@ static u64 *rmap_next(struct kvm *kvm, unsigned long 
*rmapp, u64 *spte)
return NULL;
 }
 
+int is_dirty_and_clean_rmapp(struct kvm *kvm, unsigned long *rmapp)

+{
+   u64 *spte;
+   int dirty = 0;
+
+   if (!shadow_dirty_mask)
+   return 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   if (*spte  PT_DIRTY_MASK) {
+   set_shadow_pte(spte, (*spte = ~PT_DIRTY_MASK) |
+  SPTE_DONT_DIRTY);
+   dirty = 1;
+   break;
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+
+   return dirty;
+}
+
+
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
unsigned long *rmapp;
@@ -1381,11 +1406,17 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
int ret;
+   int i;
+
++kvm-stat.mmu_shadow_zapped;
ret = mmu_zap_unsync_children(kvm, sp);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
kvm_flush_remote_tlbs(kvm);
+   for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
+   if (sp-spt[i]  PT_DIRTY_MASK)
+   mark_page_dirty(kvm, sp-gfns[i]);
+   }



Also need to transfer dirty bit in other places probably.
  



Yes, i can think about some other case, but maybe i can avoid it using 
some trick.



  

if (!sp-role.invalid  !sp-role.direct)
unaccount_shadowed(kvm, sp-gfn);
if (sp-unsync)
@@ -1676,7 +1707,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
 * whether the guest actually used the pte (in order to detect
 * demand paging).
 */
-   spte = shadow_base_present_pte | shadow_dirty_mask;
+   spte

Re: [PATCH 2/2] kvm: change the dirty page tracking to work with dirty bity

2009-06-10 Thread Izik Eidus

Izik Eidus wrote:

Marcelo Tosatti wrote:



 
 /* Free page dirty bitmap if unneeded */

-if (!(new.flags  KVM_MEM_LOG_DIRTY_PAGES))
+if (!(new.flags  KVM_MEM_LOG_DIRTY_PAGES)) {
 new.dirty_bitmap = NULL;
+if (old.flags  KVM_MEM_LOG_DIRTY_PAGES)
+kvm_arch_flush_shadow(kvm);
+}



Whats this for?
  


We have added all this SPTE_DONT_DIRTY..., when we stop dirty bit 
tracking, we want to continue setting the dirty bit for the spte 
inside set_spte(), so writing to the page would be faster


Another way would be doing something like kvm_arch_clean_dont_dirty(), 
might be better than flushing the whole shadow page tables.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Izik Eidus

Anthony Liguori wrote:

Chris Wright wrote:

* Andrew Morton (a...@linux-foundation.org) wrote:
 

Breaks ppc64 allmodcofnig because that architecture doesn't export its
copy_user_page() to modules.



Things like this and updating to use madvise() I think all point towards
s/tristate/bool/.  I don't think CONFIG_KSM=M has huge benefit.
  


I agree.


I am sending in one sec, the madvise patch that will kick it away from 
being module anyway...




Regards,

Anthony Liguori



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Izik Eidus

Andrew Morton wrote:

On Mon, 20 Apr 2009 04:36:06 +0300
Izik Eidus iei...@redhat.com wrote:

  

Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

...
+   copy_user_highpage(kpage, page1, addr1, vma);
...



Breaks ppc64 allmodcofnig because that architecture doesn't export its
copy_user_page() to modules.

Architectures are inconsistent about this.  x86 _does_ export it,
because it bounces it to the exported copy_page().

So can I ask that you sit down and work out upon which architectures it
really makes sense to offer KSM?  Disallow the others in Kconfig and
arrange for copy_user_highpage() to be available on the allowed architectures?
  


Hi

There is some way (script) that i can run that will allow compile this 
code for every possible arch?


(I dont mind to allow it just for archs that support virtualization - 
x86, ia64, powerpc, s390, but is it the right thing to do ?)

Thanks.
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-04-30 Thread Izik Eidus
On Tue, 28 Apr 2009 02:12:00 +0300
Izik Eidus iei...@redhat.com wrote:

 Andrew Morton wrote:

  Breaks sparc64 and probably lots of other architectures:
 
  mm/ksm.c: In function `try_to_merge_two_pages_alloc':
  mm/ksm.c:697: error: `_PAGE_RW' undeclared (first use in this
  function)
 
  there should be an official arch-independent way of manipulating
  vma-vm_page_prot, but I'm not immediately finding it.

 Hi,
 
 vm_get_page_prot() will probably do the work.
 
 I will send you patch that fix it,
 but first i am waiting for Andrea and Chris to say they are happy
 with small changes that i made to the api after conversation i had
 with them (about checking if this api is robust enough so we wont
 have to change it later)
 
 When i will get their acks, i will send you patch against this
 togather with the api (until then it is ok to just leave it only for
 x86)
 
 changes are:
 1) limiting the number of memory regions registered per file
 descriptor 
 - so while (1){ (ioctl(KSM_REGISTER_MEMORY_REGION()) ) wont omm the
 host
 
 2) checking if memory is overlap in registration (more effective to 
 ignore such cases)
 
 3) allow removing specific memoy regions inside fd.
 
 Thanks.
 

Hi,

Following patchs change the api to be more robust, the result change of
the api came after conversation i had with Andrea and Chris about how
to make the api as stable as we can,

In addition i hope this patchset fix the cross compilation problems, i
compiled it on itanium (doesnt have _PAGE_RW) and it seems to work.

Thanks.
From 108b720636d1e679e8d5378469fa1220ce1e6963 Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Thu, 30 Apr 2009 20:36:57 +0300
Subject: [PATCH 09/13] ksm: limiting the num of mem regions user can register 
per fd.

Right now user can open /dev/ksm fd and register unlimited number of
regions, such behavior may allocate unlimited amount of kernel memory
and get the whole host into out of memory situation.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 mm/ksm.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 6165276..d58db6b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -48,6 +48,9 @@ static int rmap_hash_size;
 module_param(rmap_hash_size, int, 0);
 MODULE_PARM_DESC(rmap_hash_size, Hash table size for the reverse mapping);
 
+static int regions_per_fd;
+module_param(regions_per_fd, int, 0);
+
 /*
  * ksm_mem_slot - hold information for an userspace scanning range
  * (the scanning for this region will be from addr untill addr +
@@ -67,6 +70,7 @@ struct ksm_mem_slot {
  */
 struct ksm_sma {
struct list_head sma_slots;
+   int nregions;
 };
 
 /**
@@ -453,6 +457,11 @@ static int ksm_sma_ioctl_register_memory_region(struct 
ksm_sma *ksm_sma,
struct ksm_mem_slot *slot;
int ret = -EPERM;
 
+   if ((ksm_sma-nregions + 1)  regions_per_fd) {
+   ret = -EBUSY;
+   goto out;
+   }
+
slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
if (!slot) {
ret = -ENOMEM;
@@ -473,6 +482,7 @@ static int ksm_sma_ioctl_register_memory_region(struct 
ksm_sma *ksm_sma,
 
list_add_tail(slot-link, slots);
list_add_tail(slot-sma_link, ksm_sma-sma_slots);
+   ksm_sma-nregions++;
 
up_write(slots_lock);
return 0;
@@ -511,6 +521,7 @@ static int ksm_sma_ioctl_remove_memory_region(struct 
ksm_sma *ksm_sma)
mmput(slot-mm);
list_del(slot-sma_link);
kfree(slot);
+   ksm_sma-nregions--;
}
up_write(slots_lock);
return 0;
@@ -1389,6 +1400,7 @@ static int ksm_dev_ioctl_create_shared_memory_area(void)
}
 
INIT_LIST_HEAD(ksm_sma-sma_slots);
+   ksm_sma-nregions = 0;
 
fd = anon_inode_getfd(ksm-sma, ksm_sma_fops, ksm_sma, 0);
if (fd  0)
@@ -1631,6 +1643,9 @@ static int __init ksm_init(void)
if (r)
goto out_free1;
 
+   if (!regions_per_fd)
+   regions_per_fd = 1024;
+
ksm_thread = kthread_run(ksm_scan_thread, NULL, kksmd);
if (IS_ERR(ksm_thread)) {
printk(KERN_ERR ksm: creating kthread failed\n);
-- 
1.5.6.5

From f24a9aa8c049c951a33613909951d115be5f84cd Mon Sep 17 00:00:00 2001
From: Izik Eidus iei...@redhat.com
Date: Thu, 30 Apr 2009 20:37:17 +0300
Subject: [PATCH 10/13] ksm: dont allow overlap memory addresses registrations.

subjects say it all.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 mm/ksm.c |   58 ++
 1 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index d58db6b..982dfff 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -451,21 +451,71 @@ static void remove_page_from_tree(struct mm_struct *mm,
remove_rmap_item_from_tree(rmap_item);
 }
 
+static inline int is_intersecting_address(unsigned long addr

Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-04-27 Thread Izik Eidus

Andrew Morton wrote:

On Mon, 20 Apr 2009 04:36:06 +0300
Izik Eidus iei...@redhat.com wrote:

  

Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.



Breaks sparc64 and probably lots of other architectures:

mm/ksm.c: In function `try_to_merge_two_pages_alloc':
mm/ksm.c:697: error: `_PAGE_RW' undeclared (first use in this function)

there should be an official arch-independent way of manipulating
vma-vm_page_prot, but I'm not immediately finding it.
  

Hi,

vm_get_page_prot() will probably do the work.

I will send you patch that fix it,
but first i am waiting for Andrea and Chris to say they are happy with 
small changes that i made to the api after conversation i had with them
(about checking if this api is robust enough so we wont have to change 
it later)


When i will get their acks, i will send you patch against this togather 
with the api (until then it is ok to just leave it only for x86)


changes are:
1) limiting the number of memory regions registered per file descriptor 
- so while (1){ (ioctl(KSM_REGISTER_MEMORY_REGION()) ) wont omm the host


2) checking if memory is overlap in registration (more effective to 
ignore such cases)


3) allow removing specific memoy regions inside fd.

Thanks.



An alternative (and quite inferior) fix would be to disable ksm on
architectures which don't implement _PAGE_RW.  That's most of them.

  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-04-20 Thread Izik Eidus

Avi Kivity wrote:

Alan Cox wrote:

The minor number you are using already belongs to another project.

10,234 is free but it would be good to know what device naming is
proposed. I imagine other folks would like to know why you aren't using
sysfs or similar or extending /dev/kvm ?
  


ksm was deliberately made independent of kvm.  While there may or may 
not be uses of ksm without kvm (you could run ordinary qemu, but no 
one would do this in a production deployment), keeping them separate 
helps avoid unnecessary interdependencies.  For example all tlb 
flushes are mediated through mmu notifiers instead of ksm hooking 
directly into kvm.



Yes, beside, I do use sysfs for controlling the ksm behavior,
Ioctls are provided as easier way for application to register its memory.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] ksm - dynamic page sharing driver for linux v4

2009-04-19 Thread Izik Eidus
/pages_to_scan
   echo 1  /sys/kernel/mm/ksm/sleep
   echo 1  /sys/kernel/mm/ksm/run 
   (Or any other numbers...)


Ok, you are ready :-)

(Just remember, memory that is swapped, isnt scanned by ksm until it
 come back to memory, so dont try to raise alot of VMS togather)


Thanks.


Izik Eidus (5):
  MMU_NOTIFIERS: add set_pte_at_notify()
  add get_pte(): helper function: fetching pte for va
  add page_wrprotect(): write protecting page.
  add replace_page(): change the page pte is pointing to.
  add ksm kernel shared memory driver.

 include/linux/ksm.h  |   48 ++
 include/linux/miscdevice.h   |1 +
 include/linux/mm.h   |   29 +
 include/linux/mmu_notifier.h |   34 +
 include/linux/rmap.h |   11 +
 mm/Kconfig   |6 +
 mm/Makefile  |1 +
 mm/ksm.c | 1675 ++
 mm/memory.c  |   90 +++-
 mm/mmu_notifier.c|   20 +
 mm/rmap.c|  139 
 11 files changed, 2052 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] MMU_NOTIFIERS: add set_pte_at_notify()

2009-04-19 Thread Izik Eidus
this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mmu_notifier.h |   34 ++
 mm/memory.c  |   10 --
 mm/mmu_notifier.c|   20 
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
 struct mm_struct *mm,
 unsigned long address);
 
+   /* 
+   * change_pte is called in cases that pte mapping into page is changed
+   * for example when ksm mapped pte to point into a new shared page.
+   */
+   void (*change_pte)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long address,
+  pte_t pte);
+
/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm, 
+ unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_change_pte(mm, address, pte);
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
__young;\
 })
 
+#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
+({ \
+   struct mm_struct *___mm = __mm; \
+   unsigned long ___address = __address;   \
+   pte_t ___pte = __pte;   \
+   \
+   set_pte_at(__mm, __address, __ptep, ___pte);\
+   mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..1e1a14b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,9 +2051,15 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address

[PATCH 2/5] add get_pte(): helper function: fetching pte for va

2009-04-19 Thread Izik Eidus
get_pte() receive mm_struct of a task, and a virtual address and return
the pte corresponding to it.

this function return NULL in case it couldnt fetch the pte.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |   24 
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..9a34109 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -894,6 +894,30 @@ int vma_wants_writenotify(struct vm_area_struct *vma);
 
 extern pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, 
spinlock_t **ptl);
 
+static inline pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep = NULL;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map(pmd, addr);
+out:
+   return ptep;
+}
+
 #ifdef __PAGETABLE_PUD_FOLDED
 static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd,
unsigned long address)
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] add replace_page(): change the page pte is pointing to.

2009-04-19 Thread Izik Eidus
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page-index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |5 +++
 mm/memory.c|   80 
 2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a34109..a0ddfb5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1264,6 +1264,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:  vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page-index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] add ksm kernel shared memory driver.

2009-04-19 Thread Izik Eidus
Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Signed-off-by: Izik Eidus iei...@redhat.com
Signed-off-by: Chris Wright chr...@redhat.com
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
---
 include/linux/ksm.h|   48 ++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1675 
 5 files changed, 1731 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..2c11e9a
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,48 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index beb6ec9..297c0bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
 #define HPET_MINOR 228
 #define FUSE_MINOR 229
 #define KVM_MINOR  232
+#define KSM_MINOR  233
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
diff --git a/mm/Kconfig b/mm/Kconfig
index 57971d2..fb8ac63 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -225,3 +225,9 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   tristate Enable KSM for page sharing
+   help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..7fd4158
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1675 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C) 2008 Red Hat, Inc.
+ * Authors:
+ * Izik Eidus
+ * Andrea Arcangeli
+ * Chris Wright
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/mm.h
+#include linux/fs.h
+#include linux/miscdevice.h
+#include linux/vmalloc.h
+#include linux/file.h
+#include linux/mman.h
+#include linux/sched.h
+#include linux/rwsem.h
+#include linux/pagemap.h
+#include linux/sched.h
+#include linux/rmap.h
+#include linux/spinlock.h
+#include linux/jhash.h
+#include linux/delay.h
+#include linux/kthread.h
+#include linux/wait.h

Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

2009-04-16 Thread Izik Eidus

Nick Piggin wrote:

On Wednesday 15 April 2009 08:09:03 Andrew Morton wrote:
  

On Thu,  9 Apr 2009 06:58:37 +0300
Izik Eidus iei...@redhat.com wrote:



KSM is a linux driver that allows dynamicly sharing identical memory
pages between one or more processes.
  

Generally looks OK to me.  But that doesn't mean much.  We should rub
bottles with words like hugh and nick on them to be sure.



I haven't looked too closely at it yet sorry. Hugh has a great eye for
these details, though, hint hint :)

As everyone knows, my favourite thing is to say nasty things about any
new feature that adds complexity to common code.


The whole idea and the way i wrote it so it wont touch common code, i 
didnt change the linux mm logic no where.

The worst thing that we have add is helper functions.


 I feel like crying to
hear about how many more instances of MS Office we can all run, if only
we apply this patch.


And more instances of linux guests...


 And the poorly written HPC app just sounds like
scrapings from the bottom of justification barrel.
  


So if you have a big rendering application that load gigas of 
geometrical data that is handled by many threads
and you have a case that each thread sometimes change this geometrical 
data and you dont want the other threads will notice it.
How would you share it in traditional way?, after one time shared data 
will get cowed, how will you recollect it again when it become identical?

KSM do it for applications transparently

KSM writing motivation indeed was KVM where there it is highly needed 
you may check what VMware say about the fact that they have much better 
overcommit than Hyper-V / XEN:


http://blogs.vmware.com/virtualreality/2008/03/cheap-hyperviso.html

It is important to understand that in virtualization enviorments there 
are cases where memory is much more critical than any other resource for 
higher density.


Together with KSM, KVM will have the same memory overcommit abilitys 
such as VMware have.

I'm sorry, maybe I'm way off with my understanding of how important
this is. There isn't too much help in the changelog. A discussion of
where the memory savings comes from,


Memory saving come from identical librarys, identical kernels, zeroed 
pages - that is for virtualization.
The Librarys code will always be identical among similar guests, so why 
have this code at multiple places on the host memory?



 and how far does things like
sharing of fs image, or ballooning goes and how much extra savings we
get from this...


Ballooning is much worse when it come to performance, beacuse what it 
does is shrink the guest memory, with KSM we find identical pages and 
merge them into one page, so we dont get guest performance lose



 with people from other hypervisors involved as well.
Have I missed this kind of discussion?

Careful what you wish for, ay? :)
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-15 Thread Izik Eidus

Andrew Morton wrote:

On Thu,  9 Apr 2009 06:58:41 +0300
Izik Eidus iei...@redhat.com wrote:

  


Confused.  In the covering email you indicated that v2 of the patchset
had abandoned ioctls and had moved the interface to sysfs.
  
We have abandoned the ioctls that control the ksm behavior (how much cpu 
it take, how much kernel pages it may allocate and so on...)
But we still use ioctls to register the application memory to be used 
with ksm.



It would be good to completely (and briefly) describe KSM's proposed
userspace intefaces in the changelog or somewhere.  I'm a bit confused.
  


I will post new clean description for the ksm api with V4.




  


+static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep = NULL;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map(pmd, addr);
+out:
+   return ptep;
+}



hm, this looks very generic.  Does it duplicate anything which core
kernel already provides? 


I dont think so.


 If not, perhaps core kernel should provide
this (perhaps after some reorganisation).
  


Quick grep on the code show me at least 2 places that can use this function
one is:
remove_migration_pte() inside migrate.c
and the other is:
page_check_address() inside rmap.c

I will post with V4 an inline get_ptep() function, worst case i will get 
nacked.


  

...

+static int rmap_hash_init(void)
+{
+   if (!rmap_hash_size) {
+   struct sysinfo sinfo;
+
+   si_meminfo(sinfo);
+   rmap_hash_size = sinfo.totalram / 10;



One slot per ten pages of physical memory?  Is this too large, too
small or just right?
  


Highly depend on the number of processes / memory regions that will be 
registered inside ksm

It is a module parameter and so user can change it to how much it want.

  

+   }
+   nrmaps_hash = rmap_hash_size;
+   rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
+   if (!rmap_hash)
+   return -ENOMEM;
+   memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
+   return 0;
+}
+

...

+static void break_cow(struct mm_struct *mm, unsigned long addr)
+{
+   struct page *page[1];
+
+   down_read(mm-mmap_sem);
+   if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL)) {
+   put_page(page[0]);
+   }
+   up_read(mm-mmap_sem);
+}



- unneeded brakes around single statement

- that single statement is over-indented.

- and it seems wrong.  If get_user_pages() returned, say, -ENOMEM, we
  end up doing put_page(random-uninitialised-address-from-stack-go-oops)?
  


Good catch.

  

...

+static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,
+   struct ksm_memory_region *mem)
+{
+   struct ksm_mem_slot *slot;
+   int ret = -EPERM;
+
+   slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
+   if (!slot) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   slot-mm = get_task_mm(current);
+   if (!slot-mm)
+   goto out_free;
+   slot-addr = mem-addr;
+   slot-npages = mem-npages;
+
+   down_write(slots_lock);
+
+   list_add_tail(slot-link, slots);
+   list_add_tail(slot-sma_link, ksm_sma-sma_slots);
+
+   up_write(slots_lock);
+   return 0;
+
+out_free:
+   kfree(slot);
+out:
+   return ret;
+}



So this function pins the mm_struct.  I wonder what the implications of
this are. 


The mm struct wont go away until the file will be closed... (Application 
close the file descriptor, or the Application die)



 Not much, I guess.  Some comments in the code which explain
the object lifecycles would be nice.

  


...

+static int memcmp_pages(struct page *page1, struct page *page2)
+{
+   char *addr1, *addr2;
+   int r;
+
+   addr1 = kmap_atomic(page1, KM_USER0);
+   addr2 = kmap_atomic(page2, KM_USER1);
+   r = memcmp(addr1, addr2, PAGE_SIZE);
+   kunmap_atomic(addr1, KM_USER0);
+   kunmap_atomic(addr2, KM_USER1);
+   return r;
+}



I wonder if this code all does enough cpu cache flushing to be able to
guarantee that it's looking at valid data.  Not my area, and presumably
not an issue on x86.
  


Andrea pointed in previous reply that due to the fact that we are 
running page_wrprotect() on this pages memcmp_pages should be stable.


  

...

+static int try_to_merge_one_page(struct mm_struct *mm,
+struct vm_area_struct *vma,
+struct page *oldpage,
+struct page *newpage

Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-15 Thread Izik Eidus

Jeremy Fitzhardinge wrote:

Andrew Morton wrote:

+static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
+{
+pgd_t *pgd;
+pud_t *pud;
+pmd_t *pmd;
+pte_t *ptep = NULL;
+
+pgd = pgd_offset(mm, addr);
+if (!pgd_present(*pgd))
+goto out;
+
+pud = pud_offset(pgd, addr);
+if (!pud_present(*pud))
+goto out;
+
+pmd = pmd_offset(pud, addr);
+if (!pmd_present(*pmd))
+goto out;
+
+ptep = pte_offset_map(pmd, addr);
+out:
+return ptep;
+}



hm, this looks very generic.  Does it duplicate anything which core
kernel already provides?  If not, perhaps core kernel should provide
this (perhaps after some reorganisation).
  


It is lookup_address() which works on user addresses, and as such is 
very useful.  


But ksm need the pgd offset of an mm struct, not the kernel pgd, so 
maybe changing it to get the pgd offset would be nice..


Another thing it is just for x86 right now, so probably it need to go 
out to the common code


But it would need to deal with returning a level so it can deal with 
large pages in usermode, and have some well-defined semantics on 
whether the caller is responsible for unmapping the returned thing 
(ie, only if its a pte).


I implemented this myself a couple of months ago, but I can't find it 
anywhere...



+static int memcmp_pages(struct page *page1, struct page *page2)
+{
+char *addr1, *addr2;
+int r;
+
+addr1 = kmap_atomic(page1, KM_USER0);
+addr2 = kmap_atomic(page2, KM_USER1);
+r = memcmp(addr1, addr2, PAGE_SIZE);
+kunmap_atomic(addr1, KM_USER0);
+kunmap_atomic(addr2, KM_USER1);
+return r;
+}



I wonder if this code all does enough cpu cache flushing to be able to
guarantee that it's looking at valid data.  Not my area, and presumably
not an issue on x86.
  


Shouldn't that be kmap_atomic's job anyway?  Otherwise it would be 
hard to use on any virtual-tag/indexed cache machine.


   J


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] kvm: dont hold pagecount reference for mapped sptes pages.

2009-04-12 Thread Izik Eidus

Izik Eidus wrote:

Marcelo Tosatti wrote:

On Tue, Mar 31, 2009 at 03:00:02AM +0300, Izik Eidus wrote:
 

When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)



IMO ifdef'ing CONFIG_MMU_NOTIFIERS here (and keeping the ref if unset)
instead of in the backward compat code gives less room for headaches.

  

That was the first version of this patch, Avi preferred not to do it...


Avi, You mind if i changed it to use the IFDEF ?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] kvm: dont hold pagecount reference for mapped sptes pages.

2009-04-09 Thread Izik Eidus

Marcelo Tosatti wrote:

On Tue, Mar 31, 2009 at 03:00:02AM +0300, Izik Eidus wrote:
  

When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)



IMO ifdef'ing CONFIG_MMU_NOTIFIERS here (and keeping the ref if unset)
instead of in the backward compat code gives less room for headaches.

  

That was the first version of this patch, Avi preferred not to do it...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] add replace_page(): change the page pte is pointing to.

2009-04-08 Thread Izik Eidus
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page-index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |5 +++
 mm/memory.c|   80 
 2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..7a831ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1240,6 +1240,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:  vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page-index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()

2009-04-08 Thread Izik Eidus
this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mmu_notifier.h |   34 ++
 mm/memory.c  |   10 --
 mm/mmu_notifier.c|   20 
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
 struct mm_struct *mm,
 unsigned long address);
 
+   /* 
+   * change_pte is called in cases that pte mapping into page is changed
+   * for example when ksm mapped pte to point into a new shared page.
+   */
+   void (*change_pte)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long address,
+  pte_t pte);
+
/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm, 
+ unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_change_pte(mm, address, pte);
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
__young;\
 })
 
+#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
+({ \
+   struct mm_struct *___mm = __mm; \
+   unsigned long ___address = __address;   \
+   pte_t ___pte = __pte;   \
+   \
+   set_pte_at(__mm, __address, __ptep, ___pte);\
+   mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..1e1a14b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,9 +2051,15 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address

[PATCH 0/4] ksm - dynamic page sharing driver for linux v3

2009-04-08 Thread Izik Eidus
 and recreate the entire unstable tree each round
  of memory scanning - so if we have corruption, it will be fixed
  when we will rebuild the tree.
   b) Ksm is using RB-tree, that its balancing is made by the node color
  and not by the content, so even if the page get corrupted, it still
  would take the same amount of time to search on it.

3) In addition to the unstable tree, ksm hold another tree that is called
   stable tree - this tree is RB-tree that is sorted by the pages
   content and all its pages are write protected, and therefore it cant get
   corrupted.
   Each time ksm will find two identcial pages using the unstable tree,
   it will create new write-protected shared page, and this page will be
   inserted into the stable tree, and would be saved there, the
   stable tree, unlike the unstable tree, is never throwen away, so each
   page that we find would be saved inside it.

Taking into account the three levels that described above, the algorithm
work like that:

search primary tree (sorted by entire page contents, pages write protected)
- if match found, merge
- if no match found...
  - search secondary tree (sorted by entire page contents, pages not write
protected)
- if match found, merge
  - remove from secondary tree and insert merged page into primary tree
- if no match found...
  - checksum
- if checksum hasn't changed
  - insert into secondary tree
- if it has, store updated checksum (note: first time this page
  is handled it won't have a checksum, so checksum will appear
  as changed, so it takes two passes w/ no other matches to
  get into secondary tree)
  - do not insert into any tree, will see it again on next pass

The basic idea of this algorithm, is that even if the unstable tree doesnt
promise to us to find two identical pages in the first round, we would
probably find them in the second or the third or the tenth round,
then after we have found this two identical pages only once, we will insert
them into the stable tree, and then they would be protected there forever.
So the all idea of the unstable tree, is just to build the stable tree and
then we will find the identical pages using it.

The current implemantion can be improved alot:
we dont have to calculate exspensive checksum, we can just use the host
dirty bit.

currently we dont support shared pages swapping (other pages that are not
shared can be swapped (all the pages that we didnt find to be identical
to other pages...).

Walking on the tree, we keep call to get_user_pages(), we can optimized it
by saving the pfn, and using mmu notifiers to know when the virtual address
mapping was changed.

We currently scan just programs that were registred to be used by ksm, we
would later want to add the abilaty to tell ksm to scan PIDS (so you can
scan closed binary applications as well).

Right now ksm scanning is made by just one thread, multiple scanners
support might would be needed.

This driver is very useful for KVM as in cases of runing multiple guests
operation system of the same type.
(For desktop work loads we have achived more than x2 memory overcommit
(more like x3))

This driver have found users other than KVM, for example CERN,
Fons Rademakers:
on many-core machines we run one large detector simulation program per core.
These simulation programs are identical but run each in their own process and
need about 2 - 2.5 GB RAM.
We typically buy machines with 2GB RAM per core and so have a problem to run
one of these programs per core.
Of the 2 - 2.5 GB about 700MB is identical data in the form of magnetic field
maps, detector geometry, etc.
Currently people have been trying to start one program, initialize the geometry
and field maps and then fork it N times, to have the data shared.
With KSM this would be done automatically by the system so it sounded extremely
attractive when Andrea presented it.

I am sending another seires of patchs for kvm kernel and kvm-userspace
that would allow users of kvm to test ksm with it.
The kvm patchs would apply to Avi git tree.



Izik Eidus (4):
  MMU_NOTIFIERS: add set_pte_at_notify()
  add page_wrprotect(): write protecting page.
  add replace_page(): change the page pte is pointing to.
  add ksm kernel shared memory driver.

 include/linux/ksm.h  |   48 ++
 include/linux/miscdevice.h   |1 +
 include/linux/mm.h   |5 +
 include/linux/mmu_notifier.h |   34 +
 include/linux/rmap.h |   11 +
 mm/Kconfig   |6 +
 mm/Makefile  |1 +
 mm/ksm.c | 1674 ++
 mm/memory.c  |   90 +++-
 mm/mmu_notifier.c|   20 +
 mm/rmap.c|  139 
 11 files changed, 2027 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

--
To unsubscribe from this list: send the line unsubscribe kvm

[PATCH 2/4] add page_wrprotect(): write protecting page.

2009-04-08 Thread Izik Eidus
this patch add new function called page_wrprotect(),
page_wrprotect() is used to take a page and mark all the pte that
point into it as readonly.

The function is working by walking the rmap of the page, and setting
each pte realted to the page as readonly.

The odirect_sync parameter is used to protect against possible races
with odirect while we are marking the pte as readonly,
as noted by Andrea Arcanglei:

While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected...

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/rmap.h |   11 
 mm/rmap.c|  139 ++
 2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..469376d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
+#endif
+
 #else  /* !CONFIG_MMU */
 
 #define anon_vma_init()do {} while (0)
@@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page)
return 0;
 }
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+static inline int page_wrprotect(struct page *page, int *odirect_sync,
+int count_offset)
+{
+   return 0;
+}
+#endif
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..95c55ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,6 +585,145 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync, int count_offset)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   unsigned long address;
+   pte_t *pte;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   address = vma_address(page, vma);
+   if (address == -EFAULT)
+   goto out;
+
+   pte = page_check_address(page, mm, address, ptl, 0);
+   if (!pte)
+   goto out;
+
+   if (pte_write(*pte)) {
+   pte_t entry;
+
+   flush_cache_page(vma, address, pte_pfn(*pte));
+   /*
+* Ok this is tricky, when get_user_pages_fast() run it doesnt
+* take any lock, therefore the check that we are going to make
+* with the pagecount against the mapcount is racey and
+* O_DIRECT can happen right after the check.
+* So we clear the pte and flush the tlb before the check
+* this assure us that no O_DIRECT can happen after the check
+* or in the middle of the check.
+*/
+   entry = ptep_clear_flush(vma, address, pte);
+   /*
+* Check that no O_DIRECT or similar I/O is in progress on the
+* page
+*/
+   if ((page_mapcount(page) + count_offset) != page_count(page)) {
+   *odirect_sync = 0;
+   set_pte_at_notify(mm, address, pte, entry);
+   goto out_unlock;
+   }
+   entry = pte_wrprotect(entry);
+   set_pte_at_notify(mm, address, pte, entry);
+   }
+   ret = 1;
+
+out_unlock:
+   pte_unmap_unlock(pte, ptl);
+out:
+   return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct address_space *mapping;
+   struct prio_tree_iter iter;
+   struct vm_area_struct *vma;
+   pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   int ret = 0;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return ret;
+
+   spin_lock(mapping-i_mmap_lock);
+
+   vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff)
+   ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+   spin_unlock(mapping-i_mmap_lock);
+
+   return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct vm_area_struct *vma;
+   struct anon_vma *anon_vma;
+   int ret = 0;
+
+   anon_vma = page_lock_anon_vma(page);
+   if (!anon_vma)
+   return ret;
+
+   /*
+* If the page is inside the swap cache, its _count number was
+* increased by one, therefore we have to increase

[PATCH 4/4] add ksm kernel shared memory driver.

2009-04-08 Thread Izik Eidus
Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
 number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.
__u64 reserved_bits;
reserved bits for future usage.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus iei...@redhat.com
Signed-off-by: Chris Wright chr...@redhat.com
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
---
 include/linux/ksm.h|   48 ++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1674 
 5 files changed, 1730 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..2c11e9a
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,48 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index beb6ec9..297c0bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
 #define HPET_MINOR 228
 #define FUSE_MINOR 229
 #define KVM_MINOR  232
+#define KSM_MINOR  233
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
diff --git a/mm/Kconfig b/mm/Kconfig
index b53427a..3f3fd04 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   tristate Enable KSM for page sharing
+   help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..a15a92d
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1674 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages

Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-06 Thread Izik Eidus

Andrey Panin wrote:

On 094, 04 04, 2009 at 05:35:22PM +0300, Izik Eidus wrote:

SNIP

  

+static inline u32 calc_checksum(struct page *page)
+{
+   u32 checksum;
+   void *addr = kmap_atomic(page, KM_USER0);
+   checksum = jhash(addr, PAGE_SIZE, 17);



Why jhash2() is not used here ? It's faster and leads to smaller code size.
  


Beacuse i didnt know, i will check that and change.

Thanks.

(We should really use in cpu crc for Intel Nehalem, and dirty bit for 
the rest of the architactures...)


  

+   kunmap_atomic(addr, KM_USER0);
+   return checksum;
+}



  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] add replace_page(): change the page pte is pointing to.

2009-04-04 Thread Izik Eidus
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page-index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |5 +++
 mm/memory.c|   80 
 2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..7a831ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1240,6 +1240,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:  vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page-index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] add page_wrprotect(): write protecting page.

2009-04-04 Thread Izik Eidus
this patch add new function called page_wrprotect(),
page_wrprotect() is used to take a page and mark all the pte that
point into it as readonly.

The function is working by walking the rmap of the page, and setting
each pte realted to the page as readonly.

The odirect_sync parameter is used to protect against possible races
with odirect while we are marking the pte as readonly,
as noted by Andrea Arcanglei:

While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected...

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/rmap.h |   11 
 mm/rmap.c|  139 ++
 2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..469376d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
+#endif
+
 #else  /* !CONFIG_MMU */
 
 #define anon_vma_init()do {} while (0)
@@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page)
return 0;
 }
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+static inline int page_wrprotect(struct page *page, int *odirect_sync,
+int count_offset)
+{
+   return 0;
+}
+#endif
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..95c55ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,6 +585,145 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync, int count_offset)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   unsigned long address;
+   pte_t *pte;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   address = vma_address(page, vma);
+   if (address == -EFAULT)
+   goto out;
+
+   pte = page_check_address(page, mm, address, ptl, 0);
+   if (!pte)
+   goto out;
+
+   if (pte_write(*pte)) {
+   pte_t entry;
+
+   flush_cache_page(vma, address, pte_pfn(*pte));
+   /*
+* Ok this is tricky, when get_user_pages_fast() run it doesnt
+* take any lock, therefore the check that we are going to make
+* with the pagecount against the mapcount is racey and
+* O_DIRECT can happen right after the check.
+* So we clear the pte and flush the tlb before the check
+* this assure us that no O_DIRECT can happen after the check
+* or in the middle of the check.
+*/
+   entry = ptep_clear_flush(vma, address, pte);
+   /*
+* Check that no O_DIRECT or similar I/O is in progress on the
+* page
+*/
+   if ((page_mapcount(page) + count_offset) != page_count(page)) {
+   *odirect_sync = 0;
+   set_pte_at_notify(mm, address, pte, entry);
+   goto out_unlock;
+   }
+   entry = pte_wrprotect(entry);
+   set_pte_at_notify(mm, address, pte, entry);
+   }
+   ret = 1;
+
+out_unlock:
+   pte_unmap_unlock(pte, ptl);
+out:
+   return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct address_space *mapping;
+   struct prio_tree_iter iter;
+   struct vm_area_struct *vma;
+   pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   int ret = 0;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return ret;
+
+   spin_lock(mapping-i_mmap_lock);
+
+   vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff)
+   ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+   spin_unlock(mapping-i_mmap_lock);
+
+   return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct vm_area_struct *vma;
+   struct anon_vma *anon_vma;
+   int ret = 0;
+
+   anon_vma = page_lock_anon_vma(page);
+   if (!anon_vma)
+   return ret;
+
+   /*
+* If the page is inside the swap cache, its _count number was
+* increased by one, therefore we have to increase

[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()

2009-04-04 Thread Izik Eidus
this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mmu_notifier.h |   34 ++
 mm/memory.c  |   10 --
 mm/mmu_notifier.c|   20 
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
 struct mm_struct *mm,
 unsigned long address);
 
+   /* 
+   * change_pte is called in cases that pte mapping into page is changed
+   * for example when ksm mapped pte to point into a new shared page.
+   */
+   void (*change_pte)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long address,
+  pte_t pte);
+
/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm, 
+ unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_change_pte(mm, address, pte);
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
__young;\
 })
 
+#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
+({ \
+   struct mm_struct *___mm = __mm; \
+   unsigned long ___address = __address;   \
+   pte_t ___pte = __pte;   \
+   \
+   set_pte_at(__mm, __address, __ptep, ___pte);\
+   mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..1e1a14b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,9 +2051,15 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address

[PATCH 0/4] ksm - dynamic page sharing driver for linux v2

2009-04-04 Thread Izik Eidus
...
  - checksum
- if checksum hasn't changed
  - insert into secondary tree
- if it has, store updated checksum (note: first time this page
  is handled it won't have a checksum, so checksum will appear
  as changed, so it takes two passes w/ no other matches to
  get into secondary tree)
  - do not insert into any tree, will see it again on next pass

The basic idea of this algorithm, is that even if the unstable tree doesnt
promise to us to find two identical pages in the first round, we would
probably find them in the second or the third or the tenth round,
then after we have found this two identical pages only once, we will insert
them into the stable tree, and then they would be protected there forever.
So the all idea of the unstable tree, is just to build the stable tree and
then we will find the identical pages using it.

The current implemantion can be improved alot:
we dont have to calculate exspensive checksum, we can just use the host
dirty bit.

currently we dont support shared pages swapping (other pages that are not
shared can be swapped (all the pages that we didnt find to be identical
to other pages...).

Walking on the tree, we keep call to get_user_pages(), we can optimized it
by saving the pfn, and using mmu notifiers to know when the virtual address
mapping was changed.

We currently scan just programs that were registred to be used by ksm, we
would later want to add the abilaty to tell ksm to scan PIDS (so you can
scan closed binary applications as well).

Right now ksm scanning is made by just one thread, multiple scanners
support might would be needed.

This driver is very useful for KVM as in cases of runing multiple guests
operation system of the same type.
(For desktop work loads we have achived more than x2 memory overcommit
(more like x3))

This driver have found users other than KVM, for example CERN,
Fons Rademakers:
on many-core machines we run one large detector simulation program per core.
These simulation programs are identical but run each in their own process and
need about 2 - 2.5 GB RAM.
We typically buy machines with 2GB RAM per core and so have a problem to run
one of these programs per core.
Of the 2 - 2.5 GB about 700MB is identical data in the form of magnetic field
maps, detector geometry, etc.
Currently people have been trying to start one program, initialize the geometry
and field maps and then fork it N times, to have the data shared.
With KSM this would be done automatically by the system so it sounded extremely
attractive when Andrea presented it.

I am sending another seires of patchs for kvm kernel and kvm-userspace
that would allow users of kvm to test ksm with it.
The kvm patchs would apply to Avi git tree.


Izik Eidus (4):
  MMU_NOTIFIERS: add set_pte_at_notify()
  add page_wrprotect(): write protecting page.
  add replace_page(): change the page pte is pointing to.
  add ksm kernel shared memory driver.

 include/linux/ksm.h  |   48 ++
 include/linux/miscdevice.h   |1 +
 include/linux/mm.h   |5 +
 include/linux/mmu_notifier.h |   34 +
 include/linux/rmap.h |   11 +
 mm/Kconfig   |6 +
 mm/Makefile  |1 +
 mm/ksm.c | 1668 ++
 mm/memory.c  |   90 +++-
 mm/mmu_notifier.c|   20 +
 mm/rmap.c|  139 
 11 files changed, 2021 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] add ksm kernel shared memory driver.

2009-04-04 Thread Izik Eidus
Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
 number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.
__u64 reserved_bits;
reserved bits for future usage.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/ksm.h|   48 ++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1668 
 5 files changed, 1724 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..2c11e9a
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,48 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index beb6ec9..297c0bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
 #define HPET_MINOR 228
 #define FUSE_MINOR 229
 #define KVM_MINOR  232
+#define KSM_MINOR  233
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
diff --git a/mm/Kconfig b/mm/Kconfig
index b53427a..3f3fd04 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   tristate Enable KSM for page sharing
+   help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..fb59a08
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1668 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C

Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-03 Thread Izik Eidus

Gerd Hoffmann wrote:

Izik Eidus wrote:
  

The main problem that ksm will face when removing the fd interface is:
right now when you register memory into ksm, you open fd, and then ksm
do get_task_mm(), we will do mmput when the file will be closed



Did you test whenever it really cleans up in case you kill -9 qemu?

I recently did something simliar with the result that the extra
reference hold on mm_struct prevented the process memory from being
zapped ...

cheers,
  Gerd
  

Did you use mmput() after you called get_task_mm() ???
get_task_mm() do nothing beside atomic_inc(mm-mm_users);

and mmput() do nothing beside dec this counter and check if no reference 
are available to this


Am i missing anything?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-02 Thread Izik Eidus

Anthony Liguori wrote:

Chris Wright wrote:

* Anthony Liguori (anth...@codemonkey.ws) wrote:
 
The ioctl() interface is quite bad for what you're doing.  You're  
telling the kernel extra information about a VA range in 
userspace.   That's what madvise is for.  You're tweaking simple 
read/write values of  kernel infrastructure.  That's what sysfs is for.



I agree re: sysfs (brought it up myself before).  As far as madvise vs.
ioctl, the one thing that comes from the ioctl is fops-release to
automagically unregister memory on exit.


This is precisely why ioctl() is a bad interface.  fops-release isn't 
tied to the process but rather tied to the open file.  The file can 
stay open long after the process exits either by a fork()'d child 
inheriting the file descriptor or through something more sinister like 
SCM_RIGHTS.


In fact, a common mistake is to leak file descriptors by not closing 
them when exec()'ing a process.  Instead of just delaying a close, if 
you rely on this behavior to unregister memory regions, you could 
potentially have badness happen in the kernel if ksm attempted to 
access an invalid memory region. 

How could such badness ever happen in the kernel?
Ksm work by virtual addresses!, it fetch the pages by using 
get_user_pages(), and the mm struct is protected by get_task_mm(), in 
addion we take the down_read(mmap_sem)


So how could ksm ever acces to invalid memory region unless the host 
page table or get_task_mm() would stop working!


When someone register memory for scan, we do get_task_mm() when the file 
is closed or when he say that he dont want this to be registered anymore 
he call the unregister ioctl



You can aurgoment about API, but this is mathamathical thing to say Ksm 
is insecure, please show me senario!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Izik Eidus

Chris Wright wrote:

* Anthony Liguori (anth...@codemonkey.ws) wrote:
  
Using an interface like madvise() would force the issue to be dealt with  
properly from the start :-)



Yeah, I'm not at all opposed to it.

This updates to madvise for register and sysfs for control.

madvise issues:
- MADV_SHAREABLE
  - register only ATM, can add MADV_UNSHAREABLE to allow an app to proactively
unregister, but need a cleanup when -mm goes away via exit/exec
  - will register a region per vma, should probably push the whole thing
into vma rather than keep [mm,addr,len] tuple in ksm

  

The main problem that ksm will face when removing the fd interface is:
right now when you register memory into ksm, you open fd, and then ksm 
do get_task_mm(), we will do mmput when the file will be closed
(note that this doesnt mean that if you fork and not close the fd the 
memory wont go away, get_task_mm() doesnt protect the vmas inside 
the mm strcture and therefore they will be able to get removed)


So if we move into madvice and we remove the get_task_mm() usage, we 
will have to add notification to exit_mm() so ksm will know it should 
stop using this mm strcture, and drop it from all the trees data...


Is this what we want?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/4] update ksm userspace interfaces

2009-04-02 Thread Izik Eidus

Chris Wright wrote:

* Izik Eidus (iei...@redhat.com) wrote:
  

Is this what we want?



How about baby steps...

admit that ioctl to control plane is better done via sysfs?
  

Yes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-04-02 Thread Izik Eidus

Jesper Juhl wrote:

Hi,

On Tue, 31 Mar 2009, Izik Eidus wrote:

  

KSM is a linux driver that allows dynamicly sharing identical memory
pages between one or more processes.

Unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and
merged.
The sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

To find identical pages ksm use algorithm that is split into three
primery levels:

1) Ksm will start scan the memory and will calculate checksum for each
   page that is registred to be scanned.
   (In the first round of the scanning, ksm would only calculate
this checksum for all the pages)




One question;

Calcolating a checksum is a fine way to find pages that are likely to be 
identical


I dont use checksum as with hash table, the checksum doesnt use to find 
identical pages by the way that they have similer data...
the checksum is used to let me know that the page was not changed for a 
while and it is worth checking for identical pages to it...
In the future we will want to use the page table dirty bit for it, as 
taking checksum is somewhat expensive


, but there is no guarantee that two pages with the same 
checksum really are identical - there *will* be checksum collisions 
eventually. So, I really hope that your implementation actually checks 
that two pages that it find that have identical checksums really are 100% 
identical by comparing them bit by bit before throwing one away.
  

We do that :-)

If you rely only on a checksum then eventually a user will get bitten by a 
checksum collision and, in the best case, something will crash, and in the 
worst case, data will silently be corrupted.


Do you rely only on the checksum or do you actually compare pages to check 
they are 100% identical before sharing?
  


I do 100% compare to the pages before i share them.

I must admit that I have not read through the patch to find the answer, I 
just read your description and became concerned.


  

Dont worry, me neither :-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Izik Eidus

KAMEZAWA Hiroyuki wrote:

On Tue, 31 Mar 2009 15:21:53 +0300
Izik Eidus iei...@redhat.com wrote:
  
  
  

kpage is actually what going to be KsmPage - the shared page...

Right now this pages are not swappable..., after ksm will be merged we 
will make this pages swappable as well...




sure.

  

If so, please
 - show the amount of kpage
 
 - allow users to set limit for usage of kpages. or preserve kpages at boot or

   by user's command.
  
  
kpage actually save memory..., and limiting the number of them, would 
make you limit the number of shared pages...





Ah, I'm working for memory control cgroup. And *KSM* will be out of control.
It's ok to make the default limit value as INFINITY. but please add knobs.
  
Sure, when i will post V2 i will take care for this issue (i will do it 
after i get little bit more review for ksm.c :-))


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Izik Eidus

Anthony Liguori wrote:

Andrea Arcangeli wrote:

On Tue, Mar 31, 2009 at 10:54:57AM -0500, Anthony Liguori wrote:
 
You can still disable ksm and simply return ENOSYS for the MADV_ 
flag.  You 



Anthony, the biggest problem about madvice() is that it is a real system 
call api, i wouldnt want in that stage of ksm commit into api changes of 
linux...


The ioctl itself is restricting, madvice is much more...,

Can we draft this issue to after ksm is merged, and after all the big 
new fetures that we want to add to ksm will be merge
(then the api would be much more stable, and we will be able to ask ppl 
in the list about changing of api, but for new driver that it yet to be 
merged, it is kind of overkill to add api to linux)


What do you think?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-03-31 Thread Izik Eidus

KAMEZAWA Hiroyuki wrote:

On Tue, 31 Mar 2009 02:59:20 +0300
Izik Eidus iei...@redhat.com wrote:

  

Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_START_STOP_KTHREAD:
Return information about the kernel thread, the inforamtion is returned
using the ksm_kthread_info structure:
ksm_kthread_info:
__u32 sleep:
number of microsecoends to sleep between each iteration of
scanning.

__u32 pages_to_scan:
number of pages to scan for each iteration of scanning.

__u32 max_pages_to_merge:
maximum number of pages to merge in each iteration of scanning
(so even if there are still more pages to scan, we stop this
iteration)

__u32 flags:
   flags to control ksmd (right now just ksm_control_flags_run
  available)

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
 number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/ksm.h|   69 +++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1431 
 5 files changed, 1508 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..5776dce
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,69 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+struct ksm_kthread_info {
+   __u32 sleep; /* number of microsecoends to sleep */
+   __u32 pages_to_scan; /* number of pages to scan */
+   __u32 flags; /* control flags */
+__u32 pad;
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+/*
+ * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed
+ * (can stop the kernel thread from working by setting running = 0)
+ */
+#define KSM_START_STOP_KTHREAD  _IOW(KSMIO,  0x02,\
+ struct ksm_kthread_info)
+/*
+ * KSM_GET_INFO_KTHREAD - return information about the kernel thread
+ * scanning speed.
+ */
+#define KSM_GET_INFO_KTHREAD_IOW(KSMIO,  0x03,\
+ struct ksm_kthread_info)
+
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index a820f81

Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-03-31 Thread Izik Eidus

Anthony Liguori wrote:

Izik Eidus wrote:

Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_START_STOP_KTHREAD:
Return information about the kernel thread, the inforamtion is returned
using the ksm_kthread_info structure:
ksm_kthread_info:
__u32 sleep:
number of microsecoends to sleep between each iteration of
scanning.

__u32 pages_to_scan:
number of pages to scan for each iteration of scanning.

__u32 max_pages_to_merge:
maximum number of pages to merge in each iteration of scanning
(so even if there are still more pages to scan, we stop this
iteration)

__u32 flags:
   flags to control ksmd (right now just ksm_control_flags_run
  available)
  


Wouldn't this make more sense as a sysfs interface?


I belive using ioctl for registering memory of applications make it 
easier
Ksm doesnt have any complicated API that would benefit from sysfs 
(beside adding more complexity)


That is, the KSM_START_STOP_KTHREAD part, not necessarily the rest of 
the API.


What you mean?


Regards,

Anthony Liguori



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-03-31 Thread Izik Eidus

Anthony Liguori wrote:

Izik Eidus wrote:

I am sending another seires of patchs for kvm kernel and kvm-userspace
that would allow users of kvm to test ksm with it.
The kvm patchs would apply to Avi git tree.
  
Any reason to not take these through upstream QEMU instead of 
kvm-userspace?  In principle, I don't see anything that would prevent 
normal QEMU from almost making use of this functionality.  That would 
make it one less thing to eventually have to merge...


The changes for the kvm-userspace were just provided for testing it...
After we will have ksm inside the kernel we will send another patch to 
qemu-devel that will add support for it.




Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] ksm - dynamic page sharing driver for linux

2009-03-30 Thread Izik Eidus
.
Of the 2 - 2.5 GB about 700MB is identical data in the form of magnetic field
maps, detector geometry, etc.
Currently people have been trying to start one program, initialize the geometry
and field maps and then fork it N times, to have the data shared.
With KSM this would be done automatically by the system so it sounded extremely
attractive when Andrea presented it.

I am sending another seires of patchs for kvm kernel and kvm-userspace
that would allow users of kvm to test ksm with it.
The kvm patchs would apply to Avi git tree.

Izik Eidus (4):
  MMU_NOTIFIERS: add set_pte_at_notify()
  add page_wrprotect(): write protecting page.
  add replace_page(): change the page pte is pointing to.
  add ksm kernel shared memory driver.

 include/linux/ksm.h  |   69 ++
 include/linux/miscdevice.h   |1 +
 include/linux/mm.h   |5 +
 include/linux/mmu_notifier.h |   34 +
 include/linux/rmap.h |   11 +
 mm/Kconfig   |6 +
 mm/Makefile  |1 +
 mm/ksm.c | 1431 ++
 mm/memory.c  |   90 +++-
 mm/mmu_notifier.c|   20 +
 mm/rmap.c|  139 
 11 files changed, 1805 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()

2009-03-30 Thread Izik Eidus
this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mmu_notifier.h |   34 ++
 mm/memory.c  |   10 --
 mm/mmu_notifier.c|   20 
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
 struct mm_struct *mm,
 unsigned long address);
 
+   /* 
+   * change_pte is called in cases that pte mapping into page is changed
+   * for example when ksm mapped pte to point into a new shared page.
+   */
+   void (*change_pte)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long address,
+  pte_t pte);
+
/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm, 
+ unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_change_pte(mm, address, pte);
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
__young;\
 })
 
+#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
+({ \
+   struct mm_struct *___mm = __mm; \
+   unsigned long ___address = __address;   \
+   pte_t ___pte = __pte;   \
+   \
+   set_pte_at(__mm, __address, __ptep, ___pte);\
+   mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/mm/memory.c b/mm/memory.c
index baa999e..0382a34 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2031,9 +2031,15 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address

[PATCH 2/4] add page_wrprotect(): write protecting page.

2009-03-30 Thread Izik Eidus
this patch add new function called page_wrprotect(),
page_wrprotect() is used to take a page and mark all the pte that
point into it as readonly.

The function is working by walking the rmap of the page, and setting
each pte realted to the page as readonly.

The odirect_sync parameter is used to protect against possible races
with odirect while we are marking the pte as readonly,
as noted by Andrea Arcanglei:

While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected...

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/rmap.h |   11 
 mm/rmap.c|  139 ++
 2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..469376d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
+#endif
+
 #else  /* !CONFIG_MMU */
 
 #define anon_vma_init()do {} while (0)
@@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page)
return 0;
 }
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+static inline int page_wrprotect(struct page *page, int *odirect_sync,
+int count_offset)
+{
+   return 0;
+}
+#endif
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..95c55ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,6 +585,145 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync, int count_offset)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   unsigned long address;
+   pte_t *pte;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   address = vma_address(page, vma);
+   if (address == -EFAULT)
+   goto out;
+
+   pte = page_check_address(page, mm, address, ptl, 0);
+   if (!pte)
+   goto out;
+
+   if (pte_write(*pte)) {
+   pte_t entry;
+
+   flush_cache_page(vma, address, pte_pfn(*pte));
+   /*
+* Ok this is tricky, when get_user_pages_fast() run it doesnt
+* take any lock, therefore the check that we are going to make
+* with the pagecount against the mapcount is racey and
+* O_DIRECT can happen right after the check.
+* So we clear the pte and flush the tlb before the check
+* this assure us that no O_DIRECT can happen after the check
+* or in the middle of the check.
+*/
+   entry = ptep_clear_flush(vma, address, pte);
+   /*
+* Check that no O_DIRECT or similar I/O is in progress on the
+* page
+*/
+   if ((page_mapcount(page) + count_offset) != page_count(page)) {
+   *odirect_sync = 0;
+   set_pte_at_notify(mm, address, pte, entry);
+   goto out_unlock;
+   }
+   entry = pte_wrprotect(entry);
+   set_pte_at_notify(mm, address, pte, entry);
+   }
+   ret = 1;
+
+out_unlock:
+   pte_unmap_unlock(pte, ptl);
+out:
+   return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct address_space *mapping;
+   struct prio_tree_iter iter;
+   struct vm_area_struct *vma;
+   pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   int ret = 0;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return ret;
+
+   spin_lock(mapping-i_mmap_lock);
+
+   vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff)
+   ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+   spin_unlock(mapping-i_mmap_lock);
+
+   return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct vm_area_struct *vma;
+   struct anon_vma *anon_vma;
+   int ret = 0;
+
+   anon_vma = page_lock_anon_vma(page);
+   if (!anon_vma)
+   return ret;
+
+   /*
+* If the page is inside the swap cache, its _count number was
+* increased by one, therefore we have to increase

[PATCH 3/4] add replace_page(): change the page pte is pointing to.

2009-03-30 Thread Izik Eidus
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page-index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |5 +++
 mm/memory.c|   80 
 2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..b19e4c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1237,6 +1237,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 0382a34..3946e79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1562,6 +1562,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:  vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page-index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] kvm support for ksm

2009-03-30 Thread Izik Eidus
apply it against Avi git tree.

Izik Eidus (3):
  kvm: dont hold pagecount reference for mapped sptes pages.
  kvm: add SPTE_HOST_WRITEABLE flag to the shadow ptes.
  kvm: add support for change_pte mmu notifiers

 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   89 ---
 arch/x86/kvm/paging_tmpl.h  |   16 ++-
 virt/kvm/kvm_main.c |   14 ++
 4 files changed, 101 insertions(+), 19 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvm: dont hold pagecount reference for mapped sptes pages.

2009-03-30 Thread Izik Eidus
When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |7 ++-
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b625ed4..df8fbaf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -567,9 +567,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
if (*spte  shadow_accessed_mask)
kvm_set_pfn_accessed(pfn);
if (is_writeble_pte(*spte))
-   kvm_release_pfn_dirty(pfn);
-   else
-   kvm_release_pfn_clean(pfn);
+   kvm_set_pfn_dirty(pfn);
rmapp = gfn_to_rmap(kvm, sp-gfns[spte - sp-spt], is_large_pte(*spte));
if (!*rmapp) {
printk(KERN_ERR rmap_remove: %p %llx 0-BUG\n, spte, *spte);
@@ -1812,8 +1810,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
page_header_update_slot(vcpu-kvm, shadow_pte, gfn);
if (!was_rmapped) {
rmap_add(vcpu, shadow_pte, gfn, largepage);
-   if (!is_rmap_pte(*shadow_pte))
-   kvm_release_pfn_clean(pfn);
+   kvm_release_pfn_clean(pfn);
} else {
if (was_writeble)
kvm_release_pfn_dirty(pfn);
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kvm: add SPTE_HOST_WRITEABLE flag to the shadow ptes.

2009-03-30 Thread Izik Eidus
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/kvm/mmu.c |   14 ++
 arch/x86/kvm/paging_tmpl.h |   16 +---
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index df8fbaf..6b4d795 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -138,6 +138,8 @@ module_param(oos_shadow, bool, 0644);
 #define ACC_USER_MASKPT_USER_MASK
 #define ACC_ALL  (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+#define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
 struct kvm_rmap_desc {
@@ -1676,7 +1678,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int largepage,
int global, gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync)
+   bool can_unsync, bool reset_host_protection)
 {
u64 spte;
int ret = 0;
@@ -1719,6 +1721,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
kvm_x86_ops-get_mt_mask_shift();
spte |= mt_mask;
}
+   if (reset_host_protection)
+   spte |= SPTE_HOST_WRITEABLE;
 
spte |= (u64)pfn  PAGE_SHIFT;
 
@@ -1764,7 +1768,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int largepage, int global,
-gfn_t gfn, pfn_t pfn, bool speculative)
+gfn_t gfn, pfn_t pfn, bool speculative,
+bool reset_host_protection)
 {
int was_rmapped = 0;
int was_writeble = is_writeble_pte(*shadow_pte);
@@ -1793,7 +1798,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
was_rmapped = 1;
}
if (set_spte(vcpu, shadow_pte, pte_access, user_fault, write_fault,
- dirty, largepage, global, gfn, pfn, speculative, true)) {
+ dirty, largepage, global, gfn, pfn, speculative, true,
+ reset_host_protection)) {
if (write_fault)
*ptwrite = 1;
kvm_x86_ops-tlb_flush(vcpu);
@@ -1840,7 +1846,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
|| (largepage  iterator.level == PT_DIRECTORY_LEVEL)) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
 0, write, 1, pt_write,
-largepage, 0, gfn, pfn, false);
+largepage, 0, gfn, pfn, false, true);
++vcpu-stat.pf_fixed;
break;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index eae9499..9fdacd0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -259,10 +259,14 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *page,
if (mmu_notifier_retry(vcpu, vcpu-arch.update_pte.mmu_seq))
return;
kvm_get_pfn(pfn);
+   /*
+* we call mmu_set_spte() with reset_host_protection = true beacuse that
+* vcpu-arch.update_pte.pfn was fetched from get_user_pages(write = 1).
+*/
mmu_set_spte(vcpu, spte, page-role.access, pte_access, 0, 0,
 gpte  PT_DIRTY_MASK, NULL, largepage,
 gpte  PT_GLOBAL_MASK, gpte_to_gfn(gpte),
-pfn, true);
+pfn, true, true);
 }
 
 /*
@@ -297,7 +301,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 gw-ptes[gw-level-1]  PT_DIRTY_MASK,
 ptwrite, largepage,
 gw-ptes[gw-level-1]  PT_GLOBAL_MASK,
-gw-gfn, pfn, false);
+gw-gfn, pfn, false, true);
break;
}
 
@@ -547,6 +551,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
int i, offset, nr_present;
+bool reset_host_protection = 1;
 
offset = nr_present = 0;
 
@@ -584,9 +589,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
nr_present++;
pte_access

[PATCH 3/3] kvm: add support for change_pte mmu notifiers

2009-03-30 Thread Izik Eidus
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   68 +++
 virt/kvm/kvm_main.c |   14 
 3 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8351c4d..9062729 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -791,5 +791,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6b4d795..f8816dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -257,6 +257,11 @@ static pfn_t spte_to_pfn(u64 pte)
return (pte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
 }
 
+static pte_t ptep_val(pte_t *ptep)
+{
+   return *ptep;
+}
+
 static gfn_t pse36_gfn_delta(u32 gpte)
 {
int shift = 32 - PT32_DIR_PSE36_SHIFT - PAGE_SHIFT;
@@ -678,7 +683,8 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
return write_protected;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -693,8 +699,48 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 
+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data)
+{
+   int need_flush = 0;
+   u64 *spte, new_spte;
+   pte_t *ptep = (pte_t *)data;
+   pfn_t new_pfn;
+
+   new_pfn = pte_pfn(ptep_val(ptep));
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!is_shadow_present_pte(*spte));
+   rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, spte, *spte);
+   need_flush = 1;
+   if (pte_write(ptep_val(ptep))) {
+   rmap_remove(kvm, spte);
+   set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+   spte = rmap_next(kvm, rmapp, NULL);
+   } else {
+   new_spte = *spte ~ (PT64_BASE_ADDR_MASK);
+   new_spte |= new_pfn  PAGE_SHIFT;
+
+   if (!pte_write(ptep_val(ptep))) {
+   new_spte = ~PT_WRITABLE_MASK;
+   new_spte = ~SPTE_HOST_WRITEABLE;
+   if (is_writeble_pte(*spte))
+   kvm_set_pfn_dirty(spte_to_pfn(*spte));
+   }
+   set_shadow_pte(spte, new_spte);
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   }
+   if (need_flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   return 0;
+}
+
 static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+ unsigned long data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data))
 {
int i;
int retval = 0;
@@ -715,11 +761,13 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
end = start + (memslot-npages  PAGE_SHIFT);
if (hva = start  hva  end) {
gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
-   retval |= handler(kvm, memslot-rmap[gfn_offset]);
+   retval |= handler(kvm, memslot-rmap[gfn_offset],
+ data);
retval |= handler(kvm,
  memslot-lpage_info[
  gfn_offset /
- 
KVM_PAGES_PER_HPAGE].rmap_pde);
+ KVM_PAGES_PER_HPAGE].rmap_pde,
+ data);
}
}
 
@@ -728,10 +776,16 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-   return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
+   return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+}
+
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+{
+   kvm_handle_hva(kvm, hva, (unsigned long)pte, kvm_set_pte_rmapp);
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int

[PATCH 0/2] kvm-userspace ksm support

2009-03-30 Thread Izik Eidus
Apply it against Avi kvm-userspace git tree.

Izik Eidus (2):
  qemu: add ksm support
  qemu: add ksmctl.

 qemu/ksm.h |   70 
 qemu/vl.c  |   34 +
 user/Makefile  |6 +++-
 user/config-x86-common.mak |2 +-
 user/ksmctl.c  |   69 +++
 5 files changed, 179 insertions(+), 2 deletions(-)
 create mode 100644 qemu/ksm.h
 create mode 100644 user/ksmctl.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] qemu: add ksm support

2009-03-30 Thread Izik Eidus
Signed-off-by: Izik Eidus iei...@redhat.com
---
 qemu/ksm.h |   70 
 qemu/vl.c  |   34 +
 2 files changed, 104 insertions(+), 0 deletions(-)
 create mode 100644 qemu/ksm.h

diff --git a/qemu/ksm.h b/qemu/ksm.h
new file mode 100644
index 000..2fb91a8
--- /dev/null
+++ b/qemu/ksm.h
@@ -0,0 +1,70 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+
+#include sys/types.h
+#include sys/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+struct ksm_kthread_info {
+   __u32 sleep; /* number of microsecoends to sleep */
+   __u32 pages_to_scan; /* number of pages to scan */
+   __u32 flags; /* control flags */
+__u32 pad;
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+/*
+ * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed
+ * (can stop the kernel thread from working by setting running = 0)
+ */
+#define KSM_START_STOP_KTHREAD  _IOW(KSMIO,  0x02,\
+ struct ksm_kthread_info)
+/*
+ * KSM_GET_INFO_KTHREAD - return information about the kernel thread
+ * scanning speed.
+ */
+#define KSM_GET_INFO_KTHREAD_IOW(KSMIO,  0x03,\
+ struct ksm_kthread_info)
+
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/qemu/vl.c b/qemu/vl.c
index c52d2d7..54a9dd9 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -130,6 +130,7 @@ int main(int argc, char **argv)
 #define main qemu_main
 #endif /* CONFIG_COCOA */
 
+#include ksm.h
 #include hw/hw.h
 #include hw/boards.h
 #include hw/usb.h
@@ -4873,6 +4874,37 @@ static void termsig_setup(void)
 
 #endif
 
+static int ksm_register_memory(void)
+{
+int fd;
+int ksm_fd;
+int r = 1;
+struct ksm_memory_region ksm_region;
+
+fd = open(/dev/ksm, O_RDWR | O_TRUNC, (mode_t)0600);
+if (fd == -1)
+goto out;
+
+ksm_fd = ioctl(fd, KSM_CREATE_SHARED_MEMORY_AREA);
+if (ksm_fd == -1)
+goto out_free;
+
+ksm_region.npages = phys_ram_size / TARGET_PAGE_SIZE;
+ksm_region.addr = (unsigned long)phys_ram_base;
+r = ioctl(ksm_fd, KSM_REGISTER_MEMORY_REGION, ksm_region);
+if (r)
+goto out_free1;
+
+return r;
+
+out_free1:
+close(ksm_fd);
+out_free:
+close(fd);
+out:
+return r;
+}
+
 int main(int argc, char **argv, char **envp)
 {
 #ifdef CONFIG_GDBSTUB
@@ -5862,6 +5894,8 @@ int main(int argc, char **argv, char **envp)
 /* init the dynamic translator */
 cpu_exec_init_all(tb_size * 1024 * 1024);
 
+ksm_register_memory();
+
 bdrv_init();
 dma_helper_init();
 
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] qemu: add ksmctl.

2009-03-30 Thread Izik Eidus
userspace tool to control the ksm kernel thread

Signed-off-by: Izik Eidus iei...@redhat.com
---
 user/Makefile  |6 +++-
 user/config-x86-common.mak |2 +-
 user/ksmctl.c  |   69 
 3 files changed, 75 insertions(+), 2 deletions(-)
 create mode 100644 user/ksmctl.c

diff --git a/user/Makefile b/user/Makefile
index cf7f8ed..a291b37 100644
--- a/user/Makefile
+++ b/user/Makefile
@@ -39,6 +39,10 @@ autodepend-flags = -MMD -MF $(dir $*).$(notdir $*).d
 
 LDFLAGS += -pthread -lrt
 
+ksmctl_objs= ksmctl.o
+ksmctl: $(ksmctl_objs)
+   $(CC) $(LDFLAGS) $^ -o $@
+
 kvmtrace_objs= kvmtrace.o
 
 kvmctl: $(kvmctl_objs)
@@ -56,4 +60,4 @@ $(libcflat): $(cflatobjs)
 -include .*.d
 
 clean: arch_clean
-   $(RM) kvmctl kvmtrace *.o *.a .*.d $(libcflat) $(cflatobjs)
+   $(RM) ksmctl kvmctl kvmtrace *.o *.a .*.d $(libcflat) $(cflatobjs)
diff --git a/user/config-x86-common.mak b/user/config-x86-common.mak
index e789fd4..4303aee 100644
--- a/user/config-x86-common.mak
+++ b/user/config-x86-common.mak
@@ -1,6 +1,6 @@
 #This is a make file with common rules for both x86  x86-64
 
-all: kvmctl kvmtrace test_cases
+all: ksmctl kvmctl kvmtrace test_cases
 
 kvmctl_objs= main.o iotable.o ../libkvm/libkvm.a
 balloon_ctl: balloon_ctl.o
diff --git a/user/ksmctl.c b/user/ksmctl.c
new file mode 100644
index 000..034469f
--- /dev/null
+++ b/user/ksmctl.c
@@ -0,0 +1,69 @@
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/types.h
+#include sys/stat.h
+#include sys/ioctl.h
+#include fcntl.h
+#include sys/mman.h
+#include unistd.h
+#include ../qemu/ksm.h
+
+int main(int argc, char *argv[])
+{
+   int fd;
+   int used = 0;
+   int fd_start;
+   struct ksm_kthread_info info;
+   
+
+   if (argc  2) {
+   fprintf(stderr, usage: %s {start npages sleep | stop | 
info}\n, argv[0]);
+   exit(1);
+   }
+
+   fd = open(/dev/ksm, O_RDWR | O_TRUNC, (mode_t)0600);
+   if (fd == -1) {
+   fprintf(stderr, could not open /dev/ksm\n);
+   exit(1);
+   }
+
+   if (!strncmp(argv[1], start, strlen(argv[1]))) {
+   used = 1;
+   if (argc  4) {
+   fprintf(stderr,
+   usage: %s start npages_to_scan sleep\n,
+   argv[0]);
+   exit(1);
+   }
+   info.pages_to_scan = atoi(argv[2]);
+   info.sleep = atoi(argv[3]);
+   info.flags = ksm_control_flags_run;
+
+   fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, info);
+   if (fd_start == -1) {
+   fprintf(stderr, KSM_START_KTHREAD failed\n);
+   exit(1);
+   }
+   printf(created scanner\n);
+   }
+
+   if (!strncmp(argv[1], stop, strlen(argv[1]))) {
+   used = 1;
+   info.flags = 0;
+   fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, info);
+   printf(stopped scanner\n);
+   }
+
+   if (!strncmp(argv[1], info, strlen(argv[1]))) {
+   used = 1;
+   ioctl(fd, KSM_GET_INFO_KTHREAD, info);
+printf(flags %d, pages_to_scan %d, sleep_time %d\n,
+info.flags, info.pages_to_scan, info.sleep);
+   }
+
+   if (!used)
+   fprintf(stderr, unknown command %s\n, argv[1]);
+
+   return 0;
+}
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] add ksm kernel shared memory driver.

2009-03-30 Thread Izik Eidus
Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_START_STOP_KTHREAD:
Return information about the kernel thread, the inforamtion is returned
using the ksm_kthread_info structure:
ksm_kthread_info:
__u32 sleep:
number of microsecoends to sleep between each iteration of
scanning.

__u32 pages_to_scan:
number of pages to scan for each iteration of scanning.

__u32 max_pages_to_merge:
maximum number of pages to merge in each iteration of scanning
(so even if there are still more pages to scan, we stop this
iteration)

__u32 flags:
   flags to control ksmd (right now just ksm_control_flags_run
  available)

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
 number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/ksm.h|   69 +++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1431 
 5 files changed, 1508 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..5776dce
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,69 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+struct ksm_kthread_info {
+   __u32 sleep; /* number of microsecoends to sleep */
+   __u32 pages_to_scan; /* number of pages to scan */
+   __u32 flags; /* control flags */
+__u32 pad;
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+/*
+ * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed
+ * (can stop the kernel thread from working by setting running = 0)
+ */
+#define KSM_START_STOP_KTHREAD  _IOW(KSMIO,  0x02,\
+ struct ksm_kthread_info)
+/*
+ * KSM_GET_INFO_KTHREAD - return information about the kernel thread
+ * scanning speed.
+ */
+#define KSM_GET_INFO_KTHREAD_IOW(KSMIO,  0x03,\
+ struct ksm_kthread_info)
+
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index a820f81..6d4f8df 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -29,6 +29,7

Re: Live memory allocation?

2009-03-26 Thread Izik Eidus

Tomasz Chmielewski wrote:

Evert schrieb:

Hi all,

According to the Wikipedia ( 
http://en.wikipedia.org/wiki/Comparison_of_platform_virtual_machines 
) both VirtualBox  VMware server support something called 'Live 
memory allocation'.

Does KVM support this as well?


What does this term mean exactly? Is it the same as ballooning used 
by KVM?



I guess it referring to memory allocation on first time access to the 
memory areas, Meaning the memory allocation will be made only when it 
really going to be used.



(But this is just a guess)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Porting KVM to Mac OS?

2009-01-25 Thread Izik Eidus

Alexander Graf wrote:

Hi,

On 25.01.2009, at 09:16, Neo Jia wrote:


hi,

I am thinking if it is possible to port KVM to Mac OS (leopard). Is
there anybody doing this already?


I've considered doing it, but haven't gotten around to it, due to lack 
of inspiration.
The biggest problem IMHO is the sync. Rewriting a kvm module for Mac 
OS X should be fairly easy, but you'll miss all the good bugfixes from 
upstream. Using the upstream code with a wrapper on the other hand is 
probably a really big hassle, because osx doesn't really know about 
mmu notifiers and a lot of other Linux internal things.


You can use hardware breakpoints to emulate mmu notifiers beaivor for 
kernel that you cannot patch.
The only issue is that beacuse the source is close you may never know 
where is the right place to put the notification.




So if you come up with a good idea for this problem, I'd be glad to 
help you out as much as time permits :-).


Alex


If it is possible, which KVM release should I use as a start?

Thanks,
Neo
--
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-userspace: set pci mem to start at 0xc100000 and vesa to 0xc000000

2009-01-17 Thread Izik Eidus

Chris Wright wrote:

* Izik Eidus (iei...@redhat.com) wrote:
  

This patch make the pci mem region larger (1 giga now).
this is needed for pci devices that require large amount of memory
such as video cards.

for pea guests this patch is not an issue beacuse the guest OS will map
the rest of the ram after 0x1...,
for 32bits that arent pea, it mean the maximum memory that would be
avaible now is 3giga.



Seems a little heavy handed.

a) Given the size...code could be cleaned up so that a simple constant
change doesn't need to touch so much code.
  


Yea it probably can...


b) It is brute force.  I'm not sure it really matters all that much to
limit a 32-bit (non-PAE) guest to 3G, but it's a little extreme for the
cases that don't care about the large hole. 


Is there anyway to make it dynamic based on the requirements of the
devices that are part of the launched VM?
  


There is (you need to transfer data to the bios but it is possible...), 
the thing is - there was concern that

it will make windows crazy if you keep changing the devices physical mapping

Avi what do you think?


thanks,
-chris
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1] pci hole remaping

2009-01-10 Thread Izik Eidus

kind of simple, i would send one to qemu later (need to check something
first

Spice need this, it allow more memory cache (badly needed when runing
with multiple screens)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-userspace: set pci mem to start at 0xc100000 and vesa to 0xc000000

2009-01-10 Thread Izik Eidus
This patch make the pci mem region larger (1 giga now).
this is needed for pci devices that require large amount of memory
such as video cards.

for pea guests this patch is not an issue beacuse the guest OS will map
the rest of the ram after 0x1...,
for 32bits that arent pea, it mean the maximum memory that would be
avaible now is 3giga.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 bios/acpi-dsdt.dsl  |2 +-
 bios/rombios.c  |2 +-
 bios/rombios32.c|   10 +-
 qemu/hw/pc.c|6 +++---
 qemu/hw/vga_int.h   |2 +-
 qemu/hw/vmware_vga.c|4 ++--
 qemu/kvm-tpr-opt.c  |2 +-
 qemu/qemu-kvm.c |2 +-
 vgabios/vbe.h   |2 +-
 vgabios/vbe_display_api.txt |2 +-
 10 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/bios/acpi-dsdt.dsl b/bios/acpi-dsdt.dsl
index d67616d..78061ab 100755
--- a/bios/acpi-dsdt.dsl
+++ b/bios/acpi-dsdt.dsl
@@ -226,7 +226,7 @@ DefinitionBlock (
 ,, , AddressRangeMemory, TypeStatic)
 DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
 0x, // Address Space Granularity
-0xE000, // Address Range Minimum
+0xC000, // Address Range Minimum
 0xFEBF, // Address Range Maximum
 0x, // Address Translation Offset
 0x1EC0, // Address Length
diff --git a/bios/rombios.c b/bios/rombios.c
index c4f6ccd..146dd52 100644
--- a/bios/rombios.c
+++ b/bios/rombios.c
@@ -9829,7 +9829,7 @@ pcibios_init_sel_reg:
 pcibios_init_iomem_bases:
   push bp
   mov  bp, sp
-  mov  eax, #0xe000 ;; base for memory init
+  mov  eax, #0xc000 ;; base for memory init
   push eax
   mov  ax, #0xc000 ;; base for i/o init
   push ax
diff --git a/bios/rombios32.c b/bios/rombios32.c
index ab37e13..dceaff6 100755
--- a/bios/rombios32.c
+++ b/bios/rombios32.c
@@ -565,8 +565,8 @@ void setup_mtrr(void)
 wrmsr_smp(MSR_MTRRfix4K_E8000, 0);
 wrmsr_smp(MSR_MTRRfix4K_F, 0);
 wrmsr_smp(MSR_MTRRfix4K_F8000, 0);
-/* Mark 3.5-4GB as UC, anything not specified defaults to WB */
-wrmsr_smp(MTRRphysBase_MSR(0), 0xe000ull | 0);
+/* Mark 3-4GB as UC, anything not specified defaults to WB */
+wrmsr_smp(MTRRphysBase_MSR(0), 0xc000ull | 0);
 wrmsr_smp(MTRRphysMask_MSR(0), ~(0x2000ull - 1) | 0x800);
 wrmsr_smp(MSR_MTRRdefType, 0xc06);
 }
@@ -924,8 +924,8 @@ static void pci_bios_init_device(PCIDevice *d)
 case 0x0300: /* Display controller - VGA compatible controller */
 if (vendor_id != 0x1234)
 goto default_map;
-/* VGA: map frame buffer to default Bochs VBE address */
-pci_set_io_region_addr(d, 0, 0xE000);
+/* VGA: map frame buffer */
+pci_set_io_region_addr(d, 0, 0xC000);
 break;
 case 0x0800: /* Generic system peripheral - PIC */
 if (vendor_id == PCI_VENDOR_ID_IBM) {
@@ -1016,7 +1016,7 @@ void pci_for_each_device(void (*init_func)(PCIDevice *d))
 void pci_bios_init(void)
 {
 pci_bios_io_addr = 0xc000;
-pci_bios_mem_addr = 0xf000;
+pci_bios_mem_addr = 0xc000 + 0x100;
 pci_bios_bigmem_addr = ram_size;
 if (pci_bios_bigmem_addr  0x9000)
 pci_bios_bigmem_addr = 0x9000;
diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index c470646..c04d9b6 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -821,9 +821,9 @@ static void pc_init1(ram_addr_t ram_size, int vga_ram_size,
 BlockDriverState *hd[MAX_IDE_BUS * MAX_IDE_DEVS];
 BlockDriverState *fd[MAX_FD];
 
-if (ram_size = 0xe000 ) {
-above_4g_mem_size = ram_size - 0xe000;
-below_4g_mem_size = 0xe000;
+if (ram_size = 0xc000 ) {
+above_4g_mem_size = ram_size - 0xc000;
+below_4g_mem_size = 0xc000;
 } else {
 below_4g_mem_size = ram_size;
 }
diff --git a/qemu/hw/vga_int.h b/qemu/hw/vga_int.h
index 65ac68a..64f594a 100644
--- a/qemu/hw/vga_int.h
+++ b/qemu/hw/vga_int.h
@@ -59,7 +59,7 @@
 #define VBE_DISPI_LFB_ENABLED   0x40
 #define VBE_DISPI_NOCLEARMEM0x80
 
-#define VBE_DISPI_LFB_PHYSICAL_ADDRESS  0xE000
+#define VBE_DISPI_LFB_PHYSICAL_ADDRESS  0xC000
 
 #ifdef CONFIG_BOCHS_VBE
 
diff --git a/qemu/hw/vmware_vga.c b/qemu/hw/vmware_vga.c
index e30d03f..028bf81 100644
--- a/qemu/hw/vmware_vga.c
+++ b/qemu/hw/vmware_vga.c
@@ -119,14 +119,14 @@ struct pci_vmsvga_state_s {
 # define SVGA_IO_BASE  SVGA_LEGACY_BASE_PORT
 # define SVGA_IO_MUL   1
 # define SVGA_FIFO_SIZE0x1
-# define SVGA_MEM_BASE 0xe000
+# define SVGA_MEM_BASE 0xc000
 # define SVGA_PCI_DEVICE_IDPCI_DEVICE_ID_VMWARE_SVGA2
 #else
 # define SVGA_ID   SVGA_ID_1
 # define

[PATCH 0/2] remove kvm vmap usage

2008-12-28 Thread Izik Eidus
Remove the vmap usage from kvm, this is needed both for ksm and
get_user_pages != write.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: introducing kvm_read_guest_virt, kvm_write_guest_virt.

2008-12-28 Thread Izik Eidus
This commit change the name of emulator_read_std into kvm_read_guest_virt,
and add new function name kvm_write_guest_virt that allow writing into a
guest virtual address.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |4 ---
 arch/x86/kvm/x86.c  |   56 +-
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ab8ef1d..a129700 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -608,10 +608,6 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu);
 
 void fx_init(struct kvm_vcpu *vcpu);
 
-int emulator_read_std(unsigned long addr,
- void *val,
- unsigned int bytes,
- struct kvm_vcpu *vcpu);
 int emulator_write_emulated(unsigned long addr,
const void *val,
unsigned int bytes,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa4575c..c812209 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1973,10 +1973,8 @@ static struct kvm_io_device *vcpu_find_mmio_dev(struct 
kvm_vcpu *vcpu,
return dev;
 }
 
-int emulator_read_std(unsigned long addr,
-void *val,
-unsigned int bytes,
-struct kvm_vcpu *vcpu)
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+   struct kvm_vcpu *vcpu)
 {
void *data = val;
int r = X86EMUL_CONTINUE;
@@ -1984,27 +1982,57 @@ int emulator_read_std(unsigned long addr,
while (bytes) {
gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr);
unsigned offset = addr  (PAGE_SIZE-1);
-   unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset);
+   unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset);
int ret;
 
if (gpa == UNMAPPED_GVA) {
r = X86EMUL_PROPAGATE_FAULT;
goto out;
}
-   ret = kvm_read_guest(vcpu-kvm, gpa, data, tocopy);
+   ret = kvm_read_guest(vcpu-kvm, gpa, data, toread);
if (ret  0) {
r = X86EMUL_UNHANDLEABLE;
goto out;
}
 
-   bytes -= tocopy;
-   data += tocopy;
-   addr += tocopy;
+   bytes -= toread;
+   data += toread;
+   addr += toread;
}
 out:
return r;
 }
-EXPORT_SYMBOL_GPL(emulator_read_std);
+
+int kvm_write_guest_virt(gva_t addr, void *val, unsigned int bytes,
+struct kvm_vcpu *vcpu)
+{
+   void *data = val;
+   int r = X86EMUL_CONTINUE;
+
+   while (bytes) {
+   gpa_t gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, addr);
+   unsigned offset = addr  (PAGE_SIZE-1);
+   unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
+   int ret;
+
+   if (gpa == UNMAPPED_GVA) {
+   r = X86EMUL_PROPAGATE_FAULT;
+   goto out;
+   }
+   ret = kvm_write_guest(vcpu-kvm, gpa, data, towrite);
+   if (ret  0) {
+   r = X86EMUL_UNHANDLEABLE;
+   goto out;
+   }
+
+   bytes -= towrite;
+   data += towrite;
+   addr += towrite;
+   }
+out:
+   return r;
+}
+
 
 static int emulator_read_emulated(unsigned long addr,
  void *val,
@@ -2026,8 +2054,8 @@ static int emulator_read_emulated(unsigned long addr,
if ((gpa  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
goto mmio;
 
-   if (emulator_read_std(addr, val, bytes, vcpu)
-   == X86EMUL_CONTINUE)
+   if (kvm_read_guest_virt(addr, val, bytes, vcpu)
+   == X86EMUL_CONTINUE)
return X86EMUL_CONTINUE;
if (gpa == UNMAPPED_GVA)
return X86EMUL_PROPAGATE_FAULT;
@@ -2230,7 +2258,7 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, 
const char *context)
 
rip_linear = rip + get_segment_base(vcpu, VCPU_SREG_CS);
 
-   emulator_read_std(rip_linear, (void *)opcodes, 4, vcpu);
+   kvm_read_guest_virt(rip_linear, (void *)opcodes, 4, vcpu);
 
printk(KERN_ERR emulation failed (%s) rip %lx %02x %02x %02x %02x\n,
   context, rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]);
@@ -2238,7 +2266,7 @@ void kvm_report_emulation_failure(struct kvm_vcpu *vcpu, 
const char *context)
 EXPORT_SYMBOL_GPL(kvm_report_emulation_failure);
 
 static struct x86_emulate_ops emulate_ops = {
-   .read_std= emulator_read_std,
+   .read_std= kvm_read_guest_virt,
.read_emulated

  1   2   >