Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-09 Thread Andrea Arcangeli
On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote:
 Another possibility, would something like this work?
 
  
  /*
   * null out the begin function, no new begin calls can be made
   */
  rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); 
 
  /*
   * lock/unlock all rmap locks in any order - this ensures that any
   * pending start() will have its end() function called.
   */
  mm_barrier(mm);
 
  /*
   * now that no new start() call can be made and all start()/end() pairs
   * are complete we can remove the notifier.
   */
  mmu_notifier_remove(mm, my_notifier);
 
 
 This requires a mmu_notifier instance per attached mm and that
 __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
 the function.
 
 But I think its enough to ensure that:
 
   for each start an end will be called

We don't need that, it's perfectly ok if start is called but end is
not, it's ok to unregister in the middle as I guarantee -release is
called before mmu_notifier_unregister returns (if -release is needed
at all, not the case for KVM/GRU).

Unregister is already solved with srcu/rcu without any additional
complication as we don't need the guarantee that for each start an end
will be called.

 It can however happen that end is called without start - but we could
 handle that I think.

The only reason mm_lock() was introduced is to solve register, to
guarantee that for each end there was a start. We can't handle end
called without start in the driver.

The reason the driver must be prevented to register in the middle of
start/end, if that if it ever happens the driver has no way to know it
must stop the secondary mmu page faults to call get_user_pages and
instantiate sptes/secondarytlbs on pages that will be freed as soon as
zap_page_range starts.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 001/001] mmu-notifier-core v17

2008-05-09 Thread Andrea Arcangeli
From: Andrea Arcangeli [EMAIL PROTECTED]

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean secondary pte. In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk-mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_take_all_locks() to register the mmu notifier when the whole VM
   isn't doing anything with mm. This allows mmu notifier users to
   keep track if the VM is in the middle of the
   invalidate_range_begin/end critical section with an atomic counter
   incraese in range_begin and decreased in range_end. No secondary
   MMU page fault is allowed to map any spte or secondary tlb
   reference, while the VM is in the middle of range_begin/end as any
   page returned by get_user_pages in that critical section could
   later immediately be freed without any further -invalidate_page
   notification (invalidate_range_begin/end works on ranges and
   -invalidate_page isn't called immediately before freeing the
   page). To stop all page freeing and pagetable overwrites the
   mmap_sem must be taken in write mode and all other anon_vma/i_mmap
   locks must be taken too.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may
be interrupted

Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-08 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 09:11:33AM -0700, Linus Torvalds wrote:
 Btw, this is an issue only on 32-bit x86, because on 64-bit one we already 
 have the padding due to the alignment of the 64-bit pointers in the 
 list_head (so there's already empty space there).
 
 On 32-bit, the alignment of list-head is obviously just 32 bits, so right 
 now the structure is perfectly packed and doesn't have any empty space. 
 But that's just because the spinlock is unnecessarily big.
 
 (Of course, if anybody really uses NR_CPUS = 256 on 32-bit x86, then the 
 structure really will grow. That's a very odd configuration, though, and 
 not one I feel we really need to care about).

I see two ways to implement it:

1) use #ifdef and make it zero overhead for 64bit only without playing
any non obvious trick.

struct anon_vma {
   spinlock_t lock;
#ifdef CONFIG_MMU_NOTIFIER
   int global_mm_lock:1;
#endif

struct address_space {
   spinlock_t   private_lock;
#ifdef CONFIG_MMU_NOTIFIER
   int global_mm_lock:1;
#endif

2) add a:

#define AS_GLOBAL_MM_LOCK   (__GFP_BITS_SHIFT + 2)  /* global_mm_locked */

and use address_space-flags with bitops

And as Andrew pointed me out by PM, for the anon_vma we can use the
LSB of the list.next/prev because the list can't be browsed when the
lock is taken, so taking the lock and then setting the bit and
clearing the bit before unlocking is safe. The LSB will always read 0
even if it's under list_add modification when the global spinlock isn't
taken. And after taking the anon_vma lock we can switch it the LSB
from 0 to 1 without races and the 1 will be protected by the
global spinlock.

The above solution is zero cost for 32bit too, so I prefer it.

So I now agree with you this is a great idea on how to remove sort()
and vmalloc and especially vfree without increasing the VM footprint.

I'll send an update with this for review very shortly and I hope this
goes in so KVM will be able to swap and do many other things very well
starting in 2.6.26.

Thanks a lot,
Andrea

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 04 of 11] free-pgtables

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115130 -7200
# Node ID 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
# Parent  d60d200565abde6a8ed45271e53cde9c5c75b426
free-pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma-vm_next;
unsigned long addr = vma-vm_start;
@@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
/*
@@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
}
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
vma = next;
}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc
update_hiwater_rss(mm);
unmap_vmas(tlb, vma, start, end, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
+   tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
-   tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 03 of 11] invalidate_page outside PT lock

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115129 -7200
# Node ID d60d200565abde6a8ed45271e53cde9c5c75b426
# Parent  c5badbefeee07518d9d1acca13e94c981420317c
invalidate_page outside PT lock

Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -210,35 +210,6 @@ static inline void mmu_notifier_mm_destr
__mmu_notifier_mm_destroy(mm);
 }
 
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is impleemnted as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep)  \
-({ \
-   pte_t __pte;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __pte = ptep_clear_flush(___vma, ___address, __ptep);   \
-   mmu_notifier_invalidate_page(___vma-vm_mm, ___address);\
-   __pte;  \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
-({ \
-   int __young;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
-   __young |= mmu_notifier_clear_flush_young(___vma-vm_mm,\
- ___address);  \
-   __young;\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -274,9 +245,6 @@ static inline void mmu_notifier_mm_destr
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
-   pteval = ptep_clear_flush_notify(vma, address, pte);
+   pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
+   /* must invalidate_page _before_ freeing the page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(page);
}
}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct *
 */
page_table = pte_offset_map_lock(mm, pmd, address,
 ptl);
-   page_cache_release(old_page);
+   new_page = NULL;
if (!pte_same(*page_table, orig_pte))
goto unlock;
+   page_cache_release(old_page);
 
page_mkwrite = 1;
}
@@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct *
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+   old_page = new_page = NULL;
goto unlock;
}
 
@@ -1776,7 +1778,7 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1788,12 +1790,18 @@ gotten

[kvm-devel] [PATCH 10 of 11] export zap_page_range for XPMEM

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115797 -7200
# Node ID 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
# Parent  94eaa1515369e8ef183e2457f6f25a7f36473d70
export zap_page_range for XPMEM

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a
 
return unmap_vmas(vma, address, end, nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 11 of 11] mmap sems

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115798 -7200
# Node ID eb924315351f6b056428e35c983ad28040420fea
# Parent  5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
  *
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
+ *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
  *
  *  -i_mutex
  *-i_alloc_sem (various)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 06 of 11] rwsem contended

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115132 -7200
# Node ID 0621238970155f8ff2d60ca4996dcdd470f9c6ce
# Parent  20bc6a66a86ef6bd60919cc77ff51d4af741b057
rwsem contended

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+   return rwsem_is_contended(sem);
+#else
+   return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
spin_unlock_irqrestore(sem-wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+   /*
+* Racy check for an empty list. False positives or negatives
+* would be okay. False positive may cause a useless dropping of
+* locks. False negatives may cause locks to be held a bit
+* longer until the next check.
+*/
+   return !list_empty(sem-wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+   /*
+* Racy check for an empty list. False positives or negatives
+* would be okay. False positive may cause a useless dropping of
+* locks. False negatives may cause locks to be held a bit
+* longer until the next check.
+*/
+   return !list_empty(sem-wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 09 of 11] mm_lock-rwsem

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115508 -7200
# Node ID 94eaa1515369e8ef183e2457f6f25a7f36473d70
# Parent  6b384bb988786aa78ef07440180e4b2948c4c6a2
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
   unsigned long flags, struct page **pages);
 
 struct mm_lock_data {
-   spinlock_t **i_mmap_locks;
-   spinlock_t **anon_vma_locks;
-   size_t nr_i_mmap_locks;
-   size_t nr_anon_vma_locks;
+   struct rw_semaphore **i_mmap_sems;
+   struct rw_semaphore **anon_vma_sems;
+   size_t nr_i_mmap_sems;
+   size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
 
 static int mm_lock_cmp(const void *a, const void *b)
 {
-   unsigned long _a = (unsigned long)*(spinlock_t **)a;
-   unsigned long _b = (unsigned long)*(spinlock_t **)b;
+   unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+   unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
 
cond_resched();
if (_a  _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
return 0;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore 
**sems,
  int anon)
 {
struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
for (vma = mm-mmap; vma; vma = vma-vm_next) {
if (anon) {
if (vma-anon_vma)
-   locks[i++] = vma-anon_vma-lock;
+   sems[i++] = vma-anon_vma-sem;
} else {
if (vma-vm_file  vma-vm_file-f_mapping)
-   locks[i++] = 
vma-vm_file-f_mapping-i_mmap_lock;
+   sems[i++] = 
vma-vm_file-f_mapping-i_mmap_sem;
}
}
 
if (!i)
goto out;
 
-   sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+   sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 1);
+   return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-   spinlock_t **locks)
+   struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 0);
+   return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-   spinlock_t *last = NULL;
+   struct rw_semaphore *last = NULL;
size_t i;
 
for (i = 0; i  nr; i++)
/*  Multiple vmas may use the same lock. */
-   if (locks[i] != last) {
-   BUG_ON((unsigned long) last  (unsigned long) locks[i]);
-   last = locks[i];
+   if (sems[i] != last) {
+   BUG_ON((unsigned long) last  (unsigned long) sems[i]);
+   last = sems[i];
if (lock)
-   spin_lock(last);
+   down_write(last);
else
-   spin_unlock(last);
+   up_write(last);
}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 1);
+   mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 0);
+   mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2358,10 +2358,10 @@ static inline void __mm_unlock(spinlock_
  * of vmas is defined in /proc/sys/vm/max_map_count.
  *
  * mm_lock() can fail if memory allocation fails. The worst case
- * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
- * so around 1Mbyte, but in practice it'll be much less

[kvm-devel] [PATCH 05 of 11] unmap vmas tlb flushing

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115131 -7200
# Node ID 20bc6a66a86ef6bd60919cc77ff51d4af741b057
# Parent  34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
unmap vmas tlb flushing

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar
 
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details-i_mmap_lock: NULL;
-   int fullmm = (*tlbp)-fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
struct mm_struct *mm = vma-vm_mm;
 
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb-fullmm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
unsigned long end;
@@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, zap_work, details);
 
if (zap_work  0) {
@@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
 
if (need_resched() ||
(i_mmap_lock  spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
-   *tlbp = NULL;
+   tlb = NULL;
goto out;
}
cond_resched();
}
 
-   *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma-vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
@@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address

[kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115136 -7200
# Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2
# Parent  58f716ad4d067afb6bdd1b5f7042e19d854aae0d
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page

[kvm-devel] [PATCH 07 of 11] i_mmap_rwsem

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115135 -7200
# Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d
# Parent  0621238970155f8ff2d60ca4996dcdd470f9c6ce
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
if (!vma_shareable(vma, addr))
return;
 
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
put_page(virt_to_page(spte));
spin_unlock(mm-page_table_lock);
 out:
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
pgoff = offset  PAGE_SHIFT;
 
i_size_write(inode, offset);
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
if (!prio_tree_empty(mapping-i_mmap))
hugetlb_vmtruncate_list(mapping-i_mmap, pgoff);
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(inode-i_devices);
INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC);
rwlock_init(inode-i_data.tree_lock);
-   spin_lock_init(inode-i_data.i_mmap_lock);
+   init_rwsem(inode-i_data.i_mmap_sem);
INIT_LIST_HEAD(inode-i_data.private_list);
spin_lock_init(inode-i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
unsigned inti_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root   i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-   spinlock_t  i_mmap_lock;/* protect tree, count, list */
+   struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned inttruncate_count; /* Cover race condition with 
truncate */
unsigned long   nrpages;/* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
struct address_space *check_mapping;/* Check page-mapping if set */
pgoff_t first_index;/* Lowest page-index to unmap 
*/
pgoff_t last_index; /* Highest page-index to unmap 
*/
-   spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */
+   struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */
unsigned long truncate_count;   /* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
atomic_dec(inode-i_writecount);
 
/* insert tmp into the share list, just after mpnt */
-   spin_lock(file-f_mapping-i_mmap_lock);
+   down_write(file-f_mapping-i_mmap_sem);
tmp-vm_truncate_count = mpnt

[kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210096013 -7200
# Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c
# Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean secondary pte. In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk-mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with mm. This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further -invalidate_page notification (invalidate_range_begin/end
   works on ranges and -invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-05-07 Thread Andrea Arcangeli
On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote:
 Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
 yourself in this direction, you're very welcome to go ahead while I

In case you didn't notice this already, for a further explanation of
why semaphores runs slower for small critical sections and why the
conversion from spinlock to rwsem should happen under a config option,
see the AIM7 40% regression with 2.6.26-rc1 thread.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 00 of 11] mmu notifier #v16

2008-05-07 Thread Andrea Arcangeli
Hello,

this is the last update of the mmu notifier patch.

Jack asked a __mmu_notifier_register to call under mmap_sem in write mode.

Here an update with that change plus allowing -release not to be implemented
(two liner change to mmu_notifier.c).

The entire diff between v15 and v16 mmu-notifier-core was posted in separate
email.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 02 of 11] get_task_mm

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 10:59:48AM -0500, Robin Holt wrote:
 You can drop this patch.
 
 This turned out to be a race in xpmem.  It appeared as if it were a
 race in get_task_mm, but it really is not.  The current-mm field is
 cleared under the task_lock and the task_lock is grabbed by get_task_mm.

100% agreed, I'll nuke it as it seems really a noop.

 I have been testing you v15 version without this patch and not
 encountere the problem again (now that I fixed my xpmem race).

Great. About your other deadlock I'm curious if my deadlock fix for
the i_mmap_sem patch helped. That was crashing kvm with a VM 2G in the
swap + a swaphog allocating and freeing another 2G of swap in a
loop. I couldn't reproduce any other problem with KVM since I fixed
that bit regardless if I apply only mmu-notifier-core (2.6.26 version)
or the full patchset (post 2.6.26).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 03 of 11] invalidate_page outside PT lock

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 01:39:43PM -0400, Rik van Riel wrote:
 Would it be an idea to merge them into one, so the first patch
 introduces the right conventions directly?

The only reason this isn't merged into one, is that this requires
non obvious (not difficult though) to the core VM code. I wanted to
keep an obviously safe approach for 2.6.26. The other conventions are
only needed by XPMEM and XPMEM can't work without all other patches anyway.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 01:56:23PM -0700, Linus Torvalds wrote:
 This also looks very debatable indeed. The only performance numbers quoted 
 are:
 
This results in f.e. the Aim9 brk performance test to got down by 10-15%.
 
 which just seems like a total disaster.
 
 The whole series looks bad, in fact. Lack of authorship, bad single-line 

Glad you agree. Note that the fact the whole series looks bad, is
_exactly_ why I couldn't let Christoph keep going with
mmu-notifier-core at the very end of his patchset. I had to move it at
the top to have a chance to get the KVM and GRU requirements merged
in 2.6.26.

I think the spinlock-rwsem conversion is ok under config option, as
you can see I complained myself to various of those patches and I'll
take care they're in a mergeable state the moment I submit them. What
XPMEM requires are different semantics for the methods, and we never
had to do any blocking I/O during vmtruncate before, now we have to.
And I don't see a problem in making the conversion from
spinlock-rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
anything but ia64.

Please ignore all patches but mmu-notifier-core. I regularly forward
_only_ mmu-notifier-core to Andrew, that's the only one that is in
merge-ready status, everything else is just so XPMEM can test and we
can keep discussing it to bring it in a mergeable state like
mmu-notifier-core already is.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 01:30:39PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 7 May 2008, Andrew Morton wrote:
  
  The patch looks OK to me.
 
 As far as I can tell, authorship has been destroyed by at least two of the 
 patches (ie Christoph seems to be the author, but Andrea seems to have 
 dropped that fact).

I can't follow this, please be more specific.

About the patches I merged from Christoph, I didn't touch them at all
(except for fixing a kernel crashing bug in them plus some reject
fix). Initially I didn't even add a signed-off-by: andrea, and I only
had the signed-off-by: christoph. But then he said I had to add my
signed-off-by too, while I thought at most an acked-by was
required. So if I got any attribution on Christoph work it's only
because he explicitly requested it as it was passing through my
maintenance line. In any case, all patches except mmu-notifier-core
are irrelevant in this context and I'm entirely fine to give Christoph
the whole attribution of the whole patchset including the whole
mmu-notifier-core where most of the code is mine.

We had many discussions with Christoph, Robin and Jack, but I can
assure you nobody had a single problem with regard to attribution.

About all patches except mmu-notifier-core: Christoph, Robin and
everyone (especially myself) agrees those patches can't yet be merged
in 2.6.26.

With regard to the post-2.6.26 material, I think adding a config
option to make the change at compile time, is ok. And there's no other
way to deal with it in a clean way, as vmtrunate has to teardown
pagetables, and if the i_mmap_lock is a spinlock there's no way to
notify secondary mmus about it, if the -invalidate_range_start method
has to allocate an skb, send it through the network and wait for I/O
completion with schedule().

 Yeah, too late and no upside.

No upside to all people setting CONFIG_KVM=n true, but no downside
for them either, that's the important fact!

And for all the people setting CONFIG_KVM!=n, I should provide some
background here. KVM MM development is halted without this, that
includes: paging, ballooning, tlb flushing at large, pci-passthrough
removing page pin as a whole, etc...

Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d
patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle
things without mmu notifiers (mlock whatever).

Rusty agreed we had to get mmu notifiers in 2.6.26 so much that he has
gone as far as writing his own ultrasimple mmu notifier
implementation, unfortunately too simple as invalidate_range_start was
missing and we can't remove the page pinning and avoid doing
spte=invalid;tlbflush;unpin for every group of sptes released without
it. And without mm_lock invalidate_range_start can't be implemented in
a generic way (to work for GRU/XPMEM too).

 That locking code is also too ugly to live, at least without some 
 serious arguments for why it has to be done that way. Sorting the locks? 
 In a vmalloc'ed area?  And calling this something innocuous like 
 mm_lock()? Hell no. 

That's only invoked in mmu_notifier_register, mm_lock is explicitly
documented as heavyweight function. In the KVM case it's only called
when a VM is created, that's irrelevant cpu cost compared to the time
it takes to the OS to boot in the VM... (especially without real mode
emulation with direct NPT-like secondary-mmu paging).

mm_lock solved the fundamental race in the range_start/end
invalidation model (that will allow GRU to do a single tlb flush for
the whole range that is going to be freed by
zap_page_range/unmap_vmas/whatever). Christoph merged mm_lock in his
EMM versions of mmu notifiers, moments after I released it, I think he
wouldn't have done it if there was a better way.

 That code needs some serious re-thinking.

Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu
notifiers, my original mmu notifiers, Christoph's first version of my
mmu notifiers, with my new mmu notifiers, with christoph EMM version
of my new mmu notifiers, with my latest mmu notifiers, and all people
making suggestions and testing the code and needing the code badly,
and further patches waiting inclusion during 2.6.27 in this area, it
must be obvious for everyone, that there's zero chance this code won't
evolve over time to perfection, but we can't wait it to be perfect
before start using it or we're screwed. Even if it's entirely broken
this will allow kvm development to continue and then we'll fix it (but
don't worry it works great at runtime and there are no race
conditions, Jack and Robin are also using it with zero problems with
GRU and XPMEM just in case the KVM testing going great isn't enough).

Furthermore the API is freezed for almost months, everyone agrees with
all fundamental blocks in mmu-notifier-core patch (to be complete
Christoph would like to replace invalidate_page with an
invalidate_range_start/end but that's a minor detail).

And most important we need something in now, regardless of which
API. We can 

Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
  had to do any blocking I/O during vmtruncate before, now we have to.
 
 I really suspect we don't really have to, and that it would be better to 
 just fix the code that does that.

I'll let you discuss with Christoph and Robin about it. The moment I
heard the schedule inside -invalidate_page() requirement I reacted
the same way you did. But I don't see any other real solution for XPMEM
other than spin-looping for ages halting the scheduler for ages, while
the ack is received from the network device.

But mm_lock is required even without XPMEM. And srcu is also required
without XPMEM to allow -release to schedule (however downgrading srcu
to rcu will result in a very small patch, srcu and rcu are about the
same with a kernel supporting preempt=y like 2.6.26).

 I literally think that mm_lock() is an unbelievable piece of utter and 
 horrible CRAP.
 
 There's simply no excuse for code like that.

I think it's a great smp scalability optimization over the global lock
you're proposing below.

 No, the simple solution is to just make up a whole new upper-level lock, 
 and get that lock *first*. You can then take all the multiple locks at a 
 lower level in any order you damn well please. 

Unfortunately the lock you're talking about would be:

static spinlock_t global_lock = ...

There's no way to make it more granular.

So every time before taking any -i_mmap_lock _and_ any anon_vma-lock
we'd need to take that extremely wide spinlock first (and even worse,
later it would become a rwsem when XPMEM is selected making the VM
even slower than it already becomes when XPMEM support is selected at
compile time).

 And yes, it's one more lock, and yes, it serializes stuff, but:
 
  - that code had better not be critical anyway, because if it was, then 
the whole vmalloc+sort+lock+vunmap sh*t was wrong _anyway_

mmu_notifier_register can take ages. No problem.

  - parallelism is overrated: it doesn't matter one effing _whit_ if 
something is a hundred times more parallel, if it's also a hundred 
times *SLOWER*.

mmu_notifier_register is fine to be hundred times slower (preempt-rt
will turn all locks in spinlocks so no problem).

 And here's an admission that I lied: it wasn't *all* clearly crap. I did 
 like one part, namely list_del_init_rcu(), but that one should have been 
 in a separate patch. I'll happily apply that one.

Sure, I'll split it from the rest if the mmu-notifier-core isn't merged.

My objective has been:

1) add zero overhead to the VM before anybody starts a VM with kvm and
   still zero overhead for all other tasks except the task where the
   VM runs.  The only exception is the unlikely(!mm-mmu_notifier_mm)
   check that is optimized away too when CONFIG_KVM=n. And even for
   that check my invalidate_page reduces the number of branches to the
   absolute minimum possible.

2) avoid any new cacheline collision in the fast paths to allow numa
   systems not to nearly-crash (mm-mmu_notifier_mm will be shared and
   never written, except during the first mmu_notifier_register)

3) avoid any risk to introduce regressions in 2.6.26 (the patch must
   be obviously safe). Even if mm_lock would be a bad idea like you
   say, it's order of magnitude safer even if entirely broken then
   messing with the VM core locking in 2.6.26.

mm_lock (or whatever name you like to give it, I admit mm_lock may not
be worrysome enough for people to have an idea to call it in a fast
path) is going to be the real deal for the long term to allow
mmu_notifier_register to serialize against
invalidate_page_start/end. If I fail in 2.6.26 I'll offer
maintainership to Christoph as promised, and you'll find him pushing
for mm_lock to be merged (as XPMEM/GRU aren't technologies running on
cellphones where your global wide spinlocks is optimized away at
compile time, and he also has to deal with XPMEM where such a spinlock
would need to become a rwsem as the anon_vma-sem has to be taken
after it), but let's assume you're right entirely right here that
mm_lock is going to be dropped and there's a better way: it's still a
fine solution for 2.6.26.

And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to
mmu_notifier.[ch] so you don't get any pollution in the core VM, and
mm_lock will be invisible to everything but anybody calling
mmu_notifier_register() then and it will be trivial to remove later if
you really want to add a global spinlock as there's no way to be more
granular than a _global_ numa-wide spinlock taken before any
i_mmap_lock/anon_vma-lock, without my mm_lock.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___

Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 03:11:10PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 7 May 2008, Andrea Arcangeli wrote:
 
   As far as I can tell, authorship has been destroyed by at least two of 
   the 
   patches (ie Christoph seems to be the author, but Andrea seems to have 
   dropped that fact).
  
  I can't follow this, please be more specific.
 
 The patches were sent to lkml without *any* indication that you weren't 
 actually the author.
 
 So if Andrew had merged them, they would have been merged as yours.

I rechecked and I guarantee that the patches where Christoph isn't
listed are developed by myself and he didn't write a single line on
them. In any case I expect Christoph to review (he's CCed) and to
point me to any attribution error. The only mistake I did once in that
area was to give too _few_ attribution to myself and he asked me to
add myself in the signed-off so I added myself by Christoph own
request, but be sure I didn't remove him!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-07 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 12:27:58AM +0200, Andrea Arcangeli wrote:
 I rechecked and I guarantee that the patches where Christoph isn't
 listed are developed by myself and he didn't write a single line on
 them. In any case I expect Christoph to review (he's CCed) and to
 point me to any attribution error. The only mistake I did once in that
 area was to give too _few_ attribution to myself and he asked me to
 add myself in the signed-off so I added myself by Christoph own
 request, but be sure I didn't remove him!

By PM (guess he's scared to post to this thread ;) Chris is telling
me, what you mean perhaps is I should add a From: Christoph in the
body of the email if the first signed-off-by is from Christoph, to
indicate the first signoff was by him and the patch in turn was
started by him. I thought the order of the signoffs was enough, but if
that From was mandatory and missing, if there's any error it obviously
wasn't intentional especially given I only left a signed-off-by:
christoph on his patches until he asked me to add my signoff
too. Correcting it is trivial given I carefully ordered the signoff so
that the author is at the top of the signoff list.

At least for mmu-notifier-core given I obviously am the original
author of that code, I hope the From: of the email was enough even if
an additional From: andrea was missing in the body.

Also you can be sure that Christoph and especially Robin (XPMEM) will
be more than happy if all patches with Christoph at the top of the
signed-off-by will be merged in 2.6.26 despite there wasn't From:
christoph at the top of the body ;). So I don't see a big deal here...

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 03:31:08PM -0700, Roland Dreier wrote:
 I think the point you're missing is that any patches written by
 Christoph need a line like
 
 From: Christoph Lameter [EMAIL PROTECTED]
 
 at the top of the body so that Christoph becomes the author when it is
 committed into git.  The Signed-off-by: line needs to be preserved too
 of course, but it is not sufficient by itself.

Ok so I see the problem Linus is referring to now (I received the hint
by PM too), I thought the order of the signed-off-by was relevant, it
clearly isn't or we're wasting space ;)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote:
 Nope.  We only need to take the global lock before taking *two or more* of
 the per-vma locks.
 
 I really wish I'd thought of that.

I don't see how you can avoid taking the system-wide-global lock
before every single anon_vma-lock/i_mmap_lock out there without
mm_lock.

Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
than one lock and we've to still take the global-system-wide lock
_before_ this single i_mmap_lock and no other lock at all.

Please elaborate, thanks!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 03:44:24PM -0700, Linus Torvalds wrote:
 
 
 On Thu, 8 May 2008, Andrea Arcangeli wrote:
  
  Unfortunately the lock you're talking about would be:
  
  static spinlock_t global_lock = ...
  
  There's no way to make it more granular.
 
 Right. So what? 
 
 It's still about a million times faster than what the code does now.

mmu_notifier_register only runs when windows or linux or macosx
boots. Who could ever care of the msec spent in mm_lock compared to
the time it takes to linux to boot?

What you're proposing is to slowdown AIM and certain benchmarks 20% or
more for all users, just so you save at most 1msec to start a VM.

 Rewrite the code, or not. I don't care. I'll very happily not merge crap 
 for the rest of my life.

If you want the global lock I'll do it no problem, I just think it's
obviously inferior solution for 99% of users out there (including kvm
users that will also have to take that lock while kvm userland runs).

In my view the most we should do in this area is to reduce further the
max number of locks to take if max_map_count already isn't enough.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
To remove mm_lock without adding an horrible system-wide lock before
every i_mmap_lock etc.. we've to remove
invalidate_range_begin/end. Then we can return to an older approach of
doing only invalidate_page and serializing it with the PT lock against
get_user_pages. That works fine for KVM but GRU will have to flush the
tlb once every time we drop the PT lock, that means once per each 512
ptes on x86-64 etc... instead of a single time for the whole range
regardless how large the range is.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
Hi Andrew,

On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote:
   CPU0:   CPU1:
 
   spin_lock(global_lock)  
   spin_lock(a-lock); spin_lock(b-lock);
== mmu_notifier_register()
   spin_lock(b-lock); spin_unlock(b-lock);
   spin_lock(a-lock);
   spin_unlock(a-lock);
 
 also OK.

But the problem is that we've to stop the critical section in the
place I marked with  while mmu_notifier_register
runs. Otherwise the driver calling mmu_notifier_register won't know if
it's safe to start establishing secondary sptes/tlbs. If the driver
will establish sptes/tlbs with get_user_pages/follow_page the page
could be freed immediately later when zap_page_range starts.

So if CPU1 doesn't take the global_lock before proceeding in
zap_page_range (inside vmtruncate i_mmap_lock that is represented as
b-lock above) we're in trouble.

What we can do is to replace the mm_lock with a
spin_lock(global_lock) only if all places that takes i_mmap_lock
takes the global lock first and that hurts scalability of the fast
paths that are performance critical like vmtruncate and
anon_vma-lock. Perhaps they're not so performance critical, but
surely much more performant critical than mmu_notifier_register ;).

The idea of polluting various scalable paths like truncate() syscall
in the VM with a global spinlock frightens me, I'd rather return to
invalidate_page() inside the PT lock removing both
invalidate_range_start/end. Then all serialization against the mmu
notifiers will be provided by the PT lock that the secondary mmu page
fault also has to take in get_user_pages (or follow_page). In any case
that is a better solution that won't slowdown the VM when
MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance
is about the same with or without invalidate_range_start/end. I didn't
think anybody could care about how long mmu_notifier_register takes
until it returns compared to all heavyweight operations that happens
to start a VM (not only in the kernel but in the guest too).

Infact if it's security that we worry about here, can put a cap of
_time_ that mmu_notifier_register can take before it fails, and we
fail to start a VM if it takes more than 5sec, that's still fine as
the failure could happen for other reasons too like vmalloc shortage
and we already handle it just fine. This 5sec delay can't possibly happen in
practice anyway in the only interesting scenario, just like the
vmalloc shortage. This is obviously a superior solution than polluting
the VM with an useless global spinlock that will destroy truncate/AIM
on numa.

Anyway Christoph, I uploaded my last version here:


http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16

(applies and runs fine on 26-rc1)

You're more than welcome to takeover from it, I kind of feel my time
now may be better spent to emulate the mmu-notifier-core with kprobes.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 09:28:38AM +1000, Benjamin Herrenschmidt wrote:
 
 On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote:
  
  Please note, we can't allow a thread to be in the middle of
  zap_page_range while mmu_notifier_register runs.
 
 You said yourself that mmu_notifier_register can be as slow as you
 want ... what about you use stop_machine for it ? I'm not even joking
 here :-)

We can put a cap of time + a cap of vmas. It's not important if it
fails, the only useful case we know it, and it won't be slow at
all. The failure can happen because the cap of time or the cap of vmas
or the cap vmas triggers or there's a vmalloc shortage. We handle the
failure in userland of course. There are zillon of allocations needed
anyway, any one of them can fail, so this isn't a new fail path, is
the same fail path that always existed before mmu_notifiers existed.

I can't possibly see how adding a new global wide lock that forces all
truncate to be serialized against each other, practically eliminating
the need of the i_mmap_lock, could be superior to an approach that
doesn't cause the overhead to the VM at all, and only require kvm to
pay for an additional cost when it startup.

Furthermore the only reason I had to implement mm_lock was to fix the
invalidate_range_start/end model, if we go with only invalidate_page
and invalidate_pages called inside the PT lock and we use the PT lock
to serialize, we don't need a mm_lock anymore and no new lock from the
VM either. I tried to push for that, but everyone else wanted
invalidate_range_start/end. I only did the only possible thing to do:
to make invalidate_range_start safe to make everyone happy without
slowing down the VM.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:02:49PM -0700, Linus Torvalds wrote:
 You replace mm_lock() with the sequence that Andrew gave you (and I 
 described):
 
   spin_lock(global_lock)
   .. get all locks UNORDERED ..
   spin_unlock(global_lock)
 
 and you're now done. You have your mm_lock() (which still needs to be 
 renamed - it should be a mmu_notifier_lock() or something like that), 
 but you don't need the insane sorting. At most you apparently need a way 
 to recognize duplicates (so that you don't deadlock on yourself), which 
 looks like a simple bit-per-vma.
 
 The global lock doesn't protect any data structures itself - it just 
 protects two of these mm_lock() functions from ABBA'ing on each other!

I thought the thing to remove was the get all locks. I didn't
realize the major problem was only the sorting of the array.

I'll add the global lock, it's worth it as it drops the worst case
number of steps by log(65536) times. Furthermore surely two concurrent
mm_notifier_lock will run faster as it'll decrease the cacheline
collisions. Since you ask to call it mmu_notifier_lock I'll also move
it to mmu_notifier.[ch] as consequence.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
Sorry for not having completely answered to this. I initially thought
stop_machine could work when you mentioned it, but I don't think it
can even removing xpmem block-inside-mmu-notifier-method requirements.

For stop_machine to solve this (besides being slower and potentially
not more safe as running stop_machine in a loop isn't nice), we'd need
to prevent preemption in between invalidate_range_start/end.

I think there are two ways:

1) add global lock around mm_lock to remove the sorting

2) remove invalidate_range_start/end, nuke mm_lock as consequence of
   it, and replace all three with invalidate_pages issued inside the
   PT lock, one invalidation for each 512 pte_t modified, so
   serialization against get_user_pages becomes trivial but this will
   be not ok at all for SGI as it increases a lot their invalidation
   frequency

For KVM both ways are almost the same.

I'll implement 1 now then we'll see...

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:39:48PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 7 May 2008, Christoph Lameter wrote:
  
   (That said, we're not running out of vm flags yet, and if we were, we 
   could just add another word. We're already wasting that space right now 
   on 
   64-bit by calling it unsigned long).
  
  We sure have enough flags.
 
 Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are 
 unique), we need to mark the address spaces/anonvma's. So the flag would 
 need to be in the struct anon_vma (and struct address_space), not in the 
 vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and 
 would require adding a new field to struct anon_vma()
 
 And related to that brain-fart of mine, that obviously also means that 
 yes, the locking has to be stronger than mm-mmap_sem held for writing, 
 so yeah, it would have be a separate global spinlock (or perhaps a 
 blocking lock if you have some reason to protect anything else with this 

So because the bitflag can't prevent taking the same lock twice on two
different vmas in the same mm, we still can't remove the sorting, and
the global lock won't buy much other than reducing the collisions. I
can add that though.

I think it's more interesting to put a cap on the number of vmas to
min(1024,max_map_count). The sort time on an 8k array runs in constant
time. kvm runs with 127 vmas allocated...

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:57:05PM -0700, Linus Torvalds wrote:
 Take five minutes. Take a deep breadth. And *think* about actually reading 
 what I wrote.
 
 The bitflag *can* prevent taking the same lock twice. It just needs to be 
 in the right place.

It's not that I didn't read it, but to do it I've to grow every
anon_vma by 8 bytes. I thought it was implicit that the conclusion of
your email is that it couldn't possibly make sense to grow the size of
each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem
kernel modules. It surely isn't my preferred solution, while capping
the number of vmas to 1024 means sort() will make around 10240 steps,
Matt call tell the exact number. The big cost shouldn't be in
sort. Even 512 vmas will be more than enough for us infact. Note that
I've a cond_resched in the sort compare function and I can re-add the
signal_pending check. I had the signal_pending check in the original
version that didn't use sort() but was doing an inner loop, I thought
signal_pending wasn't necessary after speeding it up with sort(). But
I can add it again, so then we'll only fail to abort inside sort() and
we'll be able to break the loop while taking all the spinlocks, but
with such as small array that can't be an issue and the result will
surely run faster than stop_machine with zero ram and cpu overhead for
the VM (besides stop_machine can't work or we'd need to disable
preemption between invalidate_range_start/end, even removing the xpmem
schedule-inside-mmu-notifier requirement).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:12:32PM -0700, Christoph Lameter wrote:
 Andrea's mm_lock could have wider impact. It is the first effective 
 way that I have seen of temporarily holding off reclaim from an address 
 space. It sure is a brute force approach.

The only improvement I can imagine on mm_lock, is after changing the
name to global_mm_lock() to reestablish the signal_pending check in
the loop that takes the spinlock and to backoff and put the cap to 512
vmas so the ram wasted on anon-vmas wouldn't save more than 10-100usec
at most (plus the vfree that may be a bigger cost but we're ok to pay
it and it surely isn't security related).

Then on the long term we need to talk to Matt on returning a parameter
to the sort function to break the loop. After that we remove the 512
vma cap and mm_lock is free to run as long as it wants like
/dev/urandom, nobody can care less how long it will run before
returning as long as it reacts to signals.

This is the right way if we want to support XPMEM/GRU efficiently and
without introducing unnecessary regressions in the VM fastpaths and VM
footprint.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 08:10:33PM -0700, Christoph Lameter wrote:
 On Thu, 8 May 2008, Andrea Arcangeli wrote:
 
  to the sort function to break the loop. After that we remove the 512
  vma cap and mm_lock is free to run as long as it wants like
  /dev/urandom, nobody can care less how long it will run before
  returning as long as it reacts to signals.
 
 Look Linus has told you what to do. Why not simply do it?

When it looked like we could use vm_flags to remove sort, that looked
an ok optimization, no problem with optimizations, I'm all for
optimizations if they cost nothing to the VM fast paths and VM footprint.

But removing sort isn't worth it if it takes away ram from the VM even
when global_mm_lock will never be called.

sort is like /dev/urandom so after sort is fixed to handle signals
(and I expect Matt will help with this) we'll remove the 512 vmas cap.
In the meantime we can live with the 512 vmas cap that guarantees sort
won't take more than a few dozen usec. Removing sort() is the only
thing that the anon vma bitflag can achieve and it's clearly not worth
it and it would go in the wrong direction (fixing sort to handle
signals is clearly the right direction, if sort is a concern at all).

Adding the global lock around global_mm_lock to avoid one
global_mm_lock to collide against another global_mm_lock is sure ok
with me, if that's still wanted now that it's clear removing sort
isn't worth it, I'm neutral on this.

Christoph please go ahead and add the bitflag to anon-vma yourself if
you want. If something isn't technically right I don't do it no matter
who asks it.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 09:14:45PM -0700, Linus Torvalds wrote:
 IOW, you didn't even look at it, did you?

Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.

2.6.26-rc1

# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
anon_vma 260576 24  1441 : tunables  120   608 : 
slabdata  4  4  0
^^  ^^^

2.6.26-rc1 + below patch

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -27,6 +27,7 @@ struct anon_vma {
 struct anon_vma {
spinlock_t lock;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
+   int flag:1;
 };
 
 #ifdef CONFIG_MMU

# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
anon_vma 250560 32  1121 : tunables  120   608 : 
slabdata  5  5  0
^^  ^^^

Not a big deal sure to grow it 33%, it's so small anyway, but I don't
see the point in growing it. sort() can be interrupted by signals, and
until it can we can cap it to 512 vmas making the worst case taking
dozen usecs, I fail to see what you have against sort().

Again: if a vma bitflag + global lock could have avoided sort and run
O(N) instead of current O(N*log(N)) I would have done that
immediately, infact I was in the process of doing it when you posted
the followup. Nothing personal here, just staying technical. Hope you
too.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 08:30:20AM +0300, Pekka Enberg wrote:
 On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg [EMAIL PROTECTED] wrote:
  You might want to read carefully what Linus wrote:
 
The one that already has a 4 byte padding thing on x86-64 just after the
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have 
  two
bytes of padding if we didn't just make the spinlock type unconditionally
32 bits rather than the 16 bits we actually _use_?
 
   So you need to add the flag _after_ -lock and _before_ -head
 
 Oh should have taken my morning coffee first, before -lock works
 obviously as well.

Sorry, Linus's right: I didn't realize the after the spinlock was
literally after the spinlock, I didn't see the 4 byte padding when I
read the code and put the flag:1 in. If put between -lock and -head
it doesn't take more memory on x86-64 as described literlly. So the
next would be to find another place like that in the address
space. Perhaps after the private_lock using the same trick or perhaps
the slab alignment won't actually alter the number of slabs per page
regardless.

I leave that to Christoph, he's surely better than me at doing this, I
give it up entirely and I consider my attempt to merge a total failure
and I strongly regret it.

On a side note the anon_vma will change to this when XPMEM support is
compiled in:

 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head; /* List of private related vmas
*/
 };

not sure if it'll grow in size or not after that but let's say it's
not a big deal.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-06 Thread Andrea Arcangeli
On Mon, May 05, 2008 at 02:46:25PM -0500, Jack Steiner wrote:
 If a task fails to unmap a GRU segment, they still exist at the start of

Yes, this will also happen in case the well behaved task receives
SIGKILL, so you can test it that way too.

 exit. On the -release callout, I set a flag in the container of my
 mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes
 are skipped because of the flag is set. When the GRU VMA is deleted, I free

GRU TLB flushes aren't skipped because your flag is set but because
__mmu_notifier_release already executed
list_del_init_rcu(grunotifier-hlist) before proceeding with
unmap_vmas.

 my structure containing the notifier.

As long as nobody can write through the already established gru tlbs
and nobody can establish new tlbs after exit_mmap run you don't
strictly need -release.

 I _think_ works. Do you see any problems?

You can remove the flag and -release and -clear_flush_young (if you
keep clear_flush_young implemented it should return 0). The
synchronize_rcu after mmu_notifier_register can also be dropped thanks
to mm_lock(). gru_drop_mmu_notifier should be careful with current-mm
if you're using an fd and if the fd can be passed to a different task
through unix sockets (you should probably fail any operation if
current-mm != gru-mm).

The way I use -release in KVM is to set the root hpa to -1UL
(invalid) as a debug trap. That's only for debugging because even if
tlb entries and sptes are still established on the secondary mmu they
are only relevant when the cpu jumps to guest mode and that can never
happen again after exit_mmap is started.

 I should also mention that I have an open-coded function that possibly
 belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments.
 Each GRU has a couple of data structures linked to the VMA. All, however,
 need to share the same notifier. I currently open code a function that
 scans the notifier list to determine if a GRU notifier already exists.
 If it does, I update a refcnt  use it. Otherwise, I register a new
 one. All of this is protected by the mmap_sem.
 
 Just in case I mangled the above description, I'll attach a copy of the GRU 
 mmuops
 code.

Well that function needs fixing w.r.t. srcu. Are you sure you want to
search for mn-ops == gru_mmuops and not for mn == gmn?  And if you
search for mn why can't you keep track of the mn being registered or
unregistered outside of the mmu_notifier layer? Set a bitflag in the
container after mmu_notifier_register returns and a clear it after
_unregister returns. I doubt saving one bitflag is worth searching the
list and your approach make it obvious that you've to protect the
bitflag and the register/unregister under write-mmap_sem
yourself. Otherwise the find function will return an object that can
be freed at any time if somebody calls unregister and
kfree. (synchronize_srcu in mmu_notifier_unregister won't wait for
anything but some outstanding srcu_read_lock)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] mmu notifier v15 - v16 diff

2008-05-06 Thread Andrea Arcangeli
Hello everyone,

This is to allow GRU code to call __mmu_notifier_register inside the
mmap_sem (write mode is required as documented in the patch).

It also removes the requirement to implement -release as it's not
guaranteed all users will really need it.

I didn't integrate the search function as we can sort that out after
2.6.26 is out and it wasn't entirely obvious it's really needed, as
the driver should be able to track if a mmu notifier is registered in
the container.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -29,10 +29,25 @@ struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
 * being destroyed by exit_mmap, always before all pages are
-* freed. It's mandatory to implement this method. This can
-* run concurrently with other mmu notifier methods and it
+* freed. This can run concurrently with other mmu notifier
+* methods (the ones invoked outside the mm context) and it
 * should tear down all secondary mmu mappings and freeze the
-* secondary mmu.
+* secondary mmu. If this method isn't implemented you've to
+* be sure that nothing could possibly write to the pages
+* through the secondary mmu by the time the last thread with
+* tsk-mm == mm exits.
+*
+* As side note: the pages freed after -release returns could
+* be immediately reallocated by the gart at an alias physical
+* address with a different cache model, so if -release isn't
+* implemented because all _software_ driven memory accesses
+* through the secondary mmu are terminated by the time the
+* last thread of this mm quits, you've also to be sure that
+* speculative _hardware_ operations can't allocate dirty
+* cachelines in the cpu that could not be snooped and made
+* coherent with the other read and write operations happening
+* through the gart alias address, so leading to memory
+* corruption.
 */
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2340,13 +2340,20 @@ static inline void __mm_unlock(spinlock_
 /*
  * This operation locks against the VM for all pte/vma/mm related
  * operations that could ever happen on a certain mm. This includes
- * vmtruncate, try_to_unmap, and all page faults. The holder
- * must not hold any mm related lock. A single task can't take more
- * than one mm_lock in a row or it would deadlock.
+ * vmtruncate, try_to_unmap, and all page faults.
  *
- * The mmap_sem must be taken in write mode to block all operations
- * that could modify pagetables and free pages without altering the
- * vma layout (for example populate_range() with nonlinear vmas).
+ * The caller must take the mmap_sem in read or write mode before
+ * calling mm_lock(). The caller isn't allowed to release the mmap_sem
+ * until mm_unlock() returns.
+ *
+ * While mm_lock() itself won't strictly require the mmap_sem in write
+ * mode to be safe, in order to block all operations that could modify
+ * pagetables and free pages without need of altering the vma layout
+ * (for example populate_range() with nonlinear vmas) the mmap_sem
+ * must be taken in write mode by the caller.
+ *
+ * A single task can't take more than one mm_lock in a row or it would
+ * deadlock.
  *
  * The sorting is needed to avoid lock inversion deadlocks if two
  * tasks run mm_lock at the same time on different mm that happen to
@@ -2377,17 +2384,13 @@ int mm_lock(struct mm_struct *mm, struct
 {
spinlock_t **anon_vma_locks, **i_mmap_locks;
 
-   down_write(mm-mmap_sem);
if (mm-map_count) {
anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm-map_count);
-   if (unlikely(!anon_vma_locks)) {
-   up_write(mm-mmap_sem);
+   if (unlikely(!anon_vma_locks))
return -ENOMEM;
-   }
 
i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm-map_count);
if (unlikely(!i_mmap_locks)) {
-   up_write(mm-mmap_sem);
vfree(anon_vma_locks);
return -ENOMEM;
}
@@ -2426,10 +2429,12 @@ static void mm_unlock_vfree(spinlock_t *
 /*
  * mm_unlock doesn't require any memory allocation and it won't fail.
  *
+ * The mmap_sem cannot be released until mm_unlock returns.
+ *
  * All memory has been previously allocated by mm_lock and it'll be
  * all freed before returning. Only after mm_unlock returns, the
  * caller is allowed to free and forget the mm_lock_data structure.
- * 
+ *
  * mm_unlock runs in O(N) where N is the max number of VMAs in the
  * mm. The max number of vmas is defined in
  * 

Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-05 Thread Andrea Arcangeli
On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote:
 The GRU does the registration/deregistration of mmu notifiers from 
 mmap/munmap.
 At this point, the mmap_sem is already held writeable. I hit a deadlock
 in mm_lock.

It'd been better to know about this detail earlier, but frankly this
is a minor problem, the important thing is we all agree together on
the more difficult parts ;).

 A quick fix would be to do one of the following:
 
   - move the mmap_sem locking to the caller of the [de]registration 
 routines.
 Since the first/last thing done in mm_lock/mm_unlock is to
 acquire/release mmap_sem, this change does not cause major changes.

I don't like this solution very much. Nor GRU nor KVM will call
mmu_notifier_register inside the mmap_sem protected sections, so I
think the default mmu_notifier_register should be smp safe by itself
without requiring additional locks to be artificially taken externally
(especially because the need for mmap_sem in write mode is a very
mmu_notifier internal detail).

   - add a flag to mmu_notifier_[un]register routines to indicate
 if mmap_sem is already locked.

The interface would change like this:

#define MMU_NOTIFIER_REGISTER_MMAP_SEM (10)
void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm,
   unsigned long mmu_notifier_flags);

A third solution is to add:

/*
 * This must can be called instead of mmu_notifier_register after
 * taking the mmap_sem in write mode (read mode isn't enough).
 */
void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm);

Do you still prefer the bitflag or you prefer
__mmu_notifier_register. It's ok either ways, except
__mmu_notifier_reigster could be removed in a backwards compatible
way, the bitflag can't.

 I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
 test. More later

Sure! In the meantime go ahead this way.

Another very minor change I've been thinking about is to make
-release not mandatory. It happens that with KVM -release isn't
strictly required because after mm_users reaches 0, no guest could
possibly run anymore. So I'm using -release only for debugging by
placing -1UL in the root shadow pagetable, to be sure ;). So because
at least one user won't strictly require -release being consistent in
having all method optional may be nicer. Alternatively we could make
them all mandatory and if somebody doesn't need one of the methods it
should implement it as a dummy function. Both ways have pros and cons,
but they don't make any difference to us in practice. If I've to
change the patch for the mmap_sem taken during registration I may as
well cleanup this minor bit.

Also note the rculist.h patch you sent earlier won't work against
mainline so I can't incorporate it in my patchset, Andrew will have to
apply it as mmu-notifier-core-mm after incorporating mmu-notifier-core
into -mm.

Until a new update is released, mmu-notifier-core v15 remains ok for
merging, no known bugs, here we're talking about a new and simple
feature and a tiny cleanup that nobody can notice anyway.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-05 Thread Andrea Arcangeli
On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote:
 Agree. My apologies... I should have caught it.

No problem.

 __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
 go, although either is ok.

If you also like __mmu_notifier_register more I'll go with it. The
bitflags seems like a bit of overkill as I can't see the need of any
other bitflag other than this one and they also can't be removed as
easily in case you'll find a way to call it outside the lock later.

 Let me finish my testing. At one time, I did not use -release but
 with all the locking  teardown changes, I need to do some reverification.

If you didn't implement it you shall apply this patch but you shall
read carefully the comment I written that covers that usage case.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -29,10 +29,25 @@ struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
 * being destroyed by exit_mmap, always before all pages are
-* freed. It's mandatory to implement this method. This can
-* run concurrently with other mmu notifier methods and it
+* freed. This can run concurrently with other mmu notifier
+* methods (the ones invoked outside the mm context) and it
 * should tear down all secondary mmu mappings and freeze the
-* secondary mmu.
+* secondary mmu. If this method isn't implemented you've to
+* be sure that nothing could possibly write to the pages
+* through the secondary mmu by the time the last thread with
+* tsk-mm == mm exits.
+*
+* As side note: the pages freed after -release returns could
+* be immediately reallocated by the gart at an alias physical
+* address with a different cache model, so if -release isn't
+* implemented because all memory accesses through the
+* secondary mmu implicitly are terminated by the time the
+* last thread of this mm quits, you've also to be sure that
+* speculative hardware operations can't allocate dirty
+* cachelines in the cpu that could not be snooped and made
+* coherent with the other read and write operations happening
+* through the gart alias address, leading to memory
+* corruption.
 */
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st
 * from establishing any more sptes before all the
 * pages in the mm are freed.
 */
-   mn-ops-release(mn, mm);
+   if (mn-ops-release)
+   mn-ops-release(mn, mm);
srcu_read_unlock(mm-mmu_notifier_mm-srcu, srcu);
spin_lock(mm-mmu_notifier_mm-lock);
}
@@ -251,7 +252,8 @@ void mmu_notifier_unregister(struct mmu_
 * guarantee -release is called before freeing the
 * pages.
 */
-   mn-ops-release(mn, mm);
+   if (mn-ops-release)
+   mn-ops-release(mn, mm);
srcu_read_unlock(mm-mmu_notifier_mm-srcu, srcu);
} else
spin_unlock(mm-mmu_notifier_mm-lock);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-04 Thread Andrea Arcangeli
On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote:
  diff --git a/mm/Kconfig b/mm/Kconfig
  --- a/mm/Kconfig
  +++ b/mm/Kconfig
  @@ -205,3 +205,6 @@ config VIRT_TO_BUS
   config VIRT_TO_BUS
  def_bool y
  depends on !ARCH_NO_VIRT_TO_BUS
  +
  +config MMU_NOTIFIER
  +   bool
 
 Without some text following the bool keyword, I am not even asked for
 this config setting on my ia64 build.

Yes, this was explicitly asked by Andrew after his review. This is the
explanation pasted from the changelog.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] kvm mmu notifier update

2008-05-03 Thread Andrea Arcangeli
I updated the mmu notifier patch. There was a problem in shutting down
the vm in mmu_notifier_release instead of waiting the last
filp-release, because all vcpus will be freed and the
filp-private_data will point to already freed memory when the vcpu fd
is closed.

It seems we may really not need to do anything in the mmu notifier
release method, because for us the shadow pagetables are meaningless
if no guest can run, and the -release method is only invoked when all
tasks with current-mm == kvm-mm already quit. After that the guest
can't possibly run anymore, the ioctl becomes useless too. So I
changed the code to only invalidate the root of the spte radix tree in
every vcpu, just for debugging. If any guest attempts to run after mmu
notifier release runs, we'll then notice. No spte can be established
after -release returns.

Probably -release shouldn't be mandatory to implement, but from a
different point of view it may also payoff to make all methods
mandatory to be implemented as a microoptimization to avoid the null
pointer check before invoking the notifier (and in the future to also
fail registration if the api is extended and a module isn't updated,
to decrease the risk of runtime failure).

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d45fab..ce3251c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate Kernel-based Virtual Machine (KVM) support
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
+   select MMU_NOTIFIER
select ANON_INODES
---help---
  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d769c3..978da9b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -651,6 +651,101 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
account_shadowed(kvm, gfn);
 }
 
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte, *curr_spte;
+   int need_tlb_flush = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!(*spte  PT_PRESENT_MASK));
+   rmap_printk(kvm_rmap_unmap_hva: spte %p %llx\n, spte, *spte);
+   curr_spte = spte;
+   spte = rmap_next(kvm, rmapp, spte);
+   rmap_remove(kvm, curr_spte);
+   set_shadow_pte(curr_spte, shadow_trap_nonpresent_pte);
+   need_tlb_flush = 1;
+   }
+   return need_tlb_flush;
+}
+
+int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+   int need_tlb_flush = 0;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   for (i = 0; i  kvm-nmemslots; i++) {
+   struct kvm_memory_slot *memslot = kvm-memslots[i];
+   unsigned long start = memslot-userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot-npages  PAGE_SHIFT);
+   if (hva = start  hva  end) {
+   gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
+   need_tlb_flush |= kvm_unmap_rmapp(kvm,
+ 
memslot-rmap[gfn_offset]);
+   }
+   }
+
+   return need_tlb_flush;
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte;
+   int young = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   int _young;
+   u64 _spte = *spte;
+   BUG_ON(!(_spte  PT_PRESENT_MASK));
+   _young = _spte  PT_ACCESSED_MASK;
+   if (_young) {
+   young = !!_young;
+   set_shadow_pte(spte, _spte  ~PT_ACCESSED_MASK);
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+   int young = 0;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   for (i = 0; i  kvm-nmemslots; i++) {
+   struct kvm_memory_slot *memslot = kvm-memslots[i];
+   unsigned long start = memslot-userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot-npages  PAGE_SHIFT);
+   if (hva = start  hva  end) {
+   gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
+   young |= kvm_age_rmapp(kvm

[kvm-devel] [PATCH 00 of 11] mmu notifier #v15

2008-05-02 Thread Andrea Arcangeli
Hello everyone,

1/11 is the latest version of the mmu-notifier-core patch.

As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.

Thanks!
Andrea

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 02 of 11] get_task_mm

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740185 -7200
# Node ID c85c85c4be165eb6de16136bb97cf1fa7fd5c88f
# Parent  1489529e7b53d3f2dab8431372aa4850ec821caa
get_task_mm

get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas
if (task-flags  PF_BORROWED_MM)
mm = NULL;
else
-   atomic_inc(mm-mm_users);
+   if (!atomic_inc_not_zero(mm-mm_users))
+   mm = NULL;
}
task_unlock(task);
return mm;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740175 -7200
# Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa
# Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean secondary pte. In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk-mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with mm. This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further -invalidate_page notification (invalidate_range_begin/end
   works on ranges and -invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU

[kvm-devel] [PATCH 05 of 11] unmap vmas tlb flushing

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740186 -7200
# Node ID a8ac53b928dfcea0ccb326fb7d71f908f0df85f4
# Parent  14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e
unmap vmas tlb flushing

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar
 
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details-i_mmap_lock: NULL;
-   int fullmm = (*tlbp)-fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
struct mm_struct *mm = vma-vm_mm;
 
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb-fullmm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
unsigned long end;
@@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, zap_work, details);
 
if (zap_work  0) {
@@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
 
if (need_resched() ||
(i_mmap_lock  spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
-   *tlbp = NULL;
+   tlb = NULL;
goto out;
}
cond_resched();
}
 
-   *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma-vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
@@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address

[kvm-devel] [PATCH 04 of 11] free-pgtables

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740185 -7200
# Node ID 14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e
# Parent  ea8fc9187b6d3ef2742061b4f62598afe55281cf
free-pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma-vm_next;
unsigned long addr = vma-vm_start;
@@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
/*
@@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
}
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
vma = next;
}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc
update_hiwater_rss(mm);
unmap_vmas(tlb, vma, start, end, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
+   tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
-   tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 06 of 11] rwsem contended

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740186 -7200
# Node ID 74b873f3ea07012e2fc864f203edf1179865feb1
# Parent  a8ac53b928dfcea0ccb326fb7d71f908f0df85f4
rwsem contended

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+   return rwsem_is_contended(sem);
+#else
+   return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
spin_unlock_irqrestore(sem-wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+   /*
+* Racy check for an empty list. False positives or negatives
+* would be okay. False positive may cause a useless dropping of
+* locks. False negatives may cause locks to be held a bit
+* longer until the next check.
+*/
+   return !list_empty(sem-wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+   /*
+* Racy check for an empty list. False positives or negatives
+* would be okay. False positive may cause a useless dropping of
+* locks. False negatives may cause locks to be held a bit
+* longer until the next check.
+*/
+   return !list_empty(sem-wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 07 of 11] i_mmap_rwsem

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740186 -7200
# Node ID de28c85baef11b90c993047ca851a2f52c85a5be
# Parent  74b873f3ea07012e2fc864f203edf1179865feb1
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
if (!vma_shareable(vma, addr))
return;
 
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
put_page(virt_to_page(spte));
spin_unlock(mm-page_table_lock);
 out:
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
pgoff = offset  PAGE_SHIFT;
 
i_size_write(inode, offset);
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
if (!prio_tree_empty(mapping-i_mmap))
hugetlb_vmtruncate_list(mapping-i_mmap, pgoff);
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(inode-i_devices);
INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC);
rwlock_init(inode-i_data.tree_lock);
-   spin_lock_init(inode-i_data.i_mmap_lock);
+   init_rwsem(inode-i_data.i_mmap_sem);
INIT_LIST_HEAD(inode-i_data.private_list);
spin_lock_init(inode-i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
unsigned inti_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root   i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-   spinlock_t  i_mmap_lock;/* protect tree, count, list */
+   struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned inttruncate_count; /* Cover race condition with 
truncate */
unsigned long   nrpages;/* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
struct address_space *check_mapping;/* Check page-mapping if set */
pgoff_t first_index;/* Lowest page-index to unmap 
*/
pgoff_t last_index; /* Highest page-index to unmap 
*/
-   spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */
+   struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */
unsigned long truncate_count;   /* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
atomic_dec(inode-i_writecount);
 
/* insert tmp into the share list, just after mpnt */
-   spin_lock(file-f_mapping-i_mmap_lock);
+   down_write(file-f_mapping-i_mmap_sem);
tmp-vm_truncate_count = mpnt

[kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740186 -7200
# Node ID 0be678c52e540d5f5d5fd9af549b57b9bb018d32
# Parent  de28c85baef11b90c993047ca851a2f52c85a5be
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page

[kvm-devel] [PATCH 09 of 11] mm_lock-rwsem

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740226 -7200
# Node ID 721c3787cd42043734331e54a42eb20c51766f71
# Parent  0be678c52e540d5f5d5fd9af549b57b9bb018d32
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
   unsigned long flags, struct page **pages);
 
 struct mm_lock_data {
-   spinlock_t **i_mmap_locks;
-   spinlock_t **anon_vma_locks;
-   size_t nr_i_mmap_locks;
-   size_t nr_anon_vma_locks;
+   struct rw_semaphore **i_mmap_sems;
+   struct rw_semaphore **anon_vma_sems;
+   size_t nr_i_mmap_sems;
+   size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
 
 static int mm_lock_cmp(const void *a, const void *b)
 {
-   unsigned long _a = (unsigned long)*(spinlock_t **)a;
-   unsigned long _b = (unsigned long)*(spinlock_t **)b;
+   unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+   unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
 
cond_resched();
if (_a  _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
return 0;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore 
**sems,
  int anon)
 {
struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
for (vma = mm-mmap; vma; vma = vma-vm_next) {
if (anon) {
if (vma-anon_vma)
-   locks[i++] = vma-anon_vma-lock;
+   sems[i++] = vma-anon_vma-sem;
} else {
if (vma-vm_file  vma-vm_file-f_mapping)
-   locks[i++] = 
vma-vm_file-f_mapping-i_mmap_lock;
+   sems[i++] = 
vma-vm_file-f_mapping-i_mmap_sem;
}
}
 
if (!i)
goto out;
 
-   sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+   sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 1);
+   return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-   spinlock_t **locks)
+   struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 0);
+   return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-   spinlock_t *last = NULL;
+   struct rw_semaphore *last = NULL;
size_t i;
 
for (i = 0; i  nr; i++)
/*  Multiple vmas may use the same lock. */
-   if (locks[i] != last) {
-   BUG_ON((unsigned long) last  (unsigned long) locks[i]);
-   last = locks[i];
+   if (sems[i] != last) {
+   BUG_ON((unsigned long) last  (unsigned long) sems[i]);
+   last = sems[i];
if (lock)
-   spin_lock(last);
+   down_write(last);
else
-   spin_unlock(last);
+   up_write(last);
}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 1);
+   mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 0);
+   mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2351,10 +2351,10 @@ static inline void __mm_unlock(spinlock_
  * of vmas is defined in /proc/sys/vm/max_map_count.
  *
  * mm_lock() can fail if memory allocation fails. The worst case
- * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
- * so around 1Mbyte, but in practice it'll be much less

[kvm-devel] [PATCH 10 of 11] export zap_page_range for XPMEM

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740229 -7200
# Node ID 4f462fb3dff614cd7d971219c3feaef0b43359c1
# Parent  721c3787cd42043734331e54a42eb20c51766f71
export zap_page_range for XPMEM

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a
 
return unmap_vmas(vma, address, end, nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 11 of 11] mmap sems

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740229 -7200
# Node ID b4bf6df98bc00bfbef9423b0dd31cfdba63a5eeb
# Parent  4f462fb3dff614cd7d971219c3feaef0b43359c1
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
  *
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
+ *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
  *
  *  -i_mutex
  *-i_alloc_sem (various)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] fx_init schedule in atomic

2008-05-02 Thread Andrea Arcangeli
On Fri, May 02, 2008 at 12:28:32PM +0300, Avi Kivity wrote:
 Applied, thanks. Dynamic allocation for the fpu state was introduced in 
 2.6.26-rc, right?

It seems very recent, hit mainline on 30 Apr.

Also we may want to think if there's something cheaper than fx_save to
trigger a math exception that doesn't alter the fpu state, I didn't
think much about it given it's such a slow path that's probably not
worth changing with something more complicated anyway. And bringing in
a few l1 exclusive cachelines in the cpu should allow the second
instruction to repeat faster than the first.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] mmu notifier-core v14-v15 diff for review

2008-05-01 Thread Andrea Arcangeli
Hello everyone,

this is the v14 to v15 difference to the mmu-notifier-core patch. This
is just for review of the difference, I'll post full v15 soon, please
review the diff in the meantime. Lots of those cleanups are thanks to
Andrew review on mmu-notifier-core in v14. He also spotted the
GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP
failed to catch it until I enabled PREEMPT (GFP_KERNEL there was
perfectly safe with all patchset applied but not ok if only
mmu-notifier-core was applied). As usual that bug couldn't hurt
anybody unless the mmu notifiers were armed.

I also wrote a proper changelog to the mmu-notifier-core patch that I
will append before the v14-v15 diff:

Subject: mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean secondary pte. In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk-mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with mm. This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further -invalidate_page notification (invalidate_range_begin/end
   works on ranges and -invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a 

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

2008-04-30 Thread Andrea Arcangeli
On Wed, Apr 30, 2008 at 11:59:47AM +0300, Avi Kivity wrote:
 The code is not trying to find a vma for the address, but a vma for the 
 address which also has VM_PFNMAP set. The cases for vma not found, or vma 
 found, but not VM_PFNMAP, are folded together.

Muli's saying the comparison is reversed, change = to .

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Andrea Arcangeli
Hi Hugh!!

On Tue, Apr 29, 2008 at 11:49:11AM +0100, Hugh Dickins wrote:
 [I'm scarcely following the mmu notifiers to-and-fro, which seems
 to be in good hands, amongst faster thinkers than me: who actually
 need and can test this stuff.  Don't let me slow you down; but I
 can quickly clarify on this history.]

Still I think it'd be great if you could review mmu-notifier-core v14.
You and Nick are the core VM maintainers so it'd be great to hear any
feedback about it. I think it's fairly easy to classify the patch as
obviously safe as long as mmu notifiers are disarmed. Here a link for
your convenience.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core

 No, the locking was different as you had it, Andrea: there was an extra
 bitspin lock, carried over from the pte_chains days (maybe we changed
 the name, maybe we disagreed over the name, I forget), which mainly
 guarded the page-mapcount.  I thought that was one lock more than we
 needed, and eliminated it in favour of atomic page-mapcount in 2.6.9.

Thanks a lot for the explanation!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Andrea Arcangeli
On Mon, Apr 28, 2008 at 06:28:06PM -0700, Christoph Lameter wrote:
 On Tue, 29 Apr 2008, Andrea Arcangeli wrote:
 
  Frankly I've absolutely no idea why rcu is needed in all rmap code
  when walking the page-mapping. Definitely the PG_locked is taken so
  there's no way page-mapping could possibly go away under the rmap
  code, hence the anon_vma can't go away as it's queued in the vma, and
  the vma has to go away before the page is zapped out of the pte.
 
 zap_pte_range can race with the rmap code and it does not take the page 
 lock. The page may not go away since a refcount was taken but the mapping 
 can go away. Without RCU you have no guarantee that the anon_vma is 
 existing when you take the lock. 

There's some room for improvement, like using down_read_trylock, if
that succeeds we don't need to increase the refcount and we can keep
the rcu_read_lock held instead.

Secondly we don't need to increase the refcount in fork() when we
queue the vma-copy in the anon_vma. You should init the refcount to 1
when the anon_vma is allocated, remove the atomic_inc from all code
(except when down_read_trylock fails) and then change anon_vma_unlink
to:

up_write(anon_vma-sem);
if (empty)
put_anon_vma(anon_vma);

While the down_read_trylock surely won't help in AIM, the second
change will reduce a bit the overhead in the VM core fast paths by
avoiding all refcounting changes by checking the list_empty the same
way the current code does. I really like how I designed the garbage
collection through list_empty and that's efficient and I'd like to
keep it.

I however doubt this will bring us back to the same performance of the
current spinlock version, as the real overhead should come out of
overscheduling in down_write ai anon_vma_link. Here an initially
spinning lock would help but that's gray area, it greatly depends on
timings, and on very large systems where a cacheline wait with many
cpus forking at the same time takes more than scheduling a semaphore
may not slowdown performance that much. So I think the only way is a
configuration option to switch the locking at compile time, then XPMEM
will depend on that option to be on, I don't see a big deal and this
guarantees embedded isn't screwed up by totally unnecessary locks on UP.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Andrea Arcangeli
On Tue, Apr 29, 2008 at 10:50:30AM -0500, Robin Holt wrote:
 You have said this continually about a CONFIG option.  I am unsure how
 that could be achieved.  Could you provide a patch?

I'm busy with the reserved ram patch against 2.6.25 and latest kvm.git
that is moving from pages to pfn for pci passthrough (that change will
also remove the page pin with mmu notifiers).

Unfortunately reserved-ram bugs out again in the blk-settings.c on
real hardware. The fix I pushed in .25 for it, works when booting kvm
(that's how I tested it) but on real hardware sata b_pfn happens to be
1 page less than the result of the min comparison and I'll have to
figure out what happens (only .24 code works on real hardware..., at
least my fix is surely better than the previous .25-pre code).

I've other people waiting on that reserved-ram to be working, so once
I've finished, I'll do the optimization to anon-vma (at least the
removal of the unnecessary atomic_inc from fork) and add the config
option.

Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
yourself in this direction, you're very welcome to go ahead while I
finish sorting out reserved-ram. If you do, please let me know so we
don't duplicate effort, and it'd be absolutely great if the patches
could be incremental with #v14 so I can merge them trivially later and
upload a new patchset once you're finished (the only outstanding fix
you have to apply on top of #v14 that is already integrated in my
patchset, is the i_mmap_sem deadlock fix I posted and that I'm sure
you've already applied on top of #v14 before doing any more
development on it).

Thanks!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-28 Thread Andrea Arcangeli
On Mon, Apr 28, 2008 at 01:34:11PM -0700, Christoph Lameter wrote:
 On Sun, 27 Apr 2008, Andrea Arcangeli wrote:
 
  Talking about post 2.6.26: the refcount with rcu in the anon-vma
  conversion seems unnecessary and may explain part of the AIM slowdown
  too. The rest looks ok and probably we should switch the code to a
  compile-time decision between rwlock and rwsem (so obsoleting the
  current spinlock).
 
 You are going to take a semphore in an rcu section? Guess you did not 
 activate all debugging options while testing? I was not aware that you can 
 take a sleeping lock from a non preemptible context.

I'd hoped to discuss this topic after mmu-notifier-core was already
merged, but let's do it anyway.

My point of view is that there was no rcu when I wrote that code, yet
there was no reference count and yet all locking looks still exactly
the same as I wrote it. There's even still the page_table_lock to
serialize threads taking the mmap_sem in read mode against the first
vma-anon_vma = anon_vma during the page fault.

Frankly I've absolutely no idea why rcu is needed in all rmap code
when walking the page-mapping. Definitely the PG_locked is taken so
there's no way page-mapping could possibly go away under the rmap
code, hence the anon_vma can't go away as it's queued in the vma, and
the vma has to go away before the page is zapped out of the pte.

So there are some possible scenarios:

1) my original anon_vma code was buggy not taking the rcu_read_lock()
and somebody fixed it (I tend to exclude it)

2) somebody has seen a race that doesn't exist and didn't bother to
document it other than with this obscure comment

 * Getting a lock on a stable anon_vma from a page off the LRU is
 * tricky: page_lock_anon_vma rely on RCU to guard against the races.

I tend to exclude it too as VM folks are too smart for this to be the case.

3) somebody did some microoptimization using rcu and we surely can
undo that microoptimization to get the code back to my original code
that didn't need rcu despite it worked exactly the same, and that is
going to be cheaper to use with semaphores than doubling the number of
locked ops for every lock instruction.

Now the double atomic op may not be horrible when not contented, as it
works on the same cacheline but with cacheline bouncing with
contention it sounds doubly horrible than a single cacheline bounce
and I don't see the point of it as you can't use rcu anyways, so you
can't possibly take advantage of whatever microoptimization done over
the original locking.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-27 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
 the first four sets.  The fifth is the oversubscription test which trips
 my xpmem bug.  This is as good as the v12 runs from before.

Now that mmu-notifier-core #v14 seems finished and hopefully will
appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier
code with the full patchset applied and not only with
mmu-notifier-core. I soon found the full patchset has a swap deadlock
bug. Then I tried without using kvm (so with mmu notifier disarmed)
and I could still reproduce the crashes. After grabbing a few stack
traces I tracked it down to a bug in the i_mmap_lock-i_mmap_sem
conversion. If you oversubscription means swapping, you should retest
with this applied on #v14 i_mmap_sem patch as it would eventually
deadlock with all tasks allocating memory in D state without this. Now
the full patchset is as rock solid as with only mmu-notifier-core
applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for
the last hours without a problem. Everything is working great with KVM
at least.

Talking about post 2.6.26: the refcount with rcu in the anon-vma
conversion seems unnecessary and may explain part of the AIM slowdown
too. The rest looks ok and probably we should switch the code to a
compile-time decision between rwlock and rwsem (so obsoleting the
current spinlock).

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page
list_for_each_entry(vma, mapping-i_mmap_nonlinear, shared.vm_set.list)
vma-vm_private_data = NULL;
 out:
-   up_write(mapping-i_mmap_sem);
+   up_read(mapping-i_mmap_sem);
return ret;
 }
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-26 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
 Since this include and the one for mm_types.h both are build breakages
 for ia64, I think you need to apply your ia64_cpumask and the following
 (possibly as a single patch) first or in your patch 1.  Without that,
 ia64 doing a git-bisect could hit a build failure.

Agreed, so it doesn't risk to break ia64 compilation, thanks for the
great XPMEM feedback!

Also note, I figured out that mmu_notifier_release can actually run
concurrently against other mmu notifiers in case there's a vmtruncate
(-release could already run concurrently if invoked by _unregister,
the only guarantee is that -release will be called one time and only
one time and that no mmu notifier will ever run after _unregister
returns).

In short I can't keep the list_del_init in _release and I need a
list_del_init_rcu instead to fix this minor issue. So this won't
really make much difference after all.

I'll release #v14 with all this after a bit of kvm testing with it...

diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -755,6 +755,14 @@ static inline void hlist_del_init(struct
}
 }
 
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+   if (!hlist_unhashed(n)) {
+   __hlist_del(n);
+   n-pprev = NULL;
+   }
+}
+
 /**
  * hlist_replace_rcu - replace old entry by new one
  * @old : the element to be replaced
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -22,7 +22,10 @@ struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
 * being destroyed by exit_mmap, always before all pages are
-* freed. It's mandatory to implement this method.
+* freed. It's mandatory to implement this method. This can
+* run concurrently to other mmu notifier methods and it
+* should teardown all secondary mmu mappings and freeze the
+* secondary mmu.
 */
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -19,12 +19,13 @@
 
 /*
  * This function can't run concurrently against mmu_notifier_register
- * or any other mmu notifier method. mmu_notifier_register can only
- * run with mm-mm_users  0 (and exit_mmap runs only when mm_users is
- * zero). All other tasks of this mm already quit so they can't invoke
- * mmu notifiers anymore. This can run concurrently only against
- * mmu_notifier_unregister and it serializes against it with the
- * mmu_notifier_mm-lock in addition to RCU. struct mmu_notifier_mm
+ * because mm-mm_users  0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there's no task using this mm anymore, through
+ * the vmas outside of the exit_mmap context, like with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm-lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
  * can't go away from under us as exit_mmap holds a mm_count pin
  * itself.
  */
@@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st
 * to wait -release to finish and
 * mmu_notifier_unregister to return.
 */
-   hlist_del_init(mn-hlist);
+   hlist_del_init_rcu(mn-hlist);
/*
 * SRCU here will block mmu_notifier_unregister until
 * -release returns.
@@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not
 * side note: mmu_notifier_release can't run concurrently with
 * us because we hold the mm_users pin (either implicitly as
 * current-mm or explicitly with get_task_mm() or similar).
+* We can't race against any other mmu notifiers either thanks
+* to mm_lock().
 */
spin_lock(mm-mmu_notifier_mm-lock);
hlist_add_head(mn-hlist, mm-mmu_notifier_mm-list);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] fix external module compile

2008-04-26 Thread Andrea Arcangeli
Hello,

after updating kvm-userland.git, kvm.git and linux-2.6-hg, and after
make distclean and rebuild with slightly reduced .config, I can't
compile the external module anymore. Looking into it with V=1, $(src)
defines to  and including /external-module-compat.h clearly fails. I
fixed it like below, because it seems more consistent to enforce the
ordering of the special includes in the same place, notably
$(src)/include is already included as $LINUX at point 1 of the
comment, so this looks a cleanup of superflous line in Kconfig besides
fixing my compile by moving the external-module-compat in the same
place with the other includes where `pwd` works instead of $(src) that
doesn't work anymore for whatever reason.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/kernel/Kbuild b/kernel/Kbuild
index cabfc75..d9245eb 100644
--- a/kernel/Kbuild
+++ b/kernel/Kbuild
@@ -1,4 +1,3 @@
-EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h
 obj-m := kvm.o kvm-intel.o kvm-amd.o
 kvm-objs := kvm_main.o x86.o mmu.o x86_emulate.o anon_inodes.o irq.o i8259.o \
 lapic.o ioapic.o preempt.o i8254.o external-module-compat.o
diff --git a/kernel/Makefile b/kernel/Makefile
index 78ff923..e3fccbe 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -27,7 +27,8 @@ all::
 #  include header priority 1) $LINUX 2) $KERNELDIR 3) include-compat
$(MAKE) -C $(KERNELDIR) M=`pwd` \
LINUXINCLUDE=-I`pwd`/include -Iinclude -I`pwd`/include-compat \
-   -include include/linux/autoconf.h \
+   -include include/linux/autoconf.h \
+   -include `pwd`/external-module-compat.h
$$@
 
 sync: header-sync source-sync

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote:
 +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
 +{
 +struct page *page = pfn_to_page((*spte  PT64_BASE_ADDR_MASK)  
 PAGE_SHIFT);
 +get_page(page);
   

 You should not assume a struct page exists for any given spte. Instead, use 
 kvm_get_pfn() and kvm_release_pfn_clean().

Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build rmap
on mmio regions, so the above is more efficient so put_page runs
directly on the page without going back and forth between spte - pfn
- page - pfn - page in a single function.

Certainly if we start building rmap on mmio regions we'll have to
change that.

 Perhaps I just have a weak stomach but I am uneasy having a function that 
 takes a lock on exit. I walked through the logic and it doesn't appear to 
 be wrong but it also is pretty clear that you could defer the acquisition 
 of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the 
 update_pte assignment into kvm_mmu_pte_write.

I agree out_lock is an uncommon exit path, the problem is that the
code was buggy, and I tried to fix it with the smallest possible
change and that resulting in an out_lock. That section likely need a
refactoring, all those update_pte fields should be at least returned
by the function guess_  but I tried to reduce the changes to make
the issue more readable, I didn't want to rewrite certain functions
just to take a spinlock a few instructions ahead.

 Worst case, you pass 4 more pointer arguments here and, take the spin lock, 
 and then depending on the result of mmu_guess_page_from_pte_write, update 
 vcpu-arch.update_pte.

Yes that was my same idea, but that's left for a later patch. Fixing
this bug mixed with the mmu notifier patch was perhaps excessive
already ;).

 Why move the destruction of the vm to the MMU notifier unregister hook? 
 Does anything else ever call mmu_notifier_unregister that would implicitly 
 destroy the VM?

mmu notifier -release can run at anytime before the filehandle is
closed. -release has to zap all sptes and freeze the mmu (hence all
vcpus) to prevent any further page fault. After -release returns all
pages are freed (we'll never relay on the page pin to avoid the
rmap_remove put_page to be a relevant unpin event). So the idea is
that I wanted to maintain the same ordering of the current code in the
vm destroy event, I didn't want to leave a partially shutdown VM on
the vmlist. If the ordering is entirely irrelevant and the
kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then
I can avoid changes to kvm_main.c but I doubt.

I've done it in a way that archs not needing mmu notifiers like s390
can simply add the kvm_destroy_common_vm at the top of their
kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke
kvm_destroy_common_vm in the -release of the mmu notifiers.

This will ensure that everything will be ok regardless if exit_mmap is
called before/after exit_files, and it won't make a whole lot of
difference anymore, if the driver fd is pinned through vmas-vm_file
released in exit_mmap or through the task filedescriptors relased in
exit_files etc... Infact this allows to call mmu_notifier_unregister
at anytime later after the task has already been killed, without any
trouble (like if the mmu notifier owner isn't registering in
current-mm but some other tasks mm).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Andrea Arcangeli
 and
reused by the kernel. Keep in mind that there's no page pin on the
memory pointed by the sptes.

The ioctl of the qemu userland could run in any other task with a mm
different than the one of the guest and -release allows this to work
fine without memory corruption and without requiring page pinning.

As far a I can tell your example explains why we need this fix ;).

Here an updated patch that passes my swap test (the only missing thing
is the out_lock cleanup).

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d45fab..ce3251c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate Kernel-based Virtual Machine (KVM) support
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
+   select MMU_NOTIFIER
select ANON_INODES
---help---
  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2ad6f54..330eaed 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -663,6 +663,101 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
account_shadowed(kvm, gfn);
 }
 
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte, *curr_spte;
+   int need_tlb_flush = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!(*spte  PT_PRESENT_MASK));
+   rmap_printk(kvm_rmap_unmap_hva: spte %p %llx\n, spte, *spte);
+   curr_spte = spte;
+   spte = rmap_next(kvm, rmapp, spte);
+   rmap_remove(kvm, curr_spte);
+   set_shadow_pte(curr_spte, shadow_trap_nonpresent_pte);
+   need_tlb_flush = 1;
+   }
+   return need_tlb_flush;
+}
+
+int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+   int need_tlb_flush = 0;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   for (i = 0; i  kvm-nmemslots; i++) {
+   struct kvm_memory_slot *memslot = kvm-memslots[i];
+   unsigned long start = memslot-userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot-npages  PAGE_SHIFT);
+   if (hva = start  hva  end) {
+   gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
+   need_tlb_flush |= kvm_unmap_rmapp(kvm,
+ 
memslot-rmap[gfn_offset]);
+   }
+   }
+
+   return need_tlb_flush;
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte;
+   int young = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   int _young;
+   u64 _spte = *spte;
+   BUG_ON(!(_spte  PT_PRESENT_MASK));
+   _young = _spte  PT_ACCESSED_MASK;
+   if (_young) {
+   young = !!_young;
+   set_shadow_pte(spte, _spte  ~PT_ACCESSED_MASK);
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+   int young = 0;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   for (i = 0; i  kvm-nmemslots; i++) {
+   struct kvm_memory_slot *memslot = kvm-memslots[i];
+   unsigned long start = memslot-userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot-npages  PAGE_SHIFT);
+   if (hva = start  hva  end) {
+   gfn_t gfn_offset = (hva - start)  PAGE_SHIFT;
+   young |= kvm_age_rmapp(kvm, memslot-rmap[gfn_offset]);
+   }
+   }
+
+   return young;
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
@@ -1200,6 +1295,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
int r;
int largepage = 0;
pfn_t pfn;
+   int mmu_seq;
 
down_read(current-mm-mmap_sem);
if (is_largepage_backed(vcpu, gfn  ~(KVM_PAGES_PER_HPAGE-1))) {
@@ -1207,6 +1303,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
largepage = 1;
}
 
+   mmu_seq = atomic_read(vcpu-kvm-arch.mmu_notifier_seq);
+   /* implicit mb(), we'll read before PT lock is unlocked */
pfn = gfn_to_pfn(vcpu-kvm

Re: [kvm-devel] [PATCH 1 of 9] Lock the entire mm to prevent any mmu related operation to happen

2008-04-25 Thread Andrea Arcangeli
I somehow lost missed this email in my inbox, found it now because it
was strangely still unread... Sorry for the late reply!

On Tue, Apr 22, 2008 at 03:06:24PM +1000, Rusty Russell wrote:
 On Wednesday 09 April 2008 01:44:04 Andrea Arcangeli wrote:
  --- a/include/linux/mm.h
  +++ b/include/linux/mm.h
  @@ -1050,6 +1050,15 @@
 unsigned long addr, unsigned long len,
 unsigned long flags, struct page **pages);
 
  +struct mm_lock_data {
  +   spinlock_t **i_mmap_locks;
  +   spinlock_t **anon_vma_locks;
  +   unsigned long nr_i_mmap_locks;
  +   unsigned long nr_anon_vma_locks;
  +};
  +extern struct mm_lock_data *mm_lock(struct mm_struct * mm);
  +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
 
 As far as I can tell you don't actually need to expose this struct at all?

Yes, it should be possible to only expose 'struct mm_lock_data;'.

  +   data-i_mmap_locks = vmalloc(nr_i_mmap_locks *
  +sizeof(spinlock_t));
 
 This is why non-typesafe allocators suck.  You want 'sizeof(spinlock_t *)' 
 here.
 
  +   data-anon_vma_locks = vmalloc(nr_anon_vma_locks *
  +  sizeof(spinlock_t));
 
 and here.

Great catch! (it was temporarily wasting some ram which isn't nice at all)

  +   err = -EINTR;
  +   i_mmap_lock_last = NULL;
  +   nr_i_mmap_locks = 0;
  +   for (;;) {
  +   spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
  +   for (vma = mm-mmap; vma; vma = vma-vm_next) {
 ...
  +   data-i_mmap_locks[nr_i_mmap_locks++] = i_mmap_lock;
  +   }
  +   data-nr_i_mmap_locks = nr_i_mmap_locks;
 
 How about you track your running counter in data-nr_i_mmap_locks, leave 
 nr_i_mmap_locks alone, and BUG_ON(data-nr_i_mmap_locks != nr_i_mmap_locks)?
 
 Even nicer would be to wrap this in a get_sorted_mmap_locks() function.

I'll try to clean this up further and I'll make a further update for review.

 Unfortunately, I just don't think we can fail locking like this.  In your 
 next 
 patch unregistering a notifier can fail because of it: that not usable.

Fortunately I figured out we don't really need mm_lock in unregister
because it's ok to unregister in the middle of the range_begin/end
critical section (that's definitely not ok for register that's why
register needs mm_lock). And it's perfectly ok to fail in register().

Also it wasn't ok to unpin the module count in -release as -release
needs to 'ret' to get back to the mmu notifier code. And without any
unregister at all, the module can't be unloaded at all which
is quite unacceptable...

The logic is to prevent mmu_notifier_register to race with
mmu_notifier_release because it takes the mm_users pin (implicit or
explicit, and then mmput just after mmu_notifier_register
returns). Then _register serializes against all the mmu notifier
methods (except -release) with srcu (-release can't run thanks to
the mm_users pin). The mmu_notifier_mm-lock then serializes the
modification on the list (register vs unregister) and it ensures one
and only one between _unregister and _releases calls -release before
_unregister returns. All other methods runs freely with srcu. Having
the guarante that -release is called just before all pages are freed
or inside _unregister, allows the module to zap and freeze its
secondary mmu inside -release with the race condition of exit()
against mmu_notifier_unregister internally by the mmu notifier code
and without dependency on exit_files/exit_mm ordering depending if the
fd of the driver is open the filetables or in the vma only. The
mmu_notifier_mm can be reset to 0 only after the last mmdrop.

About the mm_count refcounting for _release and _unregiste: no mmu
notifier and not even mmu_notifier_unregister and _release can cope
with mmu_notfier_mm list and srcu structures going away out of
order. exit_mmap is safe as it holds an mm_count implicitly because
mmdrop is run after exit_mmap returns. mmu_notifier_unregister is safe
too as _register takes the mm_count pin. We can't prevent
mmu_notifer_mm to go away with mm_users as that will screwup the vma
filedescriptor closure that only happens inside exit_mmap (mm_users
pinned prevents exit_mmap to run, and it can only be taken temporarily
until _register returns).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1 of 9] Lock the entire mm to prevent any mmu related operation to happen

2008-04-25 Thread Andrea Arcangeli
On Fri, Apr 25, 2008 at 06:56:39PM +0200, Andrea Arcangeli wrote:
   + data-i_mmap_locks = vmalloc(nr_i_mmap_locks *
   +  sizeof(spinlock_t));
  
  This is why non-typesafe allocators suck.  You want 'sizeof(spinlock_t *)' 
  here.
  
   + data-anon_vma_locks = vmalloc(nr_anon_vma_locks *
   +sizeof(spinlock_t));
  
  and here.
 
 Great catch! (it was temporarily wasting some ram which isn't nice at all)

As I went into the editor I just found the above already fixed in
#v14-pre3. And I can't move the structure into the file anymore
without kmallocing it. Exposing that structure avoids the
ERR_PTR/PTR_ERR on the retvals and one kmalloc so I think it makes the
code simpler in the end to keep it as it is now. I'd rather avoid
further changes to the 1/N patch, as long as they don't make any
difference at runtime and as long as they involve more than
cut-and-pasting a structure from .h to .c file.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1 of 9] Lock the entire mm to prevent any mmu related operation to happen

2008-04-25 Thread Andrea Arcangeli
On Fri, Apr 25, 2008 at 02:25:32PM -0500, Robin Holt wrote:
 I think you still need mm_lock (unless I miss something).  What happens
 when one callout is scanning mmu_notifier_invalidate_range_start() and
 you unlink.  That list next pointer with LIST_POISON1 which is a really
 bad address for the processor to track.

Ok, _release list_del_init qcan't race with that because it happens in
exit_mmap when no other mmu notifier can trigger anymore.

_unregister can run concurrently but it does list_del_rcu, that only
overwrites the pprev pointer with LIST_POISON2. The
mmu_notifier_invalidate_range_start won't crash on LIST_POISON1 thanks
to srcu.

Actually I did more changes than necessary, for example I noticed the
mmu_notifier_register can return a list_add_head instead of
list_add_head_rcu. _register can't race against _release thanks to the
mm_users temporary or implicit pin. _register can't race against
_unregister thanks to the mmu_notifier_mm-lock. And register can't
race against all other mmu notifiers thanks to the mm_lock.

At this time I've no other pending patches on top of v14-pre3 other
than the below micro-optimizing cleanup. It'd be great to have
confirmation that v14-pre3 passes GRU/XPMEM regressions tests as well
as my KVM testing already passed successfully on it. I'll forward
v14-pre3 mmu-notifier-core plus the below to Andrew tomorrow, I'm
trying to be optimistic here! ;)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -187,7 +187,7 @@ int mmu_notifier_register(struct mmu_not
 * current-mm or explicitly with get_task_mm() or similar).
 */
spin_lock(mm-mmu_notifier_mm-lock);
-   hlist_add_head_rcu(mn-hlist, mm-mmu_notifier_mm-list);
+   hlist_add_head(mn-hlist, mm-mmu_notifier_mm-list);
spin_unlock(mm-mmu_notifier_mm-lock);
 out_unlock:
mm_unlock(mm, data);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Andrea Arcangeli
On Thu, Apr 24, 2008 at 12:19:28AM +0200, Andrea Arcangeli wrote:
 /dev/kvm closure. Given this can be a considered a bugfix to
 mmu_notifier_unregister I'll apply it to 1/N and I'll release a new

I'm not sure anymore this can be considered a bugfix given how large
change this resulted in the locking and register/unregister/release
behavior.

Here a full draft patch for review and testing. Works great with KVM
so far at least...

- mmu_notifier_register has to run on current-mm or on
  get_task_mm() (in the later case it can mmput after
  mmu_notifier_register returns)

- mmu_notifier_register in turn can't race against
  mmu_notifier_release as that runs in exit_mmap after the last mmput

- mmu_notifier_unregister can run at any time, even after exit_mmap
  completed. No mm_count pin is required, it's taken automatically by
  register and released by unregister

- mmu_notifier_unregister serializes against all mmu notifiers with
  srcu, and it serializes especially against a concurrent
  mmu_notifier_unregister with a mix of a spinlock and SRCU

- the spinlock let us keep track who run first between
  mmu_notifier_unregister and mmu_notifier_release, this makes life
  much easier for the driver to handle as the driver is then
  guaranteed that -release will run.

- The first that runs executes -release method as well after dropping
  the spinlock but before releasing the srcu lock

- it was unsafe to unpin the module count from -release, as release
  itself has to run the 'ret' instruction to return back to the mmu
  notifier code

- the -release method is mandatory as it has to run before the pages
  are freed to zap all existing sptes

- the one that arrives second between mmu_notifier_unregister and
  mmu_notifier_register waits the first with srcu

As said this is a much larger change than I hoped, but as usual it can
only affect KVM/GRU/XPMEM if something is wrong with this. I don't
exclude we'll have to backoff to the previous mm_users model. The main
issue with taking a mm_users pin is that filehandles associated with
vmas aren't closed by exit() if the mm_users is pinned (that simply
leaks ram with kvm). It looks more correct not to relay on the
mm_users being 0 only in mmu_notifier_register. The other big change
is that -release is mandatory and always called by the first between
mmu_notifier_unregister or mmu_notifier_release. Both
mmu_notifier_unregister and mmu_notifier_release are slow paths so
taking a spinlock there is no big deal.

Impact when the mmu notifiers are disarmed is unchanged.

The interesting part of the kvm patch to test this change is
below. After this last bit KVM patch status is almost final if this
new mmu notifier update is remotely ok, I've another one that does the
locking change to remove the page pin.

+static void kvm_free_vcpus(struct kvm *kvm);
+/* This must zap all the sptes because all pages will be freed then */
+static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
+struct mm_struct *mm)
+{
+   struct kvm *kvm = mmu_notifier_to_kvm(mn);
+   BUG_ON(mm != kvm-mm);
+   kvm_free_pit(kvm);
+   kfree(kvm-arch.vpic);
+   kfree(kvm-arch.vioapic);
+   kvm_free_vcpus(kvm);
+   kvm_free_physmem(kvm);
+   if (kvm-arch.apic_access_page)
+   put_page(kvm-arch.apic_access_page);
+}
+
+static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+   .release= kvm_mmu_notifier_release,
+   .invalidate_page= kvm_mmu_notifier_invalidate_page,
+   .invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
+   .clear_flush_young  = kvm_mmu_notifier_clear_flush_young,
+};
+
 struct  kvm *kvm_arch_create_vm(void)
 {
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+   int err;
 
if (!kvm)
return ERR_PTR(-ENOMEM);
 
INIT_LIST_HEAD(kvm-arch.active_mmu_pages);
 
+   kvm-arch.mmu_notifier.ops = kvm_mmu_notifier_ops;
+   err = mmu_notifier_register(kvm-arch.mmu_notifier, current-mm);
+   if (err) {
+   kfree(kvm);
+   return ERR_PTR(err);
+   }
+
return kvm;
 }
 
@@ -3899,13 +3967,12 @@ static void kvm_free_vcpus(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-   kvm_free_pit(kvm);
-   kfree(kvm-arch.vpic);
-   kfree(kvm-arch.vioapic);
-   kvm_free_vcpus(kvm);
-   kvm_free_physmem(kvm);
-   if (kvm-arch.apic_access_page)
-   put_page(kvm-arch.apic_access_page);
+   /*
+* kvm_mmu_notifier_release() will be called before
+* mmu_notifier_unregister returns, if it didn't run
+* already.
+*/
+   mmu_notifier_unregister(kvm-arch.mmu_notifier, kvm-mm);
kfree(kvm);
 }


Let's call this mmu notifier #v14-test1.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Andrea Arcangeli
On Thu, Apr 24, 2008 at 04:51:12AM -0500, Robin Holt wrote:
 It seems to me the work done by mmu_notifier_mm_destroy should really
 be done inside the mm_lock()/mm_unlock area of mmu_unregister and

There's no mm_lock/unlock for mmu_unregister anymore. That's the whole
point of using srcu so it becomes reliable and quick.

 mm_notifier_release when we have removed the last entry.  That would
 give the users job the same performance after they are done using the
 special device that they had prior to its use.

That's not feasible. Otherwise mmu_notifier_mm will go away at any
time under both _release from exit_mmap and under _unregister
too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm
after the last mmdrop makes it safe. mmu_notifier_unregister also
holds the mm_count because mm_count was pinned by
mmu_notifier_register. That solves the issue with mmu_notifier_mm
going away from under mmu_notifier_unregister and _release and that's
why it can only be freed after mm_count == 0.

There's at least one small issue I noticed so far, that while _release
don't need to care about _register, but _unregister definitely need to
care about _register. I've to take the mmap_sem in addition or in
replacement of the unregister_lock. The srcu_read_lock can also likely
moved just before releasing the unregister_lock but that's just a
minor optimization to make the code more strict.

 On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote:
 ...
  diff --git a/mm/memory.c b/mm/memory.c
  --- a/mm/memory.c
  +++ b/mm/memory.c
 ...
  @@ -603,25 +605,39 @@
   * readonly mappings. The tradeoff is that copy_page_range is more
   * efficient than faulting.
   */
  +   ret = 0;
  if (!(vma-vm_flags  
  (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
  if (!vma-anon_vma)
  -   return 0;
  +   goto out;
  }
   
  -   if (is_vm_hugetlb_page(vma))
  -   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
  +   if (unlikely(is_vm_hugetlb_page(vma))) {
  +   ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
  +   goto out;
  +   }
   
  +   if (is_cow_mapping(vma-vm_flags))
  +   mmu_notifier_invalidate_range_start(src_mm, addr, end);
  +
  +   ret = 0;
 
 I don't think this is needed.

It's not needed right, but I thought it was cleaner if they all use
ret after I had to change the code at the end of the
function. Anyway I'll delete this to make the patch shorter and only
change the minimum, agreed.

 ...
  +/* avoid memory allocations for mm_unlock to prevent deadlock */
  +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
  +{
  +   if (mm-map_count) {
  +   if (data-nr_anon_vma_locks)
  +   mm_unlock_vfree(data-anon_vma_locks,
  +   data-nr_anon_vma_locks);
  +   if (data-i_mmap_locks)
 
 I think you really want data-nr_i_mmap_locks.

Indeed. It never happens that there are zero vmas with filebacked
mappings, this is why this couldn't be triggered in practice, thanks!

 The second paragraph of this comment seems extraneous.

ok removed.

  +   /*
  +* Wait -release if mmu_notifier_unregister run list_del_rcu.
  +* srcu can't go away from under us because one mm_count is
  +* hold by exit_mmap.
  +*/
 
 These two sentences don't make any sense to me.

Well that was a short explanation of why the mmu_notifier_mm structure
can only be freed after the last mmdrop, which is what you asked at
the top. I'll try to rephrase.

  +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
  +{
  +   int before_release = 0, srcu;
  +
  +   BUG_ON(atomic_read(mm-mm_count) = 0);
  +
  +   srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
  +   spin_lock(mm-mmu_notifier_mm-unregister_lock);
  +   if (!hlist_unhashed(mn-hlist)) {
  +   hlist_del_rcu(mn-hlist);
  +   before_release = 1;
  +   }
  +   spin_unlock(mm-mmu_notifier_mm-unregister_lock);
  +   if (before_release)
  +   /*
  +* exit_mmap will block in mmu_notifier_release to
  +* guarantee -release is called before freeing the
  +* pages.
  +*/
  +   mn-ops-release(mn, mm);
 
 I am not certain about the need to do the release callout when the driver
 has already told this subsystem it is done.  For XPMEM, this callout
 would immediately return.  I would expect it to be the same or GRU.

The point is that you don't want to run it twice. And without this you
will have to serialize against -release yourself in the driver. It's
much more convenient if you know that -release will be called just
once, and before mmu_notifier_unregister returns. It could be called
by _release even after you're already inside _unregister, _release may
reach the spinlock before _unregister, you won't notice the
difference. Solving this race in the driver looked too complex, I
rather solve it once

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Andrea Arcangeli
On Thu, Apr 24, 2008 at 05:39:43PM +0200, Andrea Arcangeli wrote:
 There's at least one small issue I noticed so far, that while _release
 don't need to care about _register, but _unregister definitely need to
 care about _register. I've to take the mmap_sem in addition or in

In the end the best is to use the spinlock around those
list_add/list_del they all run in O(1) with the hlist and they take a
few asm insn. This also avoids to take the mmap_sem in exit_mmap, at
exit_mmap time nobody should need to use mmap_sem anymore, it might
work but this looks cleaner. The lock is dynamically allocated only
when the notifiers are registered, so the few bytes taken by it aren't
relevant.

A full new update will some become visible here:


http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/

Please have a close look again. Your help is extremely appreciated and
very helpful as usual! Thanks a lot.

diff -urN xxx/include/linux/mmu_notifier.h xx/include/linux/mmu_notifier.h
--- xxx/include/linux/mmu_notifier.h2008-04-24 19:41:15.0 +0200
+++ xx/include/linux/mmu_notifier.h 2008-04-24 19:38:37.0 +0200
@@ -15,7 +15,7 @@
struct hlist_head list;
struct srcu_struct srcu;
/* to serialize mmu_notifier_unregister against mmu_notifier_release */
-   spinlock_t unregister_lock;
+   spinlock_t lock;
 };
 
 struct mmu_notifier_ops {
diff -urN xxx/mm/memory.c xx/mm/memory.c
--- xxx/mm/memory.c 2008-04-24 19:41:15.0 +0200
+++ xx/mm/memory.c  2008-04-24 19:38:37.0 +0200
@@ -605,16 +605,13 @@
 * readonly mappings. The tradeoff is that copy_page_range is more
 * efficient than faulting.
 */
-   ret = 0;
if (!(vma-vm_flags  
(VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
if (!vma-anon_vma)
-   goto out;
+   return 0;
}
 
-   if (unlikely(is_vm_hugetlb_page(vma))) {
-   ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
-   goto out;
-   }
+   if (is_vm_hugetlb_page(vma))
+   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_start(src_mm, addr, end);
@@ -636,7 +633,6 @@
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_end(src_mm,
  vma-vm_start, end);
-out:
return ret;
 }
 
diff -urN xxx/mm/mmap.c xx/mm/mmap.c
--- xxx/mm/mmap.c   2008-04-24 19:41:15.0 +0200
+++ xx/mm/mmap.c2008-04-24 19:38:37.0 +0200
@@ -2381,7 +2381,7 @@
if (data-nr_anon_vma_locks)
mm_unlock_vfree(data-anon_vma_locks,
data-nr_anon_vma_locks);
-   if (data-i_mmap_locks)
+   if (data-nr_i_mmap_locks)
mm_unlock_vfree(data-i_mmap_locks,
data-nr_i_mmap_locks);
}
diff -urN xxx/mm/mmu_notifier.c xx/mm/mmu_notifier.c
--- xxx/mm/mmu_notifier.c   2008-04-24 19:41:15.0 +0200
+++ xx/mm/mmu_notifier.c2008-04-24 19:31:23.0 +0200
@@ -24,22 +24,16 @@
  * zero). All other tasks of this mm already quit so they can't invoke
  * mmu notifiers anymore. This can run concurrently only against
  * mmu_notifier_unregister and it serializes against it with the
- * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go
- * away from under us as the exit_mmap holds a mm_count pin itself.
- *
- * The -release method can't allow the module to be unloaded, the
- * module can only be unloaded after mmu_notifier_unregister run. This
- * is because the release method has to run the ret instruction to
- * return back here, and so it can't allow the ret instruction to be
- * freed.
+ * mmu_notifier_mm-lock in addition to RCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds a mm_count pin
+ * itself.
  */
 void __mmu_notifier_release(struct mm_struct *mm)
 {
struct mmu_notifier *mn;
int srcu;
 
-   srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
-   spin_lock(mm-mmu_notifier_mm-unregister_lock);
+   spin_lock(mm-mmu_notifier_mm-lock);
while (unlikely(!hlist_empty(mm-mmu_notifier_mm-list))) {
mn = hlist_entry(mm-mmu_notifier_mm-list.first,
 struct mmu_notifier,
@@ -52,23 +46,28 @@
 */
hlist_del_init(mn-hlist);
/*
+* SRCU here will block mmu_notifier_unregister until
+* -release returns.
+*/
+   srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
+   spin_unlock(mm-mmu_notifier_mm-lock);
+   /*
 * if -release runs before mmu_notifier_unregister

Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:30:53PM -0700, Christoph Lameter wrote:
 One solution would be to separate the invalidate_page() callout into a
 patch at the very end that can be omitted. AFACIT There is no compelling 
 reason to have this callback and it complicates the API for the device 
 driver writers. Not having this callback makes the way that mmu notifiers 
 are called from the VM uniform which is a desirable goal.

I agree that the invalidate_page optimization can be moved to a
separate patch. That will be a patch that will definitely alter the
API in a not backwards compatible way (unlike 2-12 in my #v13, which
are all backwards compatible in terms of mmu notifier API).

invalidate_page is beneficial to both mmu notifier users, and a bit
beneficial to the do_wp_page users too. So there's no point to remove
it from my mmu-notifier-core as long as the mmu-notifier-core is 1/N
in my patchset, and N/N in your patchset, the differences caused by
that ordering difference are a bigger change than invalidate_page
existing or not.

As I expected invalidate_page provided significant benefits (not just
to GRU but to KVM too) without altering the locking scheme at all,
this is because the page fault handler has to notice if begin-end
both runs anyway after follow_page/get_user_pages. So it's a no
brainer to keep and my approach will avoid a not backwards compatible
breakage of the API IMHO. Not a big deal, nobody can care if the API
will change, it will definitely change eventually, it's a kernel
internal one, but given I've already invalidate_page in my patch
there's no reason to remove it as long as mmu-notifier-core remains
N/N on your patchset.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 04:14:26PM -0700, Christoph Lameter wrote:
 We want a full solution and this kind of patching makes the patches 
 difficuilt to review because later patches revert earlier ones.

I know you rather want to see KVM development stalled for more months
than to get a partial solution now that already covers KVM and GRU
with the same API that XPMEM will also use later. It's very unfair on
your side to pretend to stall other people development if what you
need has stronger requirements and can't be merged immediately. This
is especially true given it was publically stated that XPMEM never
passed all regression tests anyway, so you can't possibly be in such
an hurry like we are, we can't progress without this. Infact we can
but it would be an huge effort and it would run _slower_ and it would
all need to be deleted once mmu notifiers are in.

Note that the only patch that you can avoid with your approach is
mm_lock-rwsem, given that's software developed and not human developed
I don't see a big deal of wasted effort. The main difference is the
ordering. Most of the code is orthogonal so there's not much to
revert.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
  The only other change I did has been to move mmu_notifier_unregister
  at the end of the patchset after getting more questions about its
  reliability and I documented a bit the rmmod requirements for
  -release. we'll think later if it makes sense to add it, nobody's
  using it anyway.
 
 XPMEM is using it.  GRU will be as well (probably already does).

XPMEM requires more patches anyway. Note that in previous email you
told me you weren't using it. I think GRU can work fine on 2.6.26
without mmu_notifier_unregister, like KVM too. You've simply to unpin
the module count in -release. The most important bit is that you've
to do that anyway in case mmu_notifier_unregister fails (and it can
fail because of vmalloc space shortage because somebody loaded some
framebuffer driver or whatever).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
 unregister but it's not reliable, only -release is reliable.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -119,6 +119,8 @@
 
 extern int mmu_notifier_register(struct mmu_notifier *mn,
 struct mm_struct *mm);
+extern int mmu_notifier_unregister(struct mmu_notifier *mn,
+  struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -106,3 +106,29 @@
return ret;
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * mm_users can't go down to zero while mmu_notifier_unregister()
+ * runs or it can race with -release. So a mm_users pin must
+ * be taken by the caller (if mm can be different from current-mm).
+ *
+ * This function can fail (for example during out of memory conditions
+ * or after vmalloc virtual range shortage), so the only reliable way
+ * to unregister is to wait release() to be called.
+ */
+int mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+   struct mm_lock_data data;
+   int ret;
+
+   BUG_ON(!atomic_read(mm-mm_users));
+
+   ret = mm_lock(mm, data);
+   if (unlikely(ret))
+   goto out;
+   hlist_del(mn-hlist);
+   mm_unlock(mm, data);
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 10:45:36AM -0500, Robin Holt wrote:
 XPMEM has passed all regression tests using your version 12 notifiers.

That's great news, thanks! I'd greatly appreciate if you could test
#v13 too as I posted it. It already passed GRU and KVM regressions
tests and it should work fine for XPMEM too. You can ignore the purely
cosmetical error I managed to introduce in mm_lock_cmp (I implemented
a BUG_ON that would have trigger if that wasn't a purely cosmetical
issue, and it clearly doesn't trigger so you can be sure it's
only cosmetical ;).

Once I get confirmation that everyone is ok with #v13 I'll push a #v14
before Saturday with that cosmetical error cleaned up and
mmu_notifier_unregister moved at the end (XPMEM will have unregister
don't worry). I expect the 1/13 of #v14 to go in -mm and then 2.6.26.

 I have a bug in xpmem which shows up on our 8x oversubscription tests,
 but that is clearly my bug to figure out.  Unfortunately it only shows

This is what I meant.

As opposed we don't have any known bug left in this area, infact we
need mmu_notifiers to _fix_ issues I identified that can't be fixed
efficiently without mmu notifiers, and we need the mmu notifier to go
productive ASAP.

 up on a 128 processor machine so I have 1024 stack traces to sort
 through each time it fails.  Does take a bit of time and a lot of
 concentration.

Sure, hope you find it soon!

 SGI is under an equally strict timeline.  We really needed the sleeping
 version into 2.6.26.  We may still be able to get this accepted by
 vendor distros if we make 2.6.27.

I don't think vendor distro are less likely to take the patches 2-12
if 1/N (aka mmu-notifier-core) is merged in 2.6.26 especially at the
light of kabi.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
 I guess I have to prepare another patchset then?

If you want to embarrass yourself three time in a row go ahead ;). I
thought two failed takeovers was enough.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 07:28:49PM -0500, Jack Steiner wrote:
 The GRU driver unregisters the notifier when all GRU mappings
 are unmapped. I could make it work either way - either with or without
 an unregister function. However, unregister is the most logical
 action to take when all mappings have been destroyed.

This is true for KVM as well, unregister would be the most logical
action to take when the kvm device is closed and the vm
destroyed. However we can't implement mm_lock in O(N*log(N)) without
triggering RAM allocations. And the size of those ram allocations are
unknown at the time unregister runs (they also depend on the
max_nr_vmas sysctl). So on a second thought not even passing the array
from register to unregister would solve it (unless we allocate
max_nr_vmas and we block the sysctl to alter max_nr_vmas if not all
unregister run yet).That's clearly unacceptable.

The only way to avoid failing because of vmalloc space shortage or
oom, would be to provide a O(N*N) fallback. But one that can't be
interrupted by sigkill! sigkill interruption was ok in #v12 because we
didn't rely on mmu_notifier_unregister to succeed. So it avoided any
DoS but it still can't provide any reliable unregister.

So in the end unregistering with kill -9 leading to -release in O(1)
sounds safer solution for the long term. You can't loop if unregister
fails and pretend your module not to have deadlocks.

Yes, waiting -release add up a bit of complexity but I think it worth
it, and there weren't genial ideas on how to avoid O(N*N) complexity
and allocations too in mmu_notifier_unregister yet. Until that genius
idea will materialize we'll stick with -release in O(1) as the only
safe unregister so we guarantee the admin will be in control of his
hardware in O(1) with kill -9 no matter if /dev/kvm and /dev/gru are
owned by sillyuser.

I'm afraid if you don't want to worst-case unregister with -release
you need to have a better idea than my mm_lock and personally I can't
see any other way than mm_lock to ensure not to miss range_begin...

All the above is in 2.6.27 context (for 2.6.26 -release is the way,
even if the genius idea would materialize).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 06:26:29PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
  I guess I have to prepare another patchset then?

Apologies for my previous not too polite comment in answer to the
above, but I thought this double patchset was over now that you
converged in #v12 and obsoleted EMM and after the last private
discussions. There's nothing personal here on my side, just a bit of
general frustration on this matter. I appreciate all great
contribution from you, as last your idea to use sort(), but I can't
really see any possible benefit or justification anymore from keeping
two patchsets floating around given we already converged on the
mmu-notifier-core, and given it's almost certain mmu-notifier-core
will go in -mm in time for 2.6.26. Let's put it this way, if I fail to
merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire
patchset and leave maintainership to you so you move 1/N to N/N and
remove mm_lock-sem patch (everything else can remain the same as it's
all orthogonal so changing the order is a matter of minutes).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 12:09:09PM -0500, Jack Steiner wrote:
 
 You may have spotted this already. If so, just ignore this.
 
 It looks like there is a bug in copy_page_range() around line 667.
 It's possible to do a mmu_notifier_invalidate_range_start(), then
 return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end().

No I didn't spot it yet, great catch!! ;) Thanks a lot. I think we can
take example by Jack and use our energy to spot any bug in the
mmu-notifier-core like with his above auditing effort (I'm quite
certain you didn't reprouce this with real oom ;) so we get a rock
solid mmu-notifier implementation in 2.6.26 so XPMEM will also benefit
later in 2.6.27 and I hope the last XPMEM internal bugs will also be
fixed by that time.

(for the not going to become mmu-notifier users, nothing to worry
about for you, unless you used KVM or GRU actively with mmu-notifiers
this bug would be entirely harmless with both MMU_NOTIFIER=n and =y,
as previously guaranteed)

Here the still untested fix for review.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -597,6 +597,7 @@
unsigned long next;
unsigned long addr = vma-vm_start;
unsigned long end = vma-vm_end;
+   int ret;
 
/*
 * Don't copy ptes where a page fault will fill them correctly.
@@ -604,33 +605,39 @@
 * readonly mappings. The tradeoff is that copy_page_range is more
 * efficient than faulting.
 */
+   ret = 0;
if (!(vma-vm_flags  
(VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
if (!vma-anon_vma)
-   return 0;
+   goto out;
}
 
-   if (is_vm_hugetlb_page(vma))
-   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   if (unlikely(is_vm_hugetlb_page(vma))) {
+   ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   goto out;
+   }
 
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
+   ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
-   if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-   vma, addr, next))
-   return -ENOMEM;
+   if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+   vma, addr, next))) {
+   ret = -ENOMEM;
+   break;
+   }
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_end(src_mm,
-   vma-vm_start, end);
-
-   return 0;
+ vma-vm_start, end);
+out:
+   return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:02:18AM -0700, Christoph Lameter wrote:
 We have had this workaround effort done years ago and have been 
 suffering the ill effects of pinning for years. Had to deal with 

Yes. In addition to the pinning, there's lot of additional tlb
flushing work to do in kvm without mmu notifiers as the swapcache
could be freed by the vm the instruction after put_page unpins the
page for whatever reason.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote:
 Why is there still the hlist stuff being used for the mmu notifier list? 
 And why is this still unsafe?

What's the problem with hlist, it saves 8 bytes for each mm_struct,
you should be using it too instead of list.

 There are cases in which you do not take the reverse map locks or mmap_sem
 while traversing the notifier list?

There aren't.

 This hope for inclusion without proper review (first for .25 now for .26) 
 seems to interfere with the patch cleanup work and cause delay after delay 
 for getting the patch ready. On what basis do you think that there is a 
 chance of any of these patches making it into 2.6.26 given that this 
 patchset has never been vetted in Andrew's tree?

Let's say I try to be optimistic and hope the right thing will happen
given this is like a new driver that can't hurt anybody but KVM and
GRU if there's any bug. But in my view what interfere with proper
review for .26 are the endless discussions we're doing ;).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:19:26AM -0700, Christoph Lameter wrote:
 If unregister fails then the driver should not detach from the address
 space immediately but wait until --release is called. That may be
 a possible solution. It will be rare that the unregister fails.

This is the current idea, exactly. Unless we find a way to replace
mm_lock with something else, I don't see a way to make
mmu_notifier_unregister reliable without wasting ram.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:21:49AM -0700, Christoph Lameter wrote:
 No I really want you to do this. I have no interest in a takeover in the 

Ok if you want me to do this, I definitely prefer the core to go in
now. It's so much easier to concentrate on two problems at different
times then to attack both problems at the same time given they're
mostly completely orthogonal problems. Given we already solved one
problem, I'd like to close it before concentrating on the second
problem. I already told you it was my interest to support XPMEM
too. For example it was me to notice we couldn't possibly remove
can_sleep parameter from invalidate_range without altering the locking
as vmas were unstable outside of one of the three core vm locks. That
finding resulted in much bigger patches than we hoped (like Andrew
previously sort of predicted) and you did all great work to develop
those. From my part, once the converged part is in, it'll be a lot
easier to fully concentrate on the rest. My main focus right now is to
produce a mmu-notifier-core that is entirely bug free for .26.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:27:21AM -0700, Christoph Lameter wrote:
 There is a potential issue in move_ptes where you call 
 invalidate_range_end after dropping i_mmap_sem whereas my patches did the 
 opposite. Mmap_sem saves you there?

Yes, there's really no risk of races in this area after introducing
mm_lock, any place that mangles over ptes and doesn't hold any of the
three locks is buggy anyway. I appreciate the audit work (I also did
it and couldn't find bugs but the more eyes the better).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 06:37:13PM +0200, Andrea Arcangeli wrote:
 I'm afraid if you don't want to worst-case unregister with -release
 you need to have a better idea than my mm_lock and personally I can't
 see any other way than mm_lock to ensure not to miss range_begin...

But wait, mmu_notifier_register absolutely requires mm_lock to ensure
that when the kvm-arch.mmu_notifier_invalidate_range_count is zero
(large variable name, it'll get shorter but this is to explain),
really no cpu is in the middle of range_begin/end critical
section. That's why we've to take all the mm locks.

But we cannot care less if we unregister in the middle, unregister
only needs to be sure that no cpu could possibly still using the ram
of the notifier allocated by the driver before returning. So I'll
implement unregister in O(1) and without ram allocations using srcu
and that'll fix all issues with unregister. It'll return void to
make it crystal clear it can't fail. It turns out unregister will make
life easier to kvm as well, mostly to simplify the teardown of the
/dev/kvm closure. Given this can be a considered a bugfix to
mmu_notifier_unregister I'll apply it to 1/N and I'll release a new
mmu-notifier-core patch for you to review before I resend to Andrew
before Saturday. Thanks!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
 invalidate_range_start {
   spin_lock(kvm-mmu_lock);
 
   kvm-invalidate_range_count++;
   rmap-invalidate of sptes in range
 

write_seqlock; write_sequnlock;

   spin_unlock(kvm-mmu_lock)
 }
 
 invalidate_range_end {
   spin_lock(kvm-mmu_lock);
 
   kvm-invalidate_range_count--;


write_seqlock; write_sequnlock;

 
   spin_unlock(kvm-mmu_lock)
 }

Robin correctly pointed out by PM there should be a seqlock in
range_begin/end too like corrected above.

I guess it's better to use an explicit sequence counter so we avoid an
useless spinlock of the write_seqlock (mmu_lock is enough already in
all places) and so we can increase it with a single op with +=2 in the
range_begin/end. The above is a lower-perf version of the final
locking but simpler for reading purposes.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote:
 On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
  On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
   invalidate_range_start {
 spin_lock(kvm-mmu_lock);
   
 kvm-invalidate_range_count++;
 rmap-invalidate of sptes in range
   
  
  write_seqlock; write_sequnlock;
 
 I don't think you need it here since invalidate_range_count is already
 elevated which will accomplish the same effect.

Agreed, seqlock only in range_end should be enough. BTW, the fact
seqlock is needed regardless of invalidate_page existing or not,
really makes invalidate_page a no brainer not just from the core VM
point of view, but from the driver point of view too. The
kvm_page_fault logic would be the same even if I remove
invalidate_page from the mmu notifier patch but it'd run slower both
when armed and disarmed.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote:
 I am a little confused about the value of the seq_lock versus a simple
 atomic, but I assumed there is a reason and left it at that.

There's no value for anything but get_user_pages (get_user_pages takes
its own lock internally though). I preferred to explain it as a
seqlock because it was simpler for reading, but I totally agree in the
final implementation it shouldn't be a seqlock. My code was meant to
be pseudo-code only. It doesn't even need to be atomic ;).

 I don't know what you mean by it'd run slower and what you mean by
 armed and disarmed.

1) when armed the time-window where the kvm-page-fault would be
blocked would be a bit larger without invalidate_page for no good
reason

2) if you were to remove invalidate_page when disarmed the VM could
would need two branches instead of one in various places

I don't want to waste cycles if not wasting them improves performance
both when armed and disarmed.

 For the sake of this discussion, I will assume it'd means the kernel in
 general and not KVM.  With the two call sites for range_begin/range_end,

I actually meant for both.

 By disarmed, I will assume you mean no notifiers registered for a
 particular mm.  In that case, the cache will make the second call
 effectively free.  So, for the disarmed case, I see no measurable
 difference.

For rmap is sure effective free, for do_wp_page it costs one branch
for no good reason.

 For the case where there is a notifier registered, I certainly can see
 a difference.  I am not certain how to quantify the difference as it

Agreed.

 When I was discussing this difference with Jack, he reminded me that
 the GRU, due to its hardware, does not have any race issues with the
 invalidate_page callout simply doing the tlb shootdown and not modifying
 any of its internal structures.  He then put a caveat on the discussion
 that _either_ method was acceptable as far as he was concerned.  The real
 issue is getting a patch in that satisfies all needs and not whether
 there is a seperate invalidate_page callout.

Sure, we have that patch now, I'll send it out in a minute, I was just
trying to explain why it makes sense to have an invalidate_page too
(which remains the only difference by now), removing it would be a
regression on all sides, even if a minor one.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208870142 -7200
# Node ID ea87c15371b1bd49380c40c3f15f1c7ca4438af5
# Parent  fb3bc9942fb78629d096bd07564f435d51d86e5f
Core of mmu notifiers.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,27 @@
   unsigned long addr, unsigned long len,
   unsigned long flags, struct page **pages);
 
+/*
+ * mm_lock will take mmap_sem writably (to prevent all modifications
+ * and scanning of vmas) and then also takes the mapping locks for
+ * each of the vma to lockout any scans of pagetables of this address
+ * space. This can be used to effectively holding off reclaim from the
+ * address space.
+ *
+ * mm_lock can fail if there is not enough memory to store a pointer
+ * array to all vmas.
+ *
+ * mm_lock and mm_unlock are expensive operations that may take a long time.
+ */
+struct mm_lock_data {
+   spinlock_t **i_mmap_locks;
+   spinlock_t **anon_vma_locks;
+   size_t nr_i_mmap_locks;
+   size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,6 +225,9 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,229 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/mm_types.h
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_ops {
+   /*
+* Called after all other threads have terminated and the executing
+* thread is the only remaining execution thread. There are no
+* users of the mm_struct remaining.
+*/
+   void (*release)(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+
+   /*
+* clear_flush_young is called after the VM is
+* test-and-clearing the young/accessed bitflag in the
+* pte. This way the VM will provide proper aging to the
+* accesses to the page through the secondary MMUs and not
+* only to the ones through the Linux pte.
+*/
+   int (*clear_flush_young)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long address);
+
+   /*
+* Before this is invoked any secondary MMU is still ok to
+* read/write to the page previously pointed by the Linux pte
+* because the old page hasn't been freed yet.  If required
+* set_page_dirty has to be called internally to this method.
+*/
+   void (*invalidate_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+
+   /*
+* invalidate_range_start() and invalidate_range_end() must be
+* paired and are called only when the mmap_sem is held and/or
+* the semaphores protecting the reverse maps. Both functions
+* may sleep. The subsystem must guarantee that no additional
+* references to the pages in the range established between
+* the call to invalidate_range_start() and the matching call
+* to invalidate_range_end().
+*
+* Invalidation of multiple concurrent ranges may be permitted
+* by the driver or the driver may exclude other invalidation
+* from proceeding by blocking on new invalidate_range_start()
+* callback that overlap invalidates that are already in
+* progress. Either way the establishment of sptes to the
+* range can only be allowed if all invalidate_range_stop()
+* function have been called.
+*
+* invalidate_range_start() is called when all pages in the
+* range are still mapped and have at least a refcount of one.
+*
+* invalidate_range_end() is called when all pages in the
+* range have been unmapped and the pages have been freed by
+* the VM.
+*
+* The VM will remove the page table entries and potentially

[kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID 3c804dca25b15017b22008647783d6f5f3801fa9
# Parent  ea87c15371b1bd49380c40c3f15f1c7ca4438af5
Fix ia64 compilation failure because of common code include bug.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include linux/rbtree.h
 #include linux/rwsem.h
 #include linux/completion.h
+#include linux/cpumask.h
 #include asm/page.h
 #include asm/mmu.h
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID ac9bb1fb3de2aa5d27210a28edf24f6577094076
# Parent  a6672bdeead0d41b2ebd6846f731d43a611645b7
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -169,27 +169,6 @@
INIT_HLIST_HEAD(mm-mmu_notifier_list);
 }
 
-#define ptep_clear_flush_notify(__vma, __address, __ptep)  \
-({ \
-   pte_t __pte;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __pte = ptep_clear_flush(___vma, ___address, __ptep);   \
-   mmu_notifier_invalidate_page(___vma-vm_mm, ___address);\
-   __pte;  \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
-({ \
-   int __young;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
-   __young |= mmu_notifier_clear_flush_young(___vma-vm_mm,\
- ___address);  \
-   __young;\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -221,9 +200,6 @@
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,11 +194,13 @@
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
-   pteval = ptep_clear_flush_notify(vma, address, pte);
+   pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
+   /* must invalidate_page _before_ freeing the page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(page);
}
}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1627,9 +1627,10 @@
 */
page_table = pte_offset_map_lock(mm, pmd, address,
 ptl);
-   page_cache_release(old_page);
+   new_page = NULL;
if (!pte_same(*page_table, orig_pte))
goto unlock;
+   page_cache_release(old_page);
 
page_mkwrite = 1;
}
@@ -1645,6 +1646,7 @@
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+   old_page = new_page = NULL;
goto unlock;
}
 
@@ -1689,7 +1691,7 @@
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1701,12 +1703,18 @@
} else
mem_cgroup_uncharge_page(new_page);
 
-   if (new_page)
+unlock:
+   pte_unmap_unlock(page_table, ptl);
+
+   if (new_page) {
+   if (new_page == old_page)
+   /* cow happened, notify before releasing old_page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(new_page);
+   }
if (old_page)
page_cache_release(old_page);
-unlock:
-   pte_unmap_unlock(page_table, ptl);
+
if (dirty_page) {
if (vma-vm_file)
file_update_time(vma-vm_file);
diff --git

[kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
# Parent  bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1062,10 +1062,10 @@
  * mm_lock and mm_unlock are expensive operations that may take a long time.
  */
 struct mm_lock_data {
-   spinlock_t **i_mmap_locks;
-   spinlock_t **anon_vma_locks;
-   size_t nr_i_mmap_locks;
-   size_t nr_anon_vma_locks;
+   struct rw_semaphore **i_mmap_sems;
+   struct rw_semaphore **anon_vma_sems;
+   size_t nr_i_mmap_sems;
+   size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2243,8 +2243,8 @@
 static int mm_lock_cmp(const void *a, const void *b)
 {
cond_resched();
-   if ((unsigned long)*(spinlock_t **)a 
-   (unsigned long)*(spinlock_t **)b)
+   if ((unsigned long)*(struct rw_semaphore **)a 
+   (unsigned long)*(struct rw_semaphore **)b)
return -1;
else if (a == b)
return 0;
@@ -2252,7 +2252,7 @@
return 1;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore 
**sems,
  int anon)
 {
struct vm_area_struct *vma;
@@ -2261,59 +2261,59 @@
for (vma = mm-mmap; vma; vma = vma-vm_next) {
if (anon) {
if (vma-anon_vma)
-   locks[i++] = vma-anon_vma-lock;
+   sems[i++] = vma-anon_vma-sem;
} else {
if (vma-vm_file  vma-vm_file-f_mapping)
-   locks[i++] = 
vma-vm_file-f_mapping-i_mmap_lock;
+   sems[i++] = 
vma-vm_file-f_mapping-i_mmap_sem;
}
}
 
if (!i)
goto out;
 
-   sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+   sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 1);
+   return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-   spinlock_t **locks)
+   struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 0);
+   return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-   spinlock_t *last = NULL;
+   struct rw_semaphore *last = NULL;
size_t i;
 
for (i = 0; i  nr; i++)
/*  Multiple vmas may use the same lock. */
-   if (locks[i] != last) {
-   BUG_ON((unsigned long) last  (unsigned long) locks[i]);
-   last = locks[i];
+   if (sems[i] != last) {
+   BUG_ON((unsigned long) last  (unsigned long) sems[i]);
+   last = sems[i];
if (lock)
-   spin_lock(last);
+   down_write(last);
else
-   spin_unlock(last);
+   up_write(last);
}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 1);
+   mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 0);
+   mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2325,57 +2325,57 @@
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-   spinlock_t **anon_vma_locks, **i_mmap_locks;
+   struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
down_write(mm-mmap_sem);
if (mm-map_count) {
-   anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm-map_count);
-   if (unlikely(!anon_vma_locks)) {
+   anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * 
mm-map_count

[kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
# Parent  ac9bb1fb3de2aa5d27210a28edf24f6577094076
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -751,8 +751,8 @@
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma-vm_next;
unsigned long addr = vma-vm_start;
@@ -286,7 +288,8 @@
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
/*
@@ -299,9 +302,11 @@
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
}
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
vma = next;
}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1752,9 +1752,9 @@
update_hiwater_rss(mm);
unmap_vmas(tlb, vma, start, end, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
+   tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
-   tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2050,8 +2050,8 @@
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID e847039ee2e815088661933b7195584847dc7540
# Parent  128d705f38c8a774ac11559db445787ce6e91c77
This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@
  *
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
+ *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
  *
  *  -i_mutex
  *-i_alloc_sem (various)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed()

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID 128d705f38c8a774ac11559db445787ce6e91c77
# Parent  f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -909,6 +909,7 @@
 
return unmap_vmas(vma, address, end, nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
# Parent  ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,8 +723,7 @@
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,7 +804,6 @@
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -816,20 +815,13 @@
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -838,9 +830,14 @@
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details-i_mmap_lock: NULL;
-   int fullmm = (*tlbp)-fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
struct mm_struct *mm = vma-vm_mm;
 
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb-fullmm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
unsigned long end;
@@ -867,7 +864,7 @@
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, zap_work, details);
 
if (zap_work  0) {
@@ -875,22 +872,23 @@
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
 
if (need_resched() ||
(i_mmap_lock  spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
-   *tlbp = NULL;
+   tlb = NULL;
goto out;
}
cond_resched();
}
 
-   *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma-vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
@@ -906,18 +904,10 @@
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
 {
-   struct mm_struct *mm = vma-vm_mm;
-   struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
 
-   lru_add_drain();
-   tlb = tlb_gather_mmu(mm, 0

[kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Andrea Arcangeli
Hello,

This is the latest and greatest version of the mmu notifier patch #v13.

Changes are mainly in the mm_lock that uses sort() suggested by Christoph.
This reduces the complexity from O(N**2) to O(N*log(N)).

I folded the mm_lock functionality together with the mmu-notifier-core 1/12
patch to make it self-contained. I recommend merging 1/12 into -mm/mainline
ASAP. Lack of mmu notifiers is holding off KVM development. We are going to
rework the way the pages are mapped and unmapped to work with pure pfn for pci
passthrough without the use of page pinning, and we can't without mmu
notifiers. This is not just a performance matter.

KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch
that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is
zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed
further.

XPMEM support comes with the later patches 2-12, risk for those patches is 0
and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are
simple and can go in immediately but not all are so simple.

2-12/12 are posted as usual for review by the VM developers and so Robin can
keep testing them on XPMEM and they can be merged later without any downside
(they're mostly orthogonal with 1/12).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7
# Parent  3c804dca25b15017b22008647783d6f5f3801fa9
get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -442,7 +442,8 @@
if (task-flags  PF_BORROWED_MM)
mm = NULL;
else
-   atomic_inc(mm-mm_users);
+   if (!atomic_inc_not_zero(mm-mm_users))
+   mm = NULL;
}
task_unlock(task);
return mm;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
# Parent  6e04df1f4284689b1c46e57a67559abe49ecf292
Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -623,7 +624,7 @@
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -647,16 +648,14 @@
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page() in migration, then,
 * just care Anon page here.
 */
-   if (PageAnon(page)) {
-   rcu_read_lock();
-   rcu_locked = 1;
-   }
+   if (PageAnon(page))
+   anon_vma

  1   2   3   4   >