Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-05-07 Thread Andrea Arcangeli
On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote:
 Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
 yourself in this direction, you're very welcome to go ahead while I

In case you didn't notice this already, for a further explanation of
why semaphores runs slower for small critical sections and why the
conversion from spinlock to rwsem should happen under a config option,
see the AIM7 40% regression with 2.6.26-rc1 thread.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Hugh Dickins
On Tue, 29 Apr 2008, Andrea Arcangeli wrote:
 
 My point of view is that there was no rcu when I wrote that code, yet
 there was no reference count and yet all locking looks still exactly
 the same as I wrote it. There's even still the page_table_lock to
 serialize threads taking the mmap_sem in read mode against the first
 vma-anon_vma = anon_vma during the page fault.
 
 Frankly I've absolutely no idea why rcu is needed in all rmap code
 when walking the page-mapping. Definitely the PG_locked is taken so
 there's no way page-mapping could possibly go away under the rmap
 code, hence the anon_vma can't go away as it's queued in the vma, and
 the vma has to go away before the page is zapped out of the pte.

[I'm scarcely following the mmu notifiers to-and-fro, which seems
to be in good hands, amongst faster thinkers than me: who actually
need and can test this stuff.  Don't let me slow you down; but I
can quickly clarify on this history.]

No, the locking was different as you had it, Andrea: there was an extra
bitspin lock, carried over from the pte_chains days (maybe we changed
the name, maybe we disagreed over the name, I forget), which mainly
guarded the page-mapcount.  I thought that was one lock more than we
needed, and eliminated it in favour of atomic page-mapcount in 2.6.9.

Here's the relevant extracts from ChangeLog-2.6.9:

[PATCH] rmaplock: PageAnon in mapping

First of a batch of five patches to eliminate rmap's page_map_lock, replace
its trylocking by spinlocking, and use anon_vma to speed up swapoff.

Patches updated from the originals against 2.6.7-mm7: nothing new so I won't
spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting
the unuse_process mmap_sem fix already in 2.6.8-rc3.

This patch:

Replace the PG_anon page-flags bit by setting the lower bit of the pointer in
page-mapping when it's anon_vma: PAGE_MAPPING_ANON bit.

We're about to eliminate the locking which kept the flags and mapping in
synch: it's much easier to work on a local copy of page-mapping, than worry
about whether flags and mapping are in synch (though I imagine it could be
done, at greater cost, with some barriers).

[PATCH] rmaplock: kill page_map_lock

The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to
lock its pte_chains.  We kept this (as page_map_lock: bit_spin_lock on
PG_maplock) when we moved to objrmap.  But the file objrmap locks its vma tree
with mapping-i_mmap_lock, and the anon objrmap locks its vma list with
anon_vma-lock: so isn't the page_map_lock superfluous?

Pretty much, yes.  The mapcount was protected by it, and needs to become an
atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely
up and down.  The last page_remove_rmap can't clear anon page mapping any
more, because of races with page_add_rmap; from which some BUG_ONs must go for
the same reason, but they've served their purpose.

vmscan decisions are naturally racy, little change there beyond removing
page_map_lock/unlock.  But to stabilize the file-backed page-mapping against
truncation while acquiring i_mmap_lock, page_referenced_file now needs page
lock to be held even for refill_inactive_zone.  There's a similar issue in
acquiring anon_vma-lock, where page lock doesn't help: which this patch
pretends to handle, but actually it needs the next.

Roughly 10% cut off lmbench fork numbers on my 2*HT*P4.  Must confess my
testing failed to show the races even while they were knowingly exposed: would
benefit from testing on racier equipment.

[PATCH] rmaplock: SLAB_DESTROY_BY_RCU

With page_map_lock gone, how to stabilize page-mapping's anon_vma while
acquiring anon_vma-lock in page_referenced_anon and try_to_unmap_anon?

The page cannot actually be freed (vmscan holds reference), but however much
we check page_mapped (which guarantees that anon_vma is in use - or would
guarantee that if we added suitable barriers), there's no locking against page
becoming unmapped the instant after, then anon_vma freed.

It's okay to take anon_vma-lock after it's freed, so long as it remains a
struct anon_vma (its list would become empty, or perhaps reused for an
unrelated anon_vma: but no problem since we always check that the page located
is the right one); but corruption if that memory gets reused for some other
purpose.

This is not unique: it's liable to be problem whenever the kernel tries to
approach a structure obliquely.  It's generally solved with an atomic
reference count; but one advantage of anon_vma over anonmm is that it does not
have such a count, and it would be a backward step to add one.

Therefore...  implement SLAB_DESTROY_BY_RCU flag, to guarantee that such a
kmem_cache_alloc'ed structure cannot get freed to other use while the
rcu_read_lock is held i.e.  preempt disabled; and use that for anon_vma.

Fix concerns raised by Manfred: this flag is incompatible with poisoning and
destructor, and kmem_cache_destroy needs to synchronize_kernel.

I hope SLAB_DESTROY_BY_RCU 

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Andrea Arcangeli
Hi Hugh!!

On Tue, Apr 29, 2008 at 11:49:11AM +0100, Hugh Dickins wrote:
 [I'm scarcely following the mmu notifiers to-and-fro, which seems
 to be in good hands, amongst faster thinkers than me: who actually
 need and can test this stuff.  Don't let me slow you down; but I
 can quickly clarify on this history.]

Still I think it'd be great if you could review mmu-notifier-core v14.
You and Nick are the core VM maintainers so it'd be great to hear any
feedback about it. I think it's fairly easy to classify the patch as
obviously safe as long as mmu notifiers are disarmed. Here a link for
your convenience.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core

 No, the locking was different as you had it, Andrea: there was an extra
 bitspin lock, carried over from the pte_chains days (maybe we changed
 the name, maybe we disagreed over the name, I forget), which mainly
 guarded the page-mapcount.  I thought that was one lock more than we
 needed, and eliminated it in favour of atomic page-mapcount in 2.6.9.

Thanks a lot for the explanation!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Andrea Arcangeli
On Mon, Apr 28, 2008 at 06:28:06PM -0700, Christoph Lameter wrote:
 On Tue, 29 Apr 2008, Andrea Arcangeli wrote:
 
  Frankly I've absolutely no idea why rcu is needed in all rmap code
  when walking the page-mapping. Definitely the PG_locked is taken so
  there's no way page-mapping could possibly go away under the rmap
  code, hence the anon_vma can't go away as it's queued in the vma, and
  the vma has to go away before the page is zapped out of the pte.
 
 zap_pte_range can race with the rmap code and it does not take the page 
 lock. The page may not go away since a refcount was taken but the mapping 
 can go away. Without RCU you have no guarantee that the anon_vma is 
 existing when you take the lock. 

There's some room for improvement, like using down_read_trylock, if
that succeeds we don't need to increase the refcount and we can keep
the rcu_read_lock held instead.

Secondly we don't need to increase the refcount in fork() when we
queue the vma-copy in the anon_vma. You should init the refcount to 1
when the anon_vma is allocated, remove the atomic_inc from all code
(except when down_read_trylock fails) and then change anon_vma_unlink
to:

up_write(anon_vma-sem);
if (empty)
put_anon_vma(anon_vma);

While the down_read_trylock surely won't help in AIM, the second
change will reduce a bit the overhead in the VM core fast paths by
avoiding all refcounting changes by checking the list_empty the same
way the current code does. I really like how I designed the garbage
collection through list_empty and that's efficient and I'd like to
keep it.

I however doubt this will bring us back to the same performance of the
current spinlock version, as the real overhead should come out of
overscheduling in down_write ai anon_vma_link. Here an initially
spinning lock would help but that's gray area, it greatly depends on
timings, and on very large systems where a cacheline wait with many
cpus forking at the same time takes more than scheduling a semaphore
may not slowdown performance that much. So I think the only way is a
configuration option to switch the locking at compile time, then XPMEM
will depend on that option to be on, I don't see a big deal and this
guarantees embedded isn't screwed up by totally unnecessary locks on UP.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Robin Holt
 I however doubt this will bring us back to the same performance of the
 current spinlock version, as the real overhead should come out of
 overscheduling in down_write ai anon_vma_link. Here an initially
 spinning lock would help but that's gray area, it greatly depends on
 timings, and on very large systems where a cacheline wait with many
 cpus forking at the same time takes more than scheduling a semaphore
 may not slowdown performance that much. So I think the only way is a
 configuration option to switch the locking at compile time, then XPMEM
 will depend on that option to be on, I don't see a big deal and this
 guarantees embedded isn't screwed up by totally unnecessary locks on UP.

You have said this continually about a CONFIG option.  I am unsure how
that could be achieved.  Could you provide a patch?

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-29 Thread Andrea Arcangeli
On Tue, Apr 29, 2008 at 10:50:30AM -0500, Robin Holt wrote:
 You have said this continually about a CONFIG option.  I am unsure how
 that could be achieved.  Could you provide a patch?

I'm busy with the reserved ram patch against 2.6.25 and latest kvm.git
that is moving from pages to pfn for pci passthrough (that change will
also remove the page pin with mmu notifiers).

Unfortunately reserved-ram bugs out again in the blk-settings.c on
real hardware. The fix I pushed in .25 for it, works when booting kvm
(that's how I tested it) but on real hardware sata b_pfn happens to be
1 page less than the result of the min comparison and I'll have to
figure out what happens (only .24 code works on real hardware..., at
least my fix is surely better than the previous .25-pre code).

I've other people waiting on that reserved-ram to be working, so once
I've finished, I'll do the optimization to anon-vma (at least the
removal of the unnecessary atomic_inc from fork) and add the config
option.

Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem
yourself in this direction, you're very welcome to go ahead while I
finish sorting out reserved-ram. If you do, please let me know so we
don't duplicate effort, and it'd be absolutely great if the patches
could be incremental with #v14 so I can merge them trivially later and
upload a new patchset once you're finished (the only outstanding fix
you have to apply on top of #v14 that is already integrated in my
patchset, is the i_mmap_sem deadlock fix I posted and that I'm sure
you've already applied on top of #v14 before doing any more
development on it).

Thanks!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-28 Thread Christoph Lameter
On Sun, 27 Apr 2008, Andrea Arcangeli wrote:

 Talking about post 2.6.26: the refcount with rcu in the anon-vma
 conversion seems unnecessary and may explain part of the AIM slowdown
 too. The rest looks ok and probably we should switch the code to a
 compile-time decision between rwlock and rwsem (so obsoleting the
 current spinlock).

You are going to take a semphore in an rcu section? Guess you did not 
activate all debugging options while testing? I was not aware that you can 
take a sleeping lock from a non preemptible context.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-28 Thread Andrea Arcangeli
On Mon, Apr 28, 2008 at 01:34:11PM -0700, Christoph Lameter wrote:
 On Sun, 27 Apr 2008, Andrea Arcangeli wrote:
 
  Talking about post 2.6.26: the refcount with rcu in the anon-vma
  conversion seems unnecessary and may explain part of the AIM slowdown
  too. The rest looks ok and probably we should switch the code to a
  compile-time decision between rwlock and rwsem (so obsoleting the
  current spinlock).
 
 You are going to take a semphore in an rcu section? Guess you did not 
 activate all debugging options while testing? I was not aware that you can 
 take a sleeping lock from a non preemptible context.

I'd hoped to discuss this topic after mmu-notifier-core was already
merged, but let's do it anyway.

My point of view is that there was no rcu when I wrote that code, yet
there was no reference count and yet all locking looks still exactly
the same as I wrote it. There's even still the page_table_lock to
serialize threads taking the mmap_sem in read mode against the first
vma-anon_vma = anon_vma during the page fault.

Frankly I've absolutely no idea why rcu is needed in all rmap code
when walking the page-mapping. Definitely the PG_locked is taken so
there's no way page-mapping could possibly go away under the rmap
code, hence the anon_vma can't go away as it's queued in the vma, and
the vma has to go away before the page is zapped out of the pte.

So there are some possible scenarios:

1) my original anon_vma code was buggy not taking the rcu_read_lock()
and somebody fixed it (I tend to exclude it)

2) somebody has seen a race that doesn't exist and didn't bother to
document it other than with this obscure comment

 * Getting a lock on a stable anon_vma from a page off the LRU is
 * tricky: page_lock_anon_vma rely on RCU to guard against the races.

I tend to exclude it too as VM folks are too smart for this to be the case.

3) somebody did some microoptimization using rcu and we surely can
undo that microoptimization to get the code back to my original code
that didn't need rcu despite it worked exactly the same, and that is
going to be cheaper to use with semaphores than doubling the number of
locked ops for every lock instruction.

Now the double atomic op may not be horrible when not contented, as it
works on the same cacheline but with cacheline bouncing with
contention it sounds doubly horrible than a single cacheline bounce
and I don't see the point of it as you can't use rcu anyways, so you
can't possibly take advantage of whatever microoptimization done over
the original locking.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-27 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
 the first four sets.  The fifth is the oversubscription test which trips
 my xpmem bug.  This is as good as the v12 runs from before.

Now that mmu-notifier-core #v14 seems finished and hopefully will
appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier
code with the full patchset applied and not only with
mmu-notifier-core. I soon found the full patchset has a swap deadlock
bug. Then I tried without using kvm (so with mmu notifier disarmed)
and I could still reproduce the crashes. After grabbing a few stack
traces I tracked it down to a bug in the i_mmap_lock-i_mmap_sem
conversion. If you oversubscription means swapping, you should retest
with this applied on #v14 i_mmap_sem patch as it would eventually
deadlock with all tasks allocating memory in D state without this. Now
the full patchset is as rock solid as with only mmu-notifier-core
applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for
the last hours without a problem. Everything is working great with KVM
at least.

Talking about post 2.6.26: the refcount with rcu in the anon-vma
conversion seems unnecessary and may explain part of the AIM slowdown
too. The rest looks ok and probably we should switch the code to a
compile-time decision between rwlock and rwsem (so obsoleting the
current spinlock).

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page
list_for_each_entry(vma, mapping-i_mmap_nonlinear, shared.vm_set.list)
vma-vm_private_data = NULL;
 out:
-   up_write(mapping-i_mmap_sem);
+   up_read(mapping-i_mmap_sem);
return ret;
 }
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-26 Thread Robin Holt
On Thu, Apr 24, 2008 at 07:41:45PM +0200, Andrea Arcangeli wrote:
 A full new update will some become visible here:
 
   
 http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/

I grabbed these and built them.  Only change needed was another include.
After that, everything built fine and xpmem regression tests ran through
the first four sets.  The fifth is the oversubscription test which trips
my xpmem bug.  This is as good as the v12 runs from before.

Since this include and the one for mm_types.h both are build breakages
for ia64, I think you need to apply your ia64_cpumask and the following
(possibly as a single patch) first or in your patch 1.  Without that,
ia64 doing a git-bisect could hit a build failure.


Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h
===
--- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h2008-04-26 
06:41:54.0 -0500
+++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h 2008-04-26 
07:01:17.292071827 -0500
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include linux/mutex.h
+
 struct srcu_struct_array {
int c[2];
 };

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-26 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
 Since this include and the one for mm_types.h both are build breakages
 for ia64, I think you need to apply your ia64_cpumask and the following
 (possibly as a single patch) first or in your patch 1.  Without that,
 ia64 doing a git-bisect could hit a build failure.

Agreed, so it doesn't risk to break ia64 compilation, thanks for the
great XPMEM feedback!

Also note, I figured out that mmu_notifier_release can actually run
concurrently against other mmu notifiers in case there's a vmtruncate
(-release could already run concurrently if invoked by _unregister,
the only guarantee is that -release will be called one time and only
one time and that no mmu notifier will ever run after _unregister
returns).

In short I can't keep the list_del_init in _release and I need a
list_del_init_rcu instead to fix this minor issue. So this won't
really make much difference after all.

I'll release #v14 with all this after a bit of kvm testing with it...

diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -755,6 +755,14 @@ static inline void hlist_del_init(struct
}
 }
 
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+   if (!hlist_unhashed(n)) {
+   __hlist_del(n);
+   n-pprev = NULL;
+   }
+}
+
 /**
  * hlist_replace_rcu - replace old entry by new one
  * @old : the element to be replaced
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -22,7 +22,10 @@ struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
 * being destroyed by exit_mmap, always before all pages are
-* freed. It's mandatory to implement this method.
+* freed. It's mandatory to implement this method. This can
+* run concurrently to other mmu notifier methods and it
+* should teardown all secondary mmu mappings and freeze the
+* secondary mmu.
 */
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -19,12 +19,13 @@
 
 /*
  * This function can't run concurrently against mmu_notifier_register
- * or any other mmu notifier method. mmu_notifier_register can only
- * run with mm-mm_users  0 (and exit_mmap runs only when mm_users is
- * zero). All other tasks of this mm already quit so they can't invoke
- * mmu notifiers anymore. This can run concurrently only against
- * mmu_notifier_unregister and it serializes against it with the
- * mmu_notifier_mm-lock in addition to RCU. struct mmu_notifier_mm
+ * because mm-mm_users  0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there's no task using this mm anymore, through
+ * the vmas outside of the exit_mmap context, like with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm-lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
  * can't go away from under us as exit_mmap holds a mm_count pin
  * itself.
  */
@@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st
 * to wait -release to finish and
 * mmu_notifier_unregister to return.
 */
-   hlist_del_init(mn-hlist);
+   hlist_del_init_rcu(mn-hlist);
/*
 * SRCU here will block mmu_notifier_unregister until
 * -release returns.
@@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not
 * side note: mmu_notifier_release can't run concurrently with
 * us because we hold the mm_users pin (either implicitly as
 * current-mm or explicitly with get_task_mm() or similar).
+* We can't race against any other mmu notifiers either thanks
+* to mm_lock().
 */
spin_lock(mm-mmu_notifier_mm-lock);
hlist_add_head(mn-hlist, mm-mmu_notifier_mm-list);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Andrea Arcangeli
On Thu, Apr 24, 2008 at 12:19:28AM +0200, Andrea Arcangeli wrote:
 /dev/kvm closure. Given this can be a considered a bugfix to
 mmu_notifier_unregister I'll apply it to 1/N and I'll release a new

I'm not sure anymore this can be considered a bugfix given how large
change this resulted in the locking and register/unregister/release
behavior.

Here a full draft patch for review and testing. Works great with KVM
so far at least...

- mmu_notifier_register has to run on current-mm or on
  get_task_mm() (in the later case it can mmput after
  mmu_notifier_register returns)

- mmu_notifier_register in turn can't race against
  mmu_notifier_release as that runs in exit_mmap after the last mmput

- mmu_notifier_unregister can run at any time, even after exit_mmap
  completed. No mm_count pin is required, it's taken automatically by
  register and released by unregister

- mmu_notifier_unregister serializes against all mmu notifiers with
  srcu, and it serializes especially against a concurrent
  mmu_notifier_unregister with a mix of a spinlock and SRCU

- the spinlock let us keep track who run first between
  mmu_notifier_unregister and mmu_notifier_release, this makes life
  much easier for the driver to handle as the driver is then
  guaranteed that -release will run.

- The first that runs executes -release method as well after dropping
  the spinlock but before releasing the srcu lock

- it was unsafe to unpin the module count from -release, as release
  itself has to run the 'ret' instruction to return back to the mmu
  notifier code

- the -release method is mandatory as it has to run before the pages
  are freed to zap all existing sptes

- the one that arrives second between mmu_notifier_unregister and
  mmu_notifier_register waits the first with srcu

As said this is a much larger change than I hoped, but as usual it can
only affect KVM/GRU/XPMEM if something is wrong with this. I don't
exclude we'll have to backoff to the previous mm_users model. The main
issue with taking a mm_users pin is that filehandles associated with
vmas aren't closed by exit() if the mm_users is pinned (that simply
leaks ram with kvm). It looks more correct not to relay on the
mm_users being 0 only in mmu_notifier_register. The other big change
is that -release is mandatory and always called by the first between
mmu_notifier_unregister or mmu_notifier_release. Both
mmu_notifier_unregister and mmu_notifier_release are slow paths so
taking a spinlock there is no big deal.

Impact when the mmu notifiers are disarmed is unchanged.

The interesting part of the kvm patch to test this change is
below. After this last bit KVM patch status is almost final if this
new mmu notifier update is remotely ok, I've another one that does the
locking change to remove the page pin.

+static void kvm_free_vcpus(struct kvm *kvm);
+/* This must zap all the sptes because all pages will be freed then */
+static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
+struct mm_struct *mm)
+{
+   struct kvm *kvm = mmu_notifier_to_kvm(mn);
+   BUG_ON(mm != kvm-mm);
+   kvm_free_pit(kvm);
+   kfree(kvm-arch.vpic);
+   kfree(kvm-arch.vioapic);
+   kvm_free_vcpus(kvm);
+   kvm_free_physmem(kvm);
+   if (kvm-arch.apic_access_page)
+   put_page(kvm-arch.apic_access_page);
+}
+
+static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+   .release= kvm_mmu_notifier_release,
+   .invalidate_page= kvm_mmu_notifier_invalidate_page,
+   .invalidate_range_end   = kvm_mmu_notifier_invalidate_range_end,
+   .clear_flush_young  = kvm_mmu_notifier_clear_flush_young,
+};
+
 struct  kvm *kvm_arch_create_vm(void)
 {
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+   int err;
 
if (!kvm)
return ERR_PTR(-ENOMEM);
 
INIT_LIST_HEAD(kvm-arch.active_mmu_pages);
 
+   kvm-arch.mmu_notifier.ops = kvm_mmu_notifier_ops;
+   err = mmu_notifier_register(kvm-arch.mmu_notifier, current-mm);
+   if (err) {
+   kfree(kvm);
+   return ERR_PTR(err);
+   }
+
return kvm;
 }
 
@@ -3899,13 +3967,12 @@ static void kvm_free_vcpus(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
-   kvm_free_pit(kvm);
-   kfree(kvm-arch.vpic);
-   kfree(kvm-arch.vioapic);
-   kvm_free_vcpus(kvm);
-   kvm_free_physmem(kvm);
-   if (kvm-arch.apic_access_page)
-   put_page(kvm-arch.apic_access_page);
+   /*
+* kvm_mmu_notifier_release() will be called before
+* mmu_notifier_unregister returns, if it didn't run
+* already.
+*/
+   mmu_notifier_unregister(kvm-arch.mmu_notifier, kvm-mm);
kfree(kvm);
 }


Let's call this mmu notifier #v14-test1.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Robin Holt
I am not certain of this, but it seems like this patch leaves things in
a somewhat asymetric state.  At the very least, I think that asymetry
should be documented in the comments of either mmu_notifier.h or .c.

Before I do the first mmu_notifier_register, all places that test for
mm_has_notifiers(mm) will return false and take the fast path.

After I do some mmu_notifier_register()s and their corresponding
mmu_notifier_unregister()s, The mm_has_notifiers(mm) will return true
and the slow path will be taken.  This, despite all registered notifiers
having unregistered.

It seems to me the work done by mmu_notifier_mm_destroy should really
be done inside the mm_lock()/mm_unlock area of mmu_unregister and
mm_notifier_release when we have removed the last entry.  That would
give the users job the same performance after they are done using the
special device that they had prior to its use.


On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote:
...
 diff --git a/mm/memory.c b/mm/memory.c
 --- a/mm/memory.c
 +++ b/mm/memory.c
...
 @@ -603,25 +605,39 @@
* readonly mappings. The tradeoff is that copy_page_range is more
* efficient than faulting.
*/
 + ret = 0;
   if (!(vma-vm_flags  
 (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
   if (!vma-anon_vma)
 - return 0;
 + goto out;
   }
  
 - if (is_vm_hugetlb_page(vma))
 - return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 + if (unlikely(is_vm_hugetlb_page(vma))) {
 + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
 + goto out;
 + }
  
 + if (is_cow_mapping(vma-vm_flags))
 + mmu_notifier_invalidate_range_start(src_mm, addr, end);
 +
 + ret = 0;

I don't think this is needed.

...
 +/* avoid memory allocations for mm_unlock to prevent deadlock */
 +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 +{
 + if (mm-map_count) {
 + if (data-nr_anon_vma_locks)
 + mm_unlock_vfree(data-anon_vma_locks,
 + data-nr_anon_vma_locks);
 + if (data-i_mmap_locks)

I think you really want data-nr_i_mmap_locks.

...
 diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
 new file mode 100644
 --- /dev/null
 +++ b/mm/mmu_notifier.c
...
 +/*
 + * This function can't run concurrently against mmu_notifier_register
 + * or any other mmu notifier method. mmu_notifier_register can only
 + * run with mm-mm_users  0 (and exit_mmap runs only when mm_users is
 + * zero). All other tasks of this mm already quit so they can't invoke
 + * mmu notifiers anymore. This can run concurrently only against
 + * mmu_notifier_unregister and it serializes against it with the
 + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go
 + * away from under us as the exit_mmap holds a mm_count pin itself.
 + *
 + * The -release method can't allow the module to be unloaded, the
 + * module can only be unloaded after mmu_notifier_unregister run. This
 + * is because the release method has to run the ret instruction to
 + * return back here, and so it can't allow the ret instruction to be
 + * freed.
 + */

The second paragraph of this comment seems extraneous.

...
 + /*
 +  * Wait -release if mmu_notifier_unregister run list_del_rcu.
 +  * srcu can't go away from under us because one mm_count is
 +  * hold by exit_mmap.
 +  */

These two sentences don't make any sense to me.

...
 +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 +{
 + int before_release = 0, srcu;
 +
 + BUG_ON(atomic_read(mm-mm_count) = 0);
 +
 + srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
 + spin_lock(mm-mmu_notifier_mm-unregister_lock);
 + if (!hlist_unhashed(mn-hlist)) {
 + hlist_del_rcu(mn-hlist);
 + before_release = 1;
 + }
 + spin_unlock(mm-mmu_notifier_mm-unregister_lock);
 + if (before_release)
 + /*
 +  * exit_mmap will block in mmu_notifier_release to
 +  * guarantee -release is called before freeing the
 +  * pages.
 +  */
 + mn-ops-release(mn, mm);

I am not certain about the need to do the release callout when the driver
has already told this subsystem it is done.  For XPMEM, this callout
would immediately return.  I would expect it to be the same or GRU.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Andrea Arcangeli
On Thu, Apr 24, 2008 at 04:51:12AM -0500, Robin Holt wrote:
 It seems to me the work done by mmu_notifier_mm_destroy should really
 be done inside the mm_lock()/mm_unlock area of mmu_unregister and

There's no mm_lock/unlock for mmu_unregister anymore. That's the whole
point of using srcu so it becomes reliable and quick.

 mm_notifier_release when we have removed the last entry.  That would
 give the users job the same performance after they are done using the
 special device that they had prior to its use.

That's not feasible. Otherwise mmu_notifier_mm will go away at any
time under both _release from exit_mmap and under _unregister
too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm
after the last mmdrop makes it safe. mmu_notifier_unregister also
holds the mm_count because mm_count was pinned by
mmu_notifier_register. That solves the issue with mmu_notifier_mm
going away from under mmu_notifier_unregister and _release and that's
why it can only be freed after mm_count == 0.

There's at least one small issue I noticed so far, that while _release
don't need to care about _register, but _unregister definitely need to
care about _register. I've to take the mmap_sem in addition or in
replacement of the unregister_lock. The srcu_read_lock can also likely
moved just before releasing the unregister_lock but that's just a
minor optimization to make the code more strict.

 On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote:
 ...
  diff --git a/mm/memory.c b/mm/memory.c
  --- a/mm/memory.c
  +++ b/mm/memory.c
 ...
  @@ -603,25 +605,39 @@
   * readonly mappings. The tradeoff is that copy_page_range is more
   * efficient than faulting.
   */
  +   ret = 0;
  if (!(vma-vm_flags  
  (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
  if (!vma-anon_vma)
  -   return 0;
  +   goto out;
  }
   
  -   if (is_vm_hugetlb_page(vma))
  -   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
  +   if (unlikely(is_vm_hugetlb_page(vma))) {
  +   ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
  +   goto out;
  +   }
   
  +   if (is_cow_mapping(vma-vm_flags))
  +   mmu_notifier_invalidate_range_start(src_mm, addr, end);
  +
  +   ret = 0;
 
 I don't think this is needed.

It's not needed right, but I thought it was cleaner if they all use
ret after I had to change the code at the end of the
function. Anyway I'll delete this to make the patch shorter and only
change the minimum, agreed.

 ...
  +/* avoid memory allocations for mm_unlock to prevent deadlock */
  +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
  +{
  +   if (mm-map_count) {
  +   if (data-nr_anon_vma_locks)
  +   mm_unlock_vfree(data-anon_vma_locks,
  +   data-nr_anon_vma_locks);
  +   if (data-i_mmap_locks)
 
 I think you really want data-nr_i_mmap_locks.

Indeed. It never happens that there are zero vmas with filebacked
mappings, this is why this couldn't be triggered in practice, thanks!

 The second paragraph of this comment seems extraneous.

ok removed.

  +   /*
  +* Wait -release if mmu_notifier_unregister run list_del_rcu.
  +* srcu can't go away from under us because one mm_count is
  +* hold by exit_mmap.
  +*/
 
 These two sentences don't make any sense to me.

Well that was a short explanation of why the mmu_notifier_mm structure
can only be freed after the last mmdrop, which is what you asked at
the top. I'll try to rephrase.

  +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
  +{
  +   int before_release = 0, srcu;
  +
  +   BUG_ON(atomic_read(mm-mm_count) = 0);
  +
  +   srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
  +   spin_lock(mm-mmu_notifier_mm-unregister_lock);
  +   if (!hlist_unhashed(mn-hlist)) {
  +   hlist_del_rcu(mn-hlist);
  +   before_release = 1;
  +   }
  +   spin_unlock(mm-mmu_notifier_mm-unregister_lock);
  +   if (before_release)
  +   /*
  +* exit_mmap will block in mmu_notifier_release to
  +* guarantee -release is called before freeing the
  +* pages.
  +*/
  +   mn-ops-release(mn, mm);
 
 I am not certain about the need to do the release callout when the driver
 has already told this subsystem it is done.  For XPMEM, this callout
 would immediately return.  I would expect it to be the same or GRU.

The point is that you don't want to run it twice. And without this you
will have to serialize against -release yourself in the driver. It's
much more convenient if you know that -release will be called just
once, and before mmu_notifier_unregister returns. It could be called
by _release even after you're already inside _unregister, _release may
reach the spinlock before _unregister, you won't notice the
difference. Solving this race in the driver looked too complex, I
rather solve it once 

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-24 Thread Andrea Arcangeli
On Thu, Apr 24, 2008 at 05:39:43PM +0200, Andrea Arcangeli wrote:
 There's at least one small issue I noticed so far, that while _release
 don't need to care about _register, but _unregister definitely need to
 care about _register. I've to take the mmap_sem in addition or in

In the end the best is to use the spinlock around those
list_add/list_del they all run in O(1) with the hlist and they take a
few asm insn. This also avoids to take the mmap_sem in exit_mmap, at
exit_mmap time nobody should need to use mmap_sem anymore, it might
work but this looks cleaner. The lock is dynamically allocated only
when the notifiers are registered, so the few bytes taken by it aren't
relevant.

A full new update will some become visible here:


http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/

Please have a close look again. Your help is extremely appreciated and
very helpful as usual! Thanks a lot.

diff -urN xxx/include/linux/mmu_notifier.h xx/include/linux/mmu_notifier.h
--- xxx/include/linux/mmu_notifier.h2008-04-24 19:41:15.0 +0200
+++ xx/include/linux/mmu_notifier.h 2008-04-24 19:38:37.0 +0200
@@ -15,7 +15,7 @@
struct hlist_head list;
struct srcu_struct srcu;
/* to serialize mmu_notifier_unregister against mmu_notifier_release */
-   spinlock_t unregister_lock;
+   spinlock_t lock;
 };
 
 struct mmu_notifier_ops {
diff -urN xxx/mm/memory.c xx/mm/memory.c
--- xxx/mm/memory.c 2008-04-24 19:41:15.0 +0200
+++ xx/mm/memory.c  2008-04-24 19:38:37.0 +0200
@@ -605,16 +605,13 @@
 * readonly mappings. The tradeoff is that copy_page_range is more
 * efficient than faulting.
 */
-   ret = 0;
if (!(vma-vm_flags  
(VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
if (!vma-anon_vma)
-   goto out;
+   return 0;
}
 
-   if (unlikely(is_vm_hugetlb_page(vma))) {
-   ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
-   goto out;
-   }
+   if (is_vm_hugetlb_page(vma))
+   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_start(src_mm, addr, end);
@@ -636,7 +633,6 @@
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_end(src_mm,
  vma-vm_start, end);
-out:
return ret;
 }
 
diff -urN xxx/mm/mmap.c xx/mm/mmap.c
--- xxx/mm/mmap.c   2008-04-24 19:41:15.0 +0200
+++ xx/mm/mmap.c2008-04-24 19:38:37.0 +0200
@@ -2381,7 +2381,7 @@
if (data-nr_anon_vma_locks)
mm_unlock_vfree(data-anon_vma_locks,
data-nr_anon_vma_locks);
-   if (data-i_mmap_locks)
+   if (data-nr_i_mmap_locks)
mm_unlock_vfree(data-i_mmap_locks,
data-nr_i_mmap_locks);
}
diff -urN xxx/mm/mmu_notifier.c xx/mm/mmu_notifier.c
--- xxx/mm/mmu_notifier.c   2008-04-24 19:41:15.0 +0200
+++ xx/mm/mmu_notifier.c2008-04-24 19:31:23.0 +0200
@@ -24,22 +24,16 @@
  * zero). All other tasks of this mm already quit so they can't invoke
  * mmu notifiers anymore. This can run concurrently only against
  * mmu_notifier_unregister and it serializes against it with the
- * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go
- * away from under us as the exit_mmap holds a mm_count pin itself.
- *
- * The -release method can't allow the module to be unloaded, the
- * module can only be unloaded after mmu_notifier_unregister run. This
- * is because the release method has to run the ret instruction to
- * return back here, and so it can't allow the ret instruction to be
- * freed.
+ * mmu_notifier_mm-lock in addition to RCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds a mm_count pin
+ * itself.
  */
 void __mmu_notifier_release(struct mm_struct *mm)
 {
struct mmu_notifier *mn;
int srcu;
 
-   srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
-   spin_lock(mm-mmu_notifier_mm-unregister_lock);
+   spin_lock(mm-mmu_notifier_mm-lock);
while (unlikely(!hlist_empty(mm-mmu_notifier_mm-list))) {
mn = hlist_entry(mm-mmu_notifier_mm-list.first,
 struct mmu_notifier,
@@ -52,23 +46,28 @@
 */
hlist_del_init(mn-hlist);
/*
+* SRCU here will block mmu_notifier_unregister until
+* -release returns.
+*/
+   srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu);
+   spin_unlock(mm-mmu_notifier_mm-lock);
+   /*
 * if -release runs before mmu_notifier_unregister it
  

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
  The only other change I did has been to move mmu_notifier_unregister
  at the end of the patchset after getting more questions about its
  reliability and I documented a bit the rmmod requirements for
  -release. we'll think later if it makes sense to add it, nobody's
  using it anyway.
 
 XPMEM is using it.  GRU will be as well (probably already does).

XPMEM requires more patches anyway. Note that in previous email you
told me you weren't using it. I think GRU can work fine on 2.6.26
without mmu_notifier_unregister, like KVM too. You've simply to unpin
the module count in -release. The most important bit is that you've
to do that anyway in case mmu_notifier_unregister fails (and it can
fail because of vmalloc space shortage because somebody loaded some
framebuffer driver or whatever).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Robin Holt
On Wed, Apr 23, 2008 at 03:36:19PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
   The only other change I did has been to move mmu_notifier_unregister
   at the end of the patchset after getting more questions about its
   reliability and I documented a bit the rmmod requirements for
   -release. we'll think later if it makes sense to add it, nobody's
   using it anyway.
  
  XPMEM is using it.  GRU will be as well (probably already does).
 
 XPMEM requires more patches anyway. Note that in previous email you
 told me you weren't using it. I think GRU can work fine on 2.6.26

I said I could test without it.  It is needed for the final version.
It also makes the API consistent.  What you are proposing is equivalent
to having a file you can open but never close.

This whole discussion seems ludicrous.  You could refactor the code to get
the sorted list of locks, pass that list into mm_lock to do the locking,
do the register/unregister, then pass the same list into mm_unlock.

If the allocation fails, you could fall back to the older slower method
of repeatedly scanning the lists and acquiring locks in ascending order.

 without mmu_notifier_unregister, like KVM too. You've simply to unpin
 the module count in -release. The most important bit is that you've
 to do that anyway in case mmu_notifier_unregister fails (and it can

If you are not going to provide the _unregister callout you need to change
the API so I can scan the list of notifiers to see if my structures are
already registered.

We register our notifier structure at device open time.  If we receive a
_release callout, we mark our structure as unregistered.  At device close
time, if we have not been unregistered, we call _unregister.  If you
take away _unregister, I have an xpmem kernel structure in use _AFTER_
the device is closed with no indication that the process is using it.
In that case, I need to get an extra reference to the module in my device
open method and hold that reference until the _release callout.

Additionally, if the users program reopens the device, I need to scan the
mmu_notifiers list to see if this tasks notifier is already registered.

I view _unregister as essential.  Did I miss something?

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 09:47:47AM -0500, Robin Holt wrote:
 It also makes the API consistent.  What you are proposing is equivalent
 to having a file you can open but never close.

That's not entirely true, you can close the file just fine it by
killing the tasks leading to an mmput. From an user prospective in KVM
terms, it won't make a difference as /dev/kvm will remain open and
it'll pin the module count until the kvm task is killed anyway, I
assume for GRU it's similar.

Until I had the idea of how to implement an mm_lock to ensure the
mmu_notifier_register could miss a running invalidate_range_begin, it
wasn't even possible to implement a mmu_notifier_unregister (see EMM
patches) and it looked like you were ok with that API that missed
_unregister...

 This whole discussion seems ludicrous.  You could refactor the code to get
 the sorted list of locks, pass that list into mm_lock to do the locking,
 do the register/unregister, then pass the same list into mm_unlock.

Correct, but it will keep the vmalloc ram pinned during the
runtime. There's no reason to keep that ram allocated per-VM while the
VM runs. We only need it during the startup and teardown.

 If the allocation fails, you could fall back to the older slower method
 of repeatedly scanning the lists and acquiring locks in ascending order.

Correct, I already thought about that. This is exactly why I'm
deferring this for later! Or those perfectionism not needed for
KVM/GRU will keep delaying indefinitely the part that is already
converged and that's enough for KVM and GRU (and for this specific
bit, actually enough for XPMEM as well).

We can make a second version of mm_lock_slow to use if mm_lock fails,
in mmu_notifier_unregister, with N^2 complexity later, after the
mmu-notifier-core is merged into mainline.

 If you are not going to provide the _unregister callout you need to change
 the API so I can scan the list of notifiers to see if my structures are
 already registered.

As said 1/N isn't enough for XPMEM anyway. 1/N has to only include the
absolute minimum and zero risk stuff, that is enough for both KVM and
GRU.

 We register our notifier structure at device open time.  If we receive a
 _release callout, we mark our structure as unregistered.  At device close
 time, if we have not been unregistered, we call _unregister.  If you
 take away _unregister, I have an xpmem kernel structure in use _AFTER_
 the device is closed with no indication that the process is using it.
 In that case, I need to get an extra reference to the module in my device
 open method and hold that reference until the _release callout.

Yes exactly, but you've to do that anyway, if mmu_notifier_unregister
fails because some driver already allocated all vmalloc space (even
x86-64 hasn't indefinite amount of vmalloc because of the vmalloc
being in the end of the address space) unless we've a N^2 fallback,
but the N^2 fallback will make the code more easily dosable and
unkillable, so if I would be an admin I'd prefer having to quickly
kill -9 a task in O(N) than having to wait some syscall that runs in
O(N^2) to complete before the task quits. So the fallback to a slower
algorithm isn't necessarily what will really happen after 2.6.26 is
released, we'll see. Relaying on -release for the module unpin sounds
preferable, and it's certainly the only reliable way to unregister
that we'll provide in 2.6.26.

 Additionally, if the users program reopens the device, I need to scan the
 mmu_notifiers list to see if this tasks notifier is already registered.

But you don't need any browse the list for this, keep a flag in your
structure after the mmu_notifier struct, set the bitflag after
mmu_notifier_register returns, and clear the bitflag after -release
runs or after mmu_notifier_unregister returns success. What's the big
deal to track if you've to call mmu_notifier_register a second time or
not? Or you can create a new structure every time somebody asks to
reattach.

 I view _unregister as essential.  Did I miss something?

We can add it later, and we can keep discussing on what's the best
model to implement it as long as you want after 2.6.26 is released
with mmu-notifier-core so GRU/KVM are done. It's unlikely KVM will use
mmu_notifier_unregister anyway as we need it attached for the whole
lifetime of the task, and only for the lifetime of the task.

This is the patch to add it, as you can see it's entirely orthogonal,
backwards compatible with previous API and it doesn't duplicate or
rewrite any code.

Don't worry, any kernel after 2.6.26 will have unregister, but we can't
focus on this for 2.6.26. We can also consider making
mmu_notifier_register safe against double calls on the same structure
but again that's not something we should be doing in 1/N and it can be
done later in a backwards compatible way (plus we're perfectly fine
with the API having not backwards compatible changes as long as 2.6.26
can work for us).

-
Implement 

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
 I guess I have to prepare another patchset then?

If you want to embarrass yourself three time in a row go ahead ;). I
thought two failed takeovers was enough.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 07:28:49PM -0500, Jack Steiner wrote:
 The GRU driver unregisters the notifier when all GRU mappings
 are unmapped. I could make it work either way - either with or without
 an unregister function. However, unregister is the most logical
 action to take when all mappings have been destroyed.

This is true for KVM as well, unregister would be the most logical
action to take when the kvm device is closed and the vm
destroyed. However we can't implement mm_lock in O(N*log(N)) without
triggering RAM allocations. And the size of those ram allocations are
unknown at the time unregister runs (they also depend on the
max_nr_vmas sysctl). So on a second thought not even passing the array
from register to unregister would solve it (unless we allocate
max_nr_vmas and we block the sysctl to alter max_nr_vmas if not all
unregister run yet).That's clearly unacceptable.

The only way to avoid failing because of vmalloc space shortage or
oom, would be to provide a O(N*N) fallback. But one that can't be
interrupted by sigkill! sigkill interruption was ok in #v12 because we
didn't rely on mmu_notifier_unregister to succeed. So it avoided any
DoS but it still can't provide any reliable unregister.

So in the end unregistering with kill -9 leading to -release in O(1)
sounds safer solution for the long term. You can't loop if unregister
fails and pretend your module not to have deadlocks.

Yes, waiting -release add up a bit of complexity but I think it worth
it, and there weren't genial ideas on how to avoid O(N*N) complexity
and allocations too in mmu_notifier_unregister yet. Until that genius
idea will materialize we'll stick with -release in O(1) as the only
safe unregister so we guarantee the admin will be in control of his
hardware in O(1) with kill -9 no matter if /dev/kvm and /dev/gru are
owned by sillyuser.

I'm afraid if you don't want to worst-case unregister with -release
you need to have a better idea than my mm_lock and personally I can't
see any other way than mm_lock to ensure not to miss range_begin...

All the above is in 2.6.27 context (for 2.6.26 -release is the way,
even if the genius idea would materialize).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Jack Steiner

You may have spotted this already. If so, just ignore this.

It looks like there is a bug in copy_page_range() around line 667.
It's possible to do a mmu_notifier_invalidate_range_start(), then
return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end().

--- jack



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 06:26:29PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
  I guess I have to prepare another patchset then?

Apologies for my previous not too polite comment in answer to the
above, but I thought this double patchset was over now that you
converged in #v12 and obsoleted EMM and after the last private
discussions. There's nothing personal here on my side, just a bit of
general frustration on this matter. I appreciate all great
contribution from you, as last your idea to use sort(), but I can't
really see any possible benefit or justification anymore from keeping
two patchsets floating around given we already converged on the
mmu-notifier-core, and given it's almost certain mmu-notifier-core
will go in -mm in time for 2.6.26. Let's put it this way, if I fail to
merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire
patchset and leave maintainership to you so you move 1/N to N/N and
remove mm_lock-sem patch (everything else can remain the same as it's
all orthogonal so changing the order is a matter of minutes).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 12:09:09PM -0500, Jack Steiner wrote:
 
 You may have spotted this already. If so, just ignore this.
 
 It looks like there is a bug in copy_page_range() around line 667.
 It's possible to do a mmu_notifier_invalidate_range_start(), then
 return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end().

No I didn't spot it yet, great catch!! ;) Thanks a lot. I think we can
take example by Jack and use our energy to spot any bug in the
mmu-notifier-core like with his above auditing effort (I'm quite
certain you didn't reprouce this with real oom ;) so we get a rock
solid mmu-notifier implementation in 2.6.26 so XPMEM will also benefit
later in 2.6.27 and I hope the last XPMEM internal bugs will also be
fixed by that time.

(for the not going to become mmu-notifier users, nothing to worry
about for you, unless you used KVM or GRU actively with mmu-notifiers
this bug would be entirely harmless with both MMU_NOTIFIER=n and =y,
as previously guaranteed)

Here the still untested fix for review.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -597,6 +597,7 @@
unsigned long next;
unsigned long addr = vma-vm_start;
unsigned long end = vma-vm_end;
+   int ret;
 
/*
 * Don't copy ptes where a page fault will fill them correctly.
@@ -604,33 +605,39 @@
 * readonly mappings. The tradeoff is that copy_page_range is more
 * efficient than faulting.
 */
+   ret = 0;
if (!(vma-vm_flags  
(VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
if (!vma-anon_vma)
-   return 0;
+   goto out;
}
 
-   if (is_vm_hugetlb_page(vma))
-   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   if (unlikely(is_vm_hugetlb_page(vma))) {
+   ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
+   goto out;
+   }
 
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
+   ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
-   if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-   vma, addr, next))
-   return -ENOMEM;
+   if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+   vma, addr, next))) {
+   ret = -ENOMEM;
+   break;
+   }
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
if (is_cow_mapping(vma-vm_flags))
mmu_notifier_invalidate_range_end(src_mm,
-   vma-vm_start, end);
-
-   return 0;
+ vma-vm_start, end);
+out:
+   return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 Implement unregister but it's not reliable, only -release is reliable.

Why is there still the hlist stuff being used for the mmu notifier list? 
And why is this still unsafe?

There are cases in which you do not take the reverse map locks or mmap_sem
while traversing the notifier list?

This hope for inclusion without proper review (first for .25 now for .26) 
seems to interfere with the patch cleanup work and cause delay after delay 
for getting the patch ready. On what basis do you think that there is a 
chance of any of these patches making it into 2.6.26 given that this 
patchset has never been vetted in Andrew's tree?




-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote:
  I guess I have to prepare another patchset then?
 
 If you want to embarrass yourself three time in a row go ahead ;). I
 thought two failed takeovers was enough.

Takeover? I'd be happy if I would not have to deal with this issue.

These  patches were necessary because you were not listening to 
feedback plus there is the issue that your patchsets were not easy to 
review or diff against. I had to merge several patches to get to a useful 
patch. You have always picked up lots of stuff from my patchsets. Lots of 
work that could have been avoided by proper patchsets in the first place.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 The only way to avoid failing because of vmalloc space shortage or
 oom, would be to provide a O(N*N) fallback. But one that can't be
 interrupted by sigkill! sigkill interruption was ok in #v12 because we
 didn't rely on mmu_notifier_unregister to succeed. So it avoided any
 DoS but it still can't provide any reliable unregister.

If unregister fails then the driver should not detach from the address
space immediately but wait until --release is called. That may be
a possible solution. It will be rare that the unregister fails.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote:
 Why is there still the hlist stuff being used for the mmu notifier list? 
 And why is this still unsafe?

What's the problem with hlist, it saves 8 bytes for each mm_struct,
you should be using it too instead of list.

 There are cases in which you do not take the reverse map locks or mmap_sem
 while traversing the notifier list?

There aren't.

 This hope for inclusion without proper review (first for .25 now for .26) 
 seems to interfere with the patch cleanup work and cause delay after delay 
 for getting the patch ready. On what basis do you think that there is a 
 chance of any of these patches making it into 2.6.26 given that this 
 patchset has never been vetted in Andrew's tree?

Let's say I try to be optimistic and hope the right thing will happen
given this is like a new driver that can't hurt anybody but KVM and
GRU if there's any bug. But in my view what interfere with proper
review for .26 are the endless discussions we're doing ;).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 will go in -mm in time for 2.6.26. Let's put it this way, if I fail to
 merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire
 patchset and leave maintainership to you so you move 1/N to N/N and
 remove mm_lock-sem patch (everything else can remain the same as it's
 all orthogonal so changing the order is a matter of minutes).

No I really want you to do this. I have no interest in a takeover in the 
future and have done the EMM stuff only because I saw no other way 
forward. I just want this be done the right way for all parties with 
patches that are nice and mergeable.
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:19:26AM -0700, Christoph Lameter wrote:
 If unregister fails then the driver should not detach from the address
 space immediately but wait until --release is called. That may be
 a possible solution. It will be rare that the unregister fails.

This is the current idea, exactly. Unless we find a way to replace
mm_lock with something else, I don't see a way to make
mmu_notifier_unregister reliable without wasting ram.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote:
  Why is there still the hlist stuff being used for the mmu notifier list? 
  And why is this still unsafe?
 
 What's the problem with hlist, it saves 8 bytes for each mm_struct,
 you should be using it too instead of list.

list heads in mm_struct and in the mmu_notifier struct seemed to 
be more consistent. We have no hash list after all.

 
  There are cases in which you do not take the reverse map locks or mmap_sem
  while traversing the notifier list?
 
 There aren't.

There is a potential issue in move_ptes where you call 
invalidate_range_end after dropping i_mmap_sem whereas my patches did the 
opposite. Mmap_sem saves you there?

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:21:49AM -0700, Christoph Lameter wrote:
 No I really want you to do this. I have no interest in a takeover in the 

Ok if you want me to do this, I definitely prefer the core to go in
now. It's so much easier to concentrate on two problems at different
times then to attack both problems at the same time given they're
mostly completely orthogonal problems. Given we already solved one
problem, I'd like to close it before concentrating on the second
problem. I already told you it was my interest to support XPMEM
too. For example it was me to notice we couldn't possibly remove
can_sleep parameter from invalidate_range without altering the locking
as vmas were unstable outside of one of the three core vm locks. That
finding resulted in much bigger patches than we hoped (like Andrew
previously sort of predicted) and you did all great work to develop
those. From my part, once the converged part is in, it'll be a lot
easier to fully concentrate on the rest. My main focus right now is to
produce a mmu-notifier-core that is entirely bug free for .26.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 11:27:21AM -0700, Christoph Lameter wrote:
 There is a potential issue in move_ptes where you call 
 invalidate_range_end after dropping i_mmap_sem whereas my patches did the 
 opposite. Mmap_sem saves you there?

Yes, there's really no risk of races in this area after introducing
mm_lock, any place that mangles over ptes and doesn't hold any of the
three locks is buggy anyway. I appreciate the audit work (I also did
it and couldn't find bugs but the more eyes the better).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 Yes, there's really no risk of races in this area after introducing
 mm_lock, any place that mangles over ptes and doesn't hold any of the
 three locks is buggy anyway. I appreciate the audit work (I also did
 it and couldn't find bugs but the more eyes the better).

I guess I would need to merge some patches together somehow to be able 
to review them properly like I did before sigh. I have not reviewed the 
latest code completely.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Andrea Arcangeli
On Wed, Apr 23, 2008 at 06:37:13PM +0200, Andrea Arcangeli wrote:
 I'm afraid if you don't want to worst-case unregister with -release
 you need to have a better idea than my mm_lock and personally I can't
 see any other way than mm_lock to ensure not to miss range_begin...

But wait, mmu_notifier_register absolutely requires mm_lock to ensure
that when the kvm-arch.mmu_notifier_invalidate_range_count is zero
(large variable name, it'll get shorter but this is to explain),
really no cpu is in the middle of range_begin/end critical
section. That's why we've to take all the mm locks.

But we cannot care less if we unregister in the middle, unregister
only needs to be sure that no cpu could possibly still using the ram
of the notifier allocated by the driver before returning. So I'll
implement unregister in O(1) and without ram allocations using srcu
and that'll fix all issues with unregister. It'll return void to
make it crystal clear it can't fail. It turns out unregister will make
life easier to kvm as well, mostly to simplify the teardown of the
/dev/kvm closure. Given this can be a considered a bugfix to
mmu_notifier_unregister I'll apply it to 1/N and I'll release a new
mmu-notifier-core patch for you to review before I resend to Andrew
before Saturday. Thanks!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
 Andrea Arcangeli a écrit :
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 +cond_resched();
 +if ((unsigned long)*(spinlock_t **)a 
 +(unsigned long)*(spinlock_t **)b)
 +return -1;
 +else if (a == b)
 +return 0;
 +else
 +return 1;
 +}
 +
 This compare function looks unusual...
 It should work, but sort() could be faster if the
 if (a == b) test had a chance to be true eventually...

Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?

 static int mm_lock_cmp(const void *a, const void *b)
 {
   unsigned long la = (unsigned long)*(spinlock_t **)a;
   unsigned long lb = (unsigned long)*(spinlock_t **)b;

   cond_resched();
   if (la  lb)
   return -1;
   if (la  lb)
   return 1;
   return 0;
 }

If your intent is to use the assumption that there are going to be few
equal entries, you should have used likely(la  lb) to signal it's
rarely going to return zero or gcc is likely free to do whatever it
wants with the above. Overall that function is such a slow path that
this is going to be lost in the noise. My suggestion would be to defer
microoptimizations like this after 1/12 will be applied to mainline.

Thanks!


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Avi Kivity
Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
   
 Andrea Arcangeli a écrit :
 
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 +   cond_resched();
 +   if ((unsigned long)*(spinlock_t **)a 
 +   (unsigned long)*(spinlock_t **)b)
 +   return -1;
 +   else if (a == b)
 +   return 0;
 +   else
 +   return 1;
 +}
 +
   
 This compare function looks unusual...
 It should work, but sort() could be faster if the
 if (a == b) test had a chance to be true eventually...
 

 Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?

   

You need to compare *a to *b (at least, that's what you're doing for the 
 case).

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Eric Dumazet
Andrea Arcangeli a écrit :
 On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
   
 Andrea Arcangeli a écrit :
 
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 +   cond_resched();
 +   if ((unsigned long)*(spinlock_t **)a 
 +   (unsigned long)*(spinlock_t **)b)
 +   return -1;
 +   else if (a == b)
 +   return 0;
 +   else
 +   return 1;
 +}
 +
   
 This compare function looks unusual...
 It should work, but sort() could be faster if the
 if (a == b) test had a chance to be true eventually...
 

 Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?
   
I am saying your intent was probably to test

else if ((unsigned long)*(spinlock_t **)a ==
(unsigned long)*(spinlock_t **)b)
return 0;


Because a and b are pointers to the data you want to compare. You need 
to dereference them.


 static int mm_lock_cmp(const void *a, const void *b)
 {
  unsigned long la = (unsigned long)*(spinlock_t **)a;
  unsigned long lb = (unsigned long)*(spinlock_t **)b;

  cond_resched();
  if (la  lb)
  return -1;
  if (la  lb)
  return 1;
  return 0;
 }
 

 If your intent is to use the assumption that there are going to be few
 equal entries, you should have used likely(la  lb) to signal it's
 rarely going to return zero or gcc is likely free to do whatever it
 wants with the above. Overall that function is such a slow path that
 this is going to be lost in the noise. My suggestion would be to defer
 microoptimizations like this after 1/12 will be applied to mainline.

 Thanks!

   
Hum, it's not a micro-optimization, but a bug fix. :)

Sorry if it was not clear





-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote:
 I am saying your intent was probably to test

 else if ((unsigned long)*(spinlock_t **)a ==
   (unsigned long)*(spinlock_t **)b)
   return 0;

Indeed...

 Hum, it's not a micro-optimization, but a bug fix. :)

The good thing is that even if this bug would lead to a system crash,
it would be still zero risk for everybody that isn't using KVM/GRU
actively with mmu notifiers. The important thing is that this patch
has zero risk to introduce regressions into the kernel, both when
enabled and disabled, it's like a new driver. I'll shortly resend 1/12
and likely 12/12 for theoretical correctness. For now you can go ahead
testing with this patch as it'll work fine despite of the bug (if it
wasn't the case I would have noticed already ;).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Eric Dumazet
Andrea Arcangeli a écrit :
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 + cond_resched();
 + if ((unsigned long)*(spinlock_t **)a 
 + (unsigned long)*(spinlock_t **)b)
 + return -1;
 + else if (a == b)
 + return 0;
 + else
 + return 1;
 +}
 +
This compare function looks unusual...
It should work, but sort() could be faster if the
if (a == b) test had a chance to be true eventually...

static int mm_lock_cmp(const void *a, const void *b)
{
unsigned long la = (unsigned long)*(spinlock_t **)a;
unsigned long lb = (unsigned long)*(spinlock_t **)b;

cond_resched();
if (la  lb)
return -1;
if (la  lb)
return 1;
return 0;
}






-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Christoph Lameter
Thanks for adding most of my enhancements. But

1. There is no real need for invalidate_page(). Can be done with 
invalidate_start/end. Needlessly complicates the API. One
of the objections by Andrew was that there mere multiple
callbacks that perform similar functions.

2. The locks that are used are later changed to semaphores. This is
   f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the
   lock conversion is done first and then mm_lock is introduced. The
   way the patches are structured means that reviewers cannot review the
   final version of mm_lock etc etc. The lock conversion needs to come 
   first.

3. As noted by Eric and also contained in private post from yesterday by 
   me: The cmp function needs to retrieve the value before
   doing comparisons which is not done for the == of a and b.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
 Thanks for adding most of my enhancements. But
 
 1. There is no real need for invalidate_page(). Can be done with 
   invalidate_start/end. Needlessly complicates the API. One
   of the objections by Andrew was that there mere multiple
   callbacks that perform similar functions.

While I agree with that reading of Andrew's email about invalidate_page,
I think the GRU hardware makes a strong enough case to justify the two
seperate callouts.

Due to the GRU hardware, we can assure that invalidate_page terminates all
pending GRU faults (that includes faults that are just beginning) and can
therefore be completed without needing any locking.  The invalidate_page()
callout gets turned into a GRU flush instruction and we return.

Because the invalidate_range_start() leaves the page table information
available, we can not use a single page _start to mimick that
functionality.  Therefore, there is a documented case justifying the
seperate callouts.

I agree the case is fairly weak, but it does exist.  Given Andrea's
unwillingness to move and Jack's documented case, it is my opinion the
most likely compromise is to leave in the invalidate_page() callout.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
 3. As noted by Eric and also contained in private post from yesterday by 
me: The cmp function needs to retrieve the value before
doing comparisons which is not done for the == of a and b.

I retrieved the value, which is why mm_lock works perfectly on #v13 as
well as #v12. It's not mandatory to ever return 0, so it won't produce
any runtime error (there is a bugcheck for wrong sort ordering in my
patch just in case it would generate any runtime error and it never
did, or I would have noticed before submission), which is why I didn't
need to release any hotfix yet and I'm waiting more time to get more
comments before sending an update to clean up that bit.

Mentioning this as the third and last point I guess shows how strong
are your arguments against merging my mmu-notifier-core now, so in the
end doing that cosmetical error payed off somehow.

I'll send an update in any case to Andrew way before Saturday so
hopefully we'll finally get mmu-notifiers-core merged before next
week. Also I'm not updating my mmu-notifier-core patch anymore except
for strict bugfixes so don't worry about any more cosmetical bugs
being introduced while optimizing the code like it happened this time.

The only other change I did has been to move mmu_notifier_unregister
at the end of the patchset after getting more questions about its
reliability and I documented a bit the rmmod requirements for
-release. we'll think later if it makes sense to add it, nobody's
using it anyway.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Robin Holt
 The only other change I did has been to move mmu_notifier_unregister
 at the end of the patchset after getting more questions about its
 reliability and I documented a bit the rmmod requirements for
 -release. we'll think later if it makes sense to add it, nobody's
 using it anyway.

XPMEM is using it.  GRU will be as well (probably already does).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

 I'll send an update in any case to Andrew way before Saturday so
 hopefully we'll finally get mmu-notifiers-core merged before next
 week. Also I'm not updating my mmu-notifier-core patch anymore except
 for strict bugfixes so don't worry about any more cosmetical bugs
 being introduced while optimizing the code like it happened this time.

I guess I have to prepare another patchset then?


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Jack Steiner
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
  The only other change I did has been to move mmu_notifier_unregister
  at the end of the patchset after getting more questions about its
  reliability and I documented a bit the rmmod requirements for
  -release. we'll think later if it makes sense to add it, nobody's
  using it anyway.
 
 XPMEM is using it.  GRU will be as well (probably already does).

Yeppp.

The GRU driver unregisters the notifier when all GRU mappings
are unmapped. I could make it work either way - either with or without
an unregister function. However, unregister is the most logical
action to take when all mappings have been destroyed.


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel