Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote: Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem yourself in this direction, you're very welcome to go ahead while I In case you didn't notice this already, for a further explanation of why semaphores runs slower for small critical sections and why the conversion from spinlock to rwsem should happen under a config option, see the AIM7 40% regression with 2.6.26-rc1 thread. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, 29 Apr 2008, Andrea Arcangeli wrote: My point of view is that there was no rcu when I wrote that code, yet there was no reference count and yet all locking looks still exactly the same as I wrote it. There's even still the page_table_lock to serialize threads taking the mmap_sem in read mode against the first vma-anon_vma = anon_vma during the page fault. Frankly I've absolutely no idea why rcu is needed in all rmap code when walking the page-mapping. Definitely the PG_locked is taken so there's no way page-mapping could possibly go away under the rmap code, hence the anon_vma can't go away as it's queued in the vma, and the vma has to go away before the page is zapped out of the pte. [I'm scarcely following the mmu notifiers to-and-fro, which seems to be in good hands, amongst faster thinkers than me: who actually need and can test this stuff. Don't let me slow you down; but I can quickly clarify on this history.] No, the locking was different as you had it, Andrea: there was an extra bitspin lock, carried over from the pte_chains days (maybe we changed the name, maybe we disagreed over the name, I forget), which mainly guarded the page-mapcount. I thought that was one lock more than we needed, and eliminated it in favour of atomic page-mapcount in 2.6.9. Here's the relevant extracts from ChangeLog-2.6.9: [PATCH] rmaplock: PageAnon in mapping First of a batch of five patches to eliminate rmap's page_map_lock, replace its trylocking by spinlocking, and use anon_vma to speed up swapoff. Patches updated from the originals against 2.6.7-mm7: nothing new so I won't spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting the unuse_process mmap_sem fix already in 2.6.8-rc3. This patch: Replace the PG_anon page-flags bit by setting the lower bit of the pointer in page-mapping when it's anon_vma: PAGE_MAPPING_ANON bit. We're about to eliminate the locking which kept the flags and mapping in synch: it's much easier to work on a local copy of page-mapping, than worry about whether flags and mapping are in synch (though I imagine it could be done, at greater cost, with some barriers). [PATCH] rmaplock: kill page_map_lock The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to lock its pte_chains. We kept this (as page_map_lock: bit_spin_lock on PG_maplock) when we moved to objrmap. But the file objrmap locks its vma tree with mapping-i_mmap_lock, and the anon objrmap locks its vma list with anon_vma-lock: so isn't the page_map_lock superfluous? Pretty much, yes. The mapcount was protected by it, and needs to become an atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely up and down. The last page_remove_rmap can't clear anon page mapping any more, because of races with page_add_rmap; from which some BUG_ONs must go for the same reason, but they've served their purpose. vmscan decisions are naturally racy, little change there beyond removing page_map_lock/unlock. But to stabilize the file-backed page-mapping against truncation while acquiring i_mmap_lock, page_referenced_file now needs page lock to be held even for refill_inactive_zone. There's a similar issue in acquiring anon_vma-lock, where page lock doesn't help: which this patch pretends to handle, but actually it needs the next. Roughly 10% cut off lmbench fork numbers on my 2*HT*P4. Must confess my testing failed to show the races even while they were knowingly exposed: would benefit from testing on racier equipment. [PATCH] rmaplock: SLAB_DESTROY_BY_RCU With page_map_lock gone, how to stabilize page-mapping's anon_vma while acquiring anon_vma-lock in page_referenced_anon and try_to_unmap_anon? The page cannot actually be freed (vmscan holds reference), but however much we check page_mapped (which guarantees that anon_vma is in use - or would guarantee that if we added suitable barriers), there's no locking against page becoming unmapped the instant after, then anon_vma freed. It's okay to take anon_vma-lock after it's freed, so long as it remains a struct anon_vma (its list would become empty, or perhaps reused for an unrelated anon_vma: but no problem since we always check that the page located is the right one); but corruption if that memory gets reused for some other purpose. This is not unique: it's liable to be problem whenever the kernel tries to approach a structure obliquely. It's generally solved with an atomic reference count; but one advantage of anon_vma over anonmm is that it does not have such a count, and it would be a backward step to add one. Therefore... implement SLAB_DESTROY_BY_RCU flag, to guarantee that such a kmem_cache_alloc'ed structure cannot get freed to other use while the rcu_read_lock is held i.e. preempt disabled; and use that for anon_vma. Fix concerns raised by Manfred: this flag is incompatible with poisoning and destructor, and kmem_cache_destroy needs to synchronize_kernel. I hope SLAB_DESTROY_BY_RCU
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Hi Hugh!! On Tue, Apr 29, 2008 at 11:49:11AM +0100, Hugh Dickins wrote: [I'm scarcely following the mmu notifiers to-and-fro, which seems to be in good hands, amongst faster thinkers than me: who actually need and can test this stuff. Don't let me slow you down; but I can quickly clarify on this history.] Still I think it'd be great if you could review mmu-notifier-core v14. You and Nick are the core VM maintainers so it'd be great to hear any feedback about it. I think it's fairly easy to classify the patch as obviously safe as long as mmu notifiers are disarmed. Here a link for your convenience. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core No, the locking was different as you had it, Andrea: there was an extra bitspin lock, carried over from the pte_chains days (maybe we changed the name, maybe we disagreed over the name, I forget), which mainly guarded the page-mapcount. I thought that was one lock more than we needed, and eliminated it in favour of atomic page-mapcount in 2.6.9. Thanks a lot for the explanation! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Mon, Apr 28, 2008 at 06:28:06PM -0700, Christoph Lameter wrote: On Tue, 29 Apr 2008, Andrea Arcangeli wrote: Frankly I've absolutely no idea why rcu is needed in all rmap code when walking the page-mapping. Definitely the PG_locked is taken so there's no way page-mapping could possibly go away under the rmap code, hence the anon_vma can't go away as it's queued in the vma, and the vma has to go away before the page is zapped out of the pte. zap_pte_range can race with the rmap code and it does not take the page lock. The page may not go away since a refcount was taken but the mapping can go away. Without RCU you have no guarantee that the anon_vma is existing when you take the lock. There's some room for improvement, like using down_read_trylock, if that succeeds we don't need to increase the refcount and we can keep the rcu_read_lock held instead. Secondly we don't need to increase the refcount in fork() when we queue the vma-copy in the anon_vma. You should init the refcount to 1 when the anon_vma is allocated, remove the atomic_inc from all code (except when down_read_trylock fails) and then change anon_vma_unlink to: up_write(anon_vma-sem); if (empty) put_anon_vma(anon_vma); While the down_read_trylock surely won't help in AIM, the second change will reduce a bit the overhead in the VM core fast paths by avoiding all refcounting changes by checking the list_empty the same way the current code does. I really like how I designed the garbage collection through list_empty and that's efficient and I'd like to keep it. I however doubt this will bring us back to the same performance of the current spinlock version, as the real overhead should come out of overscheduling in down_write ai anon_vma_link. Here an initially spinning lock would help but that's gray area, it greatly depends on timings, and on very large systems where a cacheline wait with many cpus forking at the same time takes more than scheduling a semaphore may not slowdown performance that much. So I think the only way is a configuration option to switch the locking at compile time, then XPMEM will depend on that option to be on, I don't see a big deal and this guarantees embedded isn't screwed up by totally unnecessary locks on UP. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
I however doubt this will bring us back to the same performance of the current spinlock version, as the real overhead should come out of overscheduling in down_write ai anon_vma_link. Here an initially spinning lock would help but that's gray area, it greatly depends on timings, and on very large systems where a cacheline wait with many cpus forking at the same time takes more than scheduling a semaphore may not slowdown performance that much. So I think the only way is a configuration option to switch the locking at compile time, then XPMEM will depend on that option to be on, I don't see a big deal and this guarantees embedded isn't screwed up by totally unnecessary locks on UP. You have said this continually about a CONFIG option. I am unsure how that could be achieved. Could you provide a patch? Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 29, 2008 at 10:50:30AM -0500, Robin Holt wrote: You have said this continually about a CONFIG option. I am unsure how that could be achieved. Could you provide a patch? I'm busy with the reserved ram patch against 2.6.25 and latest kvm.git that is moving from pages to pfn for pci passthrough (that change will also remove the page pin with mmu notifiers). Unfortunately reserved-ram bugs out again in the blk-settings.c on real hardware. The fix I pushed in .25 for it, works when booting kvm (that's how I tested it) but on real hardware sata b_pfn happens to be 1 page less than the result of the min comparison and I'll have to figure out what happens (only .24 code works on real hardware..., at least my fix is surely better than the previous .25-pre code). I've other people waiting on that reserved-ram to be working, so once I've finished, I'll do the optimization to anon-vma (at least the removal of the unnecessary atomic_inc from fork) and add the config option. Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem yourself in this direction, you're very welcome to go ahead while I finish sorting out reserved-ram. If you do, please let me know so we don't duplicate effort, and it'd be absolutely great if the patches could be incremental with #v14 so I can merge them trivially later and upload a new patchset once you're finished (the only outstanding fix you have to apply on top of #v14 that is already integrated in my patchset, is the i_mmap_sem deadlock fix I posted and that I'm sure you've already applied on top of #v14 before doing any more development on it). Thanks! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Sun, 27 Apr 2008, Andrea Arcangeli wrote: Talking about post 2.6.26: the refcount with rcu in the anon-vma conversion seems unnecessary and may explain part of the AIM slowdown too. The rest looks ok and probably we should switch the code to a compile-time decision between rwlock and rwsem (so obsoleting the current spinlock). You are going to take a semphore in an rcu section? Guess you did not activate all debugging options while testing? I was not aware that you can take a sleeping lock from a non preemptible context. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Mon, Apr 28, 2008 at 01:34:11PM -0700, Christoph Lameter wrote: On Sun, 27 Apr 2008, Andrea Arcangeli wrote: Talking about post 2.6.26: the refcount with rcu in the anon-vma conversion seems unnecessary and may explain part of the AIM slowdown too. The rest looks ok and probably we should switch the code to a compile-time decision between rwlock and rwsem (so obsoleting the current spinlock). You are going to take a semphore in an rcu section? Guess you did not activate all debugging options while testing? I was not aware that you can take a sleeping lock from a non preemptible context. I'd hoped to discuss this topic after mmu-notifier-core was already merged, but let's do it anyway. My point of view is that there was no rcu when I wrote that code, yet there was no reference count and yet all locking looks still exactly the same as I wrote it. There's even still the page_table_lock to serialize threads taking the mmap_sem in read mode against the first vma-anon_vma = anon_vma during the page fault. Frankly I've absolutely no idea why rcu is needed in all rmap code when walking the page-mapping. Definitely the PG_locked is taken so there's no way page-mapping could possibly go away under the rmap code, hence the anon_vma can't go away as it's queued in the vma, and the vma has to go away before the page is zapped out of the pte. So there are some possible scenarios: 1) my original anon_vma code was buggy not taking the rcu_read_lock() and somebody fixed it (I tend to exclude it) 2) somebody has seen a race that doesn't exist and didn't bother to document it other than with this obscure comment * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. I tend to exclude it too as VM folks are too smart for this to be the case. 3) somebody did some microoptimization using rcu and we surely can undo that microoptimization to get the code back to my original code that didn't need rcu despite it worked exactly the same, and that is going to be cheaper to use with semaphores than doubling the number of locked ops for every lock instruction. Now the double atomic op may not be horrible when not contented, as it works on the same cacheline but with cacheline bouncing with contention it sounds doubly horrible than a single cacheline bounce and I don't see the point of it as you can't use rcu anyways, so you can't possibly take advantage of whatever microoptimization done over the original locking. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote: the first four sets. The fifth is the oversubscription test which trips my xpmem bug. This is as good as the v12 runs from before. Now that mmu-notifier-core #v14 seems finished and hopefully will appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier code with the full patchset applied and not only with mmu-notifier-core. I soon found the full patchset has a swap deadlock bug. Then I tried without using kvm (so with mmu notifier disarmed) and I could still reproduce the crashes. After grabbing a few stack traces I tracked it down to a bug in the i_mmap_lock-i_mmap_sem conversion. If you oversubscription means swapping, you should retest with this applied on #v14 i_mmap_sem patch as it would eventually deadlock with all tasks allocating memory in D state without this. Now the full patchset is as rock solid as with only mmu-notifier-core applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for the last hours without a problem. Everything is working great with KVM at least. Talking about post 2.6.26: the refcount with rcu in the anon-vma conversion seems unnecessary and may explain part of the AIM slowdown too. The rest looks ok and probably we should switch the code to a compile-time decision between rwlock and rwsem (so obsoleting the current spinlock). diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, mapping-i_mmap_nonlinear, shared.vm_set.list) vma-vm_private_data = NULL; out: - up_write(mapping-i_mmap_sem); + up_read(mapping-i_mmap_sem); return ret; } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Thu, Apr 24, 2008 at 07:41:45PM +0200, Andrea Arcangeli wrote: A full new update will some become visible here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/ I grabbed these and built them. Only change needed was another include. After that, everything built fine and xpmem regression tests ran through the first four sets. The fifth is the oversubscription test which trips my xpmem bug. This is as good as the v12 runs from before. Since this include and the one for mm_types.h both are build breakages for ia64, I think you need to apply your ia64_cpumask and the following (possibly as a single patch) first or in your patch 1. Without that, ia64 doing a git-bisect could hit a build failure. Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h === --- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h2008-04-26 06:41:54.0 -0500 +++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h 2008-04-26 07:01:17.292071827 -0500 @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include linux/mutex.h + struct srcu_struct_array { int c[2]; }; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote: Since this include and the one for mm_types.h both are build breakages for ia64, I think you need to apply your ia64_cpumask and the following (possibly as a single patch) first or in your patch 1. Without that, ia64 doing a git-bisect could hit a build failure. Agreed, so it doesn't risk to break ia64 compilation, thanks for the great XPMEM feedback! Also note, I figured out that mmu_notifier_release can actually run concurrently against other mmu notifiers in case there's a vmtruncate (-release could already run concurrently if invoked by _unregister, the only guarantee is that -release will be called one time and only one time and that no mmu notifier will ever run after _unregister returns). In short I can't keep the list_del_init in _release and I need a list_del_init_rcu instead to fix this minor issue. So this won't really make much difference after all. I'll release #v14 with all this after a bit of kvm testing with it... diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -755,6 +755,14 @@ static inline void hlist_del_init(struct } } +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n-pprev = NULL; + } +} + /** * hlist_replace_rcu - replace old entry by new one * @old : the element to be replaced diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -22,7 +22,10 @@ struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are -* freed. It's mandatory to implement this method. +* freed. It's mandatory to implement this method. This can +* run concurrently to other mmu notifier methods and it +* should teardown all secondary mmu mappings and freeze the +* secondary mmu. */ void (*release)(struct mmu_notifier *mn, struct mm_struct *mm); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -19,12 +19,13 @@ /* * This function can't run concurrently against mmu_notifier_register - * or any other mmu notifier method. mmu_notifier_register can only - * run with mm-mm_users 0 (and exit_mmap runs only when mm_users is - * zero). All other tasks of this mm already quit so they can't invoke - * mmu notifiers anymore. This can run concurrently only against - * mmu_notifier_unregister and it serializes against it with the - * mmu_notifier_mm-lock in addition to RCU. struct mmu_notifier_mm + * because mm-mm_users 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there's no task using this mm anymore, through + * the vmas outside of the exit_mmap context, like with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm-lock in addition to SRCU and it serializes + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm * can't go away from under us as exit_mmap holds a mm_count pin * itself. */ @@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st * to wait -release to finish and * mmu_notifier_unregister to return. */ - hlist_del_init(mn-hlist); + hlist_del_init_rcu(mn-hlist); /* * SRCU here will block mmu_notifier_unregister until * -release returns. @@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not * side note: mmu_notifier_release can't run concurrently with * us because we hold the mm_users pin (either implicitly as * current-mm or explicitly with get_task_mm() or similar). +* We can't race against any other mmu notifiers either thanks +* to mm_lock(). */ spin_lock(mm-mmu_notifier_mm-lock); hlist_add_head(mn-hlist, mm-mmu_notifier_mm-list); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Thu, Apr 24, 2008 at 12:19:28AM +0200, Andrea Arcangeli wrote: /dev/kvm closure. Given this can be a considered a bugfix to mmu_notifier_unregister I'll apply it to 1/N and I'll release a new I'm not sure anymore this can be considered a bugfix given how large change this resulted in the locking and register/unregister/release behavior. Here a full draft patch for review and testing. Works great with KVM so far at least... - mmu_notifier_register has to run on current-mm or on get_task_mm() (in the later case it can mmput after mmu_notifier_register returns) - mmu_notifier_register in turn can't race against mmu_notifier_release as that runs in exit_mmap after the last mmput - mmu_notifier_unregister can run at any time, even after exit_mmap completed. No mm_count pin is required, it's taken automatically by register and released by unregister - mmu_notifier_unregister serializes against all mmu notifiers with srcu, and it serializes especially against a concurrent mmu_notifier_unregister with a mix of a spinlock and SRCU - the spinlock let us keep track who run first between mmu_notifier_unregister and mmu_notifier_release, this makes life much easier for the driver to handle as the driver is then guaranteed that -release will run. - The first that runs executes -release method as well after dropping the spinlock but before releasing the srcu lock - it was unsafe to unpin the module count from -release, as release itself has to run the 'ret' instruction to return back to the mmu notifier code - the -release method is mandatory as it has to run before the pages are freed to zap all existing sptes - the one that arrives second between mmu_notifier_unregister and mmu_notifier_register waits the first with srcu As said this is a much larger change than I hoped, but as usual it can only affect KVM/GRU/XPMEM if something is wrong with this. I don't exclude we'll have to backoff to the previous mm_users model. The main issue with taking a mm_users pin is that filehandles associated with vmas aren't closed by exit() if the mm_users is pinned (that simply leaks ram with kvm). It looks more correct not to relay on the mm_users being 0 only in mmu_notifier_register. The other big change is that -release is mandatory and always called by the first between mmu_notifier_unregister or mmu_notifier_release. Both mmu_notifier_unregister and mmu_notifier_release are slow paths so taking a spinlock there is no big deal. Impact when the mmu notifiers are disarmed is unchanged. The interesting part of the kvm patch to test this change is below. After this last bit KVM patch status is almost final if this new mmu notifier update is remotely ok, I've another one that does the locking change to remove the page pin. +static void kvm_free_vcpus(struct kvm *kvm); +/* This must zap all the sptes because all pages will be freed then */ +static void kvm_mmu_notifier_release(struct mmu_notifier *mn, +struct mm_struct *mm) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm-mm); + kvm_free_pit(kvm); + kfree(kvm-arch.vpic); + kfree(kvm-arch.vioapic); + kvm_free_vcpus(kvm); + kvm_free_physmem(kvm); + if (kvm-arch.apic_access_page) + put_page(kvm-arch.apic_access_page); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .release= kvm_mmu_notifier_release, + .invalidate_page= kvm_mmu_notifier_invalidate_page, + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, + .clear_flush_young = kvm_mmu_notifier_clear_flush_young, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(kvm-arch.active_mmu_pages); + kvm-arch.mmu_notifier.ops = kvm_mmu_notifier_ops; + err = mmu_notifier_register(kvm-arch.mmu_notifier, current-mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } @@ -3899,13 +3967,12 @@ static void kvm_free_vcpus(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { - kvm_free_pit(kvm); - kfree(kvm-arch.vpic); - kfree(kvm-arch.vioapic); - kvm_free_vcpus(kvm); - kvm_free_physmem(kvm); - if (kvm-arch.apic_access_page) - put_page(kvm-arch.apic_access_page); + /* +* kvm_mmu_notifier_release() will be called before +* mmu_notifier_unregister returns, if it didn't run +* already. +*/ + mmu_notifier_unregister(kvm-arch.mmu_notifier, kvm-mm); kfree(kvm); } Let's call this mmu notifier #v14-test1. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
I am not certain of this, but it seems like this patch leaves things in a somewhat asymetric state. At the very least, I think that asymetry should be documented in the comments of either mmu_notifier.h or .c. Before I do the first mmu_notifier_register, all places that test for mm_has_notifiers(mm) will return false and take the fast path. After I do some mmu_notifier_register()s and their corresponding mmu_notifier_unregister()s, The mm_has_notifiers(mm) will return true and the slow path will be taken. This, despite all registered notifiers having unregistered. It seems to me the work done by mmu_notifier_mm_destroy should really be done inside the mm_lock()/mm_unlock area of mmu_unregister and mm_notifier_release when we have removed the last entry. That would give the users job the same performance after they are done using the special device that they had prior to its use. On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote: ... diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c ... @@ -603,25 +605,39 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ + ret = 0; if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma-anon_vma) - return 0; + goto out; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (unlikely(is_vm_hugetlb_page(vma))) { + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); + goto out; + } + if (is_cow_mapping(vma-vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; I don't think this is needed. ... +/* avoid memory allocations for mm_unlock to prevent deadlock */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm-map_count) { + if (data-nr_anon_vma_locks) + mm_unlock_vfree(data-anon_vma_locks, + data-nr_anon_vma_locks); + if (data-i_mmap_locks) I think you really want data-nr_i_mmap_locks. ... diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c ... +/* + * This function can't run concurrently against mmu_notifier_register + * or any other mmu notifier method. mmu_notifier_register can only + * run with mm-mm_users 0 (and exit_mmap runs only when mm_users is + * zero). All other tasks of this mm already quit so they can't invoke + * mmu notifiers anymore. This can run concurrently only against + * mmu_notifier_unregister and it serializes against it with the + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go + * away from under us as the exit_mmap holds a mm_count pin itself. + * + * The -release method can't allow the module to be unloaded, the + * module can only be unloaded after mmu_notifier_unregister run. This + * is because the release method has to run the ret instruction to + * return back here, and so it can't allow the ret instruction to be + * freed. + */ The second paragraph of this comment seems extraneous. ... + /* + * Wait -release if mmu_notifier_unregister run list_del_rcu. + * srcu can't go away from under us because one mm_count is + * hold by exit_mmap. + */ These two sentences don't make any sense to me. ... +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + int before_release = 0, srcu; + + BUG_ON(atomic_read(mm-mm_count) = 0); + + srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu); + spin_lock(mm-mmu_notifier_mm-unregister_lock); + if (!hlist_unhashed(mn-hlist)) { + hlist_del_rcu(mn-hlist); + before_release = 1; + } + spin_unlock(mm-mmu_notifier_mm-unregister_lock); + if (before_release) + /* + * exit_mmap will block in mmu_notifier_release to + * guarantee -release is called before freeing the + * pages. + */ + mn-ops-release(mn, mm); I am not certain about the need to do the release callout when the driver has already told this subsystem it is done. For XPMEM, this callout would immediately return. I would expect it to be the same or GRU. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Thu, Apr 24, 2008 at 04:51:12AM -0500, Robin Holt wrote: It seems to me the work done by mmu_notifier_mm_destroy should really be done inside the mm_lock()/mm_unlock area of mmu_unregister and There's no mm_lock/unlock for mmu_unregister anymore. That's the whole point of using srcu so it becomes reliable and quick. mm_notifier_release when we have removed the last entry. That would give the users job the same performance after they are done using the special device that they had prior to its use. That's not feasible. Otherwise mmu_notifier_mm will go away at any time under both _release from exit_mmap and under _unregister too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm after the last mmdrop makes it safe. mmu_notifier_unregister also holds the mm_count because mm_count was pinned by mmu_notifier_register. That solves the issue with mmu_notifier_mm going away from under mmu_notifier_unregister and _release and that's why it can only be freed after mm_count == 0. There's at least one small issue I noticed so far, that while _release don't need to care about _register, but _unregister definitely need to care about _register. I've to take the mmap_sem in addition or in replacement of the unregister_lock. The srcu_read_lock can also likely moved just before releasing the unregister_lock but that's just a minor optimization to make the code more strict. On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote: ... diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c ... @@ -603,25 +605,39 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ + ret = 0; if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma-anon_vma) - return 0; + goto out; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (unlikely(is_vm_hugetlb_page(vma))) { + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); + goto out; + } + if (is_cow_mapping(vma-vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; I don't think this is needed. It's not needed right, but I thought it was cleaner if they all use ret after I had to change the code at the end of the function. Anyway I'll delete this to make the patch shorter and only change the minimum, agreed. ... +/* avoid memory allocations for mm_unlock to prevent deadlock */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm-map_count) { + if (data-nr_anon_vma_locks) + mm_unlock_vfree(data-anon_vma_locks, + data-nr_anon_vma_locks); + if (data-i_mmap_locks) I think you really want data-nr_i_mmap_locks. Indeed. It never happens that there are zero vmas with filebacked mappings, this is why this couldn't be triggered in practice, thanks! The second paragraph of this comment seems extraneous. ok removed. + /* +* Wait -release if mmu_notifier_unregister run list_del_rcu. +* srcu can't go away from under us because one mm_count is +* hold by exit_mmap. +*/ These two sentences don't make any sense to me. Well that was a short explanation of why the mmu_notifier_mm structure can only be freed after the last mmdrop, which is what you asked at the top. I'll try to rephrase. +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + int before_release = 0, srcu; + + BUG_ON(atomic_read(mm-mm_count) = 0); + + srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu); + spin_lock(mm-mmu_notifier_mm-unregister_lock); + if (!hlist_unhashed(mn-hlist)) { + hlist_del_rcu(mn-hlist); + before_release = 1; + } + spin_unlock(mm-mmu_notifier_mm-unregister_lock); + if (before_release) + /* +* exit_mmap will block in mmu_notifier_release to +* guarantee -release is called before freeing the +* pages. +*/ + mn-ops-release(mn, mm); I am not certain about the need to do the release callout when the driver has already told this subsystem it is done. For XPMEM, this callout would immediately return. I would expect it to be the same or GRU. The point is that you don't want to run it twice. And without this you will have to serialize against -release yourself in the driver. It's much more convenient if you know that -release will be called just once, and before mmu_notifier_unregister returns. It could be called by _release even after you're already inside _unregister, _release may reach the spinlock before _unregister, you won't notice the difference. Solving this race in the driver looked too complex, I rather solve it once
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Thu, Apr 24, 2008 at 05:39:43PM +0200, Andrea Arcangeli wrote: There's at least one small issue I noticed so far, that while _release don't need to care about _register, but _unregister definitely need to care about _register. I've to take the mmap_sem in addition or in In the end the best is to use the spinlock around those list_add/list_del they all run in O(1) with the hlist and they take a few asm insn. This also avoids to take the mmap_sem in exit_mmap, at exit_mmap time nobody should need to use mmap_sem anymore, it might work but this looks cleaner. The lock is dynamically allocated only when the notifiers are registered, so the few bytes taken by it aren't relevant. A full new update will some become visible here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/ Please have a close look again. Your help is extremely appreciated and very helpful as usual! Thanks a lot. diff -urN xxx/include/linux/mmu_notifier.h xx/include/linux/mmu_notifier.h --- xxx/include/linux/mmu_notifier.h2008-04-24 19:41:15.0 +0200 +++ xx/include/linux/mmu_notifier.h 2008-04-24 19:38:37.0 +0200 @@ -15,7 +15,7 @@ struct hlist_head list; struct srcu_struct srcu; /* to serialize mmu_notifier_unregister against mmu_notifier_release */ - spinlock_t unregister_lock; + spinlock_t lock; }; struct mmu_notifier_ops { diff -urN xxx/mm/memory.c xx/mm/memory.c --- xxx/mm/memory.c 2008-04-24 19:41:15.0 +0200 +++ xx/mm/memory.c 2008-04-24 19:38:37.0 +0200 @@ -605,16 +605,13 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - ret = 0; if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma-anon_vma) - goto out; + return 0; } - if (unlikely(is_vm_hugetlb_page(vma))) { - ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); - goto out; - } + if (is_vm_hugetlb_page(vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, vma); if (is_cow_mapping(vma-vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); @@ -636,7 +633,6 @@ if (is_cow_mapping(vma-vm_flags)) mmu_notifier_invalidate_range_end(src_mm, vma-vm_start, end); -out: return ret; } diff -urN xxx/mm/mmap.c xx/mm/mmap.c --- xxx/mm/mmap.c 2008-04-24 19:41:15.0 +0200 +++ xx/mm/mmap.c2008-04-24 19:38:37.0 +0200 @@ -2381,7 +2381,7 @@ if (data-nr_anon_vma_locks) mm_unlock_vfree(data-anon_vma_locks, data-nr_anon_vma_locks); - if (data-i_mmap_locks) + if (data-nr_i_mmap_locks) mm_unlock_vfree(data-i_mmap_locks, data-nr_i_mmap_locks); } diff -urN xxx/mm/mmu_notifier.c xx/mm/mmu_notifier.c --- xxx/mm/mmu_notifier.c 2008-04-24 19:41:15.0 +0200 +++ xx/mm/mmu_notifier.c2008-04-24 19:31:23.0 +0200 @@ -24,22 +24,16 @@ * zero). All other tasks of this mm already quit so they can't invoke * mmu notifiers anymore. This can run concurrently only against * mmu_notifier_unregister and it serializes against it with the - * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go - * away from under us as the exit_mmap holds a mm_count pin itself. - * - * The -release method can't allow the module to be unloaded, the - * module can only be unloaded after mmu_notifier_unregister run. This - * is because the release method has to run the ret instruction to - * return back here, and so it can't allow the ret instruction to be - * freed. + * mmu_notifier_mm-lock in addition to RCU. struct mmu_notifier_mm + * can't go away from under us as exit_mmap holds a mm_count pin + * itself. */ void __mmu_notifier_release(struct mm_struct *mm) { struct mmu_notifier *mn; int srcu; - srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu); - spin_lock(mm-mmu_notifier_mm-unregister_lock); + spin_lock(mm-mmu_notifier_mm-lock); while (unlikely(!hlist_empty(mm-mmu_notifier_mm-list))) { mn = hlist_entry(mm-mmu_notifier_mm-list.first, struct mmu_notifier, @@ -52,23 +46,28 @@ */ hlist_del_init(mn-hlist); /* +* SRCU here will block mmu_notifier_unregister until +* -release returns. +*/ + srcu = srcu_read_lock(mm-mmu_notifier_mm-srcu); + spin_unlock(mm-mmu_notifier_mm-lock); + /* * if -release runs before mmu_notifier_unregister it
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for -release. we'll think later if it makes sense to add it, nobody's using it anyway. XPMEM is using it. GRU will be as well (probably already does). XPMEM requires more patches anyway. Note that in previous email you told me you weren't using it. I think GRU can work fine on 2.6.26 without mmu_notifier_unregister, like KVM too. You've simply to unpin the module count in -release. The most important bit is that you've to do that anyway in case mmu_notifier_unregister fails (and it can fail because of vmalloc space shortage because somebody loaded some framebuffer driver or whatever). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 03:36:19PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for -release. we'll think later if it makes sense to add it, nobody's using it anyway. XPMEM is using it. GRU will be as well (probably already does). XPMEM requires more patches anyway. Note that in previous email you told me you weren't using it. I think GRU can work fine on 2.6.26 I said I could test without it. It is needed for the final version. It also makes the API consistent. What you are proposing is equivalent to having a file you can open but never close. This whole discussion seems ludicrous. You could refactor the code to get the sorted list of locks, pass that list into mm_lock to do the locking, do the register/unregister, then pass the same list into mm_unlock. If the allocation fails, you could fall back to the older slower method of repeatedly scanning the lists and acquiring locks in ascending order. without mmu_notifier_unregister, like KVM too. You've simply to unpin the module count in -release. The most important bit is that you've to do that anyway in case mmu_notifier_unregister fails (and it can If you are not going to provide the _unregister callout you need to change the API so I can scan the list of notifiers to see if my structures are already registered. We register our notifier structure at device open time. If we receive a _release callout, we mark our structure as unregistered. At device close time, if we have not been unregistered, we call _unregister. If you take away _unregister, I have an xpmem kernel structure in use _AFTER_ the device is closed with no indication that the process is using it. In that case, I need to get an extra reference to the module in my device open method and hold that reference until the _release callout. Additionally, if the users program reopens the device, I need to scan the mmu_notifiers list to see if this tasks notifier is already registered. I view _unregister as essential. Did I miss something? Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 09:47:47AM -0500, Robin Holt wrote: It also makes the API consistent. What you are proposing is equivalent to having a file you can open but never close. That's not entirely true, you can close the file just fine it by killing the tasks leading to an mmput. From an user prospective in KVM terms, it won't make a difference as /dev/kvm will remain open and it'll pin the module count until the kvm task is killed anyway, I assume for GRU it's similar. Until I had the idea of how to implement an mm_lock to ensure the mmu_notifier_register could miss a running invalidate_range_begin, it wasn't even possible to implement a mmu_notifier_unregister (see EMM patches) and it looked like you were ok with that API that missed _unregister... This whole discussion seems ludicrous. You could refactor the code to get the sorted list of locks, pass that list into mm_lock to do the locking, do the register/unregister, then pass the same list into mm_unlock. Correct, but it will keep the vmalloc ram pinned during the runtime. There's no reason to keep that ram allocated per-VM while the VM runs. We only need it during the startup and teardown. If the allocation fails, you could fall back to the older slower method of repeatedly scanning the lists and acquiring locks in ascending order. Correct, I already thought about that. This is exactly why I'm deferring this for later! Or those perfectionism not needed for KVM/GRU will keep delaying indefinitely the part that is already converged and that's enough for KVM and GRU (and for this specific bit, actually enough for XPMEM as well). We can make a second version of mm_lock_slow to use if mm_lock fails, in mmu_notifier_unregister, with N^2 complexity later, after the mmu-notifier-core is merged into mainline. If you are not going to provide the _unregister callout you need to change the API so I can scan the list of notifiers to see if my structures are already registered. As said 1/N isn't enough for XPMEM anyway. 1/N has to only include the absolute minimum and zero risk stuff, that is enough for both KVM and GRU. We register our notifier structure at device open time. If we receive a _release callout, we mark our structure as unregistered. At device close time, if we have not been unregistered, we call _unregister. If you take away _unregister, I have an xpmem kernel structure in use _AFTER_ the device is closed with no indication that the process is using it. In that case, I need to get an extra reference to the module in my device open method and hold that reference until the _release callout. Yes exactly, but you've to do that anyway, if mmu_notifier_unregister fails because some driver already allocated all vmalloc space (even x86-64 hasn't indefinite amount of vmalloc because of the vmalloc being in the end of the address space) unless we've a N^2 fallback, but the N^2 fallback will make the code more easily dosable and unkillable, so if I would be an admin I'd prefer having to quickly kill -9 a task in O(N) than having to wait some syscall that runs in O(N^2) to complete before the task quits. So the fallback to a slower algorithm isn't necessarily what will really happen after 2.6.26 is released, we'll see. Relaying on -release for the module unpin sounds preferable, and it's certainly the only reliable way to unregister that we'll provide in 2.6.26. Additionally, if the users program reopens the device, I need to scan the mmu_notifiers list to see if this tasks notifier is already registered. But you don't need any browse the list for this, keep a flag in your structure after the mmu_notifier struct, set the bitflag after mmu_notifier_register returns, and clear the bitflag after -release runs or after mmu_notifier_unregister returns success. What's the big deal to track if you've to call mmu_notifier_register a second time or not? Or you can create a new structure every time somebody asks to reattach. I view _unregister as essential. Did I miss something? We can add it later, and we can keep discussing on what's the best model to implement it as long as you want after 2.6.26 is released with mmu-notifier-core so GRU/KVM are done. It's unlikely KVM will use mmu_notifier_unregister anyway as we need it attached for the whole lifetime of the task, and only for the lifetime of the task. This is the patch to add it, as you can see it's entirely orthogonal, backwards compatible with previous API and it doesn't duplicate or rewrite any code. Don't worry, any kernel after 2.6.26 will have unregister, but we can't focus on this for 2.6.26. We can also consider making mmu_notifier_register safe against double calls on the same structure but again that's not something we should be doing in 1/N and it can be done later in a backwards compatible way (plus we're perfectly fine with the API having not backwards compatible changes as long as 2.6.26 can work for us). - Implement
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote: I guess I have to prepare another patchset then? If you want to embarrass yourself three time in a row go ahead ;). I thought two failed takeovers was enough. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 07:28:49PM -0500, Jack Steiner wrote: The GRU driver unregisters the notifier when all GRU mappings are unmapped. I could make it work either way - either with or without an unregister function. However, unregister is the most logical action to take when all mappings have been destroyed. This is true for KVM as well, unregister would be the most logical action to take when the kvm device is closed and the vm destroyed. However we can't implement mm_lock in O(N*log(N)) without triggering RAM allocations. And the size of those ram allocations are unknown at the time unregister runs (they also depend on the max_nr_vmas sysctl). So on a second thought not even passing the array from register to unregister would solve it (unless we allocate max_nr_vmas and we block the sysctl to alter max_nr_vmas if not all unregister run yet).That's clearly unacceptable. The only way to avoid failing because of vmalloc space shortage or oom, would be to provide a O(N*N) fallback. But one that can't be interrupted by sigkill! sigkill interruption was ok in #v12 because we didn't rely on mmu_notifier_unregister to succeed. So it avoided any DoS but it still can't provide any reliable unregister. So in the end unregistering with kill -9 leading to -release in O(1) sounds safer solution for the long term. You can't loop if unregister fails and pretend your module not to have deadlocks. Yes, waiting -release add up a bit of complexity but I think it worth it, and there weren't genial ideas on how to avoid O(N*N) complexity and allocations too in mmu_notifier_unregister yet. Until that genius idea will materialize we'll stick with -release in O(1) as the only safe unregister so we guarantee the admin will be in control of his hardware in O(1) with kill -9 no matter if /dev/kvm and /dev/gru are owned by sillyuser. I'm afraid if you don't want to worst-case unregister with -release you need to have a better idea than my mm_lock and personally I can't see any other way than mm_lock to ensure not to miss range_begin... All the above is in 2.6.27 context (for 2.6.26 -release is the way, even if the genius idea would materialize). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
You may have spotted this already. If so, just ignore this. It looks like there is a bug in copy_page_range() around line 667. It's possible to do a mmu_notifier_invalidate_range_start(), then return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end(). --- jack - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 06:26:29PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote: I guess I have to prepare another patchset then? Apologies for my previous not too polite comment in answer to the above, but I thought this double patchset was over now that you converged in #v12 and obsoleted EMM and after the last private discussions. There's nothing personal here on my side, just a bit of general frustration on this matter. I appreciate all great contribution from you, as last your idea to use sort(), but I can't really see any possible benefit or justification anymore from keeping two patchsets floating around given we already converged on the mmu-notifier-core, and given it's almost certain mmu-notifier-core will go in -mm in time for 2.6.26. Let's put it this way, if I fail to merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire patchset and leave maintainership to you so you move 1/N to N/N and remove mm_lock-sem patch (everything else can remain the same as it's all orthogonal so changing the order is a matter of minutes). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 12:09:09PM -0500, Jack Steiner wrote: You may have spotted this already. If so, just ignore this. It looks like there is a bug in copy_page_range() around line 667. It's possible to do a mmu_notifier_invalidate_range_start(), then return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end(). No I didn't spot it yet, great catch!! ;) Thanks a lot. I think we can take example by Jack and use our energy to spot any bug in the mmu-notifier-core like with his above auditing effort (I'm quite certain you didn't reprouce this with real oom ;) so we get a rock solid mmu-notifier implementation in 2.6.26 so XPMEM will also benefit later in 2.6.27 and I hope the last XPMEM internal bugs will also be fixed by that time. (for the not going to become mmu-notifier users, nothing to worry about for you, unless you used KVM or GRU actively with mmu-notifiers this bug would be entirely harmless with both MMU_NOTIFIER=n and =y, as previously guaranteed) Here the still untested fix for review. diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -597,6 +597,7 @@ unsigned long next; unsigned long addr = vma-vm_start; unsigned long end = vma-vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -604,33 +605,39 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ + ret = 0; if (!(vma-vm_flags (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma-anon_vma) - return 0; + goto out; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (unlikely(is_vm_hugetlb_page(vma))) { + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); + goto out; + } if (is_cow_mapping(vma-vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); if (is_cow_mapping(vma-vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma-vm_start, end); - - return 0; + vma-vm_start, end); +out: + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: Implement unregister but it's not reliable, only -release is reliable. Why is there still the hlist stuff being used for the mmu notifier list? And why is this still unsafe? There are cases in which you do not take the reverse map locks or mmap_sem while traversing the notifier list? This hope for inclusion without proper review (first for .25 now for .26) seems to interfere with the patch cleanup work and cause delay after delay for getting the patch ready. On what basis do you think that there is a chance of any of these patches making it into 2.6.26 given that this patchset has never been vetted in Andrew's tree? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote: I guess I have to prepare another patchset then? If you want to embarrass yourself three time in a row go ahead ;). I thought two failed takeovers was enough. Takeover? I'd be happy if I would not have to deal with this issue. These patches were necessary because you were not listening to feedback plus there is the issue that your patchsets were not easy to review or diff against. I had to merge several patches to get to a useful patch. You have always picked up lots of stuff from my patchsets. Lots of work that could have been avoided by proper patchsets in the first place. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: The only way to avoid failing because of vmalloc space shortage or oom, would be to provide a O(N*N) fallback. But one that can't be interrupted by sigkill! sigkill interruption was ok in #v12 because we didn't rely on mmu_notifier_unregister to succeed. So it avoided any DoS but it still can't provide any reliable unregister. If unregister fails then the driver should not detach from the address space immediately but wait until --release is called. That may be a possible solution. It will be rare that the unregister fails. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote: Why is there still the hlist stuff being used for the mmu notifier list? And why is this still unsafe? What's the problem with hlist, it saves 8 bytes for each mm_struct, you should be using it too instead of list. There are cases in which you do not take the reverse map locks or mmap_sem while traversing the notifier list? There aren't. This hope for inclusion without proper review (first for .25 now for .26) seems to interfere with the patch cleanup work and cause delay after delay for getting the patch ready. On what basis do you think that there is a chance of any of these patches making it into 2.6.26 given that this patchset has never been vetted in Andrew's tree? Let's say I try to be optimistic and hope the right thing will happen given this is like a new driver that can't hurt anybody but KVM and GRU if there's any bug. But in my view what interfere with proper review for .26 are the endless discussions we're doing ;). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: will go in -mm in time for 2.6.26. Let's put it this way, if I fail to merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire patchset and leave maintainership to you so you move 1/N to N/N and remove mm_lock-sem patch (everything else can remain the same as it's all orthogonal so changing the order is a matter of minutes). No I really want you to do this. I have no interest in a takeover in the future and have done the EMM stuff only because I saw no other way forward. I just want this be done the right way for all parties with patches that are nice and mergeable. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 11:19:26AM -0700, Christoph Lameter wrote: If unregister fails then the driver should not detach from the address space immediately but wait until --release is called. That may be a possible solution. It will be rare that the unregister fails. This is the current idea, exactly. Unless we find a way to replace mm_lock with something else, I don't see a way to make mmu_notifier_unregister reliable without wasting ram. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote: Why is there still the hlist stuff being used for the mmu notifier list? And why is this still unsafe? What's the problem with hlist, it saves 8 bytes for each mm_struct, you should be using it too instead of list. list heads in mm_struct and in the mmu_notifier struct seemed to be more consistent. We have no hash list after all. There are cases in which you do not take the reverse map locks or mmap_sem while traversing the notifier list? There aren't. There is a potential issue in move_ptes where you call invalidate_range_end after dropping i_mmap_sem whereas my patches did the opposite. Mmap_sem saves you there? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 11:21:49AM -0700, Christoph Lameter wrote: No I really want you to do this. I have no interest in a takeover in the Ok if you want me to do this, I definitely prefer the core to go in now. It's so much easier to concentrate on two problems at different times then to attack both problems at the same time given they're mostly completely orthogonal problems. Given we already solved one problem, I'd like to close it before concentrating on the second problem. I already told you it was my interest to support XPMEM too. For example it was me to notice we couldn't possibly remove can_sleep parameter from invalidate_range without altering the locking as vmas were unstable outside of one of the three core vm locks. That finding resulted in much bigger patches than we hoped (like Andrew previously sort of predicted) and you did all great work to develop those. From my part, once the converged part is in, it'll be a lot easier to fully concentrate on the rest. My main focus right now is to produce a mmu-notifier-core that is entirely bug free for .26. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 11:27:21AM -0700, Christoph Lameter wrote: There is a potential issue in move_ptes where you call invalidate_range_end after dropping i_mmap_sem whereas my patches did the opposite. Mmap_sem saves you there? Yes, there's really no risk of races in this area after introducing mm_lock, any place that mangles over ptes and doesn't hold any of the three locks is buggy anyway. I appreciate the audit work (I also did it and couldn't find bugs but the more eyes the better). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: Yes, there's really no risk of races in this area after introducing mm_lock, any place that mangles over ptes and doesn't hold any of the three locks is buggy anyway. I appreciate the audit work (I also did it and couldn't find bugs but the more eyes the better). I guess I would need to merge some patches together somehow to be able to review them properly like I did before sigh. I have not reviewed the latest code completely. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, Apr 23, 2008 at 06:37:13PM +0200, Andrea Arcangeli wrote: I'm afraid if you don't want to worst-case unregister with -release you need to have a better idea than my mm_lock and personally I can't see any other way than mm_lock to ensure not to miss range_begin... But wait, mmu_notifier_register absolutely requires mm_lock to ensure that when the kvm-arch.mmu_notifier_invalidate_range_count is zero (large variable name, it'll get shorter but this is to explain), really no cpu is in the middle of range_begin/end critical section. That's why we've to take all the mm locks. But we cannot care less if we unregister in the middle, unregister only needs to be sure that no cpu could possibly still using the ram of the notifier allocated by the driver before returning. So I'll implement unregister in O(1) and without ram allocations using srcu and that'll fix all issues with unregister. It'll return void to make it crystal clear it can't fail. It turns out unregister will make life easier to kvm as well, mostly to simplify the teardown of the /dev/kvm closure. Given this can be a considered a bugfix to mmu_notifier_unregister I'll apply it to 1/N and I'll release a new mmu-notifier-core patch for you to review before I resend to Andrew before Saturday. Thanks! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ +cond_resched(); +if ((unsigned long)*(spinlock_t **)a +(unsigned long)*(spinlock_t **)b) +return -1; +else if (a == b) +return 0; +else +return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la lb) return -1; if (la lb) return 1; return 0; } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? You need to compare *a to *b (at least, that's what you're doing for the case). -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli a écrit : On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Because a and b are pointers to the data you want to compare. You need to dereference them. static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la lb) return -1; if (la lb) return 1; return 0; } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! Hum, it's not a micro-optimization, but a bug fix. :) Sorry if it was not clear - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote: I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Indeed... Hum, it's not a micro-optimization, but a bug fix. :) The good thing is that even if this bug would lead to a system crash, it would be still zero risk for everybody that isn't using KVM/GRU actively with mmu notifiers. The important thing is that this patch has zero risk to introduce regressions into the kernel, both when enabled and disabled, it's like a new driver. I'll shortly resend 1/12 and likely 12/12 for theoretical correctness. For now you can go ahead testing with this patch as it'll work fine despite of the bug (if it wasn't the case I would have noticed already ;). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la lb) return -1; if (la lb) return 1; return 0; } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. 2. The locks that are used are later changed to semaphores. This is f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the lock conversion is done first and then mm_lock is introduced. The way the patches are structured means that reviewers cannot review the final version of mm_lock etc etc. The lock conversion needs to come first. 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. While I agree with that reading of Andrew's email about invalidate_page, I think the GRU hardware makes a strong enough case to justify the two seperate callouts. Due to the GRU hardware, we can assure that invalidate_page terminates all pending GRU faults (that includes faults that are just beginning) and can therefore be completed without needing any locking. The invalidate_page() callout gets turned into a GRU flush instruction and we return. Because the invalidate_range_start() leaves the page table information available, we can not use a single page _start to mimick that functionality. Therefore, there is a documented case justifying the seperate callouts. I agree the case is fairly weak, but it does exist. Given Andrea's unwillingness to move and Jack's documented case, it is my opinion the most likely compromise is to leave in the invalidate_page() callout. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. I retrieved the value, which is why mm_lock works perfectly on #v13 as well as #v12. It's not mandatory to ever return 0, so it won't produce any runtime error (there is a bugcheck for wrong sort ordering in my patch just in case it would generate any runtime error and it never did, or I would have noticed before submission), which is why I didn't need to release any hotfix yet and I'm waiting more time to get more comments before sending an update to clean up that bit. Mentioning this as the third and last point I guess shows how strong are your arguments against merging my mmu-notifier-core now, so in the end doing that cosmetical error payed off somehow. I'll send an update in any case to Andrew way before Saturday so hopefully we'll finally get mmu-notifiers-core merged before next week. Also I'm not updating my mmu-notifier-core patch anymore except for strict bugfixes so don't worry about any more cosmetical bugs being introduced while optimizing the code like it happened this time. The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for -release. we'll think later if it makes sense to add it, nobody's using it anyway. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for -release. we'll think later if it makes sense to add it, nobody's using it anyway. XPMEM is using it. GRU will be as well (probably already does). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: I'll send an update in any case to Andrew way before Saturday so hopefully we'll finally get mmu-notifiers-core merged before next week. Also I'm not updating my mmu-notifier-core patch anymore except for strict bugfixes so don't worry about any more cosmetical bugs being introduced while optimizing the code like it happened this time. I guess I have to prepare another patchset then? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for -release. we'll think later if it makes sense to add it, nobody's using it anyway. XPMEM is using it. GRU will be as well (probably already does). Yeppp. The GRU driver unregisters the notifier when all GRU mappings are unmapped. I could make it work either way - either with or without an unregister function. However, unregister is the most logical action to take when all mappings have been destroyed. --- jack - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel