Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Tue, Apr 01, 2008 at 01:55:32PM -0700, Christoph Lameter wrote: +/* Perform a callback */ +int __emm_notify(struct mm_struct *mm, enum emm_operation op, + unsigned long start, unsigned long end) +{ + struct emm_notifier *e = rcu_dereference(mm)-emm_notifier; + int x; + + while (e) { + + if (e-callback) { + x = e-callback(e, mm, op, start, end); + if (x) + return x; There are much bigger issues besides the rcu safety in this patch, proper aging of the secondary mmu through access bits set by hardware is unfixable with this model (you would need to do age |= e-callback), which is the proof of why this isn't flexibile enough by forcing the same parameter and retvals for all methods. No idea why you go for such inferior solution that will never get the aging right and will likely fall apart if we add more methods in the future. For example the switch you have to add in xpmem_emm_notifier_callback doesn't look good, at least gcc may be able to optimize it with an array indexing simulating proper pointer to function like in #v9. Most other patches will apply cleanly on top of my coming mmu notifiers #v10 that I hope will go in -mm. For #v10 the only two left open issues to discuss are: 1) the moment you remove rcu_read_lock from the methods (my #v9 had rcu_read_lock so synchronize_rcu() in Jack's patch was working with my #v9) GRU has no way to ensure the methods will fire immediately after registering. To fix this race after removing the rcu_read_lock (to prepare for the later patches that allows the VM to schedule when the mmu notifiers methods are invoked) I can replace rcu_read_lock with seqlock locking in the same way as I did in a previous patch posted here (seqlock_write around the registration method, and seqlock_read replying all callbacks if the race happened). then synchronize_rcu become unnecessary and the methods will be correctly replied allowing GRU not to corrupt memory after the registration method. EMM would also need a fix like this for GRU to be safe on top of EMM. Another less obviously safe approach is to allow the register method to succeed only when mm_users=1 and the task is single threaded. This way if all the places where the mmu notifers aren't invoked on the mm not by the current task, are only doing invalidates after/before zapping ptes, if the istantiation of new ptes is single threaded too, we shouldn't worry if we miss an invalidate for a pte that is zero and doesn't point to any physical page. In the places where current-mm != mm I'm using invalidate_page 99% of the time, and that only follows the ptep_clear_flush. The problem are the range_begin that will happen before zapping the pte in places where current-mm != mm. Unfortunately in my incremental patch where I move all invalidate_page outside of the PT lock to prepare for allowing sleeping inside the mmu notifiers, I used range_begin/end in places like try_to_unmap_cluster where current-mm != mm. In general this solution looks more fragile than the seqlock. 2) I'm uncertain how the driver can handle a range_end called before range_begin. Also multiple range_begin can happen in parallel later followed by range_end, so if there's a global seqlock that serializes the secondary mmu page fault, that will screwup (you can't seqlock_write in range_begin and sequnlock_write in range_end). The write side of the seqlock must be serialized and calling seqlock_write twice in a row before any sequnlock operation will break. A recursive rwsem taken in range_begin and released in range_end seems to be the only way to stop the secondary mmu page faults. If I would remove all range_begin/end in places where current-mm != mm, then I could as well bail out in mmu_notifier_register if use mm_users != 1 to solve problem 2 too. My solution to this is that I believe the driver is safe if the range_end is being missed if range_end is followed by an invalidate event like in invalidate_range_end, so the driver is ok to just have a static value that accounts if range_begin has ever happened and it will just return from range_end without doing anything if no range_begin ever happened. Notably I'll be trying to use range_begin in KVM too so I got to deal with 2) too. For Nick: the reason for using range_begin is supposedly an optimization: to guarantee that the last free of the page will happen outside the mmu_lock, so KVM internally to the mmu_lock is free to do: spin_lock(kvm-mmu_lock) put_page() spte = nonpresent flush secondary tlb() spin_unlock(kvm-mmu_lock) The above ordering is unsafe if the page could ever reach the freelist before the tlb flush happened. The range_begin will take the mmu_lock and will hold off
Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2
David Abrahams wrote: With the title combination, the guest takes nearly 100% of my real CPU time and still only sees one CPU. Is this a known problem, and does it have a known solution? Can you send the output of 'kvm_stat -1'? Also, what does the guest think it is doing (in task manager, etc.)? What HAL do you see in device manager? -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix endianness for virtio-blk config space
Anthony Liguori wrote: The virtio config space is little endian. Make sure that in virtio-blk we store the values in little endian format. Applied, thanks. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 07:32:35AM +0300, Avi Kivity wrote: It ought to work. gfn_to_hfn() (old gfn_to_page) will still need to take a refcount if possible. This reminds me, that mmu notifiers we could implement gfn_to_hfn only with follow_page and skip the refcounting on the struct page. I'm not suggesting that though, the refcount makes the code more robust IMHO, and notably it allows to run on kernels without mmu notifiers. So the trick is to do something like if (pfn_valid(pfn)) put_page(pfn_to_page(pfn)); That's the trick I was suggesting to take care of mmio space. If the reserved ram is used instead of VT-d to allow DMA, that will have change to: if (pfn_valid(pfn)) page = pfn_to_page(pfn); if (page_count(page)) put_page(page); that will take care of both the reserved ram and the pfn. In general I made it for the reserved-ram with only the page_count != 0 check, because 'struct page' existed. I'm unsure if it's good to add struct pages for non-ram, I find it slightly confusing and not the right thing, it takes memory for stuff that can't happen (refcounting only makes sense if the page finally goes in the freelist when count reaches 0, and PG_dirty/referenced bits and the like don't make sense either for non-ram or reserved-ram). So I'm not sure the vmemmap trick is the best. This will increase the potential for type errors, so maybe we need to make gfn_t and hfn_t distinct structures, and use accessors to get the actual values. That's also a possibility. If we go for this then hfn_t is probably a better name as it's less likely to collide with core kernel VM code. Otherwise perhaps pfn can be used instead of hfn, it's up to you ;). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] True meaning of showing wealth
Fantastic offer on luxuries The clock was invented to be re-invented to have a spot on your wrist. http://www.poretiage.com/ - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 07:32:35AM +0300, Avi Kivity wrote: It ought to work. gfn_to_hfn() (old gfn_to_page) will still need to take a refcount if possible. This reminds me, that mmu notifiers we could implement gfn_to_hfn only with follow_page and skip the refcounting on the struct page. I'm not suggesting that though, the refcount makes the code more robust IMHO, and notably it allows to run on kernels without mmu notifiers. Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. We could hack something to make pre mmu notifier kernels work. I'm unsure if it's good to add struct pages for non-ram, I find it slightly confusing and not the right thing, it takes memory for stuff that can't happen (refcounting only makes sense if the page finally goes in the freelist when count reaches 0, and PG_dirty/referenced bits and the like don't make sense either for non-ram or reserved-ram). So I'm not sure the vmemmap trick is the best. I thought we'd meet with resistance to that idea :) though I'd like to point out that struct pages to exist for mmio on some machines (those with = 4GB). This will increase the potential for type errors, so maybe we need to make gfn_t and hfn_t distinct structures, and use accessors to get the actual values. That's also a possibility. If we go for this then hfn_t is probably a better name as it's less likely to collide with core kernel VM code. Otherwise perhaps pfn can be used instead of hfn, it's up to you ;). I guess we can move to pfn, they're unambiguous enough. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote: Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. Exactly, not only that, get_user_pages is likely a bit slower that we need for just kvm pte lookup. GRU uses follow_page directly because it runs in the tlb miss handler, for us instead the tlb miss handler doesn't invoke a page fault unless the spte is non present. I expect for us that optimization will be mostly lost in the noise but it is likely measurable in heavy VM workloads and in general it sure worth it in the long run (if nothing else as a microoptimization for whatever spte fault benchmark). We could hack something to make pre mmu notifier kernels work. Actually I already removed the refcounting for the reserved ram, so it's going to be very easy to do with a few CONFIG_MMU_NOTIFIER. But because it's a low prio I would defer it for later, first patch can do the refcounting unconditionally to better test it. (I'm assuming we're going to deal with kernels without mmu notifiers [or without EMM ;) ] for a long while) I thought we'd meet with resistance to that idea :) though I'd like to point out that struct pages to exist for mmio on some machines (those with = 4GB). Sure, those should have page_count 0 like the reserved ram. I'm uncertain about the PageReserved semantics, I thought Nick nuked it long ago but perhaps it was something else. Now copy_page_range bails out on PFNMAP vmas, in the old days it would threat PG_reserved pages mapped with remap_page_range like a VM_SHARED by not wrprotecting and not refcounting even if this was a private mapping. In general using page_t for anything but ram is something that should be avoided and that also makes PG_reserved semantics mostly irrelevant. The users of remap_pfn_range (like /dev/mem) should never use pages regardless if it's ram or non-ram or if page_t exists or not. The fact page_t sometime exists for non-ram is strictly a performance optimization to make the pfn_to_page resolution quicker at runtime (speedup is significant and memory loss is negligeable). I guess we can move to pfn, they're unambiguous enough. Ok! ;) - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC/PATCH 05/15 v3] kvm-s390: s390 arch backend for the kvm kernel module
Am Montag, 31. März 2008 schrieb Arnd Bergmann: Hello Arnd, thanks for the review. On Tuesday 25 March 2008, Carsten Otte wrote: + +static inline void __user *__guestaddr_to_user(struct kvm_vcpu *vcpu, + u64 guestaddr) +{ + u64 prefix = vcpu-arch.sie_block-prefix; + u64 origin = vcpu-kvm-arch.guest_origin; + u64 memsize = vcpu-kvm-arch.guest_memsize; + + if (guestaddr 2 * PAGE_SIZE) + guestaddr += prefix; + else if ((guestaddr = prefix) (guestaddr prefix + 2 * PAGE_SIZE)) + guestaddr -= prefix; What happens if prefix is set to 4096? Does this do the right thing according to the architecture definition? The z/architecture and the instructions (spx + sigp set prefix) dont allow 4096 as prefix address. When setting a prefix, bits 1-18 (IBM numbering scheme) are used and appended with 13 zero to the right. That means prefix address is always a multiple of 8192. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 2/2] Move in-kernel PIT device to separate file
Anthony Liguori wrote: This patch is mostly code motion to move the newly refactored in-kernel PIT device to a separate file. Applied both, thanks. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote: Most other patches will apply cleanly on top of my coming mmu notifiers #v10 that I hope will go in -mm. For #v10 the only two left open issues to discuss are: Does your v10 allow sleeping inside the callbacks? Thanks, Robin - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote: Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. Exactly, not only that, get_user_pages is likely a bit slower that we need for just kvm pte lookup. GRU uses follow_page directly because it runs in the tlb miss handler, for us instead the tlb miss handler doesn't invoke a page fault unless the spte is non present. How about this plan? 0. Merge mmu notifiers 1. gfn_to_page() - gfn_to_pfn() Still keeping the refcount. Change bad_page to kvm_bad_hfn. 2. Drop the refcounting from gfn_to_pfn() and from kvm_release_page_*() Still using get_user_pages() (and dropping the refcount immediately) Simultaneously, change hack_module.awk to add the refcount back. 3. Export follow_page() or something based on fast_gup(), and use it btw, if we change the method we use to read the Linux pte, I'd like to get the writable bit out of it. This was, when we create an spte for a gpte that is writable and dirty, we can set the spte writable iff the Linux pte is writable. This avoids breaking COW unnecessarily. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM hangs with windows XP guest - help with debugging?
On 01/04/2008, Avi Kivity [EMAIL PROTECTED] wrote: Please run 'kvm_stat -1' (found in the tarball) while the guest is hanging and report the output. # /home/jgu/Desktop/kvm_stat -1 efer_reload 8757563 129 exits 17039786 362 halt_exits384540 129 halt_wakeup0 0 invlpg 0 0 io_exits 3938052 100 irq_exits 113040 0 irq_window 0 0 light_exits 8419349 233 mmio_exits 4813794 0 pf_fixed 587157199 pf_guest 523123 0 request_irq0 0 signal_exits 136078 0 tlb_flush 61846733 Also report whether the qemu process is consuming cpu or not. Yes, it's hammering the CPU flat out. Let me know if i can provide more info. J. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] KVM: add kvm_get_kvm and kvm_put_kvm functions
Izik Eidus wrote: From ebb9fe4765f1572314d2249e29a7ef4d0de07273 Mon Sep 17 00:00:00 2001 From: Izik Eidus [EMAIL PROTECTED] Date: Sun, 30 Mar 2008 15:48:35 +0300 Subject: [PATCH] KVM: add kvm_get_kvm and kvm_put_kvm functions, the main purpose of adding this functions is the abilaty to release the spinlock that protect the kvm list while still be able to do operations on a specific kvm in a safe way. Applied, thanks. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM hangs with windows XP guest - help with debugging?
On 02/04/2008, Jonathan Underwood [EMAIL PROTECTED] wrote: On 01/04/2008, Avi Kivity [EMAIL PROTECTED] wrote: Please run 'kvm_stat -1' (found in the tarball) while the guest is hanging and report the output. # /home/jgu/Desktop/kvm_stat -1 efer_reload 8757563 129 exits 17039786 362 halt_exits384540 129 halt_wakeup0 0 invlpg 0 0 io_exits 3938052 100 irq_exits 113040 0 irq_window 0 0 light_exits 8419349 233 mmio_exits 4813794 0 pf_fixed 587157199 pf_guest 523123 0 request_irq0 0 signal_exits 136078 0 tlb_flush 61846733 Actually, that VM did eventually spring back into life, bizarrely. However, later on I saw another hard hang: # /home/jgu/Desktop/kvm_stat -1 efer_reload 28180156 1140 exits 339058958312917 halt_exits 1254859 0 halt_wakeup0 0 invlpg 0 0 io_exits18320867 0 irq_exits 826972 0 irq_window 0 0 light_exits312098749312898 mmio_exits 9583557 0 pf_fixed 110064034312919 pf_guest 2235329 0 request_irq0 0 signal_exits 1207511 1087 tlb_flush2391783 0 And the VM was consuming still CPU J. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 02:16:41PM +0300, Avi Kivity wrote: Ugh, there's still mark_page_accessed() and SetPageDirty(). btw, like PG_dirty is only set if the spte is writeable, mark_page_accessed should only run if the accessed bit is set in the spte. It doesn't matter now as nobody could possibly clear it, but with mmu notifiers the -clear_young will clear that accessed bit for the first time, and if the guest didn't use the spte in the meantime, we don't need to mark the page accessed and we can let the VM swapout the page at the next round of the lru. But that's ok, we'll simply run mark_page_accessed and SetPageDirty if pfn_valid is true. We're under mmu_lock so the mmu notifiers will prevent the page to go away from under us. The detail I'm uncertain is why my reserved-ram patch had both pfn_valid = 1 and page_count =0 and PG_reserved =0 for all reserved ram. Perhaps I triggered a bug in some memmap initialization as the common code should keep all pages that aren't supposed to go in the freelist when freed as PG_reserved and page_count 1. Anyway in practice it worked fine and there's zero risk I'm not using reserved-ram with the page_count = 0 check ;). The rmap_remove should look like this (this won't work with the reserved ram but that needs fixing somehow as reserved-ram must be like mmio region but with a pfn_valid and in turn with a page_t, and then the reserved-ram patch will be able to cope with the reserved ram not having a allocated memmap by luck of how page_alloc.c works). if (pfn_valid(pfn)) { page = pfn_to_page(pfn); if (!PageReserved(page)) { BUG_ON(page_count(page) != 1); if (is_writeable_pte(*spte)) SetPageDirty(page); if (*spte PT_ACCESSED_MASK) mark_page_accessed(page); } } It still skips an atomic op. Your plan still sounds just fine despite the above, infact it sounds too perfect: the awk hack to re-add the refcounting when building the external module if CONFIG_MMU_NOTIFIER isn't defined is going to be messy, a plain CONFIG_MMU_NOTIFIER in kvm.git would be simpler and more robust IMHO even if less perfect :). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 01:50:19PM +0200, Andrea Arcangeli wrote: if (pfn_valid(pfn)) { page = pfn_to_page(pfn); if (!PageReserved(page)) { BUG_ON(page_count(page) != 1); if (is_writeable_pte(*spte)) SetPageDirty(page); if (*spte PT_ACCESSED_MASK) mark_page_accessed(page); } } warning the indenting of the above in text-mode was wrong... read it like a c compiler would do sorry ;) - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] How guest virtual address is translated to physical address?
Hello, I have a question about how guest page table and shadow page table work together and more precisely, about how host is involved when guest access a page that it's already in the page table. The guest maintains its page table to translate guest virtual address to guest physical address. It uses the MMU and cr3 register for the translation. When there is a page fault, the host gets involved and it catches the guest physical address (by walking through the guest page table?) and fills its shadow page table. Thus the host can make the translation from guest physical address to physical address. Now, if the guest reads the same page, the PTE will point to guest physical address but there will not be any page fault so the host will not be involved in the translation (not so sure). I don't see how the guest virtual address will be translated to physical address? I miss something but I don't see what. Thanks for your help, Guillaume - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
I must have missed v10. Could you repost so I can build xpmem against it to see how it operates? To help reduce confusion, you should probably comandeer the patches from Christoph's set which you think are needed to make it sleep. Thanks, Robin On Wed, Apr 02, 2008 at 01:16:51PM +0200, Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 05:59:25AM -0500, Robin Holt wrote: On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote: Most other patches will apply cleanly on top of my coming mmu notifiers #v10 that I hope will go in -mm. For #v10 the only two left open issues to discuss are: Does your v10 allow sleeping inside the callbacks? Yes if you apply all the patches. But not if you apply the first patch only, most patches in EMM serie will apply cleanly or with minor rejects to #v10 too, Christoph's further work to make EEM sleep capable looks very good and it's going to be 100% shared, it's also going to be a lot more controversial for merging than the two #v10 or EMM first patch. EMM also doesn't allow sleeping inside the callbacks if you only apply the first patch in the serie. My priority is to get #v9 or the coming #v10 merged in -mm (only difference will be the replacement of rcu_read_lock with the seqlock to avoid breaking the synchronize_rcu in GRU code). I will mix seqlock with rcu ordered writes. EMM indeed breaks GRU by making synchronize_rcu a noop and by not providing any alternative (I will obsolete synchronize_rcu making it a noop instead). This assumes Jack used synchronize_rcu for whatever good reason. But this isn't the real strong point against EMM, adding seqlock to EMM is as easy as adding it to #v10 (admittedly with #v10 is a bit easier because I didn't expand the hlist operations for zero gain like in EMM). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Avi Kivity wrote: Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote: Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. Exactly, not only that, get_user_pages is likely a bit slower that we need for just kvm pte lookup. GRU uses follow_page directly because it runs in the tlb miss handler, for us instead the tlb miss handler doesn't invoke a page fault unless the spte is non present. How about this plan? 0. Merge mmu notifiers Are mmu-notifiers likely to get pushed when the 2.6.26 window opens up? 1. gfn_to_page() - gfn_to_pfn() Still keeping the refcount. Change bad_page to kvm_bad_hfn. kvm_bad_pfn. 2. Drop the refcounting from gfn_to_pfn() and from kvm_release_page_*() Still using get_user_pages() (and dropping the refcount immediately) Simultaneously, change hack_module.awk to add the refcount back. This has the dependency on mmu notifiers so step 1 can conceivably be merged in the absence of mmu notifiers. Regards, Anthony Liguori 3. Export follow_page() or something based on fast_gup(), and use it btw, if we change the method we use to read the Linux pte, I'd like to get the writable bit out of it. This was, when we create an spte for a gpte that is writable and dirty, we can set the spte writable iff the Linux pte is writable. This avoids breaking COW unnecessarily. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] How guest virtual address is translated to physical address?
Guillaume Thouvenin wrote: Hello, I have a question about how guest page table and shadow page table work together and more precisely, about how host is involved when guest access a page that it's already in the page table. The guest maintains its page table to translate guest virtual address to guest physical address. It uses the MMU and cr3 register for the translation. When there is a page fault, the host gets involved and it catches the guest physical address (by walking through the guest page table?) and fills its shadow page table. When a page fault occurs, a vmexit is generated. The vmexit is delivered before delivery to the interrupt handler but after CR2 is populated with the faulting address. We can then obtain the faulting virtual address from CR2 in the host and walk the guest's page table to determine what guest physical address should be mapped at that guest virtual address (if any at all). We then use that information to add an entry to the shadow page table. The guest runs with the shadow page table installed, the hardware never actually sees the guest page table. This is because the hardware has no knowledge of MMU virtualization (this is all assuming pre-EPT/NPT hardware). Thus the host can make the translation from guest physical address to physical address. And we use a combination of things to make that determination. First, there is a set of slots that cover ranges of physical addresses. Slots are necessary on x86 as there may be very large holes in physical memory. It also makes certain optimization easier. Each slot contains a base virtual address. This is a QEMU userspace address that is where we malloc()'d memory for that particular slot. Normally, this corresponds to phys_ram_base within QEMU. We add the guest physical address to the virtual address base in the slot (minus the slot starting address) to obtain a QEMU userspace address for the guest physical address. Once we have that, we can call get_user_pages() to pin the userspace memory into physical memory. We can then obtain a host physical address. That's what we use to populate the shadow page table. The flip side of this is that as long as we have a shadow page table mapping referencing a host physical address, we must keep the page pinned into memory. Once we drop the shadow page table entry (because the guest process has gone away or we decide to evict it from the cache) we can reduce the reference count via put_page(). This is where mmu notifiers come into play. If Linux really wants to reclaim memory on the host, there's no way ATM for it to do so with the memory we have pinned. mmu notifiers provide a mechanism to Linux to ask KVM to explicitly unpin a particular guest page. Now, if the guest reads the same page, the PTE will point to guest physical address but there will not be any page fault so the host will not be involved in the translation (not so sure). I don't see how the guest virtual address will be translated to physical address? I miss something but I don't see what. I think what you're missing is that while the guest is running, it's CR3 does not point to it's own page table but rather to the shadow page table. We intercept CR3 reads to pretend to the guest that it's page table is, in fact, installed but it's really the shadow page table that's in the hardware register. Regards, Anthony Liguori Thanks for your help, Guillaume - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2
on Wed Apr 02 2008, Avi Kivity avi-AT-qumranet.com wrote: David Abrahams wrote: With the title combination, the guest takes nearly 100% of my real CPU time and still only sees one CPU. Is this a known problem, and does it have a known solution? Can you send the output of 'kvm_stat -1'? I don't appear to have a kvm_stat executable anywhere. How do I acquire that? Also, what does the guest think it is doing (in task manager, etc.)? System Idle Process :( Everything is calm in the guest, but top in the host shows a kvm process at between 50 and 70 percent of the CPU. What HAL do you see in device manager? In the guest? I don't see anything that obviously appears to be HAL in device manager. Could you be more specific? Sorry to be so clueless, and thanks again in advance, -- Dave Abrahams Boost Consulting http://boost-consulting.com - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount
On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote: This results in f.e. the Aim9 brk performance test to got down by 10-15%. I guess it's more likely because of overscheduling for small crtitical sections, did you counted the total number of context switches? I guess there will be a lot more with your patch applied. That regression is a showstopper and it is the reason why I've suggested before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config option to make the VM locks sleep capable only when XPMEM=y (PREEMPT_RT will enable it too). Thanks for doing the benchmark work! - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: There are much bigger issues besides the rcu safety in this patch, proper aging of the secondary mmu through access bits set by hardware is unfixable with this model (you would need to do age |= e-callback), which is the proof of why this isn't flexibile enough by forcing the same parameter and retvals for all methods. No idea why you go for such inferior solution that will never get the aging right and will likely fall apart if we add more methods in the future. There is always the possibility to add special functions in the same way as done in the mmu notifier series if it really becomes necessary. EMM does in no way preclude that. Here f.e. We can add a special emm_age() function that iterates differently and does the | for you. For example the switch you have to add in xpmem_emm_notifier_callback doesn't look good, at least gcc may be able to optimize it with an array indexing simulating proper pointer to function like in #v9. Actually the switch looks really good because it allows code to run for all callbacks like f.e. xpmem_tg_ref(). Otherwise the refcounting code would have to be added to each callback. Most other patches will apply cleanly on top of my coming mmu notifiers #v10 that I hope will go in -mm. For #v10 the only two left open issues to discuss are: Did I see #v10? Could you start a new subject when you post please? Do not respond to some old message otherwise the threading will be wrong. methods will be correctly replied allowing GRU not to corrupt memory after the registration method. EMM would also need a fix like this for GRU to be safe on top of EMM. How exactly does the GRU corrupt memory? Another less obviously safe approach is to allow the register method to succeed only when mm_users=1 and the task is single threaded. This way if all the places where the mmu notifers aren't invoked on the mm not by the current task, are only doing invalidates after/before zapping ptes, if the istantiation of new ptes is single threaded too, we shouldn't worry if we miss an invalidate for a pte that is zero and doesn't point to any physical page. In the places where current-mm != mm I'm using invalidate_page 99% of the time, and that only follows the ptep_clear_flush. The problem are the range_begin that will happen before zapping the pte in places where current-mm != mm. Unfortunately in my incremental patch where I move all invalidate_page outside of the PT lock to prepare for allowing sleeping inside the mmu notifiers, I used range_begin/end in places like try_to_unmap_cluster where current-mm != mm. In general this solution looks more fragile than the seqlock. Hmmm... Okay that is one solution that would just require a BUG_ON in the registration methods. 2) I'm uncertain how the driver can handle a range_end called before range_begin. Also multiple range_begin can happen in parallel later followed by range_end, so if there's a global seqlock that serializes the secondary mmu page fault, that will screwup (you can't seqlock_write in range_begin and sequnlock_write in range_end). The write side of the seqlock must be serialized and calling seqlock_write twice in a row before any sequnlock operation will break. Well doesnt the requirement of just one execution thread also deal with that issue? - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote: This results in f.e. the Aim9 brk performance test to got down by 10-15%. I guess it's more likely because of overscheduling for small crtitical sections, did you counted the total number of context switches? I guess there will be a lot more with your patch applied. That regression is a showstopper and it is the reason why I've suggested before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config option to make the VM locks sleep capable only when XPMEM=y (PREEMPT_RT will enable it too). Thanks for doing the benchmark work! There are more context switches if locks are contended. But that has actually also some good aspects because we avoid busy loops and can potentially continue work in another process. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] EMM: Fixup return value handling of emm_notify()
On Wed, 2 Apr 2008, Christoph Lameter wrote: Here f.e. We can add a special emm_age() function that iterates differently and does the | for you. Well maybe not really necessary. How about this fix? Its likely a problem to stop callbacks if one callback returned an error. Subject: EMM: Fixup return value handling of emm_notify() Right now we stop calling additional subsystems if one callback returned an error. That has the potential for causing additional trouble with the subsystems that do not receive the callbacks they expect if one has failed. So change the handling of error code to continue callbacks to other subsystems but return the first error code encountered. If a callback returns a positive return value then add up all the value from all the calls. That can be used to establish how many references exist (xpmem may want this feature at some point) or ensure that the aging works the way Andrea wants it to (KVM, XPmem so far do not care too much). Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- mm/rmap.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c2008-04-02 11:46:20.738342852 -0700 +++ linux-2.6/mm/rmap.c 2008-04-02 12:03:57.672494320 -0700 @@ -299,27 +299,45 @@ void emm_notifier_register(struct emm_no } EXPORT_SYMBOL_GPL(emm_notifier_register); -/* Perform a callback */ +/* + * Perform a callback + * + * The return of this function is either a negative error of the first + * callback that failed or a consolidated count of all the positive + * values that were returned by the callbacks. + */ int __emm_notify(struct mm_struct *mm, enum emm_operation op, unsigned long start, unsigned long end) { struct emm_notifier *e = rcu_dereference(mm-emm_notifier); int x; + int result = 0; while (e) { - if (e-callback) { x = e-callback(e, mm, op, start, end); - if (x) - return x; + + /* +* Callback may return a positive value to indicate a count +* or a negative error code. We keep the first error code +* but continue to perform callbacks to other subscribed +* subsystems. +*/ + if (x result = 0) { + if (x = 0) + result += x; + else + result = x; + } } + /* * emm_notifier contents (e) must be fetched after * the retrival of the pointer to the notifier. */ e = rcu_dereference(e-next); } - return 0; + return result; } EXPORT_SYMBOL_GPL(__emm_notify); #endif - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Don't assume struct page for x86 MMU
This patch introduces a gfn_to_pfn() function and corresponding functions like kvm_release_pfn_dirty(). Using these new functions, we can modify the x86 MMU to no longer assume that it can always get a struct page for any given gfn. We don't want to eliminate gfn_to_page() entirely because a number of places assume they can do gfn_to_page() and then kmap() the results. When we support IO memory, gfn_to_page() will fail for IO pages although gfn_to_pfn() will succeed. This does not implement support for avoiding reference counting for reserved RAM or for IO memory. However, it should make those things pretty straight forward. Since we're only introducing new common symbols, I don't think it will break the non-x86 architectures but I haven't tested those. I've tested Intel, AMD, NPT, and hugetlbfs with Windows and Linux guests. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1594ee0..867920f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -240,11 +240,9 @@ static int is_rmap_pte(u64 pte) return is_shadow_present_pte(pte); } -static struct page *spte_to_page(u64 pte) +static pfn_t spte_to_pfn(u64 pte) { - hfn_t hfn = (pte PT64_BASE_ADDR_MASK) PAGE_SHIFT; - - return pfn_to_page(hfn); + return (pte PT64_BASE_ADDR_MASK) PAGE_SHIFT; } static gfn_t pse36_gfn_delta(u32 gpte) @@ -541,19 +539,19 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) struct kvm_rmap_desc *desc; struct kvm_rmap_desc *prev_desc; struct kvm_mmu_page *sp; - struct page *page; + pfn_t pfn; unsigned long *rmapp; int i; if (!is_rmap_pte(*spte)) return; sp = page_header(__pa(spte)); - page = spte_to_page(*spte); - mark_page_accessed(page); + pfn = spte_to_pfn(*spte); + kvm_set_pfn_accessed(pfn); if (is_writeble_pte(*spte)) - kvm_release_page_dirty(page); + kvm_release_pfn_dirty(pfn); else - kvm_release_page_clean(page); + kvm_release_pfn_clean(pfn); rmapp = gfn_to_rmap(kvm, sp-gfns[spte - sp-spt], is_large_pte(*spte)); if (!*rmapp) { printk(KERN_ERR rmap_remove: %p %llx 0-BUG\n, spte, *spte); @@ -634,11 +632,11 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) spte = rmap_next(kvm, rmapp, spte); } if (write_protected) { - struct page *page; + pfn_t pfn; spte = rmap_next(kvm, rmapp, NULL); - page = spte_to_page(*spte); - SetPageDirty(page); + pfn = spte_to_pfn(*spte); + kvm_set_pfn_dirty(pfn); } /* check for huge page mappings */ @@ -1035,7 +1033,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, unsigned pt_access, unsigned pte_access, int user_fault, int write_fault, int dirty, int *ptwrite, int largepage, gfn_t gfn, -struct page *page, bool speculative) +pfn_t pfn, bool speculative) { u64 spte; int was_rmapped = 0; @@ -1057,10 +1055,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, child = page_header(pte PT64_BASE_ADDR_MASK); mmu_page_remove_parent_pte(child, shadow_pte); - } else if (page != spte_to_page(*shadow_pte)) { + } else if (pfn != spte_to_pfn(*shadow_pte)) { pgprintk(hfn old %lx new %lx\n, -page_to_pfn(spte_to_page(*shadow_pte)), -page_to_pfn(page)); +spte_to_pfn(*shadow_pte), pfn); rmap_remove(vcpu-kvm, shadow_pte); } else { if (largepage) @@ -1089,7 +1086,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *shadow_pte, if (largepage) spte |= PT_PAGE_SIZE_MASK; - spte |= page_to_phys(page); + spte |= (pfn PAGE_SHIFT); if ((pte_access ACC_WRITE_MASK) || (write_fault !is_write_protection(vcpu) !user_fault)) { @@ -1134,12 +1131,12 @@ unshadowed: if (!was_rmapped) { rmap_add(vcpu, shadow_pte, gfn, largepage); if (!is_rmap_pte(*shadow_pte)) - kvm_release_page_clean(page); + kvm_release_pfn_clean(pfn); } else { if (was_writeble) - kvm_release_page_dirty(page); + kvm_release_pfn_dirty(pfn); else - kvm_release_page_clean(page); + kvm_release_pfn_clean(pfn); } if (!ptwrite || !*ptwrite)
Re: [kvm-devel] EMM: Fixup return value handling of emm_notify()
On Wed, Apr 02, 2008 at 12:03:50PM -0700, Christoph Lameter wrote: + /* + * Callback may return a positive value to indicate a count + * or a negative error code. We keep the first error code + * but continue to perform callbacks to other subscribed + * subsystems. + */ + if (x result = 0) { + if (x = 0) + result += x; + else + result = x; + } } + Now think of when one of the kernel janitors will micro-optimize PG_dirty to be returned by invalidate_page so a single set_page_dirty will be invoked... Keep in mind this is a kernel internal APIs, ask Greg if we can change it in order to optimize later in the future. I think my #v9 is optimal enough while being simple at the same time, but anyway it's silly to be hardwired to such an interface that worst of all requires switch statements instead of proper pointer to functions and a fixed set of parameters and retval semantics for all methods. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] EMM: Require single threadedness for registration.
Here is a patch to require single threaded execution during emm_register. This also allows an easy implementation of an unregister function and gets rid of the races that Andrea worried about. The approach here is similar to what was used in selinux for security context changes (see selinux_setprocattr). Is it okay for the users of emm to require single threadedness for registration? Subject: EMM: Require single threaded execution for register and unregister We can avoid the concurrency issues arising at registration if we only allow registration of notifiers when the process has only a single thread. That even allows to avoid the use of rcu. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- mm/rmap.c | 46 +- 1 file changed, 37 insertions(+), 9 deletions(-) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c2008-04-02 13:53:46.002473685 -0700 +++ linux-2.6/mm/rmap.c 2008-04-02 14:03:05.872199896 -0700 @@ -286,20 +286,48 @@ void emm_notifier_release(struct mm_stru } } -/* Register a notifier */ +/* + * Register a notifier + * + * mmap_sem is held writably. + * + * Process must be single threaded. + */ void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm) { + BUG_ON(atomic_read(mm-mm_users) != 1); + e-next = mm-emm_notifier; - /* -* The update to emm_notifier (e-next) must be visible -* before the pointer becomes visible. -* rcu_assign_pointer() does exactly what we need. -*/ - rcu_assign_pointer(mm-emm_notifier, e); + mm-emm_notifier = e; } EXPORT_SYMBOL_GPL(emm_notifier_register); /* + * Unregister a notifier + * + * mmap_sem is held writably + * + * Process must be single threaded + */ +void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm) +{ + struct emm_notifier *p = mm-emm_notifier; + + BUG_ON(atomic_read(mm-mm_users) != 1); + + if (e == p) + mm-emm_notifier = e-next; + else { + while (p-next != e) + p = p-next; + + p-next = e-next; + } + e-callback(e, mm, emm_release, 0, TASK_SIZE); +} +EXPORT_SYMBOL_GPL(emm_notifier_unregister); + +/* * Perform a callback * * The return of this function is either a negative error of the first @@ -309,7 +337,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register) int __emm_notify(struct mm_struct *mm, enum emm_operation op, unsigned long start, unsigned long end) { - struct emm_notifier *e = rcu_dereference(mm-emm_notifier); + struct emm_notifier *e = mm-emm_notifier; int x; int result = 0; @@ -335,7 +363,7 @@ int __emm_notify(struct mm_struct *mm, e * emm_notifier contents (e) must be fetched after * the retrival of the pointer to the notifier. */ - e = rcu_dereference(e-next); + e = e-next; } return result; } - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Wed, Apr 02, 2008 at 10:59:50AM -0700, Christoph Lameter wrote: Did I see #v10? Could you start a new subject when you post please? Do not respond to some old message otherwise the threading will be wrong. I wasn't clear enough, #v10 was in the works... I was thinking about the last two issues before posting it. How exactly does the GRU corrupt memory? Jack added synchronize_rcu, I assume for a reason. Another less obviously safe approach is to allow the register method to succeed only when mm_users=1 and the task is single threaded. This way if all the places where the mmu notifers aren't invoked on the mm not by the current task, are only doing invalidates after/before zapping ptes, if the istantiation of new ptes is single threaded too, we shouldn't worry if we miss an invalidate for a pte that is zero and doesn't point to any physical page. In the places where current-mm != mm I'm using invalidate_page 99% of the time, and that only follows the ptep_clear_flush. The problem are the range_begin that will happen before zapping the pte in places where current-mm != mm. Unfortunately in my incremental patch where I move all invalidate_page outside of the PT lock to prepare for allowing sleeping inside the mmu notifiers, I used range_begin/end in places like try_to_unmap_cluster where current-mm != mm. In general this solution looks more fragile than the seqlock. Hmmm... Okay that is one solution that would just require a BUG_ON in the registration methods. Perhaps you didn't notice that this solution can't work if you call range_begin/end not in the current context and try_to_unmap_cluster does exactly that for both my patchset and yours. Missing an _end is ok, missing a _begin is never ok. Well doesnt the requirement of just one execution thread also deal with that issue? Yes, except again it can't work for try_to_unmap_cluster. This solution is only applicable to #v10 if I fix try_to_unmap_cluster to only call invalidate_page (relaying on the fact the VM holds a pin and a lock on any page that is being mmu-notifier-invalidated). You can't use the single threaded approach to solve either 1 or 2, because your _begin call is called anywhere and that's where you call the secondary-tlb flush and it's fatal to miss it. invalidate_page is called always after, so it enforced the tlb flush to be called _after_ and so it's inherently safe. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount
On Wed, Apr 02, 2008 at 11:15:26AM -0700, Christoph Lameter wrote: On Wed, 2 Apr 2008, Andrea Arcangeli wrote: On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote: This results in f.e. the Aim9 brk performance test to got down by 10-15%. I guess it's more likely because of overscheduling for small crtitical sections, did you counted the total number of context switches? I guess there will be a lot more with your patch applied. That regression is a showstopper and it is the reason why I've suggested before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config option to make the VM locks sleep capable only when XPMEM=y (PREEMPT_RT will enable it too). Thanks for doing the benchmark work! There are more context switches if locks are contended. But that has actually also some good aspects because we avoid busy loops and can potentially continue work in another process. That would be the case if the wait time would be longer than the scheduling time, the whole point is that with anonvma the write side is so fast it's likely never worth scheduling (probably not even with preempt-rt for the write side, the read side is an entirely different matter but the read side can run concurrently if the system is heavy paging), hence the slowdown. What you benchmarked is the write side, which is also the fast path when the system is heavily CPU bound. I've to say aim is a great benchmark to test this regression. But I think a config option will solve all of this. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: Hmmm... Okay that is one solution that would just require a BUG_ON in the registration methods. Perhaps you didn't notice that this solution can't work if you call range_begin/end not in the current context and try_to_unmap_cluster does exactly that for both my patchset and yours. Missing an _end is ok, missing a _begin is never ok. If you look at the patch you will see a requirement of holding a writelock on mmap_sem which will keep out get_user_pages(). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2 of 8] Moves all mmu notifier methods outside the PT lock (first and not last
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1207159010 -7200 # Node ID fe00cb9deeb31467396370c835cb808f4b85209a # Parent a406c0cc686d0ca94a4d890d661cdfa48cfba09f Moves all mmu notifier methods outside the PT lock (first and not last step to make them sleep capable). Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -121,27 +121,6 @@ seqlock_init(mm-mmu_notifier_lock); } -#define ptep_clear_flush_notify(__vma, __address, __ptep) \ -({ \ - pte_t __pte;\ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __pte = ptep_clear_flush(___vma, ___address, __ptep); \ - mmu_notifier_invalidate_page(___vma-vm_mm, ___address);\ - __pte; \ -}) - -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ -({ \ - int __young;\ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ - __young |= mmu_notifier_clear_flush_young(___vma-vm_mm,\ - ___address); \ - __young;\ -}) - #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -173,9 +152,6 @@ { } -#define ptep_clear_flush_young_notify ptep_clear_flush_young -#define ptep_clear_flush_notify ptep_clear_flush - #endif /* CONFIG_MMU_NOTIFIER */ #endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -194,11 +194,13 @@ if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); pte_unmap_unlock(pte, ptl); + /* must invalidate_page _before_ freeing the page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(page); } } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1626,9 +1626,10 @@ */ page_table = pte_offset_map_lock(mm, pmd, address, ptl); - page_cache_release(old_page); + new_page = NULL; if (!pte_same(*page_table, orig_pte)) goto unlock; + page_cache_release(old_page); page_mkwrite = 1; } @@ -1644,6 +1645,7 @@ if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = new_page = NULL; goto unlock; } @@ -1688,7 +1690,7 @@ * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1700,12 +1702,18 @@ } else mem_cgroup_uncharge_page(new_page); - if (new_page) +unlock: + pte_unmap_unlock(page_table, ptl); + + if (new_page) { + if (new_page == old_page) + /* cow happened, notify before releasing old_page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(new_page); + } if (old_page) page_cache_release(old_page); -unlock: - pte_unmap_unlock(page_table, ptl); + if (dirty_page) { if (vma-vm_file) file_update_time(vma-vm_file); diff --git
[kvm-devel] [PATCH 4 of 8] The conversion to a rwsem allows callbacks during rmap traversal
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1207159011 -7200 # Node ID 3c3787c496cab1fc590ba3f97e7904bdfaab5375 # Parent d880c227ddf345f5d577839d36d150c37b653bfd The conversion to a rwsem allows callbacks during rmap traversal for files in a non atomic context. A rw style lock also allows concurrent walking of the reverse map. This is fairly straightforward if one removes pieces of the resched checking. [Restarting unmapping is an issue to be discussed]. This slightly increases Aim9 performance results on an 8p. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -69,7 +69,7 @@ if (!vma_shareable(vma, addr)) return; - spin_lock(mapping-i_mmap_lock); + down_read(mapping-i_mmap_sem); vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ put_page(virt_to_page(spte)); spin_unlock(mm-page_table_lock); out: - spin_unlock(mapping-i_mmap_lock); + up_read(mapping-i_mmap_sem); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -454,10 +454,10 @@ pgoff = offset PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(mapping-i_mmap_lock); + down_read(mapping-i_mmap_sem); if (!prio_tree_empty(mapping-i_mmap)) hugetlb_vmtruncate_list(mapping-i_mmap, pgoff); - spin_unlock(mapping-i_mmap_lock); + up_read(mapping-i_mmap_sem); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c --- a/fs/inode.c +++ b/fs/inode.c @@ -210,7 +210,7 @@ INIT_LIST_HEAD(inode-i_devices); INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC); rwlock_init(inode-i_data.tree_lock); - spin_lock_init(inode-i_data.i_mmap_lock); + init_rwsem(inode-i_data.i_mmap_sem); INIT_LIST_HEAD(inode-i_data.private_list); spin_lock_init(inode-i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -503,7 +503,7 @@ unsigned inti_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock;/* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned inttruncate_count; /* Cover race condition with truncate */ unsigned long nrpages;/* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -716,7 +716,7 @@ struct address_space *check_mapping;/* Check page-mapping if set */ pgoff_t first_index;/* Lowest page-index to unmap */ pgoff_t last_index; /* Highest page-index to unmap */ - spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */ + struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -274,12 +274,12 @@ atomic_dec(inode-i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(file-f_mapping-i_mmap_lock); + down_write(file-f_mapping-i_mmap_sem); tmp-vm_truncate_count = mpnt-vm_truncate_count; flush_dcache_mmap_lock(file-f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file-f_mapping); - spin_unlock(file-f_mapping-i_mmap_lock); + up_write(file-f_mapping-i_mmap_sem); } /* diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -61,16 +61,16 @@ /* * Lock ordering: * - * -i_mmap_lock (vmtruncate) + * -i_mmap_sem (vmtruncate) *-private_lock (__free_pte-__set_page_dirty_buffers) * -swap_lock(exclusive_swap_page, others) *-mapping-tree_lock * * -i_mutex - *-i_mmap_lock(truncate-unmap_mapping_range) + *-i_mmap_sem
Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: paging), hence the slowdown. What you benchmarked is the write side, which is also the fast path when the system is heavily CPU bound. I've to say aim is a great benchmark to test this regression. I am a bit surprised that brk performance is that important. There may be other measurement that have to be made to assess how this would impact a real load. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 0 of 8] mmu notifiers #v10
Hello, this is the mmu notifier #v10. Patches 1 and 2 are the only difference between this and EMM V2. The rest is the same as with Christoph's patches. I think maximum priority should be given in merging patch 1 and 2 into -mm and ASAP in mainline. Patches from 3 to 8 can go in -mm for testing but I'm not sure if we should support sleep capable notifiers in mainline unless we make the VM locking conditional to avoid overscheduling for extremely small critical sections in the common case. I only rediffed Christoph's patches on top of the mmu notifier patches. KVM current plans are to heavily depend on mmu notifiers for swapping, to optimize the spte faults, and we need it for smp guest ballooning with madvise(DONT_NEED) and other optimizations and features. Patches from 3 to 8 are Christoph's work ported on top of #v10 to make the #v10 mmu notifiers sleep capable (at least supposedly). I didn't test the scheduling, but I assume you'll quickly test XPMEM on top of this. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] EMM: Fixup return value handling of emm_notify()
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: but anyway it's silly to be hardwired to such an interface that worst of all requires switch statements instead of proper pointer to functions and a fixed set of parameters and retval semantics for all methods. The EMM API with a single callback is the simplest approach at this point. A common callback for all operations allows the driver to implement common entry and exit code as seen in XPMem. I guess we can complicate this more by switching to a different API or adding additional emm_xxx() callback if need be but I really want to have a strong case for why this would be needed. There is the danger of adding frills with special callbacks in this and that situation that could make the notifier complicated and specific to a certain usage scenario. Having this generic simple interface will hopefully avoid such things. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 5 of 8] We no longer abort unmapping in unmap vmas because we can reschedule while
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1207159055 -7200 # Node ID 316e5b1e4bf388ef0198c91b3067ed1e4171d7f6 # Parent 3c3787c496cab1fc590ba3f97e7904bdfaab5375 We no longer abort unmapping in unmap vmas because we can reschedule while unmapping since we are holding a semaphore. This would allow moving more of the tlb flusing into unmap_vmas reducing code in various places. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -723,8 +723,7 @@ struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -805,7 +805,6 @@ /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -817,20 +816,13 @@ * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { @@ -838,7 +830,15 @@ unsigned long tlb_start = 0;/* For tlb_finish_mmu */ int tlb_start_valid = 0; unsigned long start = start_addr; - int fullmm = (*tlbp)-fullmm; + int fullmm; + struct mmu_gather *tlb; + struct mm_struct *mm = vma-vm_mm; + + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); + lru_add_drain(); + tlb = tlb_gather_mmu(mm, 0); + update_hiwater_rss(mm); + fullmm = tlb-fullmm; for ( ; vma vma-vm_start end_addr; vma = vma-vm_next) { unsigned long end; @@ -865,7 +865,7 @@ (HPAGE_SIZE / PAGE_SIZE); start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, zap_work, details); if (zap_work 0) { @@ -873,13 +873,15 @@ break; } - tlb_finish_mmu(*tlbp, tlb_start, start); + tlb_finish_mmu(tlb, tlb_start, start); cond_resched(); - *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm); + tlb = tlb_gather_mmu(vma-vm_mm, fullmm); tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } + tlb_finish_mmu(tlb, start_addr, end_addr); + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -893,20 +895,10 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma-vm_mm; - struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); - update_hiwater_rss(mm); - mmu_notifier_invalidate_range_start(mm, address, end); - end = unmap_vmas(tlb, vma, address, end, nr_accounted, details); - mmu_notifier_invalidate_range_end(mm, address, end); - if (tlb) - tlb_finish_mmu(tlb, address, end); - return end; + return
[kvm-devel] [PATCH 8 of 8] This patch adds a lock ordering rule to avoid a potential deadlock when
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1207159059 -7200 # Node ID f3f119118b0abd9c4624263ef388dc7230d937fe # Parent 31fc23193bd039cc595fba1ca149a9715f7d0fb2 This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson [EMAIL PROTECTED] diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ * * -i_mutex (generic_file_buffered_write) *-mmap_sem (fault_in_pages_readable-do_page_fault) + * + *When taking multiple mmap_sems, one should lock the lowest-addressed + *one first proceeding on up to the highest-addressed one. * * -i_mutex *-i_alloc_sem (various) - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 3 of 8] Move the tlb flushing into free_pgtables. The conversion of the locks
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1207159010 -7200 # Node ID d880c227ddf345f5d577839d36d150c37b653bfd # Parent fe00cb9deeb31467396370c835cb808f4b85209a Move the tlb flushing into free_pgtables. The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables(). Moving the tlb flushing into free_pgtables allows sleeping in parts of free_pgtables(). This means that we do a tlb_finish_mmu() before freeing the page tables. Strictly speaking there may not be the need to do another tlb flush after freeing the tables. But its the only way to free a series of page table pages from the tlb list. And we do not want to call into the page allocator for performance reasons. Aim9 numbers look okay after this patch. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -751,8 +751,8 @@ void *private); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); +void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor, + unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -272,9 +272,11 @@ } while (pgd++, addr = next, addr != end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, - unsigned long floor, unsigned long ceiling) +void free_pgtables(struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling) { + struct mmu_gather *tlb; + while (vma) { struct vm_area_struct *next = vma-vm_next; unsigned long addr = vma-vm_start; @@ -286,8 +288,10 @@ unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma-vm_end, + tlb = tlb_gather_mmu(vma-vm_mm, 0); + hugetlb_free_pgd_range(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); + tlb_finish_mmu(tlb, addr, vma-vm_end); } else { /* * Optimization: gather nearby vmas into one call down @@ -299,8 +303,10 @@ anon_vma_unlink(vma); unlink_file_vma(vma); } - free_pgd_range(tlb, addr, vma-vm_end, + tlb = tlb_gather_mmu(vma-vm_mm, 0); + free_pgd_range(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); + tlb_finish_mmu(tlb, addr, vma-vm_end); } vma = next; } diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1751,9 +1751,9 @@ mmu_notifier_invalidate_range_start(mm, start, end); unmap_vmas(tlb, vma, start, end, nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS, + tlb_finish_mmu(tlb, start, end); + free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS, next? next-vm_start: 0); - tlb_finish_mmu(tlb, start, end); mmu_notifier_invalidate_range_end(mm, start, end); } @@ -2050,8 +2050,8 @@ /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it, - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 2 of 8] Moves all mmu notifier methods outside the PT lock (first and not last
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1626,9 +1626,10 @@ */ page_table = pte_offset_map_lock(mm, pmd, address, ptl); - page_cache_release(old_page); + new_page = NULL; if (!pte_same(*page_table, orig_pte)) goto unlock; + page_cache_release(old_page); page_mkwrite = 1; } This is deferring frees and not moving the callouts. KVM specific? What exactly is this doing? A significant portion of this seems to be undoing what the first patch did. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] EMM: Require single threadedness for registration.
On Thu, 3 Apr 2008, Andrea Arcangeli wrote: That would work for #v10 if I remove the invalidate_range_start from try_to_unmap_cluster, it can't work for EMM because you've emm_invalidate_start firing anywhere outside the context of the current task (even regular rmap code, not just nonlinear corner case will trigger the race). In short the single threaded approach would be But in that case it will be firing for a callback to another mm_struct. The notifiers are bound to mm_structs and keep separate contexts. The requirement for invalidate_page is that the pte and linux tlb are flushed _before_ and the page is freed _after_ the invalidate_page method. that's not the case for _begin/_end. The page is freed well before _end runs, hence the need of _begin and to block the secondary mmu page fault during the vma-mangling operations. You could flush in _begin and free on _end? I thought you are taking a refcount on the page? You can drop the refcount only on _end to ensure that the page does not go away before. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Wed, Apr 02, 2008 at 02:54:52PM -0700, Christoph Lameter wrote: On Wed, 2 Apr 2008, Andrea Arcangeli wrote: Hmmm... Okay that is one solution that would just require a BUG_ON in the registration methods. Perhaps you didn't notice that this solution can't work if you call range_begin/end not in the current context and try_to_unmap_cluster does exactly that for both my patchset and yours. Missing an _end is ok, missing a _begin is never ok. If you look at the patch you will see a requirement of holding a writelock on mmap_sem which will keep out get_user_pages(). I said try_to_unmap_cluster, not get_user_pages. CPU0 CPU1 try_to_unmap_cluster: emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10) walking the list by hand in EMM (or with hlist cleaner in #v10) xpmem method invoked schedule for a long while inside invalidate_range_start while skbs are sent gru registers synchronize_rcu (sorry useless now) single threaded, so taking a page fault secondary tlb instantiated xpm method returns end of the list (didn't notice that it has to restart to flush the gru) zap pte free the page gru corrupts memory CPU 1 was single threaded, CPU0 doesn't hold any mmap_sem or any other lock that could ever serialize against the GRU as far as I can tell. In general my #v10 solution mixing seqlock + rcu looks more robust and allows multithreaded attachment of mmu notifers as well. I could have fixed it with the single threaded thanks to the fact the only place outside the mm-mmap_sem is try_to_unmap_cluster for me but it wasn't simple to convert, nor worth it, given nonlinear isn't worth optimizing for (not even the core VM cares about try_to_unmap_cluster which is infact the only place in the VM with a O(N) complexity for each try_to_unmap call, where N is the size of the mapping divided by page_size). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount
On Wed, Apr 02, 2008 at 02:56:25PM -0700, Christoph Lameter wrote: I am a bit surprised that brk performance is that important. There may be I think it's not brk but fork that is being slowed down, did you oprofile? AIM forks a lot... The write side fast path generating the overscheduling I guess is when the new vmas are created for the child and queued in the parent anon-vma in O(1), so immediate, even preempt-rt would be ok with it spinning and not scheduling, it's just a list_add (much faster than schedule() indeed). Every time there's a collision when multiple child forks simultaneously and they all try to queue in the same anon-vma, things will slowdown. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] EMM: Require single threadedness for registration.
On Wed, Apr 02, 2008 at 03:06:19PM -0700, Christoph Lameter wrote: On Thu, 3 Apr 2008, Andrea Arcangeli wrote: That would work for #v10 if I remove the invalidate_range_start from try_to_unmap_cluster, it can't work for EMM because you've emm_invalidate_start firing anywhere outside the context of the current task (even regular rmap code, not just nonlinear corner case will trigger the race). In short the single threaded approach would be But in that case it will be firing for a callback to another mm_struct. The notifiers are bound to mm_structs and keep separate contexts. Why can't it fire on the mm_struct where GRU just registered? That mm_struct existed way before GRU registered, and VM is free to unmap it w/o mmap_sem if there was any memory pressure. You could flush in _begin and free on _end? I thought you are taking a refcount on the page? You can drop the refcount only on _end to ensure that the page does not go away before. we're going to lock + flush on begin and unlock on _end w/o refcounting to microoptimize. Free is done by unmap_vmas/madvise/munmap at will. That's a very slow path, inflating the balloon is not problematic. But invalidate_page allows to avoid blocking page faults during swapping so minor faults can happen and refresh the pte young bits etc... When the VM unmaps the page while holding the page pin, there's no race and that's where invalidate_page is being used to generate lower invalidation overhead. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1 of 8] Core of mmu notifiers
On Wed, 2 Apr 2008, Andrea Arcangeli wrote: + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + void (*invalidate_range_start)(struct mmu_notifier *mn, +struct mm_struct *mm, +unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); Still two methods ... +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + unsigned seq; + + seq = read_seqbegin(mm-mmu_notifier_lock); + while (unlikely(!hlist_empty(mm-mmu_notifier_list))) { + mn = hlist_entry(mm-mmu_notifier_list.first, + struct mmu_notifier, + hlist); + hlist_del(mn-hlist); + if (mn-ops-release) + mn-ops-release(mn, mm); + BUG_ON(read_seqretry(mm-mmu_notifier_lock, seq)); + } +} seqlock just taken for checking if everything is ok? + +/* + * If no young bitflag is supported by the hardware, -clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + unsigned seq; + + seq = read_seqbegin(mm-mmu_notifier_lock); + do { + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) { + if (mn-ops-clear_flush_young) + young |= mn-ops-clear_flush_young(mn, mm, + address); + } + } while (read_seqretry(mm-mmu_notifier_lock, seq)); + The critical section could be run multiple times for one callback which could result in multiple callbacks to clear the young bit. Guess not that big of an issue? +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + unsigned seq; + + seq = read_seqbegin(mm-mmu_notifier_lock); + do { + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) { + if (mn-ops-invalidate_page) + mn-ops-invalidate_page(mn, mm, address); + } + } while (read_seqretry(mm-mmu_notifier_lock, seq)); +} Ok. Retry would try to invalidate the page a second time which is not a problem unless you would drop the refcount or make other state changes that require correspondence with mapping. I guess this is the reason that you stopped adding a refcount? +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + unsigned seq; + + seq = read_seqbegin(mm-mmu_notifier_lock); + do { + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) { + if (mn-ops-invalidate_range_start) + mn-ops-invalidate_range_start(mn, mm, + start, end); + } + } while (read_seqretry(mm-mmu_notifier_lock, seq)); +} Multiple invalidate_range_starts on the same range? This means the driver needs to be able to deal with the situation and ignore the repeated call? +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + unsigned seq; + + seq = read_seqbegin(mm-mmu_notifier_lock); + do { + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) { + if (mn-ops-invalidate_range_end) + mn-ops-invalidate_range_end(mn, mm, + start, end); + } + } while (read_seqretry(mm-mmu_notifier_lock, seq)); +} Retry can lead to multiple invalidate_range callbacks with the same parameters? Driver needs to ignore if the range is already clear? - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___
Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls
On Thu, 3 Apr 2008, Andrea Arcangeli wrote: I said try_to_unmap_cluster, not get_user_pages. CPU0CPU1 try_to_unmap_cluster: emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10) walking the list by hand in EMM (or with hlist cleaner in #v10) xpmem method invoked schedule for a long while inside invalidate_range_start while skbs are sent gru registers synchronize_rcu (sorry useless now) All of this would be much easier if you could stop the drivel. The sync rcu was for an earlier release of the mmu notifier. Why the sniping? single threaded, so taking a page fault secondary tlb instantiated The driver must not allow faults to occur between start and end. The trouble here is that GRU and xpmem are mixed. If CPU0 would have been running GRU instead of XPMEM then the fault would not have occurred because the gru would have noticed that a range op is active. If both systems would have run xpmem then the same would have worked. I guess this means that an address space cannot reliably registered to multiple subsystems if some of those do not take a refcount. If all drivers would be required to take a refcount then this would also not occur. In general my #v10 solution mixing seqlock + rcu looks more robust and allows multithreaded attachment of mmu notifers as well. I could have Well its easy to say that if no one else has looked at it yet. I expressed some concerns in reply to your post of #v10. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [patch 0/3] separate thread for IO handling V3
This version fixes the vmdk problems found by the regression testing. Dor, regarding the option to disable the IO thread, it would require duplicating most of the changed code. For now I believe its better to get the patch into a state where its considered stable enough for inclusion. Please rerun the regression tests. Thanks. -- - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [patch 1/3] QEMU/KVM: separate thread for IO handling
Move IO processing from vcpu0 to a dedicated thread. This removes load on vcpu0 by allowing better cache locality and also improves latency. Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED] Index: kvm-userspace.io/qemu/qemu-kvm.c === --- kvm-userspace.io.orig/qemu/qemu-kvm.c +++ kvm-userspace.io/qemu/qemu-kvm.c @@ -28,6 +28,8 @@ kvm_context_t kvm_context; extern int smp_cpus; +static int qemu_kvm_reset_requested; + pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; __thread struct vcpu_info *vcpu; @@ -38,6 +40,7 @@ struct qemu_kvm_signal_table { }; static struct qemu_kvm_signal_table io_signal_table; +static struct qemu_kvm_signal_table vcpu_signal_table; #define SIG_IPI (SIGRTMIN+4) @@ -51,6 +54,8 @@ struct vcpu_info { int stopped; } vcpu_info[256]; +pthread_t io_thread; + static inline unsigned long kvm_get_thread_id(void) { return syscall(SYS_gettid); @@ -67,12 +72,19 @@ static void sig_ipi_handler(int n) void kvm_update_interrupt_request(CPUState *env) { -if (env vcpu env != vcpu-env) { - if (vcpu_info[env-cpu_index].signalled) - return; - vcpu_info[env-cpu_index].signalled = 1; - if (vcpu_info[env-cpu_index].thread) - pthread_kill(vcpu_info[env-cpu_index].thread, SIG_IPI); +int signal = 0; + +if (env) { +if (!vcpu) +signal = 1; +if (vcpu env != vcpu-env !vcpu_info[env-cpu_index].signalled) +signal = 1; + +if (signal) { +vcpu_info[env-cpu_index].signalled = 1; +if (vcpu_info[env-cpu_index].thread) +pthread_kill(vcpu_info[env-cpu_index].thread, SIG_IPI); +} } } @@ -105,7 +117,7 @@ static void post_kvm_run(void *opaque, i static int pre_kvm_run(void *opaque, int vcpu) { -CPUState *env = cpu_single_env; +CPUState *env = qemu_kvm_cpu_env(vcpu); kvm_arch_pre_kvm_run(opaque, vcpu); @@ -151,7 +163,8 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_eat_signal(CPUState *env, int timeout) +static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, + int timeout) { struct timespec ts; int r, e, ret = 0; @@ -160,12 +173,12 @@ static int kvm_eat_signal(CPUState *env, ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; -r = sigtimedwait(io_signal_table.sigset, siginfo, ts); +r = sigtimedwait(waitset-sigset, siginfo, ts); if (r == -1 (errno == EAGAIN || errno == EINTR) !timeout) return 0; e = errno; pthread_mutex_lock(qemu_mutex); -if (vcpu) +if (env vcpu) cpu_single_env = vcpu-env; if (r == -1 !(errno == EAGAIN || errno == EINTR)) { printf(sigtimedwait: %s\n, strerror(e)); @@ -181,7 +194,7 @@ static int kvm_eat_signal(CPUState *env, if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; vcpu_info[env-cpu_index].stopped = 1; - pthread_kill(vcpu_info[0].thread, SIG_IPI); + pthread_kill(io_thread, SIGUSR1); } pthread_mutex_unlock(qemu_mutex); @@ -192,24 +205,16 @@ static int kvm_eat_signal(CPUState *env, static void kvm_eat_signals(CPUState *env, int timeout) { int r = 0; +struct qemu_kvm_signal_table *waitset = vcpu_signal_table; -while (kvm_eat_signal(env, 0)) +while (kvm_eat_signal(waitset, env, 0)) r = 1; if (!r timeout) { - r = kvm_eat_signal(env, timeout); + r = kvm_eat_signal(waitset, env, timeout); if (r) - while (kvm_eat_signal(env, 0)) + while (kvm_eat_signal(waitset, env, 0)) ; } -/* - * we call select() even if no signal was received, to account for - * for which there is no signal handler installed. - */ -pthread_mutex_lock(qemu_mutex); -cpu_single_env = vcpu-env; -if (env-cpu_index == 0) - main_loop_wait(0); -pthread_mutex_unlock(qemu_mutex); } static void kvm_main_loop_wait(CPUState *env, int timeout) @@ -225,29 +230,29 @@ static int all_threads_paused(void) { int i; -for (i = 1; i smp_cpus; ++i) +for (i = 0; i smp_cpus; ++i) if (vcpu_info[i].stopped) return 0; return 1; } -static void pause_other_threads(void) +static void pause_all_threads(void) { int i; -for (i = 1; i smp_cpus; ++i) { +for (i = 0; i smp_cpus; ++i) { vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } while (!all_threads_paused()) - kvm_eat_signals(vcpu-env, 0); + kvm_eat_signal(io_signal_table, NULL, 1000); } -static void resume_other_threads(void) +static void resume_all_threads(void) { int i; -for (i = 1; i smp_cpus; ++i) { +for (i = 0; i smp_cpus; ++i) {
[kvm-devel] [patch 3/3] QEMU/KVM: notify IO thread of pending bhs
This fixes the slow vmdk issue found in the regression tests. Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED] Index: kvm-userspace.io/qemu/qemu-kvm.c === --- kvm-userspace.io.orig/qemu/qemu-kvm.c +++ kvm-userspace.io/qemu/qemu-kvm.c @@ -411,6 +411,12 @@ int kvm_init_ap(void) return 0; } +void qemu_kvm_notify_work(void) +{ +if (io_thread) +pthread_kill(io_thread, SIGUSR1); +} + /* * The IO thread has all signals that inform machine events * blocked (io_signal_table), so it won't get interrupted Index: kvm-userspace.io/qemu/qemu-kvm.h === --- kvm-userspace.io.orig/qemu/qemu-kvm.h +++ kvm-userspace.io/qemu/qemu-kvm.h @@ -60,6 +60,8 @@ void qemu_kvm_aio_wait_start(void); void qemu_kvm_aio_wait(void); void qemu_kvm_aio_wait_end(void); +void qemu_kvm_notify_work(void); + void kvm_tpr_opt_setup(); void kvm_tpr_access_report(CPUState *env, uint64_t rip, int is_write); int handle_tpr_access(void *opaque, int vcpu, Index: kvm-userspace.io/qemu/vl.c === --- kvm-userspace.io.orig/qemu/vl.c +++ kvm-userspace.io/qemu/vl.c @@ -7588,6 +7588,8 @@ void qemu_bh_schedule(QEMUBH *bh) if (env) { cpu_interrupt(env, CPU_INTERRUPT_EXIT); } +if (kvm_enabled()) +qemu_kvm_notify_work(); } void qemu_bh_cancel(QEMUBH *bh) -- - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [patch 2/3] QEMU/KVM: add function to handle signals
SIGUSR1 has no handler, and the SIGUSR2 one does nothing useful anymore (thats also true for SIGIO on tap fd, which runs host_alarm_handler unnecessarily). Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED] Index: kvm-userspace.io/qemu/qemu-kvm.c === --- kvm-userspace.io.orig/qemu/qemu-kvm.c +++ kvm-userspace.io/qemu/qemu-kvm.c @@ -163,13 +163,30 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } +static int kvm_process_signal(int si_signo) +{ +struct sigaction sa; + +switch (si_signo) { +case SIGUSR2: +pthread_cond_signal(qemu_aio_cond); +break; +case SIGALRM: +case SIGIO: +sigaction(si_signo, NULL, sa); +sa.sa_handler(si_signo); +break; +} + +return 1; +} + static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; -struct sigaction sa; ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 100; @@ -184,13 +201,9 @@ static int kvm_eat_signal(struct qemu_kv printf(sigtimedwait: %s\n, strerror(e)); exit(1); } -if (r != -1) { - sigaction(siginfo.si_signo, NULL, sa); - sa.sa_handler(siginfo.si_signo); - if (siginfo.si_signo == SIGUSR2) - pthread_cond_signal(qemu_aio_cond); - ret = 1; -} +if (r != -1) +ret = kvm_process_signal(siginfo.si_signo); + if (env vcpu_info[env-cpu_index].stop) { vcpu_info[env-cpu_index].stop = 0; vcpu_info[env-cpu_index].stopped = 1; -- - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [02/17][PATCH] Implement smp_call_function_mask for ia64 - V8
Have you looked at Jens Axboe's patches to make all this stuff a lot more arch-common? Nope, do you have a pointer? Check your favourite archive for this Subject line: Generic smp_call_function(), improvements, and smp_call_function_single() -Tony - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1 of 8] Core of mmu notifiers
On Wed, Apr 02, 2008 at 03:34:01PM -0700, Christoph Lameter wrote: Still two methods ... Yes, the invalidate_page is called with the core VM holding a reference on the page _after_ the tlb flush. The invalidate_end is called after the page has been freed already and after the tlb flush. They've different semantics and with invalidate_page there's no need to block the kvm fault handler. But invalidate_page is only the most efficient for operations that aren't creating holes in the vma, for the rest invalidate_range_start/end provides the best performance by reducing the number of tlb flushes. seqlock just taken for checking if everything is ok? Exactly. The critical section could be run multiple times for one callback which could result in multiple callbacks to clear the young bit. Guess not that big of an issue? Yes, that's ok. Ok. Retry would try to invalidate the page a second time which is not a problem unless you would drop the refcount or make other state changes that require correspondence with mapping. I guess this is the reason that you stopped adding a refcount? The current patch using mmu notifiers is already robust against multiple invalidates. The refcounting represent a spte mapping, if we already invalidated it, the spte will be nonpresent and there's no page to unpin. The removal of the refcount is only a microoptimization. Multiple invalidate_range_starts on the same range? This means the driver needs to be able to deal with the situation and ignore the repeated call? The driver would need to store current-pid in a list and remove it in range_stop. And range_stop would need to do nothing at all, if the pid isn't found in the list. But thinking more I'm not convinced the driver is safe by ignoring if range_end runs before range_begin (pid not found in the list). And I don't see a clear way to fix it not internally to the device driver nor externally. So the repeated call is easy to handle for the driver. What is not trivial is to block the secondary page faults when mmu_notifier_register happens in the middle of range_start/end critical section. sptes can be established in between range_start/_end and that shouldn't happen. So the core problem returns to be how to handle mmu_notifier_register happening in the middle of _range_start/_end, dismissing it as a job for the driver seems not feasible (you have the same problem with EMM of course). Retry can lead to multiple invalidate_range callbacks with the same parameters? Driver needs to ignore if the range is already clear? Mostly covered above. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1 of 8] Core of mmu notifiers
Thinking about this adventurous locking some more: I think you are misunderstanding what a seqlock is. It is *not* a spinlock. The critical read section with the reading of a version before and after allows you access to a certain version of memory how it is or was some time ago (caching effect). It does not mean that the current state of memory is fixed and neither does it allow syncing when an item is added to the list. So it could be that you are traversing a list that is missing one item because it is not visible to this processor yet. You may just see a state from the past. I would think that you will need a real lock in order to get the desired effect. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2
David Abrahams wrote: on Wed Apr 02 2008, Avi Kivity avi-AT-qumranet.com wrote: David Abrahams wrote: With the title combination, the guest takes nearly 100% of my real CPU time and still only sees one CPU. Is this a known problem, and does it have a known solution? Can you send the output of 'kvm_stat -1'? I don't appear to have a kvm_stat executable anywhere. How do I acquire that? Hi David, Do you have kvm userspace code? If you have, you can find it in kvm-userspace/kvm_stat. And you also can get the code from git clone git://git.kernel.org/pub/scm/virt/kvm/kvm-userspace.git. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] EMM: disable other notifiers before register and unregister
Ok lets forget about the single theaded thing to solve the registration races. As Andrea pointed out this still has ssues with other subscribed subsystems (and also try_to_unmap). We could do something like what stop_machine_run does: First disable all running subsystems before registering a new one. Maybe this is a possible solution. Subject: EMM: disable other notifiers before register and unregister As Andrea has pointed out: There are races during registration if other subsystem notifiers are active while we register a callback. Solve that issue by adding two new notifiers: emm_stop Stops the notifier operations. Notifier must block on invalidate_start and emm_referenced from this point on. If an invalidate_start has not been completed by a call to invalidate_end then the driver must wait until the operation is complete before returning. emm_start Restart notifier operations. Before registration all other subscribed subsystems are stopped. Then the new subsystem is subscribed and things can get running without consistency issues. Subsystems are restarted after the lists have been updated. This also works for unregistering. If we can get all subsystems to stop then we can also reliably unregister a subsystem. So provide that callback. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/rmap.h | 10 +++--- mm/rmap.c| 30 ++ 2 files changed, 37 insertions(+), 3 deletions(-) Index: linux-2.6/include/linux/rmap.h === --- linux-2.6.orig/include/linux/rmap.h 2008-04-02 18:16:07.906032549 -0700 +++ linux-2.6/include/linux/rmap.h 2008-04-02 18:17:10.291070009 -0700 @@ -94,7 +94,9 @@ enum emm_operation { emm_release,/* Process exiting, */ emm_invalidate_start, /* Before the VM unmaps pages */ emm_invalidate_end, /* After the VM unmapped pages */ - emm_referenced /* Check if a range was referenced */ + emm_referenced, /* Check if a range was referenced */ + emm_stop, /* Halt all faults/invalidate_starts */ + emm_start, /* Restart operations */ }; struct emm_notifier { @@ -126,13 +128,15 @@ static inline int emm_notify(struct mm_s /* * Register a notifier with an mm struct. Release occurs when the process - * terminates by calling the notifier function with emm_release. + * terminates by calling the notifier function with emm_release or when + * emm_notifier_unregister is called. * * Must hold the mmap_sem for write. */ extern void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm); - +extern void emm_notifier_unregister(struct emm_notifier *e, + struct mm_struct *mm); /* * Called from mm/vmscan.c to handle paging out Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c2008-04-02 18:16:09.378057062 -0700 +++ linux-2.6/mm/rmap.c 2008-04-02 18:16:10.710079201 -0700 @@ -289,16 +289,46 @@ void emm_notifier_release(struct mm_stru /* Register a notifier */ void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm) { + /* Bring all other notifiers into a quiescent state */ + emm_notify(mm, emm_stop, 0, TASK_SIZE); + e-next = mm-emm_notifier; + /* * The update to emm_notifier (e-next) must be visible * before the pointer becomes visible. * rcu_assign_pointer() does exactly what we need. */ rcu_assign_pointer(mm-emm_notifier, e); + + /* Continue notifiers */ + emm_notify(mm, emm_start, 0, TASK_SIZE); } EXPORT_SYMBOL_GPL(emm_notifier_register); +/* Unregister a notifier */ +void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm) +{ + struct emm_notifier *p; + + emm_notify(mm, emm_stop, 0, TASK_SIZE); + + p = mm-emm_notifier; + if (e == p) + mm-emm_notifier = e-next; + else { + while (p-next != e) + p = p-next; + + p-next = e-next; + } + e-next = mm-emm_notifier; + + emm_notify(mm, emm_start, 0, TASK_SIZE); + e-callback(e, mm, emm_release, 0, TASK_SIZE); +} +EXPORT_SYMBOL_GPL(emm_notifier_unregister); + /* * Perform a callback * - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2
David Abrahams wrote: What HAL do you see in device manager? In the guest? I don't see anything that obviously appears to be HAL in device manager. Could you be more specific? Hi, He's referring to the Windows computer type -- the entry under computer in device manager. See http://kvm.qumranet.com/kvmwiki/Windows_ACPI_Workaround -jim - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [Patch][00/18] kvm-ia64 for kernel V10
Compared with V9, just fixed indentation issues in patch 12. I put it the patchset in git://git.kernel.org/pub/scm/linux/kernel/git/xiantao/kvm-ia64.git kvm-ia64-mc10. Please help to review. Specially, the first two patches (TR Management patch and smp_call_function_mask patch) needs Tony's review and Ack. Thanks:-) Xiantao - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel