Re: [PATCH v4 1/6] KVM: MMU: introduce gfn_to_pfn_atomic() function
On Fri, Jul 02, 2010 at 01:47:56PM -0300, Marcelo Tosatti wrote: On Thu, Jul 01, 2010 at 09:53:04PM +0800, Xiao Guangrong wrote: Introduce gfn_to_pfn_atomic(), it's the fast path and can used in atomic context, the later patch will use it Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/mm/gup.c|2 ++ include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c | 32 +--- 3 files changed, 28 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 738e659..0c9034b 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -6,6 +6,7 @@ */ #include linux/sched.h #include linux/mm.h +#include linux/module.h #include linux/vmstat.h #include linux/highmem.h @@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, return nr; } +EXPORT_SYMBOL_GPL(__get_user_pages_fast); /** * get_user_pages_fast() - pin user pages in memory This should be a separate patch (overlooked that before). Ingo, Nick, can this go in through the kvm tree? I'm happy for it to be exported. Yes put it in a seperate patch and add an acked-by: me on it please. Beyond that I don't care what tree it goes through :) although Ingo might have an opinion -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow vmalloc in 2.6.35-rc3
On Sun, Jun 27, 2010 at 12:17:42PM +0300, Avi Kivity wrote: On 06/24/2010 06:14 PM, Nick Piggin wrote: On Thu, Jun 24, 2010 at 12:19:32PM +0300, Avi Kivity wrote: I see really slow vmalloc performance on 2.6.35-rc3: Can you try this patch? http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmap-area-cache.patch The patch completely eliminates the problem. Thanks for testing. Andrew the patch changelog can be updated. Avi and Steven didn't give me numbers but it solves both their performance regressions, so I would say it is ready for merging. Thanks, Nick -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow vmalloc in 2.6.35-rc3
On Thu, Jun 24, 2010 at 12:19:32PM +0300, Avi Kivity wrote: I see really slow vmalloc performance on 2.6.35-rc3: Can you try this patch? http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmap-area-cache.patch # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3) 3.581 us| vfree(); 3) | msr_io() { 3) ! 523.880 us |vmalloc(); 3) 1.702 us|vfree(); 3) ! 529.960 us | } 3) | msr_io() { 3) ! 564.200 us |vmalloc(); 3) 1.429 us|vfree(); 3) ! 568.080 us | } 3) | msr_io() { 3) ! 578.560 us |vmalloc(); 3) 1.697 us|vfree(); 3) ! 584.791 us | } 3) | msr_io() { 3) ! 559.657 us |vmalloc(); 3) 1.566 us|vfree(); 3) ! 575.948 us | } 3) | msr_io() { 3) ! 536.558 us |vmalloc(); 3) 1.553 us|vfree(); 3) ! 542.243 us | } 3) | msr_io() { 3) ! 560.086 us |vmalloc(); 3) 1.448 us|vfree(); 3) ! 569.387 us | } msr_io() is from arch/x86/kvm/x86.c, allocating at most 4K (yes it should use kmalloc()). The memory is immediately vfree()ed. There are 96 entries in /proc/vmallocinfo, and the whole thing is single threaded so there should be no contention. Yep, it should use kmalloc. Here's the perf report: 63.97% qemu [kernel] [k] rb_next | --- rb_next | |--70.75%-- alloc_vmap_area | __get_vm_area_node | __vmalloc_node | vmalloc | | | |--99.15%-- msr_io | | kvm_arch_vcpu_ioctl | | kvm_vcpu_ioctl | | vfs_ioctl | | do_vfs_ioctl | | sys_ioctl | | system_call | | __GI_ioctl | | | | | --100.00%-- 0x1dfc4a8878e71362 | | | --0.85%-- __kvm_set_memory_region | kvm_set_memory_region | kvm_vm_ioctl_set_memory_region | kvm_vm_ioctl | vfs_ioctl | do_vfs_ioctl | sys_ioctl | system_call | __GI_ioctl | --29.25%-- __get_vm_area_node __vmalloc_node vmalloc | |--98.89%-- msr_io | kvm_arch_vcpu_ioctl | kvm_vcpu_ioctl | vfs_ioctl | do_vfs_ioctl | sys_ioctl | system_call | __GI_ioctl | | | --100.00%-- 0x1dfc4a8878e71362 It seems completely wrong - iterating 8 levels of a binary tree shouldn't take half a millisecond. It's not iterating down the tree, it's iterating through the nodes to find a free area. Slows down because lazy vunmap means that quite a lot of little areas build up right at the start of our search start address. The vmap cache should hopefully fix it up. Thanks, Nick -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote: (Cc:-ed various performance/optimization folks) * Avi Kivity a...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. It can be nice to code an absolute worst-case microbenchmark too. Task migration can actually be very important to the point of being almost a fastpath in some workloads where threads are oversubscribed to CPUs and blocking on some contented resource (IO or mutex or whatever). I suspect the main issues in that case is the actual context switching and contention, but it would be nice to see just how much slower it could get. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote: On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote: On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: There are two separate problems: the more general problem is that the hypervisor can put a vcpu to sleep while holding a lock, causing other vcpus to spin until the end of their time slice. This can only be addressed with hypervisor help. Fyi - I have a early patch ready to address this issue. Basically I am using host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint host whenever guest is in spin-lock'ed section, which is read by host scheduler to defer preemption. Looks like a ni.ce simple way to handle this for the kernel. However I suspect user space will hit the same issue sooner or later. I assume your way is not easily extensable to futexes? Well userspace has always had the problem, hypervisor or not. So sleeping locks obviously help a lot with that. But we do hit the problem at times. The MySQL sysbench scalability problem was a fine example http://ozlabs.org/~anton/linux/sysbench/ Performance would tank when threads oversubscribe CPUs because lock holders would start getting preempted. This was due to nasty locking in MySQL as well, mind you. There are some ways to improve it. glibc I believe has an option to increase thread priority when taking a mutex, which is similar to what we have here. But it's a fairly broad problem for userspace. The resource may not be just a lock but it could be IO as well. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote: On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: There are two separate problems: the more general problem is that the hypervisor can put a vcpu to sleep while holding a lock, causing other vcpus to spin until the end of their time slice. This can only be addressed with hypervisor help. Fyi - I have a early patch ready to address this issue. Basically I am using host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint host whenever guest is in spin-lock'ed section, which is read by host scheduler to defer preemption. Guest side: static inline void spin_lock(spinlock_t *lock) { raw_spin_lock(lock-rlock); + __get_cpu_var(gh_vcpu_ptr)-defer_preempt++; } static inline void spin_unlock(spinlock_t *lock) { + __get_cpu_var(gh_vcpu_ptr)-defer_preempt--; raw_spin_unlock(lock-rlock); } [similar changes to other spinlock variants] Great, this is a nice way to improve it. You might want to consider playing with first taking a ticket, and then if we fail to acquire the lock immediately, then increment defer_preempt before we start spinning. The downside of this would be if we waste all our slice on spinning and then preempted in the critical section. But with ticket locks you can easily see how many entries in the queue in front of you. So you could experiment with starting to defer preempt when we notice we are getting toward the head of the queue. Have you also looked at how s390 checks if the owning vcpu is running and if so it spins, if not yields to the hypervisor. Something like turning it into an adaptive lock. This could be applicable as well. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 05:34:50PM +0530, Srivatsa Vaddagiri wrote: On Thu, Jun 03, 2010 at 08:38:55PM +1000, Nick Piggin wrote: Guest side: static inline void spin_lock(spinlock_t *lock) { raw_spin_lock(lock-rlock); + __get_cpu_var(gh_vcpu_ptr)-defer_preempt++; } static inline void spin_unlock(spinlock_t *lock) { + __get_cpu_var(gh_vcpu_ptr)-defer_preempt--; raw_spin_unlock(lock-rlock); } [similar changes to other spinlock variants] Great, this is a nice way to improve it. You might want to consider playing with first taking a ticket, and then if we fail to acquire the lock immediately, then increment defer_preempt before we start spinning. The downside of this would be if we waste all our slice on spinning and then preempted in the critical section. But with ticket locks you can easily see how many entries in the queue in front of you. So you could experiment with starting to defer preempt when we notice we are getting toward the head of the queue. Mm - my goal is to avoid long spin times in the first place (because the owning vcpu was descheduled at an unfortunate time i.e while it was holding a lock). From that sense, I am targetting preemption-defer of lock *holder* rather than of lock acquirer. So ideally whenever somebody tries to grab a lock, it should be free most of the time, it can be held only if the owner is currently running - which means we won't have to spin too long for the lock. Holding a ticket in the queue is effectively the same as holding the lock, from the pov of processes waiting behind. The difference of course is that CPU cycles do not directly reduce latency of ticket holders (only the owner). Spinlock critical sections should tend to be several orders of magnitude shorter than context switch times. So if you preempt the guy waiting at the head of the queue, then it's almost as bad as preempting the lock holder. Have you also looked at how s390 checks if the owning vcpu is running and if so it spins, if not yields to the hypervisor. Something like turning it into an adaptive lock. This could be applicable as well. I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang scheduling of vcpus, which reduces the severity of this problem very much - essentially lock acquirer/holder are run simultaneously on different cpus all the time. Gang scheduling is on my list of things to look at much later (although I have been warned that its a scalablility nightmare!). It effectively is pretty well an adaptive lock. The spinlock itself doesn't sleep of course, but it yields to the hypervisor if the owner has been preempted. This is pretty close to analogous with Linux adaptive mutexes. s390 also has the diag9c instruction which I suppose somehow boosts priority of a preempted contended lock holder. In spite of any other possible optimizations in their hypervisor like gang scheduling, diag9c apparently provides quite a large improvement in some cases. And they aren't even using ticket spinlocks!! So I think these things are fairly important to look at. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 06:28:21PM +0530, Srivatsa Vaddagiri wrote: On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote: Holding a ticket in the queue is effectively the same as holding the lock, from the pov of processes waiting behind. The difference of course is that CPU cycles do not directly reduce latency of ticket holders (only the owner). Spinlock critical sections should tend to be several orders of magnitude shorter than context switch times. So if you preempt the guy waiting at the head of the queue, then it's almost as bad as preempting the lock holder. Ok got it - although that approach is not advisable in some cases for ex: when the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by the hypervisor (which was experimented with in [1] where they foud a huge hit in perf). Sure but if you had adaptive yielding, that solves that problem. I agree that in general we should look at deferring preemption of lock acquirer esp when its at head as you suggest - I will consider that approach as the next step (want to incrementally progress basically!). Have you also looked at how s390 checks if the owning vcpu is running and if so it spins, if not yields to the hypervisor. Something like turning it into an adaptive lock. This could be applicable as well. I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang scheduling of vcpus, which reduces the severity of this problem very much - essentially lock acquirer/holder are run simultaneously on different cpus all the time. Gang scheduling is on my list of things to look at much later (although I have been warned that its a scalablility nightmare!). It effectively is pretty well an adaptive lock. The spinlock itself doesn't sleep of course, but it yields to the hypervisor if the owner has been preempted. This is pretty close to analogous with Linux adaptive mutexes. Oops you are right - sorry should have checked more closely earlier. Given that we may not be able to always guarantee that locked critical sections will not be preempted (ex: when a real-time task takes over), we will need a combination of both approaches (i.e request preemption defer on lock hold path + yield on lock acquire path if owner !scheduled). The advantage of former approach is that it could reduce job turnaround times in most cases (as lock is available when we want or we don't have to wait too long for it). Both I think would be good. It might be interesting to talk with the s390 guys and see if they can look at ticket locks and preempt defer techniques too (considering they already do the other half of the equation well). s390 also has the diag9c instruction which I suppose somehow boosts priority of a preempted contended lock holder. In spite of any other possible optimizations in their hypervisor like gang scheduling, diag9c apparently provides quite a large improvement in some cases. Ok - thx for that pointer - will have a look at diag9c. So I think these things are fairly important to look at. I agree .. - vatsa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 05:17:30PM +0200, Andi Kleen wrote: On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote: And they aren't even using ticket spinlocks!! I suppose they simply don't have unfair memory. Makes things easier. That would certainly be a part of it, I'm sure they have stronger fairness and guarantees at the expense of some performance. We saw the spinlock starvation first on 8-16 core Opterons I think, wheras Altix had been over 1024 cores and POWER7 1024 threads now apparently without reported problems. However I think more is needed than simply fair memory at the cache coherency level, considering that for example s390 implements it simply by retrying cas until it succeeds. So you could perfectly round-robin all cache requests for the lock word, but one core could quite easily always find it is granted the cacheline when the lock is already taken. So I think actively enforcing fairness at the lock level would be required. Something like if it is detected that a core is not making progress on a tight cas loop, then it will need to enter a queue of cores where the head of the queue is always granted the cacheline first after it has been dirtied. Interrupts will need to be ignored from this logic. This still doesn't solve the problem of an owner unfairly releasing and grabbing the lock again, they could have more detection to handle that. I don't know how far hardware goes. Maybe it is enough to statistically avoid starvation if memory is pretty fair. But it does seem a lot easier to enforce fairness in software. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
On Thu, Mar 11, 2010 at 03:10:47AM +, Jamie Lokier wrote: Paul Brook wrote: In a cross environment that becomes extremely hairy. For example the x86 architecture effectively has an implicit write barrier before every store, and an implicit read barrier before every load. Btw, x86 doesn't have any implicit barriers due to ordinary loads. Only stores and atomics have implicit barriers, afaik. As of March 2009[1] Intel guarantees that memory reads occur in order (they may only be reordered relative to writes). It appears AMD do not provide this guarantee, which could be an interesting problem for heterogeneous migration.. (Summary: At least on AMD64, it does too, for normal accesses to naturally aligned addresses in write-back cacheable memory.) Oh, that's interesting. Way back when I guess we knew writes were in order and it wasn't explicit that reads were, hence smp_rmb() using a locked atomic. Here is a post by Nick Piggin from 2007 with links to Intel _and_ AMD documents asserting that reads to cacheable memory are in program order: http://lkml.org/lkml/2007/9/28/212 Subject: [patch] x86: improved memory barrier implementation Links to documents: http://developer.intel.com/products/processor/manuals/318147.pdf http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf The Intel link doesn't work any more, but the AMD one does. It might have been merged into their development manual now. Nick asserts both manufacturers are committed to in-order loads from cacheable memory for the x86 architecture. At the time we did ask Intel and AMD engineers. We talked with Andy Glew from Intel I believe, but I can't recall the AMD contact. Linus was involved in the discussions as well. We tried to do the right thing with this. I have just read the AMD document, and it is in there (but not completely obviously), in section 7.2. The implicit load-load and store-store barriers are only guaranteed for normal cacheable accesses on naturally aligned boundaries to WB [write-back cacheable] memory. There are also implicit load-store barriers but not store-load. Note that the document covers AMD64; it does not say anything about their (now old) 32-bit processors. Hmm. Well it couldn't hurt to ask again. We've never seen any problems yet, so I'm rather sure we're in the clear. [*] The most recent docs I have handy. Up to and including Core-2 Duo. Are you sure the read ordering applies to 32-bit Intel and AMD CPUs too? Many years ago, before 64-bit x86 existed, I recall discussions on LKML where it was made clear that stores were performed in program order. If it were known at the time that loads were performed in program order on 32-bit x86s, I would have expected that to have been mentioned by someone. The way it was explained to us by the Intel engineer is that they had implemented only visibly in-order loads, but they wanted to keep their options open in future so they did not want to commit to in order loads as an ISA feature. So when the whitepaper was released we got their blessing to retroactively apply the rules to previous CPUs. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3
On Friday 17 April 2009 17:08:07 Jared Hulbert wrote: As everyone knows, my favourite thing is to say nasty things about any new feature that adds complexity to common code. I feel like crying to hear about how many more instances of MS Office we can all run, if only we apply this patch. And the poorly written HPC app just sounds like scrapings from the bottom of justification barrel. I'm sorry, maybe I'm way off with my understanding of how important this is. There isn't too much help in the changelog. A discussion of where the memory savings comes from, and how far does things like sharing of fs image, or ballooning goes and how much extra savings we get from this... with people from other hypervisors involved as well. Have I missed this kind of discussion? Nick, I don't know about other hypervisors, fs and balloonings, but I have tried this out. It works. It works on apps I don't consider, poorly written. I'm very excited about this. I got 10% saving in a roughly off the shelf embedded system. No user noticeable performance impact. OK well that's what I want to hear. Thanks, that means a lot to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3
On Wednesday 15 April 2009 08:09:03 Andrew Morton wrote: On Thu, 9 Apr 2009 06:58:37 +0300 Izik Eidus iei...@redhat.com wrote: KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Generally looks OK to me. But that doesn't mean much. We should rub bottles with words like hugh and nick on them to be sure. I haven't looked too closely at it yet sorry. Hugh has a great eye for these details, though, hint hint :) As everyone knows, my favourite thing is to say nasty things about any new feature that adds complexity to common code. I feel like crying to hear about how many more instances of MS Office we can all run, if only we apply this patch. And the poorly written HPC app just sounds like scrapings from the bottom of justification barrel. I'm sorry, maybe I'm way off with my understanding of how important this is. There isn't too much help in the changelog. A discussion of where the memory savings comes from, and how far does things like sharing of fs image, or ballooning goes and how much extra savings we get from this... with people from other hypervisors involved as well. Have I missed this kind of discussion? Careful what you wish for, ay? :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: __purge_vmap_area_lazy crash with CONFIG_PREEMPT_RCU=y
On Wednesday 31 December 2008 02:32:50 Avi Kivity wrote: Marcelo Tosatti wrote: On Tue, Dec 30, 2008 at 02:53:36PM +1100, Nick Piggin wrote: RSP 88011f4c7be8 ---[ end trace 31811279a2e983e8 ]--- note: qemu-system-x86[4440] exited with preempt_count 2 (gdb) l *(__purge_vmap_area_lazy + 0x12c) 0x80289ca2 is in __purge_vmap_area_lazy (mm/vmalloc.c:516). 511 if (nr || force_flush) 512 flush_tlb_kernel_range(*start, *end); 513 514 if (nr) { 515 spin_lock(vmap_area_lock); 516 list_for_each_entry(va, valist, purge_list) 517 __free_vmap_area(va); 518 spin_unlock(vmap_area_lock); 519 } 520 spin_unlock(purge_lock); 0x80289c9a __purge_vmap_area_lazy+292:mov 0x40(%rbx),%rax 0x80289c9e __purge_vmap_area_lazy+296:lea -0x40(%rax),%rbx 0x80289ca2 __purge_vmap_area_lazy+300: mov 0x40(%rbx),%rax ^^^ Note: RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b Good old POISON_FREE. Right, it seems like it has been kfreed while it is still accessed via RCU. But I just can't see how the vmap_area can be freed while there is a concurrent process traversing the vmap_area_list... __free_vmap_area removes the entry from the list first, then does a call_rcu to kfree it. Hmm... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: __purge_vmap_area_lazy crash with CONFIG_PREEMPT_RCU=y
On Tuesday 30 December 2008 01:58:21 Marcelo Tosatti wrote: On Wed, Dec 24, 2008 at 04:28:44PM +0100, Andrea Arcangeli wrote: On Wed, Dec 24, 2008 at 02:50:57PM +0200, Avi Kivity wrote: Marcelo Tosatti wrote: The destructor for huge pages uses the backing inode for adjusting hugetlbfs accounting. Hugepage mappings are destroyed by exit_mmap, after mmu_notifier_release, so there are no notifications through unmap_hugepage_range at this point. The hugetlbfs inode can be freed with pages backed by it referenced by the shadow. When the shadow releases its reference, the huge page destructor will access a now freed inode. Implement the release operation for kvm mmu notifiers to release page refs before the hugetlbfs inode is gone. I see this isn't it. Andrea, comments? Yeah, the patch looks good, I talked a bit with Marcelo about this by PM. The issue is that it's not as strightforward as it seems, basically when I implemented the -release handlers and had sptes teardown running before the files were closed (instead of waiting the kvm anon inode release handler to fire) I was getting bugchecks from debug options including preempt=y (certain debug checks only becomes functional with preempt enabled unfortunately), so eventually I removed -release because for kvm -release wasn't useful because no guest mode can run any more by the time mmu notifier -release is invoked, and that avoided the issues with the bugchecks. We'll be using the mmu notifiers -release because it's always called just before the filehandle are destroyed, it's not really about the guest mode or secondary mmu but just an ordering issue with hugetlbfs internals. So in short if no bugcheck triggers this is fine (at least until hugetlbfs provides a way to register some callback to invoke at the start of the hugetlbfs-release handler). The only bugcheck I see, which triggers on vanilla kvm upstream with CONFIG_PREEMPT_DEBUG=y and CONFIG_PREEMPT_RCU=y is: general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC4ttyS1: 1 input overrun(s) last sysfs file: /sys/class/net/tap0/address CPU 0 Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd Pid: 4768, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #164 RIP: 0010:[8028a5b6] [8028a5b6] __purge_vmap_area_lazy+0x12c/0x163 RSP: 0018:88021e1f9a38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0003 RDX: 80a1dae0 RSI: 880028083980 RDI: 0001 RBP: 88021e1f9a78 R08: 0286 R09: 80a1bf50 R10: 880119c270f8 R11: 88021e1f99b8 R12: 88021e1f9a38 R13: 88021e1f9a90 R14: 88021e1f9a98 R15: 813a FS: () GS:8080d900() knlGS: CS: 0010 DS: 002b ES: 002b CR0: 8005003b CR2: 008d9828 CR3: 00201000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 4768, threadinfo 88021e1f8000, task 880119c270f8) Stack: 88022bdfd840 880119da11b8 c20011c3 813a 0001 88022ec11c18 88022f061838 88021e1f9aa8 8028ab1d 88021e1f9aa8 c20021976000 Call Trace: [8028ab1d] free_unmap_vmap_area_noflush+0x69/0x70 [8028ab49] remove_vm_area+0x25/0x71 [8028ac54] __vunmap+0x3a/0xca [8028ad35] vfree+0x29/0x2b [a00f98a3] kvm_free_physmem_slot+0x25/0x7c [kvm] [a00f9d75] kvm_free_physmem+0x27/0x36 [kvm] [a00fccb4] kvm_arch_destroy_vm+0xa6/0xda [kvm] [a00f9e11] kvm_put_kvm+0x8d/0xa7 [kvm] [a00fa0e2] kvm_vcpu_release+0x13/0x17 [kvm] [802a1c07] __fput+0xeb/0x1a3 [802a1cd4] fput+0x15/0x17 [8029f26c] filp_close+0x67/0x72 [802378a8] put_files_struct+0x74/0xc8 [80237943] exit_files+0x47/0x4f [80238fe5] do_exit+0x1eb/0x7a7 [80587edf] ? _spin_unlock_irq+0x2b/0x51 [80239614] do_group_exit+0x73/0xa0 [80242b10] get_signal_to_deliver+0x30c/0x32c [8020b4d5] ? sysret_signal+0x19/0x29 [8020a80f] do_notify_resume+0x8c/0x851 [8025b811] ? do_futex+0x90/0x92a [80256bd7] ? trace_hardirqs_on_caller+0xf0/0x114 [80587f51] ? _spin_unlock_irqrestore+0x4c/0x68 [8026be5c] ? __rcu_read_unlock+0x92/0x9e [80256bd7] ? trace_hardirqs_on_caller+0xf0/0x114
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Thursday 13 November 2008 13:31, Andrea Arcangeli wrote: On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote: CPU0 migrate.c CPU1 filemap.c --- -- find_get_page radix_tree_lookup_slot returns the oldpage page_count still = expected_count freeze_ref (oldpage-count = 0) radix_tree_replace (too late, other side already got the oldpage) unfreeze_ref (oldpage-count = 2) page_cache_get_speculative(old_page) set count to 3 and succeeds After reading more of this lockless radix tree code, I realized this below check is the one that was intended to restart find_get_page and prevent it to return the oldpage: if (unlikely(page != *pagep)) { But there's no barrier there, atomic_add_unless would need to provide an atomic smp_mb() _after_ atomic_add_unless executed. In the old days the atomic_* routines had no implied memory barriers, you had to use smp_mb__after_atomic_add_unless if you wanted to avoid the race. I don't see much in the ppc implementation of atomic_add_unless that would provide an implicit smb_mb after the page_cache_get_speculative returns, so I can't see why the pagep can't be by find_get_page read before the other cpu executes radix_tree_replace in the above timeline. All atomic functions that both return a value and modify their memory are defined to provide full barriers before and after the operation. powerpc should be OK __asm__ __volatile__ ( LWSYNC_ON_SMP 1: lwarx %0,0,%1 # atomic_add_unless\n\ cmpw0,%0,%3 \n\ beq-2f \n\ add %0,%2,%0 \n PPC405_ERR77(0,%2) stwcx. %0,0,%1 \n\ bne-1b \n ISYNC_ON_SMP subf%0,%2,%0 \n\ 2: : =r (t) : r (v-counter), r (a), r (u) : cc, memory); lwsync instruction prevents all reorderings except allows loads to pass stores. But because it is followed by a LL/SC sequence, the store part of that sequence is prevented from passing stores, so therefore it will fail if the load had executed earlier and the value has changed. isync prevents speculative execution, so the branch has to be resolved before any subsequent instructions start. The branch depends on the result of the LL/SC, so that has to be complete. So that prevents loads from passing the LL/SC. For the SC to complete, I think by definition the store has to be visible, because it has to check the value has not changed (so it would have to own the cacheline). I think that's how it works. I'm not an expert at powerpc's weird barriers, but I'm pretty sure they work. I guess you intended to put an smp_mb() in between the page_cache_get_speculative and the *pagep to make the code safe on ppc too, but there isn't, and I think it must be fixed, either that or I don't understand ppc assembly right. The other side has a smp_wmb implicit inside radix_tree_replace_slot so it should be ok already to ensure we see the refcount going to 0 before we see the pagep changed (the fact the other side has a memory barrier, further confirms this side needs it too). I think it is OK. It should have a comment there however to say it relies on a barrier in get_page_unless_zero (I thought I had one..) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Fix vmalloc regression
On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote: Nick, This is the whole set of patches I was talking about. Patch 3 is the one that in fact fixes the problem Patches 1 and 2 are debugging aids I made use of, and could be possibly useful to others Patch 4 removes guard pages entirely for non-debug kernels, as we have already previously discussed. Hope it's all fine. OK, these all look good, but I may only push 3/4 for Linus in this round, along with some of the changes from my patch that you tested as well. With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should turn off the lazy unmapping optimisation as well, so it catches use after free similarly to the page allocator... but probably it is a good idea at least to avoid the double-guard page for 2.6.28? Anyway thanks for these, I'll send them up to Andrew/Linus and cc you. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Fix vmalloc regression
On Saturday 08 November 2008 13:13, Glauber Costa wrote: On Sat, Nov 08, 2008 at 01:58:32AM +0100, Nick Piggin wrote: On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote: Nick, This is the whole set of patches I was talking about. Patch 3 is the one that in fact fixes the problem Patches 1 and 2 are debugging aids I made use of, and could be possibly useful to others Patch 4 removes guard pages entirely for non-debug kernels, as we have already previously discussed. Hope it's all fine. OK, these all look good, but I may only push 3/4 for Linus in this round, along with some of the changes from my patch that you tested as well. Makes total sense. OK, sent. Thanks again. With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should turn off the lazy unmapping optimisation as well, so it catches use after free similarly to the page allocator... but probably it is a good idea at least to avoid the double-guard page for 2.6.28? Makes sense. Maybe poisoning after free would also be useful? It's a problem because we're only dealing with virtual address, rather than real memory. So we don't really have anything to poison (we don't know what the caller will do with the memory). I guess it would be possible to poison in the page allocator or in vfree, but probably not worthwhile (after the immediate-unmap debug option). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] restart search at beggining of vmalloc address
Excellent, thank you! Good catch If you agree my previous patch was also good in combination with this one, then I'll send them all to be merged. Thanks, Nick On Thu, Nov 06, 2008 at 06:58:26PM -0200, Glauber Costa wrote: Current vmalloc restart search for a free area in case we can't find one. The reason is there are areas which are lazily freed, and could be possibly freed now. However, current implementation start searching the tree from the last failing address, which is pretty much by definition at the end of address space. So, we fail. The proposal of this patch is to restart the search from the beginning of the requested vstart address. This fixes the regression in running KVM virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349, caused by commit db64fe02258f1507e13fe5212a989922323685ce. Signed-off-by: Glauber Costa [EMAIL PROTECTED] CC: Nick Piggin [EMAIL PROTECTED] --- mm/vmalloc.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 7db493d..6fe2003 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -378,6 +378,7 @@ found: if (!purged) { purge_vmap_area_lazy(); purged = 1; + addr = ALIGN(vstart, align); goto retry; } if (printk_ratelimit()) -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] regression: vmalloc easily fail.
On Thu, Oct 30, 2008 at 09:28:54AM -0200, Glauber Costa wrote: On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote: On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote: On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote: On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote: Nick Piggin wrote: Hmm, spanning 30MB of memory... how much vmalloc space do you have? From the original report: VmallocTotal: 122880 kB VmallocUsed: 15184 kB VmallocChunk: 83764 kB So it seems there's quite a bit of free space. Chunk is the largest free contiguous region, right? If so, it seems the Yes. problem is unrelated to guard pages, instead the search isn't finding a 1-page area (with two guard pages) for some reason, even though lots of free space is available. Hmm. The free area search could be buggy... Do you want me to grab any specific info of it? Or should I just hack myself randomly into it? I'll probably have some time for that tomorrow. I took a bit of a look. Does this help you at all? I still think we should get rid of the guard pages in non-debug kernels completely, but hopefully this will fix your problems? unfortunately, it doesn't. problem still happen in a kernel with this patch. That's weird. Any chance you could dump a list of all the vmap area start and end adresses and their flags before returning failure? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] regression: vmalloc easily fail.
On Wed, Oct 29, 2008 at 07:48:56AM -0200, Glauber Costa wrote: 0xf7bfe000-0xf7c08192 hpet_enable+0x2d/0x279 phys=fed0 ioremap 0xf7c02000-0xf7c040008192 acpi_os_map_memory+0x11/0x1a phys=7fed1000 ioremap 0xf7c06000-0xf7c080008192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap 0xf7c0a000-0xf7c0c0008192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap 0xf7c0d000-0xf7c0f0008192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc 0xf7c1-0xf7c1f000 61440 acpi_os_map_memory+0x11/0x1a phys=7fed1000 ioremap 0xf7c2-0xf7c220008192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap 0xf7c24000-0xf7c27000 12288 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap 0xf7c28000-0xf7c2a0008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap 0xf7c2c000-0xf7c2e0008192 acpi_os_map_memory+0x11/0x1a phys=7fef4000 ioremap 0xf7c3-0xf7c320008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap 0xf7c34000-0xf7c360008192 acpi_os_map_memory+0x11/0x1a phys=7fef4000 ioremap 0xf7c38000-0xf7c3a0008192 acpi_os_map_memory+0x11/0x1a phys=7fed1000 ioremap 0xf7c3c000-0xf7c3e0008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap 0xf7c4-0xf7c420008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap 0xf7c44000-0xf7c460008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap 0xf7c48000-0xf7c4a0008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap 0xf7c4b000-0xf7c57000 49152 zisofs_init+0xd/0x1c pages=11 vmalloc 0xf7c58000-0xf7c5a0008192 yenta_probe+0x123/0x63f phys=e430 ioremap 0xf7c5b000-0xf7c5e000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc 0xf7c61000-0xf7c65000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc 0xf7c67000-0xf7c71000 40960 vmalloc_exec+0x12/0x14 pages=9 vmalloc 0xf7c72000-0xf7c740008192 usb_hcd_pci_probe+0x132/0x277 phys=ee404000 ioremap 0xf7c75000-0xf7c7c000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc 0xf7c7d000-0xf7c86000 36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc 0xf7c87000-0xf7c890008192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc 0xf7c8a000-0xf7c91000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc 0xf7c92000-0xf7c97000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc 0xf7c9b000-0xf7c9f000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc 0xf7ca-0xf7ca5000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc 0xf7ca6000-0xf7cab000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc 0xf7cac000-0xf7caf000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc 0xf7cb-0xf7cb20008192 iTCO_wdt_probe+0xbe/0x281 [iTCO_wdt] phys=fed1f000 ioremap 0xf7cb3000-0xf7cbf000 49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc 0xf7cc-0xf7cce000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc 0xf7ccf000-0xf7cd2000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc 0xf7cd3000-0xf7cd50008192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc 0xf7cd6000-0xf7cdc000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc 0xf7cdd000-0xf7ce 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc 0xf7ce1000-0xf7ce4000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc 0xf7ce6000-0xf7d02000 114688 vmalloc_exec+0x12/0x14 pages=27 vmalloc 0xf7d03000-0xf7d08000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc 0xf7d09000-0xf7d0b0008192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc 0xf7d0c000-0xf7d12000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc 0xf7d13000-0xf7d24000 69632 snd_malloc_sgbuf_pages+0x190/0x1ca [snd_page_alloc] vmap 0xf7d25000-0xf7d2a000 20480 drm_ht_create+0x69/0xa5 [drm] pages=4 vmalloc 0xf7d2d000-0xf7d2f0008192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc 0xf7d3-0xf7d5 131072 vmalloc_exec+0x12/0x14 pages=31 vmalloc 0xf7d51000-0xf7d54000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc 0xf7d56000-0xf7d5a000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc 0xf7d5b000-0xf7d5e000 12288 vmalloc_user+0x13/0x38 pages=2 vmalloc user 0xf7d5f000-0xf7d67000 32768 vmalloc_exec+0x12/0x14 pages=7 vmalloc 0xf7d68000-0xf7d79000 69632 snd_malloc_sgbuf_pages+0x190/0x1ca [snd_page_alloc] vmap 0xf7d7a000-0xf7d86000 49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc 0xf7d87000-0xf7d8e000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc 0xf7d8f000-0xf7d910008192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc 0xf7d92000-0xf7d940008192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc 0xf7d96000-0xf7dba000 147456 vmalloc_exec+0x12/0x14 pages=35 vmalloc 0xf7dbb000-0xf7dc 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc 0xf7dc1000-0xf7dc4000 12288 vmalloc_32+0x12/0x14 pages=2 vmalloc 0xf7dc7000-0xf7dc90008192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc 0xf7dcb000-0xf7dd4000 36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc 0xf7dd5000-0xf7dd70008192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc 0xf7dd8000-0xf7dda0008192 pci_iomap+0x72/0x7b
Re: [PATCH] regression: vmalloc easily fail.
On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote: Nick Piggin wrote: Hmm, spanning 30MB of memory... how much vmalloc space do you have? From the original report: VmallocTotal: 122880 kB VmallocUsed: 15184 kB VmallocChunk: 83764 kB So it seems there's quite a bit of free space. Chunk is the largest free contiguous region, right? If so, it seems the Yes. problem is unrelated to guard pages, instead the search isn't finding a 1-page area (with two guard pages) for some reason, even though lots of free space is available. Hmm. The free area search could be buggy... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] regression: vmalloc easily fail.
On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote: On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote: On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote: Nick Piggin wrote: Hmm, spanning 30MB of memory... how much vmalloc space do you have? From the original report: VmallocTotal: 122880 kB VmallocUsed: 15184 kB VmallocChunk: 83764 kB So it seems there's quite a bit of free space. Chunk is the largest free contiguous region, right? If so, it seems the Yes. problem is unrelated to guard pages, instead the search isn't finding a 1-page area (with two guard pages) for some reason, even though lots of free space is available. Hmm. The free area search could be buggy... Do you want me to grab any specific info of it? Or should I just hack myself randomly into it? I'll probably have some time for that tomorrow. I took a bit of a look. Does this help you at all? I still think we should get rid of the guard pages in non-debug kernels completely, but hopefully this will fix your problems? -- - Fix off by one bug in the KVA allocator that can leave gaps - An initial vmalloc failure should start off a synchronous flush of lazy areas, in case someone is in progress flushing them already. - Purge lock can be a mutex so we can sleep while that's going on. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/mm/vmalloc.c === --- linux-2.6.orig/mm/vmalloc.c +++ linux-2.6/mm/vmalloc.c @@ -14,6 +14,7 @@ #include linux/highmem.h #include linux/slab.h #include linux/spinlock.h +#include linux/mutex.h #include linux/interrupt.h #include linux/proc_fs.h #include linux/seq_file.h @@ -362,7 +363,7 @@ retry: goto found; } - while (addr + size = first-va_start addr + size = vend) { + while (addr + size first-va_start addr + size = vend) { addr = ALIGN(first-va_end + PAGE_SIZE, align); n = rb_next(first-rb_node); @@ -472,7 +473,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_IN static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, int sync, int force_flush) { - static DEFINE_SPINLOCK(purge_lock); + static DEFINE_MUTEX(purge_lock); LIST_HEAD(valist); struct vmap_area *va; int nr = 0; @@ -483,10 +484,10 @@ static void __purge_vmap_area_lazy(unsig * the case that isn't actually used at the moment anyway. */ if (!sync !force_flush) { - if (!spin_trylock(purge_lock)) + if (!mutex_trylock(purge_lock)) return; } else - spin_lock(purge_lock); + mutex_lock(purge_lock); rcu_read_lock(); list_for_each_entry_rcu(va, vmap_area_list, list) { @@ -518,7 +519,18 @@ static void __purge_vmap_area_lazy(unsig __free_vmap_area(va); spin_unlock(vmap_area_lock); } - spin_unlock(purge_lock); + mutex_unlock(purge_lock); +} + +/* + * Kick off a purge of the outstanding lazy areas. Don't bother if somebody + * is already purging. + */ +static void try_purge_vmap_area_lazy(void) +{ + unsigned long start = ULONG_MAX, end = 0; + + __purge_vmap_area_lazy(start, end, 0, 0); } /* @@ -528,7 +540,7 @@ static void purge_vmap_area_lazy(void) { unsigned long start = ULONG_MAX, end = 0; - __purge_vmap_area_lazy(start, end, 0, 0); + __purge_vmap_area_lazy(start, end, 1, 0); } /* @@ -539,7 +551,7 @@ static void free_unmap_vmap_area(struct va-flags |= VM_LAZY_FREE; atomic_add((va-va_end - va-va_start) PAGE_SHIFT, vmap_lazy_nr); if (unlikely(atomic_read(vmap_lazy_nr) lazy_max_pages())) - purge_vmap_area_lazy(); + try_purge_vmap_area_lazy(); } static struct vmap_area *find_vmap_area(unsigned long addr) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] regression: vmalloc easily fail.
On Tue, Oct 28, 2008 at 08:55:13PM -0200, Glauber Costa wrote: Commit db64fe02258f1507e13fe5212a989922323685ce broke KVM (the symptom) for me. The cause is that vmalloc allocations fail, despite of the fact that /proc/meminfo shows plenty of vmalloc space available. After some investigation, it seems to me that the current way to compute the next addr in the rb-tree transversal leaves a spare page between each allocation. After a few allocations, regardless of their size, we run out of vmalloc space. Right... that was to add a guard page like the old vmalloc allocator. vmallocs still add their extra page too, so most of them will have a 2 page guard area, but I didn't think this would hurt significantly. I'm not against the patch, but I wonder exactly what is filling it up and how? (can you look at the vmalloc proc function to find out?) Signed-off-by: Glauber Costa [EMAIL PROTECTED] Cc: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Krzysztof Helt [EMAIL PROTECTED] --- mm/vmalloc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 0365369..a33b0d1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -363,7 +363,7 @@ retry: } while (addr + size = first-va_start addr + size = vend) { - addr = ALIGN(first-va_end + PAGE_SIZE, align); + addr = ALIGN(first-va_end, align); n = rb_next(first-rb_node); if (n) -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html