Re: [PATCH v4 1/6] KVM: MMU: introduce gfn_to_pfn_atomic() function

2010-07-02 Thread Nick Piggin
On Fri, Jul 02, 2010 at 01:47:56PM -0300, Marcelo Tosatti wrote:
 On Thu, Jul 01, 2010 at 09:53:04PM +0800, Xiao Guangrong wrote:
  Introduce gfn_to_pfn_atomic(), it's the fast path and can used in atomic
  context, the later patch will use it
  
  Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
  ---
   arch/x86/mm/gup.c|2 ++
   include/linux/kvm_host.h |1 +
   virt/kvm/kvm_main.c  |   32 +---
   3 files changed, 28 insertions(+), 7 deletions(-)
  
  diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
  index 738e659..0c9034b 100644
  --- a/arch/x86/mm/gup.c
  +++ b/arch/x86/mm/gup.c
  @@ -6,6 +6,7 @@
*/
   #include linux/sched.h
   #include linux/mm.h
  +#include linux/module.h
   #include linux/vmstat.h
   #include linux/highmem.h
   
  @@ -274,6 +275,7 @@ int __get_user_pages_fast(unsigned long start, int 
  nr_pages, int write,
   
  return nr;
   }
  +EXPORT_SYMBOL_GPL(__get_user_pages_fast);
   
   /**
* get_user_pages_fast() - pin user pages in memory
 
 This should be a separate patch (overlooked that before). Ingo, Nick,
 can this go in through the kvm tree?

I'm happy for it to be exported. Yes put it in a seperate patch and
add an acked-by: me on it please. Beyond that I don't care what tree
it goes through :) although Ingo might have an opinion


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow vmalloc in 2.6.35-rc3

2010-06-27 Thread Nick Piggin
On Sun, Jun 27, 2010 at 12:17:42PM +0300, Avi Kivity wrote:
 On 06/24/2010 06:14 PM, Nick Piggin wrote:
 On Thu, Jun 24, 2010 at 12:19:32PM +0300, Avi Kivity wrote:
 I see really slow vmalloc performance on 2.6.35-rc3:
 Can you try this patch?
 http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmap-area-cache.patch
 
 The patch completely eliminates the problem.

Thanks for testing. Andrew the patch changelog can be updated.  Avi and
Steven didn't give me numbers but it solves both their performance
regressions, so I would say it is ready for merging.

Thanks,
Nick


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow vmalloc in 2.6.35-rc3

2010-06-24 Thread Nick Piggin
On Thu, Jun 24, 2010 at 12:19:32PM +0300, Avi Kivity wrote:
 I see really slow vmalloc performance on 2.6.35-rc3:

Can you try this patch?
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmap-area-cache.patch

 
 # tracer: function_graph
 #
 # CPU  DURATION  FUNCTION CALLS
 # | |   | |   |   |   |
  3)   3.581 us|  vfree();
  3)   |  msr_io() {
  3) ! 523.880 us  |vmalloc();
  3)   1.702 us|vfree();
  3) ! 529.960 us  |  }
  3)   |  msr_io() {
  3) ! 564.200 us  |vmalloc();
  3)   1.429 us|vfree();
  3) ! 568.080 us  |  }
  3)   |  msr_io() {
  3) ! 578.560 us  |vmalloc();
  3)   1.697 us|vfree();
  3) ! 584.791 us  |  }
  3)   |  msr_io() {
  3) ! 559.657 us  |vmalloc();
  3)   1.566 us|vfree();
  3) ! 575.948 us  |  }
  3)   |  msr_io() {
  3) ! 536.558 us  |vmalloc();
  3)   1.553 us|vfree();
  3) ! 542.243 us  |  }
  3)   |  msr_io() {
  3) ! 560.086 us  |vmalloc();
  3)   1.448 us|vfree();
  3) ! 569.387 us  |  }
 
 msr_io() is from arch/x86/kvm/x86.c, allocating at most 4K (yes it
 should use kmalloc()).  The memory is immediately vfree()ed.  There
 are 96 entries in /proc/vmallocinfo, and the whole thing is single
 threaded so there should be no contention.

Yep, it should use kmalloc.

 
 Here's the perf report:
 
 63.97% qemu  [kernel]
 [k] rb_next
|
--- rb_next
   |
   |--70.75%-- alloc_vmap_area
   |  __get_vm_area_node
   |  __vmalloc_node
   |  vmalloc
   |  |
   |  |--99.15%-- msr_io
   |  |  kvm_arch_vcpu_ioctl
   |  |  kvm_vcpu_ioctl
   |  |  vfs_ioctl
   |  |  do_vfs_ioctl
   |  |  sys_ioctl
   |  |  system_call
   |  |  __GI_ioctl
   |  |  |
   |  |   --100.00%--
 0x1dfc4a8878e71362
   |  |
   |   --0.85%-- __kvm_set_memory_region
   | kvm_set_memory_region
   |
 kvm_vm_ioctl_set_memory_region
   | kvm_vm_ioctl
   | vfs_ioctl
   | do_vfs_ioctl
   | sys_ioctl
   | system_call
   | __GI_ioctl
   |
--29.25%-- __get_vm_area_node
  __vmalloc_node
  vmalloc
  |
  |--98.89%-- msr_io
  |  kvm_arch_vcpu_ioctl
  |  kvm_vcpu_ioctl
  |  vfs_ioctl
  |  do_vfs_ioctl
  |  sys_ioctl
  |  system_call
  |  __GI_ioctl
  |  |
  |   --100.00%--
 0x1dfc4a8878e71362
 
 
 It seems completely wrong - iterating 8 levels of a binary tree
 shouldn't take half a millisecond.

It's not iterating down the tree, it's iterating through the
nodes to find a free area. Slows down because lazy vunmap means
that quite a lot of little areas build up right at the start of
our search start address. The vmap cache should hopefully fix
it up.

Thanks,
Nick

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Nick Piggin
On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote:
 
 (Cc:-ed various performance/optimization folks)
 
 * Avi Kivity a...@redhat.com wrote:
 
  On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
  On 06/16/2010 12:24 AM, Avi Kivity wrote:
  Ingo, Peter, any feedback on this?
   Conceptually, this makes sense to me.  However, I have a concern what
   happens when a task is scheduled on another CPU, while its FPU state is
   still in registers in the original CPU.  That would seem to require
   expensive IPIs to spill the state in order for the rescheduling to
   proceed, and this could really damage performance.
  
  Right, this optimization isn't free.
  
  I think the tradeoff is favourable since task migrations are much
  less frequent than context switches within the same cpu, can the
  scheduler experts comment?
 
 This cannot be stated categorically without precise measurements of 
 known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
 these workloads have different characteristics.
 
 I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
 various lmbench components, X benchmarks, tiobench - you name it. Combined 
 with the fact that most micro-benchmarks wont be using the FPU, while in the 
 long run most processes will be using the FPU due to SIMM instructions. So 
 even a positive result might be skewed in practice. Has to be measured 
 carefully IMO - and i havent seen a _single_ performance measurement in the 
 submission mail. This is really essential.

It can be nice to code an absolute worst-case microbenchmark too.

Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Nick Piggin
On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote:
 On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote:
  On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
   
   There are two separate problems: the more general problem is that
   the hypervisor can put a vcpu to sleep while holding a lock, causing
   other vcpus to spin until the end of their time slice.  This can
   only be addressed with hypervisor help.
  
  Fyi - I have a early patch ready to address this issue. Basically I am using
  host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to 
  hint 
  host whenever guest is in spin-lock'ed section, which is read by host 
  scheduler 
  to defer preemption.
 
 Looks like a ni.ce simple way to handle this for the kernel.
 
 However I suspect user space will hit the same issue sooner
 or later. I assume your way is not easily extensable to futexes?

Well userspace has always had the problem, hypervisor or not. So
sleeping locks obviously help a lot with that.

But we do hit the problem at times. The MySQL sysbench scalability
problem was a fine example

http://ozlabs.org/~anton/linux/sysbench/

Performance would tank when threads oversubscribe CPUs because lock
holders would start getting preempted.

This was due to nasty locking in MySQL as well, mind you.

There are some ways to improve it. glibc I believe has an option to
increase thread priority when taking a mutex, which is similar to
what we have here.

But it's a fairly broad problem for userspace. The resource may not
be just a lock but it could be IO as well.

 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Nick Piggin
On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote:
 On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
  
  There are two separate problems: the more general problem is that
  the hypervisor can put a vcpu to sleep while holding a lock, causing
  other vcpus to spin until the end of their time slice.  This can
  only be addressed with hypervisor help.
 
 Fyi - I have a early patch ready to address this issue. Basically I am using
 host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to 
 hint 
 host whenever guest is in spin-lock'ed section, which is read by host 
 scheduler 
 to defer preemption.
 
 Guest side:
 
 static inline void spin_lock(spinlock_t *lock)
 {
   raw_spin_lock(lock-rlock);
 +   __get_cpu_var(gh_vcpu_ptr)-defer_preempt++;
 }
 
 static inline void spin_unlock(spinlock_t *lock)
 {
 + __get_cpu_var(gh_vcpu_ptr)-defer_preempt--;
 raw_spin_unlock(lock-rlock);
 }
 
 [similar changes to other spinlock variants]

Great, this is a nice way to improve it.

You might want to consider playing with first taking a ticket, and
then if we fail to acquire the lock immediately, then increment
defer_preempt before we start spinning.

The downside of this would be if we waste all our slice on spinning
and then preempted in the critical section. But with ticket locks
you can easily see how many entries in the queue in front of you.
So you could experiment with starting to defer preempt when we
notice we are getting toward the head of the queue.

Have you also looked at how s390 checks if the owning vcpu is running
and if so it spins, if not yields to the hypervisor. Something like
turning it into an adaptive lock. This could be applicable as well.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Nick Piggin
On Thu, Jun 03, 2010 at 05:34:50PM +0530, Srivatsa Vaddagiri wrote:
 On Thu, Jun 03, 2010 at 08:38:55PM +1000, Nick Piggin wrote:
   Guest side:
   
   static inline void spin_lock(spinlock_t *lock)
   {
 raw_spin_lock(lock-rlock);
   +   __get_cpu_var(gh_vcpu_ptr)-defer_preempt++;
   }
   
   static inline void spin_unlock(spinlock_t *lock)
   {
   + __get_cpu_var(gh_vcpu_ptr)-defer_preempt--;
   raw_spin_unlock(lock-rlock);
   }
   
   [similar changes to other spinlock variants]
  
  Great, this is a nice way to improve it.
  
  You might want to consider playing with first taking a ticket, and
  then if we fail to acquire the lock immediately, then increment
  defer_preempt before we start spinning.
 
  The downside of this would be if we waste all our slice on spinning
  and then preempted in the critical section. But with ticket locks
  you can easily see how many entries in the queue in front of you.
  So you could experiment with starting to defer preempt when we
  notice we are getting toward the head of the queue.
 
 Mm - my goal is to avoid long spin times in the first place (because the 
 owning vcpu was descheduled at an unfortunate time i.e while it was holding a
 lock). From that sense, I am targetting preemption-defer of lock *holder*
 rather than of lock acquirer. So ideally whenever somebody tries to grab a 
 lock,
 it should be free most of the time, it can be held only if the owner is
 currently running - which means we won't have to spin too long for the lock.

Holding a ticket in the queue is effectively the same as holding the
lock, from the pov of processes waiting behind.

The difference of course is that CPU cycles do not directly reduce
latency of ticket holders (only the owner). Spinlock critical sections
should tend to be several orders of magnitude shorter than context
switch times. So if you preempt the guy waiting at the head of the
queue, then it's almost as bad as preempting the lock holder.

 
  Have you also looked at how s390 checks if the owning vcpu is running
  and if so it spins, if not yields to the hypervisor. Something like
  turning it into an adaptive lock. This could be applicable as well.
 
 I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang
 scheduling of vcpus, which reduces the severity of this problem very much -
 essentially lock acquirer/holder are run simultaneously on different cpus all
 the time. Gang scheduling is on my list of things to look at much later
 (although I have been warned that its a scalablility nightmare!).

It effectively is pretty well an adaptive lock. The spinlock itself
doesn't sleep of course, but it yields to the hypervisor if the owner
has been preempted. This is pretty close to analogous with Linux
adaptive mutexes.

s390 also has the diag9c instruction which I suppose somehow boosts
priority of a preempted contended lock holder. In spite of any other
possible optimizations in their hypervisor like gang scheduling,
diag9c apparently provides quite a large improvement in some cases.

And they aren't even using ticket spinlocks!!

So I think these things are fairly important to look at.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Nick Piggin
On Thu, Jun 03, 2010 at 06:28:21PM +0530, Srivatsa Vaddagiri wrote:
 On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
  Holding a ticket in the queue is effectively the same as holding the
  lock, from the pov of processes waiting behind.
  
  The difference of course is that CPU cycles do not directly reduce
  latency of ticket holders (only the owner). Spinlock critical sections
  should tend to be several orders of magnitude shorter than context
  switch times. So if you preempt the guy waiting at the head of the
  queue, then it's almost as bad as preempting the lock holder.
 
 Ok got it - although that approach is not advisable in some cases for ex: when
 the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by
 the hypervisor (which was experimented with in [1] where they foud a huge hit 
 in
 perf).

Sure but if you had adaptive yielding, that solves that problem.

 
 I agree that in general we should look at deferring preemption of lock 
 acquirer esp when its at head as you suggest - I will consider that approach
 as the next step (want to incrementally progress basically!).

Have you also looked at how s390 checks if the owning vcpu is running
and if so it spins, if not yields to the hypervisor. Something like
turning it into an adaptive lock. This could be applicable as well.
   
   I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does 
   gang
   scheduling of vcpus, which reduces the severity of this problem very much 
   -
   essentially lock acquirer/holder are run simultaneously on different cpus 
   all
   the time. Gang scheduling is on my list of things to look at much later
   (although I have been warned that its a scalablility nightmare!).
  
  It effectively is pretty well an adaptive lock. The spinlock itself
  doesn't sleep of course, but it yields to the hypervisor if the owner
  has been preempted. This is pretty close to analogous with Linux adaptive 
  mutexes.
 
 Oops you are right - sorry should have checked more closely earlier. Given 
 that
 we may not be able to always guarantee that locked critical sections will not 
 be
 preempted (ex: when a real-time task takes over), we will need a combination 
 of 
 both approaches (i.e request preemption defer on lock hold path + yield on 
 lock 
 acquire path if owner !scheduled). The advantage of former approach is that it
 could reduce job turnaround times in most cases (as lock is available when we 
 want or we don't have to wait too long for it).

Both I think would be good. It might be interesting to talk with the
s390 guys and see if they can look at ticket locks and preempt defer
techniques too (considering they already do the other half of the
equation well).

 
  s390 also has the diag9c instruction which I suppose somehow boosts
  priority of a preempted contended lock holder. In spite of any other
  possible optimizations in their hypervisor like gang scheduling,
  diag9c apparently provides quite a large improvement in some cases.
 
 Ok - thx for that pointer - will have a look at diag9c.
 
  So I think these things are fairly important to look at.
 
 I agree ..
 
 - vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-03 Thread Nick Piggin
On Thu, Jun 03, 2010 at 05:17:30PM +0200, Andi Kleen wrote:
 On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote:
  And they aren't even using ticket spinlocks!!
 
 I suppose they simply don't have unfair memory. Makes things easier.

That would certainly be a part of it, I'm sure they have stronger
fairness and guarantees at the expense of some performance. We saw the
spinlock starvation first on 8-16 core Opterons I think, wheras Altix
had been over 1024 cores and POWER7 1024 threads now apparently without
reported problems.

However I think more is needed than simply fair memory at the cache
coherency level, considering that for example s390 implements it simply
by retrying cas until it succeeds. So you could perfectly round-robin
all cache requests for the lock word, but one core could quite easily
always find it is granted the cacheline when the lock is already taken.

So I think actively enforcing fairness at the lock level would be
required. Something like if it is detected that a core is not making
progress on a tight cas loop, then it will need to enter a queue of
cores where the head of the queue is always granted the cacheline first
after it has been dirtied. Interrupts will need to be ignored from this
logic. This still doesn't solve the problem of an owner unfairly
releasing and grabbing the lock again, they could have more detection to
handle that.

I don't know how far hardware goes. Maybe it is enough to statistically
avoid starvation if memory is pretty fair. But it does seem a lot easier
to enforce fairness in software.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device

2010-03-10 Thread Nick Piggin
On Thu, Mar 11, 2010 at 03:10:47AM +, Jamie Lokier wrote:
 Paul Brook wrote:
In a cross environment that becomes extremely hairy.  For example the 
x86
architecture effectively has an implicit write barrier before every
store, and an implicit read barrier before every load.
   
   Btw, x86 doesn't have any implicit barriers due to ordinary loads.
   Only stores and atomics have implicit barriers, afaik.
  
  As of March 2009[1] Intel guarantees that memory reads occur in
  order (they may only be reordered relative to writes). It appears
  AMD do not provide this guarantee, which could be an interesting
  problem for heterogeneous migration..
 
 (Summary: At least on AMD64, it does too, for normal accesses to
 naturally aligned addresses in write-back cacheable memory.)
 
 Oh, that's interesting.  Way back when I guess we knew writes were in
 order and it wasn't explicit that reads were, hence smp_rmb() using a
 locked atomic.
 
 Here is a post by Nick Piggin from 2007 with links to Intel _and_ AMD
 documents asserting that reads to cacheable memory are in program order:
 
 http://lkml.org/lkml/2007/9/28/212
 Subject: [patch] x86: improved memory barrier implementation
 
 Links to documents:
 
 http://developer.intel.com/products/processor/manuals/318147.pdf
 
 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
 
 The Intel link doesn't work any more, but the AMD one does.

It might have been merged into their development manual now.

 
 Nick asserts both manufacturers are committed to in-order loads from
 cacheable memory for the x86 architecture.

At the time we did ask Intel and AMD engineers. We talked with Andy
Glew from Intel I believe, but I can't recall the AMD contact.
Linus was involved in the discussions as well. We tried to do the
right thing with this.

 I have just read the AMD document, and it is in there (but not
 completely obviously), in section 7.2.  The implicit load-load and
 store-store barriers are only guaranteed for normal cacheable
 accesses on naturally aligned boundaries to WB [write-back cacheable]
 memory.  There are also implicit load-store barriers but not
 store-load.
 
 Note that the document covers AMD64; it does not say anything about
 their (now old) 32-bit processors.

Hmm. Well it couldn't hurt to ask again. We've never seen any
problems yet, so I'm rather sure we're in the clear.

 
  [*] The most recent docs I have handy. Up to and including Core-2 Duo.
 
 Are you sure the read ordering applies to 32-bit Intel and AMD CPUs too?
 
 Many years ago, before 64-bit x86 existed, I recall discussions on
 LKML where it was made clear that stores were performed in program
 order.  If it were known at the time that loads were performed in
 program order on 32-bit x86s, I would have expected that to have been
 mentioned by someone.

The way it was explained to us by the Intel engineer is that they
had implemented only visibly in-order loads, but they wanted to keep
their options open in future so they did not want to commit to in
order loads as an ISA feature.

So when the whitepaper was released we got their blessing to
retroactively apply the rules to previous CPUs.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

2009-04-20 Thread Nick Piggin
On Friday 17 April 2009 17:08:07 Jared Hulbert wrote:
  As everyone knows, my favourite thing is to say nasty things about any
  new feature that adds complexity to common code. I feel like crying to
  hear about how many more instances of MS Office we can all run, if only
  we apply this patch. And the poorly written HPC app just sounds like
  scrapings from the bottom of justification barrel.
 
  I'm sorry, maybe I'm way off with my understanding of how important
  this is. There isn't too much help in the changelog. A discussion of
  where the memory savings comes from, and how far does things like
  sharing of fs image, or ballooning goes and how much extra savings we
  get from this... with people from other hypervisors involved as well.
  Have I missed this kind of discussion?
 
 Nick,
 
 I don't know about other hypervisors, fs and balloonings, but I have
 tried this out.  It works.  It works on apps I don't consider, poorly
 written.  I'm very excited about this.  I got 10% saving in a
 roughly off the shelf embedded system.  No user noticeable performance
 impact.

OK well that's what I want to hear. Thanks, that means a lot to me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

2009-04-16 Thread Nick Piggin
On Wednesday 15 April 2009 08:09:03 Andrew Morton wrote:
 On Thu,  9 Apr 2009 06:58:37 +0300
 Izik Eidus iei...@redhat.com wrote:
 
  KSM is a linux driver that allows dynamicly sharing identical memory
  pages between one or more processes.
 
 Generally looks OK to me.  But that doesn't mean much.  We should rub
 bottles with words like hugh and nick on them to be sure.

I haven't looked too closely at it yet sorry. Hugh has a great eye for
these details, though, hint hint :)

As everyone knows, my favourite thing is to say nasty things about any
new feature that adds complexity to common code. I feel like crying to
hear about how many more instances of MS Office we can all run, if only
we apply this patch. And the poorly written HPC app just sounds like
scrapings from the bottom of justification barrel.

I'm sorry, maybe I'm way off with my understanding of how important
this is. There isn't too much help in the changelog. A discussion of
where the memory savings comes from, and how far does things like
sharing of fs image, or ballooning goes and how much extra savings we
get from this... with people from other hypervisors involved as well.
Have I missed this kind of discussion?

Careful what you wish for, ay? :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: __purge_vmap_area_lazy crash with CONFIG_PREEMPT_RCU=y

2008-12-30 Thread Nick Piggin
On Wednesday 31 December 2008 02:32:50 Avi Kivity wrote:
 Marcelo Tosatti wrote:
  On Tue, Dec 30, 2008 at 02:53:36PM +1100, Nick Piggin wrote:
   RSP 88011f4c7be8
  ---[ end trace 31811279a2e983e8 ]---
  note: qemu-system-x86[4440] exited with preempt_count 2
 
 
  (gdb) l *(__purge_vmap_area_lazy + 0x12c)
  0x80289ca2 is in __purge_vmap_area_lazy (mm/vmalloc.c:516).
  511 if (nr || force_flush)
  512 flush_tlb_kernel_range(*start, *end);
  513
  514 if (nr) {
  515 spin_lock(vmap_area_lock);
  516 list_for_each_entry(va, valist, purge_list)
  517 __free_vmap_area(va);
  518 spin_unlock(vmap_area_lock);
  519 }
  520 spin_unlock(purge_lock);
 
  0x80289c9a __purge_vmap_area_lazy+292:mov 0x40(%rbx),%rax
  0x80289c9e __purge_vmap_area_lazy+296:lea
  -0x40(%rax),%rbx 0x80289ca2 __purge_vmap_area_lazy+300:   
  mov 0x40(%rbx),%rax ^^^
 
  Note:
 
  RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b

 Good old POISON_FREE.

Right, it seems like it has been kfreed while it is still accessed via RCU.
But I just can't see how the vmap_area can be freed while there is a
concurrent process traversing the vmap_area_list... __free_vmap_area removes
the entry from the list first, then does a call_rcu to kfree it.

Hmm...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: __purge_vmap_area_lazy crash with CONFIG_PREEMPT_RCU=y

2008-12-29 Thread Nick Piggin
On Tuesday 30 December 2008 01:58:21 Marcelo Tosatti wrote:
 On Wed, Dec 24, 2008 at 04:28:44PM +0100, Andrea Arcangeli wrote:
  On Wed, Dec 24, 2008 at 02:50:57PM +0200, Avi Kivity wrote:
   Marcelo Tosatti wrote:
   The destructor for huge pages uses the backing inode for adjusting
   hugetlbfs accounting.
  
   Hugepage mappings are destroyed by exit_mmap, after
   mmu_notifier_release, so there are no notifications through
   unmap_hugepage_range at this point.
  
   The hugetlbfs inode can be freed with pages backed by it referenced
   by the shadow. When the shadow releases its reference, the huge page
   destructor will access a now freed inode.
  
   Implement the release operation for kvm mmu notifiers to release page
   refs before the hugetlbfs inode is gone.
  
   I see this isn't it.  Andrea, comments?
 
  Yeah, the patch looks good, I talked a bit with Marcelo about this by
  PM. The issue is that it's not as strightforward as it seems,
  basically when I implemented the -release handlers and had sptes
  teardown running before the files were closed (instead of waiting the
  kvm anon inode release handler to fire) I was getting bugchecks from
  debug options including preempt=y (certain debug checks only becomes
  functional with preempt enabled unfortunately), so eventually I
  removed -release because for kvm -release wasn't useful because no
  guest mode can run any more by the time mmu notifier -release is
  invoked, and that avoided the issues with the bugchecks.
 
  We'll be using the mmu notifiers -release because it's always called
  just before the filehandle are destroyed, it's not really about the
  guest mode or secondary mmu but just an ordering issue with hugetlbfs
  internals.
 
  So in short if no bugcheck triggers this is fine (at least until
  hugetlbfs provides a way to register some callback to invoke at the
  start of the hugetlbfs-release handler).

 The only bugcheck I see, which triggers on vanilla kvm upstream with
 CONFIG_PREEMPT_DEBUG=y and CONFIG_PREEMPT_RCU=y is:

 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC4ttyS1: 1
 input overrun(s)

 last sysfs file: /sys/class/net/tap0/address
 CPU 0
 Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc
 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT
 iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan
 ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
 shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd
 ohci_hcd ehci_hcd Pid: 4768, comm: qemu-system-x86 Not tainted
 2.6.28-00165-g4f27e3e-dirty #164 RIP: 0010:[8028a5b6] 
 [8028a5b6] __purge_vmap_area_lazy+0x12c/0x163 RSP:
 0018:88021e1f9a38  EFLAGS: 00010202
 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0003
 RDX: 80a1dae0 RSI: 880028083980 RDI: 0001
 RBP: 88021e1f9a78 R08: 0286 R09: 80a1bf50
 R10: 880119c270f8 R11: 88021e1f99b8 R12: 88021e1f9a38
 R13: 88021e1f9a90 R14: 88021e1f9a98 R15: 813a
 FS:  () GS:8080d900()
 knlGS: CS:  0010 DS: 002b ES: 002b CR0: 8005003b
 CR2: 008d9828 CR3: 00201000 CR4: 26e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-system-x86 (pid: 4768, threadinfo 88021e1f8000, task
 880119c270f8) Stack:
  88022bdfd840 880119da11b8 c20011c3 813a
   0001 88022ec11c18 88022f061838
  88021e1f9aa8 8028ab1d 88021e1f9aa8 c20021976000
 Call Trace:
  [8028ab1d] free_unmap_vmap_area_noflush+0x69/0x70
  [8028ab49] remove_vm_area+0x25/0x71
  [8028ac54] __vunmap+0x3a/0xca
  [8028ad35] vfree+0x29/0x2b
  [a00f98a3] kvm_free_physmem_slot+0x25/0x7c [kvm]
  [a00f9d75] kvm_free_physmem+0x27/0x36 [kvm]
  [a00fccb4] kvm_arch_destroy_vm+0xa6/0xda [kvm]
  [a00f9e11] kvm_put_kvm+0x8d/0xa7 [kvm]
  [a00fa0e2] kvm_vcpu_release+0x13/0x17 [kvm]
  [802a1c07] __fput+0xeb/0x1a3
  [802a1cd4] fput+0x15/0x17
  [8029f26c] filp_close+0x67/0x72
  [802378a8] put_files_struct+0x74/0xc8
  [80237943] exit_files+0x47/0x4f
  [80238fe5] do_exit+0x1eb/0x7a7
  [80587edf] ? _spin_unlock_irq+0x2b/0x51
  [80239614] do_group_exit+0x73/0xa0
  [80242b10] get_signal_to_deliver+0x30c/0x32c
  [8020b4d5] ? sysret_signal+0x19/0x29
  [8020a80f] do_notify_resume+0x8c/0x851
  [8025b811] ? do_futex+0x90/0x92a
  [80256bd7] ? trace_hardirqs_on_caller+0xf0/0x114
  [80587f51] ? _spin_unlock_irqrestore+0x4c/0x68
  [8026be5c] ? __rcu_read_unlock+0x92/0x9e
  [80256bd7] ? trace_hardirqs_on_caller+0xf0/0x114
  

Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Nick Piggin
On Thursday 13 November 2008 13:31, Andrea Arcangeli wrote:
 On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote:
  CPU0 migrate.c  CPU1 filemap.c
  --- --
  find_get_page
  radix_tree_lookup_slot returns the oldpage
  page_count still = expected_count
  freeze_ref (oldpage-count = 0)
  radix_tree_replace (too late, other side already got the oldpage)
  unfreeze_ref (oldpage-count = 2)
  page_cache_get_speculative(old_page)
  set count to 3 and succeeds

 After reading more of this lockless radix tree code, I realized this
 below check is the one that was intended to restart find_get_page and
 prevent it to return the oldpage:

   if (unlikely(page != *pagep)) {

 But there's no barrier there, atomic_add_unless would need to provide
 an atomic smp_mb() _after_ atomic_add_unless executed. In the old days
 the atomic_* routines had no implied memory barriers, you had to use
 smp_mb__after_atomic_add_unless if you wanted to avoid the race. I
 don't see much in the ppc implementation of atomic_add_unless that
 would provide an implicit smb_mb after the page_cache_get_speculative
 returns, so I can't see why the pagep can't be by find_get_page read
 before the other cpu executes radix_tree_replace in the above
 timeline.

All atomic functions that both return a value and modify their memory
are defined to provide full barriers before and after the operation.

powerpc should be OK

__asm__ __volatile__ (
LWSYNC_ON_SMP
1: lwarx   %0,0,%1 # atomic_add_unless\n\
cmpw0,%0,%3 \n\
beq-2f \n\
add %0,%2,%0 \n
PPC405_ERR77(0,%2)
   stwcx.  %0,0,%1 \n\
bne-1b \n
ISYNC_ON_SMP
   subf%0,%2,%0 \n\
2:
: =r (t)
: r (v-counter), r (a), r (u)
: cc, memory);

lwsync instruction prevents all reorderings except allows loads to
pass stores. But because it is followed by a LL/SC sequence, the
store part of that sequence is prevented from passing stores, so
therefore it will fail if the load had executed earlier and the
value has changed.

isync prevents speculative execution, so the branch has to be
resolved before any subsequent instructions start. The branch
depends on the result of the LL/SC, so that has to be complete.

So that prevents loads from passing the LL/SC.

For the SC to complete, I think by definition the store has to be
visible, because it has to check the value has not changed (so it
would have to own the cacheline).

I think that's how it works. I'm not an expert at powerpc's weird
barriers, but I'm pretty sure they work.


 I guess you intended to put an smp_mb() in between the
 page_cache_get_speculative and the *pagep to make the code safe on ppc
 too, but there isn't, and I think it must be fixed, either that or I
 don't understand ppc assembly right. The other side has a smp_wmb
 implicit inside radix_tree_replace_slot so it should be ok already to
 ensure we see the refcount going to 0 before we see the pagep changed
 (the fact the other side has a memory barrier, further confirms this
 side needs it too).

I think it is OK. It should have a comment there however to say it
relies on a barrier in get_page_unless_zero (I thought I had one..)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread Nick Piggin
On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
 Nick,
 
 This is the whole set of patches I was talking about.
 Patch 3 is the one that in fact fixes the problem
 Patches 1 and 2 are debugging aids I made use of, and could be possibly
 useful to others
 Patch 4 removes guard pages entirely for non-debug kernels, as we have already
 previously discussed.
 
 Hope it's all fine.

OK, these all look good, but I may only push 3/4 for Linus in this round,
along with some of the changes from my patch that you tested as well.

With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should
turn off the lazy unmapping optimisation as well, so it catches use
after free similarly to the page allocator... but probably it is a good
idea at least to avoid the double-guard page for 2.6.28?

Anyway thanks for these, I'll send them up to Andrew/Linus and cc you.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread Nick Piggin
On Saturday 08 November 2008 13:13, Glauber Costa wrote:
 On Sat, Nov 08, 2008 at 01:58:32AM +0100, Nick Piggin wrote:
  On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
   Nick,
  
   This is the whole set of patches I was talking about.
   Patch 3 is the one that in fact fixes the problem
   Patches 1 and 2 are debugging aids I made use of, and could be possibly
   useful to others
   Patch 4 removes guard pages entirely for non-debug kernels, as we have
   already previously discussed.
  
   Hope it's all fine.
 
  OK, these all look good, but I may only push 3/4 for Linus in this round,
  along with some of the changes from my patch that you tested as well.

 Makes total sense.

OK, sent. Thanks again.


  With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps
  should turn off the lazy unmapping optimisation as well, so it catches
  use after free similarly to the page allocator... but probably it is a
  good idea at least to avoid the double-guard page for 2.6.28?

 Makes sense. Maybe poisoning after free would also be useful?

It's a problem because we're only dealing with virtual address, rather
than real memory. So we don't really have anything to poison (we don't
know what the caller will do with the memory). I guess it would be
possible to poison in the page allocator or in vfree, but probably
not worthwhile (after the immediate-unmap debug option).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] restart search at beggining of vmalloc address

2008-11-06 Thread Nick Piggin
Excellent, thank you! Good catch

If you agree my previous patch was also good in combination with this one,
then I'll send them all to be merged.

Thanks,
Nick

On Thu, Nov 06, 2008 at 06:58:26PM -0200, Glauber Costa wrote:
 Current vmalloc restart search for a free area in case we
 can't find one. The reason is there are areas which are lazily
 freed, and could be possibly freed now. However, current implementation
 start searching the tree from the last failing address, which is
 pretty much by definition at the end of address space. So, we fail.
 
 The proposal of this patch is to restart the search from the beginning
 of the requested vstart address. This fixes the regression in running
 KVM virtual machines for me, described in
 http://lkml.org/lkml/2008/10/28/349, caused by commit
 db64fe02258f1507e13fe5212a989922323685ce.
 
 Signed-off-by: Glauber Costa [EMAIL PROTECTED]
 CC: Nick Piggin [EMAIL PROTECTED]
 ---
  mm/vmalloc.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/mm/vmalloc.c b/mm/vmalloc.c
 index 7db493d..6fe2003 100644
 --- a/mm/vmalloc.c
 +++ b/mm/vmalloc.c
 @@ -378,6 +378,7 @@ found:
   if (!purged) {
   purge_vmap_area_lazy();
   purged = 1;
 + addr = ALIGN(vstart, align);
   goto retry;
   }
   if (printk_ratelimit())
 -- 
 1.5.6.5
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-10-31 Thread Nick Piggin
On Thu, Oct 30, 2008 at 09:28:54AM -0200, Glauber Costa wrote:
 On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
  On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
   On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
 Nick Piggin wrote:
 Hmm, spanning 30MB of memory... how much vmalloc space do you have?
 
   
 
 From the original report:
 
 VmallocTotal: 122880 kB
 VmallocUsed:   15184 kB
 VmallocChunk:  83764 kB
 
 So it seems there's quite a bit of free space.
 
 Chunk is the largest free contiguous region, right?  If so, it seems 
 the 

Yes.


 problem is unrelated to guard pages, instead the search isn't finding 
 a 
 1-page area (with two guard pages) for some reason, even though lots 
 of 
 free space is available.

Hmm. The free area search could be buggy...
   Do you want me to grab any specific info of it? Or should I just hack 
   myself
   randomly into it? I'll probably have some time for that tomorrow.
  
  I took a bit of a look. Does this help you at all?
  
  I still think we should get rid of the guard pages in non-debug kernels
  completely, but hopefully this will fix your problems?
 unfortunately, it doesn't.
 problem still happen in a kernel with this patch.

That's weird. Any chance you could dump a list of all the vmap area start
and end adresses and their flags before returning failure?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-10-29 Thread Nick Piggin
On Wed, Oct 29, 2008 at 07:48:56AM -0200, Glauber Costa wrote:

 0xf7bfe000-0xf7c08192 hpet_enable+0x2d/0x279 phys=fed0 ioremap
 0xf7c02000-0xf7c040008192 acpi_os_map_memory+0x11/0x1a phys=7fed1000 
 ioremap
 0xf7c06000-0xf7c080008192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 
 ioremap
 0xf7c0a000-0xf7c0c0008192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 
 ioremap
 0xf7c0d000-0xf7c0f0008192 __kvm_set_memory_region+0x164/0x355 [kvm] 
 pages=1 vmalloc
 0xf7c1-0xf7c1f000   61440 acpi_os_map_memory+0x11/0x1a phys=7fed1000 
 ioremap
 0xf7c2-0xf7c220008192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 
 ioremap
 0xf7c24000-0xf7c27000   12288 acpi_os_map_memory+0x11/0x1a phys=7fef2000 
 ioremap
 0xf7c28000-0xf7c2a0008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 
 ioremap
 0xf7c2c000-0xf7c2e0008192 acpi_os_map_memory+0x11/0x1a phys=7fef4000 
 ioremap
 0xf7c3-0xf7c320008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 
 ioremap
 0xf7c34000-0xf7c360008192 acpi_os_map_memory+0x11/0x1a phys=7fef4000 
 ioremap
 0xf7c38000-0xf7c3a0008192 acpi_os_map_memory+0x11/0x1a phys=7fed1000 
 ioremap
 0xf7c3c000-0xf7c3e0008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 
 ioremap
 0xf7c4-0xf7c420008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 
 ioremap
 0xf7c44000-0xf7c460008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 
 ioremap
 0xf7c48000-0xf7c4a0008192 acpi_os_map_memory+0x11/0x1a phys=7fede000 
 ioremap
 0xf7c4b000-0xf7c57000   49152 zisofs_init+0xd/0x1c pages=11 vmalloc
 0xf7c58000-0xf7c5a0008192 yenta_probe+0x123/0x63f phys=e430 ioremap
 0xf7c5b000-0xf7c5e000   12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
 0xf7c61000-0xf7c65000   16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
 0xf7c67000-0xf7c71000   40960 vmalloc_exec+0x12/0x14 pages=9 vmalloc
 0xf7c72000-0xf7c740008192 usb_hcd_pci_probe+0x132/0x277 phys=ee404000 
 ioremap
 0xf7c75000-0xf7c7c000   28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
 0xf7c7d000-0xf7c86000   36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc
 0xf7c87000-0xf7c890008192 __kvm_set_memory_region+0x1df/0x355 [kvm] 
 pages=1 vmalloc
 0xf7c8a000-0xf7c91000   28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
 0xf7c92000-0xf7c97000   20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
 0xf7c9b000-0xf7c9f000   16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
 0xf7ca-0xf7ca5000   20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
 0xf7ca6000-0xf7cab000   20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
 0xf7cac000-0xf7caf000   12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
 0xf7cb-0xf7cb20008192 iTCO_wdt_probe+0xbe/0x281 [iTCO_wdt] 
 phys=fed1f000 ioremap
 0xf7cb3000-0xf7cbf000   49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc
 0xf7cc-0xf7cce000   57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
 0xf7ccf000-0xf7cd2000   12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
 0xf7cd3000-0xf7cd50008192 __kvm_set_memory_region+0x164/0x355 [kvm] 
 pages=1 vmalloc
 0xf7cd6000-0xf7cdc000   24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
 0xf7cdd000-0xf7ce   12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
 0xf7ce1000-0xf7ce4000   12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
 0xf7ce6000-0xf7d02000  114688 vmalloc_exec+0x12/0x14 pages=27 vmalloc
 0xf7d03000-0xf7d08000   20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
 0xf7d09000-0xf7d0b0008192 __kvm_set_memory_region+0x1df/0x355 [kvm] 
 pages=1 vmalloc
 0xf7d0c000-0xf7d12000   24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
 0xf7d13000-0xf7d24000   69632 snd_malloc_sgbuf_pages+0x190/0x1ca 
 [snd_page_alloc] vmap
 0xf7d25000-0xf7d2a000   20480 drm_ht_create+0x69/0xa5 [drm] pages=4 vmalloc
 0xf7d2d000-0xf7d2f0008192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
 0xf7d3-0xf7d5  131072 vmalloc_exec+0x12/0x14 pages=31 vmalloc
 0xf7d51000-0xf7d54000   12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
 0xf7d56000-0xf7d5a000   16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
 0xf7d5b000-0xf7d5e000   12288 vmalloc_user+0x13/0x38 pages=2 vmalloc user
 0xf7d5f000-0xf7d67000   32768 vmalloc_exec+0x12/0x14 pages=7 vmalloc
 0xf7d68000-0xf7d79000   69632 snd_malloc_sgbuf_pages+0x190/0x1ca 
 [snd_page_alloc] vmap
 0xf7d7a000-0xf7d86000   49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc
 0xf7d87000-0xf7d8e000   28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
 0xf7d8f000-0xf7d910008192 __kvm_set_memory_region+0x164/0x355 [kvm] 
 pages=1 vmalloc
 0xf7d92000-0xf7d940008192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
 0xf7d96000-0xf7dba000  147456 vmalloc_exec+0x12/0x14 pages=35 vmalloc
 0xf7dbb000-0xf7dc   20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
 0xf7dc1000-0xf7dc4000   12288 vmalloc_32+0x12/0x14 pages=2 vmalloc
 0xf7dc7000-0xf7dc90008192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
 0xf7dcb000-0xf7dd4000   36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc
 0xf7dd5000-0xf7dd70008192 __kvm_set_memory_region+0x1df/0x355 [kvm] 
 pages=1 vmalloc
 0xf7dd8000-0xf7dda0008192 pci_iomap+0x72/0x7b 

Re: [PATCH] regression: vmalloc easily fail.

2008-10-29 Thread Nick Piggin
On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
 Nick Piggin wrote:
 Hmm, spanning 30MB of memory... how much vmalloc space do you have?
 
   
 
 From the original report:
 
 VmallocTotal: 122880 kB
 VmallocUsed:   15184 kB
 VmallocChunk:  83764 kB
 
 So it seems there's quite a bit of free space.
 
 Chunk is the largest free contiguous region, right?  If so, it seems the 

Yes.


 problem is unrelated to guard pages, instead the search isn't finding a 
 1-page area (with two guard pages) for some reason, even though lots of 
 free space is available.

Hmm. The free area search could be buggy...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-10-29 Thread Nick Piggin
On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
 On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
  On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
   Nick Piggin wrote:
   Hmm, spanning 30MB of memory... how much vmalloc space do you have?
   
 
   
   From the original report:
   
   VmallocTotal: 122880 kB
   VmallocUsed:   15184 kB
   VmallocChunk:  83764 kB
   
   So it seems there's quite a bit of free space.
   
   Chunk is the largest free contiguous region, right?  If so, it seems the 
  
  Yes.
  
  
   problem is unrelated to guard pages, instead the search isn't finding a 
   1-page area (with two guard pages) for some reason, even though lots of 
   free space is available.
  
  Hmm. The free area search could be buggy...
 Do you want me to grab any specific info of it? Or should I just hack myself
 randomly into it? I'll probably have some time for that tomorrow.

I took a bit of a look. Does this help you at all?

I still think we should get rid of the guard pages in non-debug kernels
completely, but hopefully this will fix your problems?
--

- Fix off by one bug in the KVA allocator that can leave gaps 
- An initial vmalloc failure should start off a synchronous flush of lazy
  areas, in case someone is in progress flushing them already.
- Purge lock can be a mutex so we can sleep while that's going on.
 
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -14,6 +14,7 @@
 #include linux/highmem.h
 #include linux/slab.h
 #include linux/spinlock.h
+#include linux/mutex.h
 #include linux/interrupt.h
 #include linux/proc_fs.h
 #include linux/seq_file.h
@@ -362,7 +363,7 @@ retry:
goto found;
}
 
-   while (addr + size = first-va_start  addr + size = vend) {
+   while (addr + size  first-va_start  addr + size = vend) {
addr = ALIGN(first-va_end + PAGE_SIZE, align);
 
n = rb_next(first-rb_node);
@@ -472,7 +473,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_IN
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
int sync, int force_flush)
 {
-   static DEFINE_SPINLOCK(purge_lock);
+   static DEFINE_MUTEX(purge_lock);
LIST_HEAD(valist);
struct vmap_area *va;
int nr = 0;
@@ -483,10 +484,10 @@ static void __purge_vmap_area_lazy(unsig
 * the case that isn't actually used at the moment anyway.
 */
if (!sync  !force_flush) {
-   if (!spin_trylock(purge_lock))
+   if (!mutex_trylock(purge_lock))
return;
} else
-   spin_lock(purge_lock);
+   mutex_lock(purge_lock);
 
rcu_read_lock();
list_for_each_entry_rcu(va, vmap_area_list, list) {
@@ -518,7 +519,18 @@ static void __purge_vmap_area_lazy(unsig
__free_vmap_area(va);
spin_unlock(vmap_area_lock);
}
-   spin_unlock(purge_lock);
+   mutex_unlock(purge_lock);
+}
+
+/*
+ * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
+ * is already purging.
+ */
+static void try_purge_vmap_area_lazy(void)
+{
+   unsigned long start = ULONG_MAX, end = 0;
+
+   __purge_vmap_area_lazy(start, end, 0, 0);
 }
 
 /*
@@ -528,7 +540,7 @@ static void purge_vmap_area_lazy(void)
 {
unsigned long start = ULONG_MAX, end = 0;
 
-   __purge_vmap_area_lazy(start, end, 0, 0);
+   __purge_vmap_area_lazy(start, end, 1, 0);
 }
 
 /*
@@ -539,7 +551,7 @@ static void free_unmap_vmap_area(struct 
va-flags |= VM_LAZY_FREE;
atomic_add((va-va_end - va-va_start)  PAGE_SHIFT, vmap_lazy_nr);
if (unlikely(atomic_read(vmap_lazy_nr)  lazy_max_pages()))
-   purge_vmap_area_lazy();
+   try_purge_vmap_area_lazy();
 }
 
 static struct vmap_area *find_vmap_area(unsigned long addr)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-10-28 Thread Nick Piggin
On Tue, Oct 28, 2008 at 08:55:13PM -0200, Glauber Costa wrote:
 Commit db64fe02258f1507e13fe5212a989922323685ce broke
 KVM (the symptom) for me. The cause is that vmalloc
 allocations fail, despite of the fact that /proc/meminfo
 shows plenty of vmalloc space available.
 
 After some investigation, it seems to me that the current
 way to compute the next addr in the rb-tree transversal
 leaves a spare page between each allocation. After a few
 allocations, regardless of their size, we run out of vmalloc
 space.

Right... that was to add a guard page like the old vmalloc allocator.
vmallocs still add their extra page too, so most of them will have
a 2 page guard area, but I didn't think this would hurt significantly.

I'm not against the patch, but I wonder exactly what is filling it up
and how? (can you look at the vmalloc proc function to find out?)

 
 Signed-off-by: Glauber Costa [EMAIL PROTECTED]
 Cc: Jeremy Fitzhardinge [EMAIL PROTECTED]
 Cc: Krzysztof Helt [EMAIL PROTECTED]
 ---
  mm/vmalloc.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/mm/vmalloc.c b/mm/vmalloc.c
 index 0365369..a33b0d1 100644
 --- a/mm/vmalloc.c
 +++ b/mm/vmalloc.c
 @@ -363,7 +363,7 @@ retry:
   }
  
   while (addr + size = first-va_start  addr + size = vend) {
 - addr = ALIGN(first-va_end + PAGE_SIZE, align);
 + addr = ALIGN(first-va_end, align);
  
   n = rb_next(first-rb_node);
   if (n)
 -- 
 1.5.6.5
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html