Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-16 Thread Robin Holt
On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote:
 On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
  On Thu, 15 May 2008, Nick Piggin wrote:
  
   Oh, I get that confused because of the mixed up naming conventions
   there: unmap_page_range should actually be called zap_page_range. But
   at any rate, yes we can easily zap pagetables without holding mmap_sem.
  
  How is that synchronized with code that walks the same pagetable. These 
  walks may not hold mmap_sem either. I would expect that one could only 
  remove a portion of the pagetable where we have some sort of guarantee 
  that no accesses occur. So the removal of the vma prior ensures that?
  
 I don't really understand the question. If you remove the pte and invalidate
 the TLBS on the remote image's process (importing the page), then it can
 of course try to refault the page in because it's vma is still there. But
 you catch that refault in your driver , which can prevent the page from
 being faulted back in.

I think Christoph's question has more to do with faults that are
in flight.  A recently requested fault could have just released the
last lock that was holding up the invalidate callout.  It would then
begin messaging back the response PFN which could still be in flight.
The invalidate callout would then fire and do the interrupt shoot-down
while that response was still active (essentially beating the inflight
response).  The invalidate would clear up nothing and then the response
would insert the PFN after it is no longer the correct PFN.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-16 Thread Robin Holt
On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote:
 On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote:
  On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
   On Thu, 15 May 2008, Nick Piggin wrote:
   
Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But
at any rate, yes we can easily zap pagetables without holding mmap_sem.
   
   How is that synchronized with code that walks the same pagetable. These 
   walks may not hold mmap_sem either. I would expect that one could only 
   remove a portion of the pagetable where we have some sort of guarantee 
   that no accesses occur. So the removal of the vma prior ensures that?
   
  I don't really understand the question. If you remove the pte and invalidate
  the TLBS on the remote image's process (importing the page), then it can
  of course try to refault the page in because it's vma is still there. But
  you catch that refault in your driver , which can prevent the page from
  being faulted back in.
 
 I think Christoph's question has more to do with faults that are
 in flight.  A recently requested fault could have just released the
 last lock that was holding up the invalidate callout.  It would then
 begin messaging back the response PFN which could still be in flight.
 The invalidate callout would then fire and do the interrupt shoot-down
 while that response was still active (essentially beating the inflight
 response).  The invalidate would clear up nothing and then the response
 would insert the PFN after it is no longer the correct PFN.

I just looked over XPMEM.  I think we could make this work.  We already
have a list of active faults which is protected by a simple spinlock.
I would need to nest this lock within another lock protected our PFN
table (currently it is a mutex) and then the invalidate interrupt handler
would need to mark the fault as invalid (which is also currently there).

I think my sticking points with the interrupt method remain at fault
containment and timeout.  The inability of the ia64 processor to handle
provide predictive failures for the read/write of memory on other
partitions prevents us from being able to contain the failure.  I don't
think we can get the information we would need to do the invalidate
without introducing fault containment issues which has been a continous
area of concern for our customers.


Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-15 Thread Nick Piggin
On Wed, May 14, 2008 at 06:26:25AM -0500, Robin Holt wrote:
 On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote:
  
  I guess that you have found a way to perform TLB flushing within coherent
  domains over the numalink interconnect without sleeping. I'm sure it would
  be possible to send similar messages between non coherent domains.
 
 I assume by coherent domains, your are actually talking about system
 images.

Yes

  Our memory coherence domain on the 3700 family is 512 processors
 on 128 nodes.  On the 4700 family, it is 16,384 processors on 4096 nodes.
 We extend a Read-Exclusive mode beyond the coherence domain so any
 processor is able to read any cacheline on the system.  We also provide
 uncached access for certain types of memory beyond the coherence domain.

Yes, I understand the basics.

 
 For the other partitions, the exporting partition does not know what
 virtual address the imported pages are mapped.  The pages are frequently
 mapped in a different order by the MPI library to help with MPI collective
 operations.
 
 For the exporting side to do those TLB flushes, we would need to replicate
 all that importing information back to the exporting side.

Right. Or the exporting side could be passed tokens that it tracks itself,
rather than virtual addresses.

 
 Additionally, the hardware that does the TLB flushing is protected
 by a spinlock on each system image.  We would need to change that
 simple spinlock into a type of hardware lock that would work (on 3700)
 outside the processors coherence domain.  The only way to do that is to
 use uncached addresses with our Atomic Memory Operations which do the
 cmpxchg at the memory controller.  The uncached accesses are an order
 of magnitude or more slower.

I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,
I don't see why this can't be done.

 
  So yes, I'd much rather rework such highly specialized system to fit in
  closer with Linux than rework Linux to fit with these machines (and
  apparently slow everyone else down).
 
 But it isn't that we are having a problem adapting to just the hardware.
 One of the limiting factors is Linux on the other partition.

In what way is the Linux limiting? 


   Additionally, the call to zap_page_range expects to have the mmap_sem
   held.  I suppose we could use something other than zap_page_range and
   atomically clear the process page tables.
  
  zap_page_range does not expect to have mmap_sem held. I think for anon
  pages it is always called with mmap_sem, however try_to_unmap_anon is
  not (although it expects page lock to be held, I think we should be able
  to avoid that).
 
 zap_page_range calls unmap_vmas which walks to vma-next.  Are you saying
 that can be walked without grabbing the mmap_sem at least readably?

Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But
at any rate, yes we can easily zap pagetables without holding mmap_sem.


 I feel my understanding of list management and locking completely
 shifting.

FWIW, mmap_sem isn't held to protect vma-next there anyway, because at
that point the vmas are detached from the mm's rbtree and linked list.
But sure, in that particular path it is held for other reasons.

 
Doing that will not alleviate
   the need to sleep for the messaging to the other partitions.
  
  No, but I'd venture to guess that is not impossible to implement even
  on your current hardware (maybe a firmware update is needed)?
 
 Are you suggesting the sending side would not need to sleep or the
 receiving side?  Assuming you meant the sender, it spins waiting for the
 remote side to acknowledge the invalidate request?  We place the data
 into a previously agreed upon buffer and send an interrupt.  At this
 point, we would need to start spinning and waiting for completion.
 Let's assume we never run out of buffer space.

How would you run out of buffer space if it is synchronous?

 
 The receiving side receives an interrupt.  The interrupt currently wakes
 an XPC thread to do the work of transfering and delivering the message
 to XPMEM.  The transfer of the data which XPC does uses the BTE engine
 which takes up to 28 seconds to timeout (hardware timeout before raising
 and error) and the BTE code automatically does a retry for certain
 types of failure.  We currently need to grab semaphores which _MAY_
 be able to be reworked into other types of locks.

Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.

What is XPMEM exactly anyway? I'd 

Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-15 Thread Robin Holt
We are pursuing Linus' suggestion currently.  This discussion is
completely unrelated to that work.

On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote:
 I'm not sure if you're thinking about what I'm thinking of. With the
 scheme I'm imagining, all you will need is some way to raise an IPI-like
 interrupt on the target domain. The IPI target will have a driver to
 handle the interrupt, which will determine the mm and virtual addresses
 which are to be invalidated, and will then tear down those page tables
 and issue hardware TLB flushes within its domain. On the Linux side,
 I don't see why this can't be done.

We would need to deposit the payload into a central location to do the
invalidate, correct?  That central location would either need to be
indexed by physical cpuid (65536 possible currently, UV will push that
up much higher) or some sort of global id which is difficult because
remote partitions can reboot giving you a different view of the machine
and running partitions would need to be updated.  Alternatively, that
central location would need to be protected by a global lock or atomic
type operation, but a majority of the machine does not have coherent
access to other partitions so they would need to use uncached operations.
Essentially, take away from this paragraph that it is going to be really
slow or really large.

Then we need to deposit the information needed to do the invalidate.

Lastly, we would need to interrupt.  Unfortunately, here we have a
thundering herd.  There could be up to 16256 processors interrupting the
same processor.  That will be a lot of work.  It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

Unfortunately, the sending side is not free to continue (in most cases)
until it knows that the invalidate is completed.  So it will need to spin
waiting for a completion signal will could be as simple as an uncached
word.  But how will it handle the possible failure of the other partition?
How will it detect that failure and recover?  A timeout value could be
difficult to gauge because the other side may be off doing a considerable
amount of work and may just be backed up.

 Sure, you obviously would need to rework your code because it's been
 written with the assumption that it can sleep.

It is an assumption based upon some of the kernel functions we call
doing things like grabbing mutexes or rw_sems.  That pushes back to us.
I think the kernel's locking is perfectly reasonable.  The problem we run
into is we are trying to get from one context in one kernel to a different
context in another and the in-between piece needs to be sleepable.

 What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

XPMEM allows one process to make a portion of its virtual address range
directly addressable by another process with the appropriate access.
The other process can be on other partitions.  As long as Numa-link
allows access to the memory, we can make it available.  Userland has an
advantage in that the kernel entrance/exit code contains memory errors
so we can contain hardware failures (in most cases) to only needing to
terminate a user program and not lose the partition.  The kernel enjoys
no such fault containment so it can not safely directly reference memory.


Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-15 Thread Avi Kivity
Robin Holt wrote:
 Then we need to deposit the information needed to do the invalidate.

 Lastly, we would need to interrupt.  Unfortunately, here we have a
 thundering herd.  There could be up to 16256 processors interrupting the
 same processor.  That will be a lot of work.  It will need to look up the
 mm (without grabbing any sleeping locks in either xpmem or the kernel)
 and do the tlb invalidates.

   

You don't need to interrupt every time.  Place your data in a queue (you 
do support rmw operations, right?) and interrupt.  Invalidates from 
other processors will see that the queue hasn't been processed yet and 
skip the interrupt.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-15 Thread Christoph Lameter
On Thu, 15 May 2008, Nick Piggin wrote:

 Oh, I get that confused because of the mixed up naming conventions
 there: unmap_page_range should actually be called zap_page_range. But
 at any rate, yes we can easily zap pagetables without holding mmap_sem.

How is that synchronized with code that walks the same pagetable. These 
walks may not hold mmap_sem either. I would expect that one could only 
remove a portion of the pagetable where we have some sort of guarantee 
that no accesses occur. So the removal of the vma prior ensures that?



-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-15 Thread Nick Piggin
On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote:
 On Thu, 15 May 2008, Nick Piggin wrote:
 
  Oh, I get that confused because of the mixed up naming conventions
  there: unmap_page_range should actually be called zap_page_range. But
  at any rate, yes we can easily zap pagetables without holding mmap_sem.
 
 How is that synchronized with code that walks the same pagetable. These 
 walks may not hold mmap_sem either. I would expect that one could only 
 remove a portion of the pagetable where we have some sort of guarantee 
 that no accesses occur. So the removal of the vma prior ensures that?
 
I don't really understand the question. If you remove the pte and invalidate
the TLBS on the remote image's process (importing the page), then it can
of course try to refault the page in because it's vma is still there. But
you catch that refault in your driver , which can prevent the page from
being faulted back in.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Robin Holt
On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote:
 On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote:
  On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote:
   On Thursday 08 May 2008 10:38, Robin Holt wrote:
In order to invalidate the remote page table entries, we need to message
(uses XPC) to the remote side.  The remote side needs to acquire the
importing process's mmap_sem and call zap_page_range().  Between the
messaging and the acquiring a sleeping lock, I would argue this will
require sleeping locks in the path prior to the mmu_notifier 
invalidate_*
callouts().
   
   Why do you need to take mmap_sem in order to shoot down pagetables of
   the process? It would be nice if this can just be done without
   sleeping.
  
  We are trying to shoot down page tables of a different process running
  on a different instance of Linux running on Numa-link connected portions
  of the same machine.
 
 Right. You can zap page tables without sleeping, if you're careful. I
 don't know that we quite do that for anonymous pages at the moment, but it
 should be possible with a bit of thought, I believe.
 
  
  The messaging is clearly going to require sleeping.  Are you suggesting
  we need to rework XPC communications to not require sleeping?  I think
  that is going to be impossible since the transfer engine requires a
  sleeping context.
 
 I guess that you have found a way to perform TLB flushing within coherent
 domains over the numalink interconnect without sleeping. I'm sure it would
 be possible to send similar messages between non coherent domains.

I assume by coherent domains, your are actually talking about system
images.  Our memory coherence domain on the 3700 family is 512 processors
on 128 nodes.  On the 4700 family, it is 16,384 processors on 4096 nodes.
We extend a Read-Exclusive mode beyond the coherence domain so any
processor is able to read any cacheline on the system.  We also provide
uncached access for certain types of memory beyond the coherence domain.

For the other partitions, the exporting partition does not know what
virtual address the imported pages are mapped.  The pages are frequently
mapped in a different order by the MPI library to help with MPI collective
operations.

For the exporting side to do those TLB flushes, we would need to replicate
all that importing information back to the exporting side.

Additionally, the hardware that does the TLB flushing is protected
by a spinlock on each system image.  We would need to change that
simple spinlock into a type of hardware lock that would work (on 3700)
outside the processors coherence domain.  The only way to do that is to
use uncached addresses with our Atomic Memory Operations which do the
cmpxchg at the memory controller.  The uncached accesses are an order
of magnitude or more slower.

 So yes, I'd much rather rework such highly specialized system to fit in
 closer with Linux than rework Linux to fit with these machines (and
 apparently slow everyone else down).

But it isn't that we are having a problem adapting to just the hardware.
One of the limiting factors is Linux on the other partition.

  Additionally, the call to zap_page_range expects to have the mmap_sem
  held.  I suppose we could use something other than zap_page_range and
  atomically clear the process page tables.
 
 zap_page_range does not expect to have mmap_sem held. I think for anon
 pages it is always called with mmap_sem, however try_to_unmap_anon is
 not (although it expects page lock to be held, I think we should be able
 to avoid that).

zap_page_range calls unmap_vmas which walks to vma-next.  Are you saying
that can be walked without grabbing the mmap_sem at least readably?
I feel my understanding of list management and locking completely
shifting.

   Doing that will not alleviate
  the need to sleep for the messaging to the other partitions.
 
 No, but I'd venture to guess that is not impossible to implement even
 on your current hardware (maybe a firmware update is needed)?

Are you suggesting the sending side would not need to sleep or the
receiving side?  Assuming you meant the sender, it spins waiting for the
remote side to acknowledge the invalidate request?  We place the data
into a previously agreed upon buffer and send an interrupt.  At this
point, we would need to start spinning and waiting for completion.
Let's assume we never run out of buffer space.

The receiving side receives an interrupt.  The interrupt currently wakes
an XPC thread to do the work of transfering and delivering the message
to XPMEM.  The transfer of the data which XPC does uses the BTE engine
which takes up to 28 seconds to timeout (hardware timeout before raising
and error) and the BTE code automatically does a retry for certain
types of failure.  We currently need to grab semaphores which _MAY_
be able to be reworked into other types of locks.


Thanks,
Robin


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Linus Torvalds


On Wed, 14 May 2008, Robin Holt wrote:
 
 Are you suggesting the sending side would not need to sleep or the
 receiving side?

One thing to realize is that most of the time (read: pretty much *always*) 
when we have the problem of wanting to sleep inside a spinlock, the 
solution is actually to just move the sleeping to outside the lock, and 
then have something else that serializes things.

That way, the core code (protected by the spinlock, and in all the hot 
paths) doesn't sleep, but the special case code (that wants to sleep) can 
have some other model of serialization that allows sleeping, and that 
includes as a small part the spinlocked region.

I do not know how XPMEM actually works, or how you use it, but it 
seriously sounds like that is how things *should* work. And yes, that 
probably means that the mmu-notifiers as they are now are simply not 
workable: they'd need to be moved up so that they are inside the mmap 
semaphore but not the spinlocks.

Can it be done? I don't know. But I do know that I'm unlikely to accept a 
noticeable slowdown in some very core code for a case that affects about 
0.1% of the population. In other words, I think you *have* to do it.

Linus

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Robin Holt
On Wed, May 14, 2008 at 08:18:21AM -0700, Linus Torvalds wrote:


 On Wed, 14 May 2008, Robin Holt wrote:
 
  Are you suggesting the sending side would not need to sleep or the
  receiving side?

 One thing to realize is that most of the time (read: pretty much *always*)
 when we have the problem of wanting to sleep inside a spinlock, the
 solution is actually to just move the sleeping to outside the lock, and
 then have something else that serializes things.

 That way, the core code (protected by the spinlock, and in all the hot
 paths) doesn't sleep, but the special case code (that wants to sleep) can
 have some other model of serialization that allows sleeping, and that
 includes as a small part the spinlocked region.

 I do not know how XPMEM actually works, or how you use it, but it
 seriously sounds like that is how things *should* work. And yes, that
 probably means that the mmu-notifiers as they are now are simply not
 workable: they'd need to be moved up so that they are inside the mmap
 semaphore but not the spinlocks.

We are in the process of attempting this now.  Unfortunately for SGI,
Christoph is on vacation right now so we have been trying to work it
internally.

We are looking through two possible methods, one we add a callout to the
tlb flush paths for both the mmu_gather and flush_tlb_page locations.
The other we place a specific callout seperate from the gather callouts
in the paths we are concerned with.  We will look at both more carefully
before posting.


In either implementation, not all call paths would require the stall
to ensure data integrity.  Would it be acceptable to always put a
sleepable stall in even if the code path did not require the pages be
unwritable prior to continuing?  If we did that, I would be freed from
having a pool of invalidate threads ready for XPMEM to use for that work.
Maybe there is a better way, but the sleeping requirement we would have
on the threads make most options seem unworkable.


Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Linus Torvalds


On Wed, 14 May 2008, Robin Holt wrote:
 
 Would it be acceptable to always put a sleepable stall in even if the 
 code path did not require the pages be unwritable prior to continuing?  
 If we did that, I would be freed from having a pool of invalidate 
 threads ready for XPMEM to use for that work. Maybe there is a better 
 way, but the sleeping requirement we would have on the threads make most 
 options seem unworkable.

I'm not understanding the question. If you can do you management outside 
of the spinlocks, then you can obviously do whatever you want, including 
sleping. It's changing the existing spinlocks to be sleepable that is not 
acceptable, because it's such a performance problem.

Linus

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Christoph Lameter
On Wed, 14 May 2008, Linus Torvalds wrote:

 One thing to realize is that most of the time (read: pretty much *always*) 
 when we have the problem of wanting to sleep inside a spinlock, the 
 solution is actually to just move the sleeping to outside the lock, and 
 then have something else that serializes things.

The problem is that the code in rmap.c try_to_umap() and friends loops 
over reverse maps after taking a spinlock. The mm_struct is only known 
after the rmap has been acccessed. This means *inside* the spinlock.

That is why I tried to convert the locks to scan the revese maps to 
semaphores. If that is done then one can indeed do the callouts outside of 
atomic contexts.

 Can it be done? I don't know. But I do know that I'm unlikely to accept a 
 noticeable slowdown in some very core code for a case that affects about 
 0.1% of the population. In other words, I think you *have* to do it.

With larger number of processor semaphores make a lot of sense since the 
holdoff times on spinlocks will increase. If we go to sleep then the 
processor can do something useful instead of hogging a cacheline.

A rw lock there can also increase concurrency during reclaim espcially if 
the anon_vma chains and the number of address spaces mapping a page is 
high.


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Linus Torvalds


On Wed, 14 May 2008, Christoph Lameter wrote:
 
 The problem is that the code in rmap.c try_to_umap() and friends loops 
 over reverse maps after taking a spinlock. The mm_struct is only known 
 after the rmap has been acccessed. This means *inside* the spinlock.

So you queue them. That's what we do with things like the dirty bit. We 
need to hold various spinlocks to look up pages, but then we can't 
actually call the filesystem with the spinlock held.

Converting a spinlock to a waiting lock for things like that is simply not 
acceptable. You have to work with the system.

Yeah, there's only a single bit worth of information on whether a page is 
dirty or not, so queueing that information is trivial (it's just the 
return value from page_mkclean_file(). Some things are harder than 
others, and I suspect you need some kind of gather structure to queue up 
all the vma's that can be affected.

But it sounds like for the case of rmap, the approach of:

 - the page lock is the higher-level sleeping lock (which makes sense, 
   since this is very close to an IO event, and that is what the page lock 
   is generally used for)

   But hey, it could be anything else - maybe you have some other even 
   bigger lock to allow you to handle lots of pages in one go.

 - with that lock held, you do the whole rmap dance (which requires 
   spinlocks) and gather up the vma's and the struct mm's involved. 

 - outside the spinlocks you then do whatever it is you need to do.

This doesn't sound all that different from TLB shoot-down in SMP, and the 
mmu_gather structure. Now, admittedly we can do the TLB shoot-down while 
holding the spinlocks, but if we couldn't that's how we'd still do it: 
it would get more involved (because we'd need to guarantee that the gather 
can hold *all* the pages - right now we can just flush in the middle if we 
need to), but it wouldn't be all that fundamentally different.

And no, I really haven't even wanted to look at what XPMEM really needs to 
do, so maybe the above thing doesn't work for you, and you have other 
issues. I'm just pointing you in a general direction, not trying to say 
this is exactly how to get there. 

Linus

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-13 Thread Nick Piggin
On Thursday 08 May 2008 10:38, Robin Holt wrote:
 On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
  On Wed, 7 May 2008, Andrea Arcangeli wrote:
   I think the spinlock-rwsem conversion is ok under config option, as
   you can see I complained myself to various of those patches and I'll
   take care they're in a mergeable state the moment I submit them. What
   XPMEM requires are different semantics for the methods, and we never
   had to do any blocking I/O during vmtruncate before, now we have to.
 
  I really suspect we don't really have to, and that it would be better to
  just fix the code that does that.

 That fix is going to be fairly difficult.  I will argue impossible.

 First, a little background.  SGI allows one large numa-link connected
 machine to be broken into seperate single-system images which we call
 partitions.

 XPMEM allows, at its most extreme, one process on one partition to
 grant access to a portion of its virtual address range to processes on
 another partition.  Those processes can then fault pages and directly
 share the memory.

 In order to invalidate the remote page table entries, we need to message
 (uses XPC) to the remote side.  The remote side needs to acquire the
 importing process's mmap_sem and call zap_page_range().  Between the
 messaging and the acquiring a sleeping lock, I would argue this will
 require sleeping locks in the path prior to the mmu_notifier invalidate_*
 callouts().

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-13 Thread Nick Piggin
On Thursday 08 May 2008 11:34, Andrea Arcangeli wrote:
 Sorry for not having completely answered to this. I initially thought
 stop_machine could work when you mentioned it, but I don't think it
 can even removing xpmem block-inside-mmu-notifier-method requirements.

 For stop_machine to solve this (besides being slower and potentially
 not more safe as running stop_machine in a loop isn't nice), we'd need
 to prevent preemption in between invalidate_range_start/end.

 I think there are two ways:

 1) add global lock around mm_lock to remove the sorting

 2) remove invalidate_range_start/end, nuke mm_lock as consequence of
it, and replace all three with invalidate_pages issued inside the
PT lock, one invalidation for each 512 pte_t modified, so
serialization against get_user_pages becomes trivial but this will
be not ok at all for SGI as it increases a lot their invalidation
frequency

This is what I suggested to begin with before this crazy locking was
developed to handle these corner cases... because I wanted the locking
to match with the tried and tested Linux core mm/ locking rather than
introducing this new idea.

I don't see why you're bending over so far backwards to accommodate
this GRU thing that we don't even have numbers for and could actually
potentially be batched up in other ways (eg. using mmu_gather or
mmu_gather-like idea).

The bare essential, matches-with-Linux-mm mmu notifiers that I first
saw of yours was pretty elegant and nice. The idea that only one
solution must go in and handle everything perfectly is stupid because
it is quite obvious that the sleeping invalidate idea is just an order
of magnitude or two more complex than the simple atomic invalidates
needed by you. We should and could easily have had that code upstream
long ago :(

I'm not saying we ignore the sleeping or batching cases, but we should
introduce the ideas slowly and carefully and assess the pros and cons
of each step along the way.




 For KVM both ways are almost the same.

 I'll implement 1 now then we'll see...

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to [EMAIL PROTECTED]  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-13 Thread Robin Holt
On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote:
 On Thursday 08 May 2008 10:38, Robin Holt wrote:
  In order to invalidate the remote page table entries, we need to message
  (uses XPC) to the remote side.  The remote side needs to acquire the
  importing process's mmap_sem and call zap_page_range().  Between the
  messaging and the acquiring a sleeping lock, I would argue this will
  require sleeping locks in the path prior to the mmu_notifier invalidate_*
  callouts().
 
 Why do you need to take mmap_sem in order to shoot down pagetables of
 the process? It would be nice if this can just be done without
 sleeping.

We are trying to shoot down page tables of a different process running
on a different instance of Linux running on Numa-link connected portions
of the same machine.

The messaging is clearly going to require sleeping.  Are you suggesting
we need to rework XPC communications to not require sleeping?  I think
that is going to be impossible since the transfer engine requires a
sleeping context.

Additionally, the call to zap_page_range expects to have the mmap_sem
held.  I suppose we could use something other than zap_page_range and
atomically clear the process page tables.  Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

Thanks,
Robin

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-13 Thread Nick Piggin
On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote:
 On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote:
  On Thursday 08 May 2008 10:38, Robin Holt wrote:
   In order to invalidate the remote page table entries, we need to message
   (uses XPC) to the remote side.  The remote side needs to acquire the
   importing process's mmap_sem and call zap_page_range().  Between the
   messaging and the acquiring a sleeping lock, I would argue this will
   require sleeping locks in the path prior to the mmu_notifier invalidate_*
   callouts().
  
  Why do you need to take mmap_sem in order to shoot down pagetables of
  the process? It would be nice if this can just be done without
  sleeping.
 
 We are trying to shoot down page tables of a different process running
 on a different instance of Linux running on Numa-link connected portions
 of the same machine.

Right. You can zap page tables without sleeping, if you're careful. I
don't know that we quite do that for anonymous pages at the moment, but it
should be possible with a bit of thought, I believe.

 
 The messaging is clearly going to require sleeping.  Are you suggesting
 we need to rework XPC communications to not require sleeping?  I think
 that is going to be impossible since the transfer engine requires a
 sleeping context.

I guess that you have found a way to perform TLB flushing within coherent
domains over the numalink interconnect without sleeping. I'm sure it would
be possible to send similar messages between non coherent domains.

So yes, I'd much rather rework such highly specialized system to fit in
closer with Linux than rework Linux to fit with these machines (and
apparently slow everyone else down).

 
 Additionally, the call to zap_page_range expects to have the mmap_sem
 held.  I suppose we could use something other than zap_page_range and
 atomically clear the process page tables.

zap_page_range does not expect to have mmap_sem held. I think for anon
pages it is always called with mmap_sem, however try_to_unmap_anon is
not (although it expects page lock to be held, I think we should be able
to avoid that).


  Doing that will not alleviate
 the need to sleep for the messaging to the other partitions.

No, but I'd venture to guess that is not impossible to implement even
on your current hardware (maybe a firmware update is needed)?


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-13 Thread Benjamin Herrenschmidt

On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote:
 ea.
 
 I don't see why you're bending over so far backwards to accommodate
 this GRU thing that we don't even have numbers for and could actually
 potentially be batched up in other ways (eg. using mmu_gather or
 mmu_gather-like idea).

I agree, we're better off generalizing the mmu_gather batching
instead...

I had some never-finished patches to use the mmu_gather for pretty much
everything except single page faults, tho various subtle differences
between archs and lack of time caused me to let them take the dust and
not finish them...

I can try to dig some of that out when I'm back from my current travel,
though it's probably worth re-doing from scratch now.

Ben.



-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-09 Thread Andrea Arcangeli
On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote:
 Another possibility, would something like this work?
 
  
  /*
   * null out the begin function, no new begin calls can be made
   */
  rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); 
 
  /*
   * lock/unlock all rmap locks in any order - this ensures that any
   * pending start() will have its end() function called.
   */
  mm_barrier(mm);
 
  /*
   * now that no new start() call can be made and all start()/end() pairs
   * are complete we can remove the notifier.
   */
  mmu_notifier_remove(mm, my_notifier);
 
 
 This requires a mmu_notifier instance per attached mm and that
 __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
 the function.
 
 But I think its enough to ensure that:
 
   for each start an end will be called

We don't need that, it's perfectly ok if start is called but end is
not, it's ok to unregister in the middle as I guarantee -release is
called before mmu_notifier_unregister returns (if -release is needed
at all, not the case for KVM/GRU).

Unregister is already solved with srcu/rcu without any additional
complication as we don't need the guarantee that for each start an end
will be called.

 It can however happen that end is called without start - but we could
 handle that I think.

The only reason mm_lock() was introduced is to solve register, to
guarantee that for each end there was a start. We can't handle end
called without start in the driver.

The reason the driver must be prevented to register in the middle of
start/end, if that if it ever happens the driver has no way to know it
must stop the secondary mmu page faults to call get_user_pages and
instantiate sptes/secondarytlbs on pages that will be freed as soon as
zap_page_range starts.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-09 Thread Peter Zijlstra
On Fri, 2008-05-09 at 20:55 +0200, Andrea Arcangeli wrote:
 On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote:
  Another possibility, would something like this work?
  
   
   /*
* null out the begin function, no new begin calls can be made
*/
   rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); 
  
   /*
* lock/unlock all rmap locks in any order - this ensures that any
* pending start() will have its end() function called.
*/
   mm_barrier(mm);
  
   /*
* now that no new start() call can be made and all start()/end() pairs
* are complete we can remove the notifier.
*/
   mmu_notifier_remove(mm, my_notifier);
  
  
  This requires a mmu_notifier instance per attached mm and that
  __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain
  the function.
  
  But I think its enough to ensure that:
  
for each start an end will be called
 
 We don't need that, it's perfectly ok if start is called but end is
 not, it's ok to unregister in the middle as I guarantee -release is
 called before mmu_notifier_unregister returns (if -release is needed
 at all, not the case for KVM/GRU).
 
 Unregister is already solved with srcu/rcu without any additional
 complication as we don't need the guarantee that for each start an end
 will be called.
 
  It can however happen that end is called without start - but we could
  handle that I think.
 
 The only reason mm_lock() was introduced is to solve register, to
 guarantee that for each end there was a start. We can't handle end
 called without start in the driver.
 
 The reason the driver must be prevented to register in the middle of
 start/end, if that if it ever happens the driver has no way to know it
 must stop the secondary mmu page faults to call get_user_pages and
 instantiate sptes/secondarytlbs on pages that will be freed as soon as
 zap_page_range starts.

Right - then I got it backwards. Never mind me then..


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-08 Thread Linus Torvalds


On Thu, 8 May 2008, Andrea Arcangeli wrote:
 
 Actually I looked both at the struct and at the slab alignment just in
 case it was changed recently. Now after reading your mail I also
 compiled it just in case.

Put the flag after the spinlock, not after the list_head.

Also, we'd need to make it 

unsigned short flag:1;

_and_ change spinlock_types.h to make the spinlock size actually match the 
required size (right now we make it an unsigned int slock even when we 
actually only use 16 bits). See the 

#if (NR_CPUS  256)

code in asm-x86/spinlock.h.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-08 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 09:11:33AM -0700, Linus Torvalds wrote:
 Btw, this is an issue only on 32-bit x86, because on 64-bit one we already 
 have the padding due to the alignment of the 64-bit pointers in the 
 list_head (so there's already empty space there).
 
 On 32-bit, the alignment of list-head is obviously just 32 bits, so right 
 now the structure is perfectly packed and doesn't have any empty space. 
 But that's just because the spinlock is unnecessarily big.
 
 (Of course, if anybody really uses NR_CPUS = 256 on 32-bit x86, then the 
 structure really will grow. That's a very odd configuration, though, and 
 not one I feel we really need to care about).

I see two ways to implement it:

1) use #ifdef and make it zero overhead for 64bit only without playing
any non obvious trick.

struct anon_vma {
   spinlock_t lock;
#ifdef CONFIG_MMU_NOTIFIER
   int global_mm_lock:1;
#endif

struct address_space {
   spinlock_t   private_lock;
#ifdef CONFIG_MMU_NOTIFIER
   int global_mm_lock:1;
#endif

2) add a:

#define AS_GLOBAL_MM_LOCK   (__GFP_BITS_SHIFT + 2)  /* global_mm_locked */

and use address_space-flags with bitops

And as Andrew pointed me out by PM, for the anon_vma we can use the
LSB of the list.next/prev because the list can't be browsed when the
lock is taken, so taking the lock and then setting the bit and
clearing the bit before unlocking is safe. The LSB will always read 0
even if it's under list_add modification when the global spinlock isn't
taken. And after taking the anon_vma lock we can switch it the LSB
from 0 to 1 without races and the 1 will be protected by the
global spinlock.

The above solution is zero cost for 32bit too, so I prefer it.

So I now agree with you this is a great idea on how to remove sort()
and vmalloc and especially vfree without increasing the VM footprint.

I'll send an update with this for review very shortly and I hope this
goes in so KVM will be able to swap and do many other things very well
starting in 2.6.26.

Thanks a lot,
Andrea

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1210115136 -7200
# Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2
# Parent  58f716ad4d067afb6bdd1b5f7042e19d854aae0d
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page() in 

Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Andrea Arcangeli wrote:
 
 Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
 traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
 allows the calling of sleeping functions from reverse map traversal as
 needed for the notifier callbacks. It includes possible concurrency.

This also looks very debatable indeed. The only performance numbers quoted 
are:

   This results in f.e. the Aim9 brk performance test to got down by 10-15%.

which just seems like a total disaster.

The whole series looks bad, in fact. Lack of authorship, bad single-line 
description, and the code itself sucks so badly that it's not even funny.

NAK NAK NAK. All of it. It stinks.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 01:56:23PM -0700, Linus Torvalds wrote:
 This also looks very debatable indeed. The only performance numbers quoted 
 are:
 
This results in f.e. the Aim9 brk performance test to got down by 10-15%.
 
 which just seems like a total disaster.
 
 The whole series looks bad, in fact. Lack of authorship, bad single-line 

Glad you agree. Note that the fact the whole series looks bad, is
_exactly_ why I couldn't let Christoph keep going with
mmu-notifier-core at the very end of his patchset. I had to move it at
the top to have a chance to get the KVM and GRU requirements merged
in 2.6.26.

I think the spinlock-rwsem conversion is ok under config option, as
you can see I complained myself to various of those patches and I'll
take care they're in a mergeable state the moment I submit them. What
XPMEM requires are different semantics for the methods, and we never
had to do any blocking I/O during vmtruncate before, now we have to.
And I don't see a problem in making the conversion from
spinlock-rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
anything but ia64.

Please ignore all patches but mmu-notifier-core. I regularly forward
_only_ mmu-notifier-core to Andrew, that's the only one that is in
merge-ready status, everything else is just so XPMEM can test and we
can keep discussing it to bring it in a mergeable state like
mmu-notifier-core already is.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Andrea Arcangeli wrote:
 
 I think the spinlock-rwsem conversion is ok under config option, as
 you can see I complained myself to various of those patches and I'll
 take care they're in a mergeable state the moment I submit them. What
 XPMEM requires are different semantics for the methods, and we never
 had to do any blocking I/O during vmtruncate before, now we have to.

I really suspect we don't really have to, and that it would be better to 
just fix the code that does that.

 Please ignore all patches but mmu-notifier-core. I regularly forward
 _only_ mmu-notifier-core to Andrew, that's the only one that is in
 merge-ready status, everything else is just so XPMEM can test and we
 can keep discussing it to bring it in a mergeable state like
 mmu-notifier-core already is.

The thing is, I didn't like that one *either*. I thought it was the 
biggest turd in the series (and by biggest, I literally mean most lines 
of turd-ness rather than necessarily ugliest per se).

I literally think that mm_lock() is an unbelievable piece of utter and 
horrible CRAP.

There's simply no excuse for code like that.

If you want to avoid the deadlock from taking multiple locks in order, but 
there is really just a single operation that needs it, there's a really 
really simple solution.

And that solution is *not* to sort the whole damn f*cking list in a 
vmalloc'ed data structure prior to locking!

Damn.

No, the simple solution is to just make up a whole new upper-level lock, 
and get that lock *first*. You can then take all the multiple locks at a 
lower level in any order you damn well please. 

And yes, it's one more lock, and yes, it serializes stuff, but:

 - that code had better not be critical anyway, because if it was, then 
   the whole vmalloc+sort+lock+vunmap sh*t was wrong _anyway_

 - parallelism is overrated: it doesn't matter one effing _whit_ if 
   something is a hundred times more parallel, if it's also a hundred 
   times *SLOWER*.

So dang it, flush the whole damn series down the toilet and either forget 
the thing entirely, or re-do it sanely.

And here's an admission that I lied: it wasn't *all* clearly crap. I did 
like one part, namely list_del_init_rcu(), but that one should have been 
in a separate patch. I'll happily apply that one.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
  had to do any blocking I/O during vmtruncate before, now we have to.
 
 I really suspect we don't really have to, and that it would be better to 
 just fix the code that does that.

I'll let you discuss with Christoph and Robin about it. The moment I
heard the schedule inside -invalidate_page() requirement I reacted
the same way you did. But I don't see any other real solution for XPMEM
other than spin-looping for ages halting the scheduler for ages, while
the ack is received from the network device.

But mm_lock is required even without XPMEM. And srcu is also required
without XPMEM to allow -release to schedule (however downgrading srcu
to rcu will result in a very small patch, srcu and rcu are about the
same with a kernel supporting preempt=y like 2.6.26).

 I literally think that mm_lock() is an unbelievable piece of utter and 
 horrible CRAP.
 
 There's simply no excuse for code like that.

I think it's a great smp scalability optimization over the global lock
you're proposing below.

 No, the simple solution is to just make up a whole new upper-level lock, 
 and get that lock *first*. You can then take all the multiple locks at a 
 lower level in any order you damn well please. 

Unfortunately the lock you're talking about would be:

static spinlock_t global_lock = ...

There's no way to make it more granular.

So every time before taking any -i_mmap_lock _and_ any anon_vma-lock
we'd need to take that extremely wide spinlock first (and even worse,
later it would become a rwsem when XPMEM is selected making the VM
even slower than it already becomes when XPMEM support is selected at
compile time).

 And yes, it's one more lock, and yes, it serializes stuff, but:
 
  - that code had better not be critical anyway, because if it was, then 
the whole vmalloc+sort+lock+vunmap sh*t was wrong _anyway_

mmu_notifier_register can take ages. No problem.

  - parallelism is overrated: it doesn't matter one effing _whit_ if 
something is a hundred times more parallel, if it's also a hundred 
times *SLOWER*.

mmu_notifier_register is fine to be hundred times slower (preempt-rt
will turn all locks in spinlocks so no problem).

 And here's an admission that I lied: it wasn't *all* clearly crap. I did 
 like one part, namely list_del_init_rcu(), but that one should have been 
 in a separate patch. I'll happily apply that one.

Sure, I'll split it from the rest if the mmu-notifier-core isn't merged.

My objective has been:

1) add zero overhead to the VM before anybody starts a VM with kvm and
   still zero overhead for all other tasks except the task where the
   VM runs.  The only exception is the unlikely(!mm-mmu_notifier_mm)
   check that is optimized away too when CONFIG_KVM=n. And even for
   that check my invalidate_page reduces the number of branches to the
   absolute minimum possible.

2) avoid any new cacheline collision in the fast paths to allow numa
   systems not to nearly-crash (mm-mmu_notifier_mm will be shared and
   never written, except during the first mmu_notifier_register)

3) avoid any risk to introduce regressions in 2.6.26 (the patch must
   be obviously safe). Even if mm_lock would be a bad idea like you
   say, it's order of magnitude safer even if entirely broken then
   messing with the VM core locking in 2.6.26.

mm_lock (or whatever name you like to give it, I admit mm_lock may not
be worrysome enough for people to have an idea to call it in a fast
path) is going to be the real deal for the long term to allow
mmu_notifier_register to serialize against
invalidate_page_start/end. If I fail in 2.6.26 I'll offer
maintainership to Christoph as promised, and you'll find him pushing
for mm_lock to be merged (as XPMEM/GRU aren't technologies running on
cellphones where your global wide spinlocks is optimized away at
compile time, and he also has to deal with XPMEM where such a spinlock
would need to become a rwsem as the anon_vma-sem has to be taken
after it), but let's assume you're right entirely right here that
mm_lock is going to be dropped and there's a better way: it's still a
fine solution for 2.6.26.

And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to
mmu_notifier.[ch] so you don't get any pollution in the core VM, and
mm_lock will be invisible to everything but anybody calling
mmu_notifier_register() then and it will be trivial to remove later if
you really want to add a global spinlock as there's no way to be more
granular than a _global_ numa-wide spinlock taken before any
i_mmap_lock/anon_vma-lock, without my mm_lock.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___

Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrew Morton
On Thu, 8 May 2008 00:22:05 +0200
Andrea Arcangeli [EMAIL PROTECTED] wrote:

  No, the simple solution is to just make up a whole new upper-level lock, 
  and get that lock *first*. You can then take all the multiple locks at a 
  lower level in any order you damn well please. 
 
 Unfortunately the lock you're talking about would be:
 
 static spinlock_t global_lock = ...
 
 There's no way to make it more granular.
 
 So every time before taking any -i_mmap_lock _and_ any anon_vma-lock
 we'd need to take that extremely wide spinlock first (and even worse,
 later it would become a rwsem when XPMEM is selected making the VM
 even slower than it already becomes when XPMEM support is selected at
 compile time).

Nope.  We only need to take the global lock before taking *two or more* of
the per-vma locks.

I really wish I'd thought of that.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Jack Steiner
 And I don't see a problem in making the conversion from
 spinlock-rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
 anything but ia64.
 
That is currently true but we are also working on XPMEM for x86_64.
The new XPMEM code should be posted within a few weeks.

--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote:
 Nope.  We only need to take the global lock before taking *two or more* of
 the per-vma locks.
 
 I really wish I'd thought of that.

I don't see how you can avoid taking the system-wide-global lock
before every single anon_vma-lock/i_mmap_lock out there without
mm_lock.

Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
than one lock and we've to still take the global-system-wide lock
_before_ this single i_mmap_lock and no other lock at all.

Please elaborate, thanks!

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrew Morton
On Thu, 8 May 2008 00:44:06 +0200
Andrea Arcangeli [EMAIL PROTECTED] wrote:

 On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote:
  Nope.  We only need to take the global lock before taking *two or more* of
  the per-vma locks.
  
  I really wish I'd thought of that.
 
 I don't see how you can avoid taking the system-wide-global lock
 before every single anon_vma-lock/i_mmap_lock out there without
 mm_lock.
 
 Please note, we can't allow a thread to be in the middle of
 zap_page_range while mmu_notifier_register runs.
 
 vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
 than one lock and we've to still take the global-system-wide lock
 _before_ this single i_mmap_lock and no other lock at all.
 
 Please elaborate, thanks!


umm...


CPU0:   CPU1:

spin_lock(a-lock); spin_lock(b-lock);
spin_lock(b-lock); spin_lock(a-lock);

bad.

CPU0:   CPU1:

spin_lock(global_lock)  spin_lock(global_lock);
spin_lock(a-lock); spin_lock(b-lock);
spin_lock(b-lock); spin_lock(a-lock);

Is OK.


CPU0:   CPU1:

spin_lock(global_lock)  
spin_lock(a-lock); spin_lock(b-lock);
spin_lock(b-lock); spin_unlock(b-lock);
spin_lock(a-lock);
spin_unlock(a-lock);

also OK.

As long as all code paths which can take two-or-more locks are all covered
by the global lock there is no deadlock scenario.  If a thread takes just a
single instance of one of these locks without taking the global_lock then
there is also no deadlock.


Now, if we need to take both anon_vma-lock AND i_mmap_lock in the newly
added mm_lock() thing and we also take both those locks at the same time in
regular code, we're probably screwed.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 03:44:24PM -0700, Linus Torvalds wrote:
 
 
 On Thu, 8 May 2008, Andrea Arcangeli wrote:
  
  Unfortunately the lock you're talking about would be:
  
  static spinlock_t global_lock = ...
  
  There's no way to make it more granular.
 
 Right. So what? 
 
 It's still about a million times faster than what the code does now.

mmu_notifier_register only runs when windows or linux or macosx
boots. Who could ever care of the msec spent in mm_lock compared to
the time it takes to linux to boot?

What you're proposing is to slowdown AIM and certain benchmarks 20% or
more for all users, just so you save at most 1msec to start a VM.

 Rewrite the code, or not. I don't care. I'll very happily not merge crap 
 for the rest of my life.

If you want the global lock I'll do it no problem, I just think it's
obviously inferior solution for 99% of users out there (including kvm
users that will also have to take that lock while kvm userland runs).

In my view the most we should do in this area is to reduce further the
max number of locks to take if max_map_count already isn't enough.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
To remove mm_lock without adding an horrible system-wide lock before
every i_mmap_lock etc.. we've to remove
invalidate_range_begin/end. Then we can return to an older approach of
doing only invalidate_page and serializing it with the PT lock against
get_user_pages. That works fine for KVM but GRU will have to flush the
tlb once every time we drop the PT lock, that means once per each 512
ptes on x86-64 etc... instead of a single time for the whole range
regardless how large the range is.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Thu, 8 May 2008, Andrea Arcangeli wrote:

 mmu_notifier_register only runs when windows or linux or macosx
 boots. Who could ever care of the msec spent in mm_lock compared to
 the time it takes to linux to boot?

Andrea, you're *this* close to going to my list of people who it is not 
worth reading email from, and where it's better for everybody involved if 
I just teach my spam-filter about it.

That code was CRAP.

That code was crap whether it's used once, or whether it's used a million 
times. Stop making excuses for it just because it's not performance- 
critical.

So give it up already. I told you what the non-crap solution was. It's 
simpler, faster, and is about two lines of code compared to the crappy 
version (which was what - 200 lines of crap with a big comment on top of 
it just to explain the idiocy?).

So until you can understand the better solution, don't even bother 
emailing me, ok? Because the next email I get from you that shows the 
intelligence level of a gnat, I'll just give up and put you in a 
spam-filter.

Because my IQ goes down just from reading your mails. I can't afford to 
continue.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Benjamin Herrenschmidt

On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote:
 
 Please note, we can't allow a thread to be in the middle of
 zap_page_range while mmu_notifier_register runs.

You said yourself that mmu_notifier_register can be as slow as you
want ... what about you use stop_machine for it ? I'm not even joking
here :-)

 vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
 than one lock and we've to still take the global-system-wide lock
 _before_ this single i_mmap_lock and no other lock at all.

Ben.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Christoph Lameter
On Wed, 7 May 2008, Linus Torvalds wrote:

 The code that can take many locks, will have to get the global lock *and* 
 order the types, but that's still trivial. It's something like
 
   spin_lock(global_lock);
   for (vma = mm-mmap; vma; vma = vma-vm_next) {
   if (vma-anon_vma)
   spin_lock(vma-anon_vma-lock);
   }
   for (vma = mm-mmap; vma; vma = vma-vm_next) {
   if (!vma-anon_vma  vma-vm_file  vma-vm_file-f_mapping)
   spin_lock(vma-vm_file-f_mapping-i_mmap_lock);
   }
   spin_unlock(global_lock);

Multiple vmas may share the same mapping or refer to the same anonymous 
vma. The above code will deadlock since we may take some locks multiple 
times.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
Hi Andrew,

On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote:
   CPU0:   CPU1:
 
   spin_lock(global_lock)  
   spin_lock(a-lock); spin_lock(b-lock);
== mmu_notifier_register()
   spin_lock(b-lock); spin_unlock(b-lock);
   spin_lock(a-lock);
   spin_unlock(a-lock);
 
 also OK.

But the problem is that we've to stop the critical section in the
place I marked with  while mmu_notifier_register
runs. Otherwise the driver calling mmu_notifier_register won't know if
it's safe to start establishing secondary sptes/tlbs. If the driver
will establish sptes/tlbs with get_user_pages/follow_page the page
could be freed immediately later when zap_page_range starts.

So if CPU1 doesn't take the global_lock before proceeding in
zap_page_range (inside vmtruncate i_mmap_lock that is represented as
b-lock above) we're in trouble.

What we can do is to replace the mm_lock with a
spin_lock(global_lock) only if all places that takes i_mmap_lock
takes the global lock first and that hurts scalability of the fast
paths that are performance critical like vmtruncate and
anon_vma-lock. Perhaps they're not so performance critical, but
surely much more performant critical than mmu_notifier_register ;).

The idea of polluting various scalable paths like truncate() syscall
in the VM with a global spinlock frightens me, I'd rather return to
invalidate_page() inside the PT lock removing both
invalidate_range_start/end. Then all serialization against the mmu
notifiers will be provided by the PT lock that the secondary mmu page
fault also has to take in get_user_pages (or follow_page). In any case
that is a better solution that won't slowdown the VM when
MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance
is about the same with or without invalidate_range_start/end. I didn't
think anybody could care about how long mmu_notifier_register takes
until it returns compared to all heavyweight operations that happens
to start a VM (not only in the kernel but in the guest too).

Infact if it's security that we worry about here, can put a cap of
_time_ that mmu_notifier_register can take before it fails, and we
fail to start a VM if it takes more than 5sec, that's still fine as
the failure could happen for other reasons too like vmalloc shortage
and we already handle it just fine. This 5sec delay can't possibly happen in
practice anyway in the only interesting scenario, just like the
vmalloc shortage. This is obviously a superior solution than polluting
the VM with an useless global spinlock that will destroy truncate/AIM
on numa.

Anyway Christoph, I uploaded my last version here:


http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16

(applies and runs fine on 26-rc1)

You're more than welcome to takeover from it, I kind of feel my time
now may be better spent to emulate the mmu-notifier-core with kprobes.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 09:28:38AM +1000, Benjamin Herrenschmidt wrote:
 
 On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote:
  
  Please note, we can't allow a thread to be in the middle of
  zap_page_range while mmu_notifier_register runs.
 
 You said yourself that mmu_notifier_register can be as slow as you
 want ... what about you use stop_machine for it ? I'm not even joking
 here :-)

We can put a cap of time + a cap of vmas. It's not important if it
fails, the only useful case we know it, and it won't be slow at
all. The failure can happen because the cap of time or the cap of vmas
or the cap vmas triggers or there's a vmalloc shortage. We handle the
failure in userland of course. There are zillon of allocations needed
anyway, any one of them can fail, so this isn't a new fail path, is
the same fail path that always existed before mmu_notifiers existed.

I can't possibly see how adding a new global wide lock that forces all
truncate to be serialized against each other, practically eliminating
the need of the i_mmap_lock, could be superior to an approach that
doesn't cause the overhead to the VM at all, and only require kvm to
pay for an additional cost when it startup.

Furthermore the only reason I had to implement mm_lock was to fix the
invalidate_range_start/end model, if we go with only invalidate_page
and invalidate_pages called inside the PT lock and we use the PT lock
to serialize, we don't need a mm_lock anymore and no new lock from the
VM either. I tried to push for that, but everyone else wanted
invalidate_range_start/end. I only did the only possible thing to do:
to make invalidate_range_start safe to make everyone happy without
slowing down the VM.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Christoph Lameter wrote:
 
 Multiple vmas may share the same mapping or refer to the same anonymous 
 vma. The above code will deadlock since we may take some locks multiple 
 times.

Ok, so that actually _is_ a problem. It would be easy enough to also add 
just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing 
a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's 
suggestion of just using stop-machine is actually the right one just 
because it's _so_ simple.

(That said, we're not running out of vm flags yet, and if we were, we 
could just add another word. We're already wasting that space right now on 
64-bit by calling it unsigned long).

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Robin Holt
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote:
 On Wed, 7 May 2008, Andrea Arcangeli wrote:
  
  I think the spinlock-rwsem conversion is ok under config option, as
  you can see I complained myself to various of those patches and I'll
  take care they're in a mergeable state the moment I submit them. What
  XPMEM requires are different semantics for the methods, and we never
  had to do any blocking I/O during vmtruncate before, now we have to.
 
 I really suspect we don't really have to, and that it would be better to 
 just fix the code that does that.

That fix is going to be fairly difficult.  I will argue impossible.

First, a little background.  SGI allows one large numa-link connected
machine to be broken into seperate single-system images which we call
partitions.

XPMEM allows, at its most extreme, one process on one partition to
grant access to a portion of its virtual address range to processes on
another partition.  Those processes can then fault pages and directly
share the memory.

In order to invalidate the remote page table entries, we need to message
(uses XPC) to the remote side.  The remote side needs to acquire the
importing process's mmap_sem and call zap_page_range().  Between the
messaging and the acquiring a sleeping lock, I would argue this will
require sleeping locks in the path prior to the mmu_notifier invalidate_*
callouts().

On a side note, we currently have XPMEM working on x86_64 SSI, and
ia64 cross-partition.  We are in the process of getting XPMEM working
on x86_64 cross-partition in support of UV.

Thanks,
Robin Holt

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Robin Holt
On Wed, May 07, 2008 at 05:03:30PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 7 May 2008, Christoph Lameter wrote:
  
  Multiple vmas may share the same mapping or refer to the same anonymous 
  vma. The above code will deadlock since we may take some locks multiple 
  times.
 
 Ok, so that actually _is_ a problem. It would be easy enough to also add 
 just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing 
 a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's 
 suggestion of just using stop-machine is actually the right one just 
 because it's _so_ simple.

Also, stop-machine will not work if we come to the conclusion that
i_mmap_lock and anon_vma-lock need to be sleepable locks.

Thanks,
Robin Holt

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Christoph Lameter
On Wed, 7 May 2008, Linus Torvalds wrote:

 On Wed, 7 May 2008, Christoph Lameter wrote:
  
  Multiple vmas may share the same mapping or refer to the same anonymous 
  vma. The above code will deadlock since we may take some locks multiple 
  times.
 
 Ok, so that actually _is_ a problem. It would be easy enough to also add 
 just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing 
 a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's 
 suggestion of just using stop-machine is actually the right one just 
 because it's _so_ simple.

Set the vma flag when we locked it and then skip when we find it locked 
right? This would be in addition to the global lock?

stop-machine would work for KVM since its a once in a Guest OS time of 
thing. But GRU, KVM and eventually Infiniband need the ability to attach 
in a reasonable timeframe without causing major hiccups for other 
processes.

 (That said, we're not running out of vm flags yet, and if we were, we 
 could just add another word. We're already wasting that space right now on 
 64-bit by calling it unsigned long).

We sure have enough flags.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Robin Holt wrote:
 
 In order to invalidate the remote page table entries, we need to message
 (uses XPC) to the remote side.  The remote side needs to acquire the
 importing process's mmap_sem and call zap_page_range().  Between the
 messaging and the acquiring a sleeping lock, I would argue this will
 require sleeping locks in the path prior to the mmu_notifier invalidate_*
 callouts().

You simply will *have* to do it without locally holding all the MM 
spinlocks. Because quite frankly, slowing down all the normal VM stuff for 
some really esoteric hardware simply isn't acceptable. We just don't do 
it.

So what is it that actually triggers one of these events?

The most obvious solution is to just queue the affected pages while 
holding the spinlocks (perhaps locking them locally), and then handling 
all the stuff that can block after releasing things. That's how we 
normally do these things, and it works beautifully, without making 
everything slower.

Sometimes we go to extremes, and actually break the locks are restart 
(ugh), and it gets ugly, but even that tends to be preferable to using the 
wrong locking.

The thing is, spinlocks really kick ass. Yes, they restrict what you can 
do within them, but if 99.99% of all work is non-blocking, then the really 
odd rare blocking case is the one that needs to accomodate, not the rest.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Thu, 8 May 2008, Andrea Arcangeli wrote:

 Hi Andrew,
 
 On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote:
  CPU0:   CPU1:
  
  spin_lock(global_lock)  
  spin_lock(a-lock); spin_lock(b-lock);
   == mmu_notifier_register()

If mmy_notifier_register() takes the global lock, it cannot happen here. 
It will be blocked (by CPU0), so there's no way it can then cause an ABBA 
deadlock. It will be released when CPU0 has taken *all* the locks it 
needed to take.

 What we can do is to replace the mm_lock with a
 spin_lock(global_lock) only if all places that takes i_mmap_lock

NO!

You replace mm_lock() with the sequence that Andrew gave you (and I 
described):

spin_lock(global_lock)
.. get all locks UNORDERED ..
spin_unlock(global_lock)

and you're now done. You have your mm_lock() (which still needs to be 
renamed - it should be a mmu_notifier_lock() or something like that), 
but you don't need the insane sorting. At most you apparently need a way 
to recognize duplicates (so that you don't deadlock on yourself), which 
looks like a simple bit-per-vma.

The global lock doesn't protect any data structures itself - it just 
protects two of these mm_lock() functions from ABBA'ing on each other!

Linus


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Christoph Lameter wrote:
 
 Set the vma flag when we locked it and then skip when we find it locked 
 right? This would be in addition to the global lock?

Yes. And clear it before unlocking (and again, testing if it's already 
clear - you mustn't unlock twice, so you must only unlock when the bit 
was set).

You also (obviously) need to have somethign that guarantees that the lists 
themselves are stable over the whole sequence, but I assume you already 
have mmap_sem for reading (since you'd need it anyway just to follow the 
list).

And if you have it for writing, it can obviously *act* as the global lock, 
since it would already guarantee mutual exclusion on that mm-mmap list.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Christoph Lameter
On Wed, 7 May 2008, Linus Torvalds wrote:

 and you're now done. You have your mm_lock() (which still needs to be 
 renamed - it should be a mmu_notifier_lock() or something like that), 
 but you don't need the insane sorting. At most you apparently need a way 
 to recognize duplicates (so that you don't deadlock on yourself), which 
 looks like a simple bit-per-vma.

Andrea's mm_lock could have wider impact. It is the first effective 
way that I have seen of temporarily holding off reclaim from an address 
space. It sure is a brute force approach.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:02:49PM -0700, Linus Torvalds wrote:
 You replace mm_lock() with the sequence that Andrew gave you (and I 
 described):
 
   spin_lock(global_lock)
   .. get all locks UNORDERED ..
   spin_unlock(global_lock)
 
 and you're now done. You have your mm_lock() (which still needs to be 
 renamed - it should be a mmu_notifier_lock() or something like that), 
 but you don't need the insane sorting. At most you apparently need a way 
 to recognize duplicates (so that you don't deadlock on yourself), which 
 looks like a simple bit-per-vma.
 
 The global lock doesn't protect any data structures itself - it just 
 protects two of these mm_lock() functions from ABBA'ing on each other!

I thought the thing to remove was the get all locks. I didn't
realize the major problem was only the sorting of the array.

I'll add the global lock, it's worth it as it drops the worst case
number of steps by log(65536) times. Furthermore surely two concurrent
mm_notifier_lock will run faster as it'll decrease the cacheline
collisions. Since you ask to call it mmu_notifier_lock I'll also move
it to mmu_notifier.[ch] as consequence.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Christoph Lameter wrote:
 On Wed, 7 May 2008, Linus Torvalds wrote:
  and you're now done. You have your mm_lock() (which still needs to be 
  renamed - it should be a mmu_notifier_lock() or something like that), 
  but you don't need the insane sorting. At most you apparently need a way 
  to recognize duplicates (so that you don't deadlock on yourself), which 
  looks like a simple bit-per-vma.
 
 Andrea's mm_lock could have wider impact. It is the first effective 
 way that I have seen of temporarily holding off reclaim from an address 
 space. It sure is a brute force approach.

Well, I don't think the naming necessarily has to be about notifiers, but 
it should be at least a *bit* more scary than mm_lock(), to make it 
clear that it's pretty dang expensive. 

Even without the vmalloc and sorting, if it would be used by normal 
things it would still be very expensive for some cases - running thngs 
like ElectricFence, for example, will easily generate thousands and 
thousands of vma's in a process. 

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
Sorry for not having completely answered to this. I initially thought
stop_machine could work when you mentioned it, but I don't think it
can even removing xpmem block-inside-mmu-notifier-method requirements.

For stop_machine to solve this (besides being slower and potentially
not more safe as running stop_machine in a loop isn't nice), we'd need
to prevent preemption in between invalidate_range_start/end.

I think there are two ways:

1) add global lock around mm_lock to remove the sorting

2) remove invalidate_range_start/end, nuke mm_lock as consequence of
   it, and replace all three with invalidate_pages issued inside the
   PT lock, one invalidation for each 512 pte_t modified, so
   serialization against get_user_pages becomes trivial but this will
   be not ok at all for SGI as it increases a lot their invalidation
   frequency

For KVM both ways are almost the same.

I'll implement 1 now then we'll see...

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Wed, 7 May 2008, Christoph Lameter wrote:
 
  (That said, we're not running out of vm flags yet, and if we were, we 
  could just add another word. We're already wasting that space right now on 
  64-bit by calling it unsigned long).
 
 We sure have enough flags.

Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are 
unique), we need to mark the address spaces/anonvma's. So the flag would 
need to be in the struct anon_vma (and struct address_space), not in the 
vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and 
would require adding a new field to struct anon_vma()

And related to that brain-fart of mine, that obviously also means that 
yes, the locking has to be stronger than mm-mmap_sem held for writing, 
so yeah, it would have be a separate global spinlock (or perhaps a 
blocking lock if you have some reason to protect anything else with this 
too).

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:39:48PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 7 May 2008, Christoph Lameter wrote:
  
   (That said, we're not running out of vm flags yet, and if we were, we 
   could just add another word. We're already wasting that space right now 
   on 
   64-bit by calling it unsigned long).
  
  We sure have enough flags.
 
 Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are 
 unique), we need to mark the address spaces/anonvma's. So the flag would 
 need to be in the struct anon_vma (and struct address_space), not in the 
 vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and 
 would require adding a new field to struct anon_vma()
 
 And related to that brain-fart of mine, that obviously also means that 
 yes, the locking has to be stronger than mm-mmap_sem held for writing, 
 so yeah, it would have be a separate global spinlock (or perhaps a 
 blocking lock if you have some reason to protect anything else with this 

So because the bitflag can't prevent taking the same lock twice on two
different vmas in the same mm, we still can't remove the sorting, and
the global lock won't buy much other than reducing the collisions. I
can add that though.

I think it's more interesting to put a cap on the number of vmas to
min(1024,max_map_count). The sort time on an 8k array runs in constant
time. kvm runs with 127 vmas allocated...

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Thu, 8 May 2008, Andrea Arcangeli wrote:
 
 So because the bitflag can't prevent taking the same lock twice on two
 different vmas in the same mm, we still can't remove the sorting

Andrea. 

Take five minutes. Take a deep breadth. And *think* about actually reading 
what I wrote.

The bitflag *can* prevent taking the same lock twice. It just needs to be 
in the right place.

Let me quote it for you:

 So the flag wouldn't be one of the VM_xyzzy flags, and would require 
 adding a new field to struct anon_vma()

IOW, just make it be in that anon_vma (and the address_space). No sorting 
required.

 I think it's more interesting to put a cap on the number of vmas to
 min(1024,max_map_count). The sort time on an 8k array runs in constant
 time.

Shut up already. It's not constant time just because you can cap the 
overhead. We're not in a university, and we care about performance, not 
your made-up big-O notation.

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:57:05PM -0700, Linus Torvalds wrote:
 Take five minutes. Take a deep breadth. And *think* about actually reading 
 what I wrote.
 
 The bitflag *can* prevent taking the same lock twice. It just needs to be 
 in the right place.

It's not that I didn't read it, but to do it I've to grow every
anon_vma by 8 bytes. I thought it was implicit that the conclusion of
your email is that it couldn't possibly make sense to grow the size of
each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem
kernel modules. It surely isn't my preferred solution, while capping
the number of vmas to 1024 means sort() will make around 10240 steps,
Matt call tell the exact number. The big cost shouldn't be in
sort. Even 512 vmas will be more than enough for us infact. Note that
I've a cond_resched in the sort compare function and I can re-add the
signal_pending check. I had the signal_pending check in the original
version that didn't use sort() but was doing an inner loop, I thought
signal_pending wasn't necessary after speeding it up with sort(). But
I can add it again, so then we'll only fail to abort inside sort() and
we'll be able to break the loop while taking all the spinlocks, but
with such as small array that can't be an issue and the result will
surely run faster than stop_machine with zero ram and cpu overhead for
the VM (besides stop_machine can't work or we'd need to disable
preemption between invalidate_range_start/end, even removing the xpmem
schedule-inside-mmu-notifier requirement).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


Andrea, I'm not interested. I've stated my standpoint: the code being 
discussed is crap. We're not doing that. Not in the core VM. 

I gave solutions that I think aren't crap, but I already also stated that 
I have no problems not merging it _ever_ if no solution can be found. The 
whole issue simply isn't even worth the pain, imnsho. 

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 06:12:32PM -0700, Christoph Lameter wrote:
 Andrea's mm_lock could have wider impact. It is the first effective 
 way that I have seen of temporarily holding off reclaim from an address 
 space. It sure is a brute force approach.

The only improvement I can imagine on mm_lock, is after changing the
name to global_mm_lock() to reestablish the signal_pending check in
the loop that takes the spinlock and to backoff and put the cap to 512
vmas so the ram wasted on anon-vmas wouldn't save more than 10-100usec
at most (plus the vfree that may be a bigger cost but we're ok to pay
it and it surely isn't security related).

Then on the long term we need to talk to Matt on returning a parameter
to the sort function to break the loop. After that we remove the 512
vma cap and mm_lock is free to run as long as it wants like
/dev/urandom, nobody can care less how long it will run before
returning as long as it reacts to signals.

This is the right way if we want to support XPMEM/GRU efficiently and
without introducing unnecessary regressions in the VM fastpaths and VM
footprint.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Christoph Lameter
On Thu, 8 May 2008, Andrea Arcangeli wrote:

 to the sort function to break the loop. After that we remove the 512
 vma cap and mm_lock is free to run as long as it wants like
 /dev/urandom, nobody can care less how long it will run before
 returning as long as it reacts to signals.

Look Linus has told you what to do. Why not simply do it?

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 08:10:33PM -0700, Christoph Lameter wrote:
 On Thu, 8 May 2008, Andrea Arcangeli wrote:
 
  to the sort function to break the loop. After that we remove the 512
  vma cap and mm_lock is free to run as long as it wants like
  /dev/urandom, nobody can care less how long it will run before
  returning as long as it reacts to signals.
 
 Look Linus has told you what to do. Why not simply do it?

When it looked like we could use vm_flags to remove sort, that looked
an ok optimization, no problem with optimizations, I'm all for
optimizations if they cost nothing to the VM fast paths and VM footprint.

But removing sort isn't worth it if it takes away ram from the VM even
when global_mm_lock will never be called.

sort is like /dev/urandom so after sort is fixed to handle signals
(and I expect Matt will help with this) we'll remove the 512 vmas cap.
In the meantime we can live with the 512 vmas cap that guarantees sort
won't take more than a few dozen usec. Removing sort() is the only
thing that the anon vma bitflag can achieve and it's clearly not worth
it and it would go in the wrong direction (fixing sort to handle
signals is clearly the right direction, if sort is a concern at all).

Adding the global lock around global_mm_lock to avoid one
global_mm_lock to collide against another global_mm_lock is sure ok
with me, if that's still wanted now that it's clear removing sort
isn't worth it, I'm neutral on this.

Christoph please go ahead and add the bitflag to anon-vma yourself if
you want. If something isn't technically right I don't do it no matter
who asks it.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Linus Torvalds


On Thu, 8 May 2008, Andrea Arcangeli wrote:
 
 But removing sort isn't worth it if it takes away ram from the VM even
 when global_mm_lock will never be called.

Andrea, you really are a piece of work. Your arguments have been bogus 
crap that didn't even understand what was going on from the beginning, and 
now you continue to do that.

What exactly takes away ram from the VM?

The notion of adding a flag to struct anon_vma?

The one that already has a 4 byte padding thing on x86-64 just after the 
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two 
bytes of padding if we didn't just make the spinlock type unconditionally 
32 bits rather than the 16 bits we actually _use_?

IOW, you didn't even look at it, did you?

But whatever. I clearly don't want a patch from you anyway, so ..

Linus

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Wed, May 07, 2008 at 09:14:45PM -0700, Linus Torvalds wrote:
 IOW, you didn't even look at it, did you?

Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.

2.6.26-rc1

# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
anon_vma 260576 24  1441 : tunables  120   608 : 
slabdata  4  4  0
^^  ^^^

2.6.26-rc1 + below patch

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -27,6 +27,7 @@ struct anon_vma {
 struct anon_vma {
spinlock_t lock;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
+   int flag:1;
 };
 
 #ifdef CONFIG_MMU

# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
anon_vma 250560 32  1121 : tunables  120   608 : 
slabdata  5  5  0
^^  ^^^

Not a big deal sure to grow it 33%, it's so small anyway, but I don't
see the point in growing it. sort() can be interrupted by signals, and
until it can we can cap it to 512 vmas making the worst case taking
dozen usecs, I fail to see what you have against sort().

Again: if a vma bitflag + global lock could have avoided sort and run
O(N) instead of current O(N*log(N)) I would have done that
immediately, infact I was in the process of doing it when you posted
the followup. Nothing personal here, just staying technical. Hope you
too.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Pekka Enberg
On Thu, May 8, 2008 at 8:20 AM, Andrea Arcangeli [EMAIL PROTECTED] wrote:
  Actually I looked both at the struct and at the slab alignment just in
  case it was changed recently. Now after reading your mail I also
  compiled it just in case.

  @@ -27,6 +27,7 @@ struct anon_vma {
   struct anon_vma {

 spinlock_t lock;/* Serialize access to vma list */

 struct list_head head;  /* List of private related vmas */
  +   int flag:1;
   };

You might want to read carefully what Linus wrote:

 The one that already has a 4 byte padding thing on x86-64 just after the
 spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
 bytes of padding if we didn't just make the spinlock type unconditionally
 32 bits rather than the 16 bits we actually _use_?

So you need to add the flag _after_ -lock and _before_ -head

Pekka

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Pekka Enberg
On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg [EMAIL PROTECTED] wrote:
 You might want to read carefully what Linus wrote:

   The one that already has a 4 byte padding thing on x86-64 just after the
   spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two
   bytes of padding if we didn't just make the spinlock type unconditionally
   32 bits rather than the 16 bits we actually _use_?

  So you need to add the flag _after_ -lock and _before_ -head

Oh should have taken my morning coffee first, before -lock works
obviously as well.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Andrea Arcangeli
On Thu, May 08, 2008 at 08:30:20AM +0300, Pekka Enberg wrote:
 On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg [EMAIL PROTECTED] wrote:
  You might want to read carefully what Linus wrote:
 
The one that already has a 4 byte padding thing on x86-64 just after the
spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have 
  two
bytes of padding if we didn't just make the spinlock type unconditionally
32 bits rather than the 16 bits we actually _use_?
 
   So you need to add the flag _after_ -lock and _before_ -head
 
 Oh should have taken my morning coffee first, before -lock works
 obviously as well.

Sorry, Linus's right: I didn't realize the after the spinlock was
literally after the spinlock, I didn't see the 4 byte padding when I
read the code and put the flag:1 in. If put between -lock and -head
it doesn't take more memory on x86-64 as described literlly. So the
next would be to find another place like that in the address
space. Perhaps after the private_lock using the same trick or perhaps
the slab alignment won't actually alter the number of slabs per page
regardless.

I leave that to Christoph, he's surely better than me at doing this, I
give it up entirely and I consider my attempt to merge a total failure
and I strongly regret it.

On a side note the anon_vma will change to this when XPMEM support is
compiled in:

 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head; /* List of private related vmas
*/
 };

not sure if it'll grow in size or not after that but let's say it's
not a big deal.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-02 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1209740186 -7200
# Node ID 0be678c52e540d5f5d5fd9af549b57b9bb018d32
# Parent  de28c85baef11b90c993047ca851a2f52c85a5be
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page() in