Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-03 Thread Nick Piggin
On Wed, 2005-02-02 at 14:09 +1100, Nick Piggin wrote:
> On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote:
> > On Wed, 2 Feb 2005, Nick Piggin wrote:

> > I mean we could just speculatively copy, risk copying crap and
> > discard that later when we find that the pte has changed. This would
> > simplify the function:
> > 
> 
> I think this may be the better approach. Anyone else?
> 

Not to say it is perfect either. Normal semantics say not to touch
a page if it is not somehow pinned. So this may cause problems in
corner cases (DEBUG_PAGEALLOC comes to mind... hopefully nothing else).

But I think a plain read of the page when it isn't pinned is less
yucky than writing into the non-pinned struct page.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-03 Thread Nick Piggin
On Wed, 2005-02-02 at 14:09 +1100, Nick Piggin wrote:
 On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote:
  On Wed, 2 Feb 2005, Nick Piggin wrote:

  I mean we could just speculatively copy, risk copying crap and
  discard that later when we find that the pte has changed. This would
  simplify the function:
  
 
 I think this may be the better approach. Anyone else?
 

Not to say it is perfect either. Normal semantics say not to touch
a page if it is not somehow pinned. So this may cause problems in
corner cases (DEBUG_PAGEALLOC comes to mind... hopefully nothing else).

But I think a plain read of the page when it isn't pinned is less
yucky than writing into the non-pinned struct page.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Wed, 2 Feb 2005, Nick Piggin wrote:

> Well yeah, but the interesting case is when that isn't a lock ;)
>
> I'm not saying what you've got is no good. I'm sure it would be fine
> for testing. And if it happens that we can do the "page_count doesn't
> mean anything after it has reached zero and been freed. Nor will it
> necessarily be zero when a new page is allocated" thing without many
> problems, then this may be a fine way to do it.
>
> I was just pointing out this could be a problem without putting a
> lot of thought into it...

Surely we need to do this the right way. Do we really need to
use page_cache_get()? Is anything relying on page_count == 2 of
the old_page?

I mean we could just speculatively copy, risk copying crap and
discard that later when we find that the pte has changed. This would
simplify the function:

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-02-01 18:10:46.0 -0800
+++ linux-2.6.10/mm/memory.c2005-02-01 18:43:08.0 -0800
@@ -1323,9 +1323,6 @@ static int do_wp_page(struct mm_struct *
/*
 * Ok, we need to copy. Oh, well..
 */
-   if (!PageReserved(old_page))
-   page_cache_get(old_page);
-
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {
@@ -1336,6 +1333,10 @@ static int do_wp_page(struct mm_struct *
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto no_new_page;
+   /*
+* No page_cache_get so we may copy some crap
+* that is later discarded if the pte has changed
+*/
copy_user_highpage(new_page, old_page, address);
}
/*
@@ -1352,7 +1353,6 @@ static int do_wp_page(struct mm_struct *
acct_update_integrals();
update_mem_hiwater();
} else
-
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1363,7 +1363,6 @@ static int do_wp_page(struct mm_struct *
}
pte_unmap(page_table);
page_cache_release(new_page);
-   page_cache_release(old_page);
spin_unlock(>page_table_lock);
return VM_FAULT_MINOR;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Nick Piggin
On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote:
> On Wed, 2 Feb 2005, Nick Piggin wrote:
> 
> > Well yeah, but the interesting case is when that isn't a lock ;)
> >
> > I'm not saying what you've got is no good. I'm sure it would be fine
> > for testing. And if it happens that we can do the "page_count doesn't
> > mean anything after it has reached zero and been freed. Nor will it
> > necessarily be zero when a new page is allocated" thing without many
> > problems, then this may be a fine way to do it.
> >
> > I was just pointing out this could be a problem without putting a
> > lot of thought into it...
> 
> Surely we need to do this the right way. Do we really need to
> use page_cache_get()? Is anything relying on page_count == 2 of
> the old_page?
> 
> I mean we could just speculatively copy, risk copying crap and
> discard that later when we find that the pte has changed. This would
> simplify the function:
> 

I think this may be the better approach. Anyone else?


Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Nick Piggin
On Tue, 2005-02-01 at 17:20 -0800, Christoph Lameter wrote:
> On Wed, 2 Feb 2005, Nick Piggin wrote:
> 
> > > The unmapping in rmap.c would change the pte. This would be discovered
> > > after acquiring the spinlock later in do_wp_page. Which would then lead to
> > > the operation being abandoned.
> > Oh yes, but suppose your page_cache_get is happening at the same time
> > as free_pages_check, after the page gets freed by the scanner? I can't
> > actually think of anything that would cause a real problem (ie. not a
> > debug check), off the top of my head. But can you say there _isn't_
> > anything?
> >
> > Regardless, it seems pretty dirty to me. But could possibly be made
> > workable, of course.
> 
> Would it make you feel better if we would move the spin_unlock back to the
> prior position? This would ensure that the fallback case is exactly the
> same.
> 

Well yeah, but the interesting case is when that isn't a lock ;)

I'm not saying what you've got is no good. I'm sure it would be fine
for testing. And if it happens that we can do the "page_count doesn't
mean anything after it has reached zero and been freed. Nor will it
necessarily be zero when a new page is allocated" thing without many
problems, then this may be a fine way to do it.

I was just pointing out this could be a problem without putting a
lot of thought into it...


Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Wed, 2 Feb 2005, Nick Piggin wrote:

> > The unmapping in rmap.c would change the pte. This would be discovered
> > after acquiring the spinlock later in do_wp_page. Which would then lead to
> > the operation being abandoned.
> Oh yes, but suppose your page_cache_get is happening at the same time
> as free_pages_check, after the page gets freed by the scanner? I can't
> actually think of anything that would cause a real problem (ie. not a
> debug check), off the top of my head. But can you say there _isn't_
> anything?
>
> Regardless, it seems pretty dirty to me. But could possibly be made
> workable, of course.

Would it make you feel better if we would move the spin_unlock back to the
prior position? This would ensure that the fallback case is exactly the
same.

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-31 08:59:07.0 -0800
+++ linux-2.6.10/mm/memory.c2005-02-01 10:55:30.0 -0800
@@ -1318,7 +1318,6 @@ static int do_wp_page(struct mm_struct *
}
}
pte_unmap(page_table);
-   page_table_atomic_stop(mm);

/*
 * Ok, we need to copy. Oh, well..
@@ -1326,6 +1325,7 @@ static int do_wp_page(struct mm_struct *
if (!PageReserved(old_page))
page_cache_get(old_page);

+   page_table_atomic_stop(mm);
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Nick Piggin
On Tue, 2005-02-01 at 11:01 -0800, Christoph Lameter wrote:
> On Tue, 1 Feb 2005, Nick Piggin wrote:

> > A per-pte lock is sufficient for this case, of course, which is why the
> > pte-locked system is completely free of the page table lock.
> 
> Introducing pte locking would allow us to go further with parallelizing
> this but its another invasive procedure. I think parallelizing COW is only
> possible to do reliable with some pte locking scheme. But then the
> question is if the pte locking is really faster than obtaining a spinlock.
> I suspect this may not be the case.
> 

Well most likely not although I haven't been able to detect much
difference. But in your case you would probably be happy to live
with that if it meant better parallelising of an important
function... but we'll leave future discussion to another thread ;)

> > Although I may have some fact fundamentally wrong?
> 
> The unmapping in rmap.c would change the pte. This would be discovered
> after acquiring the spinlock later in do_wp_page. Which would then lead to
> the operation being abandoned.
> 

Oh yes, but suppose your page_cache_get is happening at the same time
as free_pages_check, after the page gets freed by the scanner? I can't
actually think of anything that would cause a real problem (ie. not a
debug check), off the top of my head. But can you say there _isn't_
anything?

Regardless, it seems pretty dirty to me. But could possibly be made
workable, of course.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Tue, 1 Feb 2005, Nick Piggin wrote:

> > pte_unmap(page_table);
> > +   page_table_atomic_stop(mm);
> >
> > /*
> >  * Ok, we need to copy. Oh, well..
> >  */
> > if (!PageReserved(old_page))
> > page_cache_get(old_page);
> > -   spin_unlock(>page_table_lock);
> >
>
> I don't think you can do this unless you have done something funky that I
> missed. And that kind of shoots down your lockless COW too, although it
> looks like you can safely have the second part of do_wp_page without the
> lock. Basically - your lockless COW patch itself seems like it should be
> OK, but this hunk does not.

See my comment at the end of this message.

> I would be very interested if you are seeing performance gains with your
> lockless COW patches, BTW.

So far I have not had time to focus on benchmarking that.

> Basically, getting a reference on a struct page was the only thing I found
> I wasn't able to do lockless with pte cmpxchg. Because it can race with
> unmapping in rmap.c and reclaim and reuse, which probably isn't too good.
> That means: the only operations you are able to do lockless is when there
> is no backing page (ie. the anonymous unpopulated->populated case).
>
> A per-pte lock is sufficient for this case, of course, which is why the
> pte-locked system is completely free of the page table lock.

Introducing pte locking would allow us to go further with parallelizing
this but its another invasive procedure. I think parallelizing COW is only
possible to do reliable with some pte locking scheme. But then the
question is if the pte locking is really faster than obtaining a spinlock.
I suspect this may not be the case.

> Although I may have some fact fundamentally wrong?

The unmapping in rmap.c would change the pte. This would be discovered
after acquiring the spinlock later in do_wp_page. Which would then lead to
the operation being abandoned.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Tue, 1 Feb 2005, Nick Piggin wrote:

> Slightly OT: are you still planning to move the update_mem_hiwater and
> friends crud out of these fastpaths? It looks like at least that function
> is unsafe to be lockless.

Yes. I have a patch pending and the author of the CSA patches is a
cowoerker of mine. The patch will be resubmitted once certain aspects
of the timer subsystem are stabilized and/or when he gets back from his
vacation. The statistics are not critical to system operation.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Tue, 1 Feb 2005, Nick Piggin wrote:

> Hmm, this is moving toward the direction my patches take.

You are right. But I am still weary of the transactional idea in your
patchset (which is really not comparable with a database transaction
after all...).

I think moving to cmpxchg and xchg operations will give this more
transparency and make it easier to understand and handle.

> Naturally I prefer the complete replacement that is made with
> my patch - however this of course means one has to move
> *everything* over to be pte_cmpxchg safe, which runs against
> your goal of getting the low hanging fruit with as little fuss
> as possible for the moment.

I would also prefer a replacement but there are certain cost-benefit
tradeoffs with atomic operations vs. spinlock that may better be tackled
on a case by case basis. Also this is pretty much at the core of the Linux
VM and thus highly sensitive. Given its history and the danger of breaking
things it may be best to preserve it intact as much as possible and move
in small steps.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Tue, 1 Feb 2005, Nick Piggin wrote:

 Slightly OT: are you still planning to move the update_mem_hiwater and
 friends crud out of these fastpaths? It looks like at least that function
 is unsafe to be lockless.

Yes. I have a patch pending and the author of the CSA patches is a
cowoerker of mine. The patch will be resubmitted once certain aspects
of the timer subsystem are stabilized and/or when he gets back from his
vacation. The statistics are not critical to system operation.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Tue, 1 Feb 2005, Nick Piggin wrote:

  pte_unmap(page_table);
  +   page_table_atomic_stop(mm);
 
  /*
   * Ok, we need to copy. Oh, well..
   */
  if (!PageReserved(old_page))
  page_cache_get(old_page);
  -   spin_unlock(mm-page_table_lock);
 

 I don't think you can do this unless you have done something funky that I
 missed. And that kind of shoots down your lockless COW too, although it
 looks like you can safely have the second part of do_wp_page without the
 lock. Basically - your lockless COW patch itself seems like it should be
 OK, but this hunk does not.

See my comment at the end of this message.

 I would be very interested if you are seeing performance gains with your
 lockless COW patches, BTW.

So far I have not had time to focus on benchmarking that.

 Basically, getting a reference on a struct page was the only thing I found
 I wasn't able to do lockless with pte cmpxchg. Because it can race with
 unmapping in rmap.c and reclaim and reuse, which probably isn't too good.
 That means: the only operations you are able to do lockless is when there
 is no backing page (ie. the anonymous unpopulated-populated case).

 A per-pte lock is sufficient for this case, of course, which is why the
 pte-locked system is completely free of the page table lock.

Introducing pte locking would allow us to go further with parallelizing
this but its another invasive procedure. I think parallelizing COW is only
possible to do reliable with some pte locking scheme. But then the
question is if the pte locking is really faster than obtaining a spinlock.
I suspect this may not be the case.

 Although I may have some fact fundamentally wrong?

The unmapping in rmap.c would change the pte. This would be discovered
after acquiring the spinlock later in do_wp_page. Which would then lead to
the operation being abandoned.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Nick Piggin
On Tue, 2005-02-01 at 11:01 -0800, Christoph Lameter wrote:
 On Tue, 1 Feb 2005, Nick Piggin wrote:

  A per-pte lock is sufficient for this case, of course, which is why the
  pte-locked system is completely free of the page table lock.
 
 Introducing pte locking would allow us to go further with parallelizing
 this but its another invasive procedure. I think parallelizing COW is only
 possible to do reliable with some pte locking scheme. But then the
 question is if the pte locking is really faster than obtaining a spinlock.
 I suspect this may not be the case.
 

Well most likely not although I haven't been able to detect much
difference. But in your case you would probably be happy to live
with that if it meant better parallelising of an important
function... but we'll leave future discussion to another thread ;)

  Although I may have some fact fundamentally wrong?
 
 The unmapping in rmap.c would change the pte. This would be discovered
 after acquiring the spinlock later in do_wp_page. Which would then lead to
 the operation being abandoned.
 

Oh yes, but suppose your page_cache_get is happening at the same time
as free_pages_check, after the page gets freed by the scanner? I can't
actually think of anything that would cause a real problem (ie. not a
debug check), off the top of my head. But can you say there _isn't_
anything?

Regardless, it seems pretty dirty to me. But could possibly be made
workable, of course.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Wed, 2 Feb 2005, Nick Piggin wrote:

  The unmapping in rmap.c would change the pte. This would be discovered
  after acquiring the spinlock later in do_wp_page. Which would then lead to
  the operation being abandoned.
 Oh yes, but suppose your page_cache_get is happening at the same time
 as free_pages_check, after the page gets freed by the scanner? I can't
 actually think of anything that would cause a real problem (ie. not a
 debug check), off the top of my head. But can you say there _isn't_
 anything?

 Regardless, it seems pretty dirty to me. But could possibly be made
 workable, of course.

Would it make you feel better if we would move the spin_unlock back to the
prior position? This would ensure that the fallback case is exactly the
same.

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-31 08:59:07.0 -0800
+++ linux-2.6.10/mm/memory.c2005-02-01 10:55:30.0 -0800
@@ -1318,7 +1318,6 @@ static int do_wp_page(struct mm_struct *
}
}
pte_unmap(page_table);
-   page_table_atomic_stop(mm);

/*
 * Ok, we need to copy. Oh, well..
@@ -1326,6 +1325,7 @@ static int do_wp_page(struct mm_struct *
if (!PageReserved(old_page))
page_cache_get(old_page);

+   page_table_atomic_stop(mm);
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Nick Piggin
On Tue, 2005-02-01 at 17:20 -0800, Christoph Lameter wrote:
 On Wed, 2 Feb 2005, Nick Piggin wrote:
 
   The unmapping in rmap.c would change the pte. This would be discovered
   after acquiring the spinlock later in do_wp_page. Which would then lead to
   the operation being abandoned.
  Oh yes, but suppose your page_cache_get is happening at the same time
  as free_pages_check, after the page gets freed by the scanner? I can't
  actually think of anything that would cause a real problem (ie. not a
  debug check), off the top of my head. But can you say there _isn't_
  anything?
 
  Regardless, it seems pretty dirty to me. But could possibly be made
  workable, of course.
 
 Would it make you feel better if we would move the spin_unlock back to the
 prior position? This would ensure that the fallback case is exactly the
 same.
 

Well yeah, but the interesting case is when that isn't a lock ;)

I'm not saying what you've got is no good. I'm sure it would be fine
for testing. And if it happens that we can do the page_count doesn't
mean anything after it has reached zero and been freed. Nor will it
necessarily be zero when a new page is allocated thing without many
problems, then this may be a fine way to do it.

I was just pointing out this could be a problem without putting a
lot of thought into it...


Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Wed, 2 Feb 2005, Nick Piggin wrote:

 Well yeah, but the interesting case is when that isn't a lock ;)

 I'm not saying what you've got is no good. I'm sure it would be fine
 for testing. And if it happens that we can do the page_count doesn't
 mean anything after it has reached zero and been freed. Nor will it
 necessarily be zero when a new page is allocated thing without many
 problems, then this may be a fine way to do it.

 I was just pointing out this could be a problem without putting a
 lot of thought into it...

Surely we need to do this the right way. Do we really need to
use page_cache_get()? Is anything relying on page_count == 2 of
the old_page?

I mean we could just speculatively copy, risk copying crap and
discard that later when we find that the pte has changed. This would
simplify the function:

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-02-01 18:10:46.0 -0800
+++ linux-2.6.10/mm/memory.c2005-02-01 18:43:08.0 -0800
@@ -1323,9 +1323,6 @@ static int do_wp_page(struct mm_struct *
/*
 * Ok, we need to copy. Oh, well..
 */
-   if (!PageReserved(old_page))
-   page_cache_get(old_page);
-
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
if (old_page == ZERO_PAGE(address)) {
@@ -1336,6 +1333,10 @@ static int do_wp_page(struct mm_struct *
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto no_new_page;
+   /*
+* No page_cache_get so we may copy some crap
+* that is later discarded if the pte has changed
+*/
copy_user_highpage(new_page, old_page, address);
}
/*
@@ -1352,7 +1353,6 @@ static int do_wp_page(struct mm_struct *
acct_update_integrals();
update_mem_hiwater();
} else
-
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1363,7 +1363,6 @@ static int do_wp_page(struct mm_struct *
}
pte_unmap(page_table);
page_cache_release(new_page);
-   page_cache_release(old_page);
spin_unlock(mm-page_table_lock);
return VM_FAULT_MINOR;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Christoph Lameter
On Tue, 1 Feb 2005, Nick Piggin wrote:

 Hmm, this is moving toward the direction my patches take.

You are right. But I am still weary of the transactional idea in your
patchset (which is really not comparable with a database transaction
after all...).

I think moving to cmpxchg and xchg operations will give this more
transparency and make it easier to understand and handle.

 Naturally I prefer the complete replacement that is made with
 my patch - however this of course means one has to move
 *everything* over to be pte_cmpxchg safe, which runs against
 your goal of getting the low hanging fruit with as little fuss
 as possible for the moment.

I would also prefer a replacement but there are certain cost-benefit
tradeoffs with atomic operations vs. spinlock that may better be tackled
on a case by case basis. Also this is pretty much at the core of the Linux
VM and thus highly sensitive. Given its history and the danger of breaking
things it may be best to preserve it intact as much as possible and move
in small steps.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-02-01 Thread Nick Piggin
On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote:
 On Wed, 2 Feb 2005, Nick Piggin wrote:
 
  Well yeah, but the interesting case is when that isn't a lock ;)
 
  I'm not saying what you've got is no good. I'm sure it would be fine
  for testing. And if it happens that we can do the page_count doesn't
  mean anything after it has reached zero and been freed. Nor will it
  necessarily be zero when a new page is allocated thing without many
  problems, then this may be a fine way to do it.
 
  I was just pointing out this could be a problem without putting a
  lot of thought into it...
 
 Surely we need to do this the right way. Do we really need to
 use page_cache_get()? Is anything relying on page_count == 2 of
 the old_page?
 
 I mean we could just speculatively copy, risk copying crap and
 discard that later when we find that the pte has changed. This would
 simplify the function:
 

I think this may be the better approach. Anyone else?


Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-31 Thread Nick Piggin
Christoph Lameter wrote:
The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.
The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with
page_table_atomic_start(struct mm_struct *)
and ends with
page_table_atomic_stop(struct mm_struct *)
Hmm, this is moving toward the direction my patches take.
I think it may be the right way to go if you're lifting the ptl
from some core things, because some architectures won't want to
audit and stuff, and some may need the lock.
Naturally I prefer the complete replacement that is made with
my patch - however this of course means one has to move
*everything* over to be pte_cmpxchg safe, which runs against
your goal of getting the low hanging fruit with as little fuss
as possible for the moment.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-31 Thread Nick Piggin
Christoph Lameter wrote:
Slightly OT: are you still planning to move the update_mem_hiwater and
friends crud out of these fastpaths? It looks like at least that function
is unsafe to be lockless.
@@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
  vma);
-   ptep_set_access_flags(vma, address, page_table, entry, 
1);
-   update_mmu_cache(vma, address, entry);
+   /*
+* If the bits are not updated then another fault
+* will be generated with another chance of updating.
+*/
+   if (ptep_cmpxchg(page_table, pte, entry))
+   update_mmu_cache(vma, address, entry);
+   else
+   inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+   page_table_atomic_stop(mm);
/*
 * Ok, we need to copy. Oh, well..
 */
if (!PageReserved(old_page))
page_cache_get(old_page);
-   spin_unlock(>page_table_lock);
I don't think you can do this unless you have done something funky that I
missed. And that kind of shoots down your lockless COW too, although it
looks like you can safely have the second part of do_wp_page without the
lock. Basically - your lockless COW patch itself seems like it should be
OK, but this hunk does not.
I would be very interested if you are seeing performance gains with your
lockless COW patches, BTW.
Basically, getting a reference on a struct page was the only thing I found
I wasn't able to do lockless with pte cmpxchg. Because it can race with
unmapping in rmap.c and reclaim and reuse, which probably isn't too good.
That means: the only operations you are able to do lockless is when there
is no backing page (ie. the anonymous unpopulated->populated case).
A per-pte lock is sufficient for this case, of course, which is why the
pte-locked system is completely free of the page table lock.
Although I may have some fact fundamentally wrong?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-31 Thread Nick Piggin
Christoph Lameter wrote:
Slightly OT: are you still planning to move the update_mem_hiwater and
friends crud out of these fastpaths? It looks like at least that function
is unsafe to be lockless.
@@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
  vma);
-   ptep_set_access_flags(vma, address, page_table, entry, 
1);
-   update_mmu_cache(vma, address, entry);
+   /*
+* If the bits are not updated then another fault
+* will be generated with another chance of updating.
+*/
+   if (ptep_cmpxchg(page_table, pte, entry))
+   update_mmu_cache(vma, address, entry);
+   else
+   inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
-   spin_unlock(mm-page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+   page_table_atomic_stop(mm);
/*
 * Ok, we need to copy. Oh, well..
 */
if (!PageReserved(old_page))
page_cache_get(old_page);
-   spin_unlock(mm-page_table_lock);
I don't think you can do this unless you have done something funky that I
missed. And that kind of shoots down your lockless COW too, although it
looks like you can safely have the second part of do_wp_page without the
lock. Basically - your lockless COW patch itself seems like it should be
OK, but this hunk does not.
I would be very interested if you are seeing performance gains with your
lockless COW patches, BTW.
Basically, getting a reference on a struct page was the only thing I found
I wasn't able to do lockless with pte cmpxchg. Because it can race with
unmapping in rmap.c and reclaim and reuse, which probably isn't too good.
That means: the only operations you are able to do lockless is when there
is no backing page (ie. the anonymous unpopulated-populated case).
A per-pte lock is sufficient for this case, of course, which is why the
pte-locked system is completely free of the page table lock.
Although I may have some fact fundamentally wrong?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-31 Thread Nick Piggin
Christoph Lameter wrote:
The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.
The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with
page_table_atomic_start(struct mm_struct *)
and ends with
page_table_atomic_stop(struct mm_struct *)
Hmm, this is moving toward the direction my patches take.
I think it may be the right way to go if you're lifting the ptl
from some core things, because some architectures won't want to
audit and stuff, and some may need the lock.
Naturally I prefer the complete replacement that is made with
my patch - however this of course means one has to move
*everything* over to be pte_cmpxchg safe, which runs against
your goal of getting the low hanging fruit with as little fuss
as possible for the moment.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-28 Thread Christoph Lameter
The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definitions for higher level atomic operations is
included and these may easily be added for other architectures.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of
faults received that led to no change in the page table. These statistics may
be viewed via /proc/meminfo

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-27 16:27:59.0 -0800
+++ linux-2.6.10/mm/memory.c2005-01-27 16:28:54.0 -0800
@@ -36,6 +36,8 @@
  * ([EMAIL PROTECTED])
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005Scalability improvement by reducing the use and the length of 
time
+ * the page table lock is held (Christoph Lameter)
  */

 #include 
@@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a
  * change only once the write actually happens. This avoids a few races,
  * and potentially makes it more efficient.
  *
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore and have started atomic pte operations,
+ * exit with pte ops completed.
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap(page_table);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
address);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
  vma);
-   ptep_set_access_flags(vma, address, page_table, entry, 
1);
-   update_mmu_cache(vma, address, entry);
+   /*
+* If the bits are not updated then another fault
+* will be generated with another chance of updating.
+*/
+   if (ptep_cmpxchg(page_table, pte, entry))
+   update_mmu_cache(vma, address, entry);
+   else
+   inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+   page_table_atomic_stop(mm);

/*
 * Ok, we need to copy. Oh, well..
 */
if (!PageReserved(old_page))
page_cache_get(old_page);
-   spin_unlock(>page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
 

page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-28 Thread Christoph Lameter
The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definitions for higher level atomic operations is
included and these may easily be added for other architectures.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of
faults received that led to no change in the page table. These statistics may
be viewed via /proc/meminfo

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-27 16:27:59.0 -0800
+++ linux-2.6.10/mm/memory.c2005-01-27 16:28:54.0 -0800
@@ -36,6 +36,8 @@
  * ([EMAIL PROTECTED])
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005Scalability improvement by reducing the use and the length of 
time
+ * the page table lock is held (Christoph Lameter)
  */

 #include linux/kernel_stat.h
@@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a
  * change only once the write actually happens. This avoids a few races,
  * and potentially makes it more efficient.
  *
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore and have started atomic pte operations,
+ * exit with pte ops completed.
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap(page_table);
printk(KERN_ERR do_wp_page: bogus page at address %08lx\n,
address);
-   spin_unlock(mm-page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
  vma);
-   ptep_set_access_flags(vma, address, page_table, entry, 
1);
-   update_mmu_cache(vma, address, entry);
+   /*
+* If the bits are not updated then another fault
+* will be generated with another chance of updating.
+*/
+   if (ptep_cmpxchg(page_table, pte, entry))
+   update_mmu_cache(vma, address, entry);
+   else
+   inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
-   spin_unlock(mm-page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+   page_table_atomic_stop(mm);

/*
 * Ok, we need to copy. Oh, well..
 */
if (!PageReserved(old_page))
page_cache_get(old_page);
-   spin_unlock(mm-page_table_lock);

if