Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Wed, 2005-02-02 at 14:09 +1100, Nick Piggin wrote: > On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote: > > On Wed, 2 Feb 2005, Nick Piggin wrote: > > I mean we could just speculatively copy, risk copying crap and > > discard that later when we find that the pte has changed. This would > > simplify the function: > > > > I think this may be the better approach. Anyone else? > Not to say it is perfect either. Normal semantics say not to touch a page if it is not somehow pinned. So this may cause problems in corner cases (DEBUG_PAGEALLOC comes to mind... hopefully nothing else). But I think a plain read of the page when it isn't pinned is less yucky than writing into the non-pinned struct page. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Wed, 2005-02-02 at 14:09 +1100, Nick Piggin wrote: On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote: On Wed, 2 Feb 2005, Nick Piggin wrote: I mean we could just speculatively copy, risk copying crap and discard that later when we find that the pte has changed. This would simplify the function: I think this may be the better approach. Anyone else? Not to say it is perfect either. Normal semantics say not to touch a page if it is not somehow pinned. So this may cause problems in corner cases (DEBUG_PAGEALLOC comes to mind... hopefully nothing else). But I think a plain read of the page when it isn't pinned is less yucky than writing into the non-pinned struct page. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Wed, 2 Feb 2005, Nick Piggin wrote: > Well yeah, but the interesting case is when that isn't a lock ;) > > I'm not saying what you've got is no good. I'm sure it would be fine > for testing. And if it happens that we can do the "page_count doesn't > mean anything after it has reached zero and been freed. Nor will it > necessarily be zero when a new page is allocated" thing without many > problems, then this may be a fine way to do it. > > I was just pointing out this could be a problem without putting a > lot of thought into it... Surely we need to do this the right way. Do we really need to use page_cache_get()? Is anything relying on page_count == 2 of the old_page? I mean we could just speculatively copy, risk copying crap and discard that later when we find that the pte has changed. This would simplify the function: Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-02-01 18:10:46.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-01 18:43:08.0 -0800 @@ -1323,9 +1323,6 @@ static int do_wp_page(struct mm_struct * /* * Ok, we need to copy. Oh, well.. */ - if (!PageReserved(old_page)) - page_cache_get(old_page); - if (unlikely(anon_vma_prepare(vma))) goto no_new_page; if (old_page == ZERO_PAGE(address)) { @@ -1336,6 +1333,10 @@ static int do_wp_page(struct mm_struct * new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!new_page) goto no_new_page; + /* +* No page_cache_get so we may copy some crap +* that is later discarded if the pte has changed +*/ copy_user_highpage(new_page, old_page, address); } /* @@ -1352,7 +1353,6 @@ static int do_wp_page(struct mm_struct * acct_update_integrals(); update_mem_hiwater(); } else - page_remove_rmap(old_page); break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); @@ -1363,7 +1363,6 @@ static int do_wp_page(struct mm_struct * } pte_unmap(page_table); page_cache_release(new_page); - page_cache_release(old_page); spin_unlock(>page_table_lock); return VM_FAULT_MINOR; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote: > On Wed, 2 Feb 2005, Nick Piggin wrote: > > > Well yeah, but the interesting case is when that isn't a lock ;) > > > > I'm not saying what you've got is no good. I'm sure it would be fine > > for testing. And if it happens that we can do the "page_count doesn't > > mean anything after it has reached zero and been freed. Nor will it > > necessarily be zero when a new page is allocated" thing without many > > problems, then this may be a fine way to do it. > > > > I was just pointing out this could be a problem without putting a > > lot of thought into it... > > Surely we need to do this the right way. Do we really need to > use page_cache_get()? Is anything relying on page_count == 2 of > the old_page? > > I mean we could just speculatively copy, risk copying crap and > discard that later when we find that the pte has changed. This would > simplify the function: > I think this may be the better approach. Anyone else? Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 2005-02-01 at 17:20 -0800, Christoph Lameter wrote: > On Wed, 2 Feb 2005, Nick Piggin wrote: > > > > The unmapping in rmap.c would change the pte. This would be discovered > > > after acquiring the spinlock later in do_wp_page. Which would then lead to > > > the operation being abandoned. > > Oh yes, but suppose your page_cache_get is happening at the same time > > as free_pages_check, after the page gets freed by the scanner? I can't > > actually think of anything that would cause a real problem (ie. not a > > debug check), off the top of my head. But can you say there _isn't_ > > anything? > > > > Regardless, it seems pretty dirty to me. But could possibly be made > > workable, of course. > > Would it make you feel better if we would move the spin_unlock back to the > prior position? This would ensure that the fallback case is exactly the > same. > Well yeah, but the interesting case is when that isn't a lock ;) I'm not saying what you've got is no good. I'm sure it would be fine for testing. And if it happens that we can do the "page_count doesn't mean anything after it has reached zero and been freed. Nor will it necessarily be zero when a new page is allocated" thing without many problems, then this may be a fine way to do it. I was just pointing out this could be a problem without putting a lot of thought into it... Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Wed, 2 Feb 2005, Nick Piggin wrote: > > The unmapping in rmap.c would change the pte. This would be discovered > > after acquiring the spinlock later in do_wp_page. Which would then lead to > > the operation being abandoned. > Oh yes, but suppose your page_cache_get is happening at the same time > as free_pages_check, after the page gets freed by the scanner? I can't > actually think of anything that would cause a real problem (ie. not a > debug check), off the top of my head. But can you say there _isn't_ > anything? > > Regardless, it seems pretty dirty to me. But could possibly be made > workable, of course. Would it make you feel better if we would move the spin_unlock back to the prior position? This would ensure that the fallback case is exactly the same. Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-01-31 08:59:07.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-01 10:55:30.0 -0800 @@ -1318,7 +1318,6 @@ static int do_wp_page(struct mm_struct * } } pte_unmap(page_table); - page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. @@ -1326,6 +1325,7 @@ static int do_wp_page(struct mm_struct * if (!PageReserved(old_page)) page_cache_get(old_page); + page_table_atomic_stop(mm); if (unlikely(anon_vma_prepare(vma))) goto no_new_page; if (old_page == ZERO_PAGE(address)) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 2005-02-01 at 11:01 -0800, Christoph Lameter wrote: > On Tue, 1 Feb 2005, Nick Piggin wrote: > > A per-pte lock is sufficient for this case, of course, which is why the > > pte-locked system is completely free of the page table lock. > > Introducing pte locking would allow us to go further with parallelizing > this but its another invasive procedure. I think parallelizing COW is only > possible to do reliable with some pte locking scheme. But then the > question is if the pte locking is really faster than obtaining a spinlock. > I suspect this may not be the case. > Well most likely not although I haven't been able to detect much difference. But in your case you would probably be happy to live with that if it meant better parallelising of an important function... but we'll leave future discussion to another thread ;) > > Although I may have some fact fundamentally wrong? > > The unmapping in rmap.c would change the pte. This would be discovered > after acquiring the spinlock later in do_wp_page. Which would then lead to > the operation being abandoned. > Oh yes, but suppose your page_cache_get is happening at the same time as free_pages_check, after the page gets freed by the scanner? I can't actually think of anything that would cause a real problem (ie. not a debug check), off the top of my head. But can you say there _isn't_ anything? Regardless, it seems pretty dirty to me. But could possibly be made workable, of course. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 1 Feb 2005, Nick Piggin wrote: > > pte_unmap(page_table); > > + page_table_atomic_stop(mm); > > > > /* > > * Ok, we need to copy. Oh, well.. > > */ > > if (!PageReserved(old_page)) > > page_cache_get(old_page); > > - spin_unlock(>page_table_lock); > > > > I don't think you can do this unless you have done something funky that I > missed. And that kind of shoots down your lockless COW too, although it > looks like you can safely have the second part of do_wp_page without the > lock. Basically - your lockless COW patch itself seems like it should be > OK, but this hunk does not. See my comment at the end of this message. > I would be very interested if you are seeing performance gains with your > lockless COW patches, BTW. So far I have not had time to focus on benchmarking that. > Basically, getting a reference on a struct page was the only thing I found > I wasn't able to do lockless with pte cmpxchg. Because it can race with > unmapping in rmap.c and reclaim and reuse, which probably isn't too good. > That means: the only operations you are able to do lockless is when there > is no backing page (ie. the anonymous unpopulated->populated case). > > A per-pte lock is sufficient for this case, of course, which is why the > pte-locked system is completely free of the page table lock. Introducing pte locking would allow us to go further with parallelizing this but its another invasive procedure. I think parallelizing COW is only possible to do reliable with some pte locking scheme. But then the question is if the pte locking is really faster than obtaining a spinlock. I suspect this may not be the case. > Although I may have some fact fundamentally wrong? The unmapping in rmap.c would change the pte. This would be discovered after acquiring the spinlock later in do_wp_page. Which would then lead to the operation being abandoned. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 1 Feb 2005, Nick Piggin wrote: > Slightly OT: are you still planning to move the update_mem_hiwater and > friends crud out of these fastpaths? It looks like at least that function > is unsafe to be lockless. Yes. I have a patch pending and the author of the CSA patches is a cowoerker of mine. The patch will be resubmitted once certain aspects of the timer subsystem are stabilized and/or when he gets back from his vacation. The statistics are not critical to system operation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 1 Feb 2005, Nick Piggin wrote: > Hmm, this is moving toward the direction my patches take. You are right. But I am still weary of the transactional idea in your patchset (which is really not comparable with a database transaction after all...). I think moving to cmpxchg and xchg operations will give this more transparency and make it easier to understand and handle. > Naturally I prefer the complete replacement that is made with > my patch - however this of course means one has to move > *everything* over to be pte_cmpxchg safe, which runs against > your goal of getting the low hanging fruit with as little fuss > as possible for the moment. I would also prefer a replacement but there are certain cost-benefit tradeoffs with atomic operations vs. spinlock that may better be tackled on a case by case basis. Also this is pretty much at the core of the Linux VM and thus highly sensitive. Given its history and the danger of breaking things it may be best to preserve it intact as much as possible and move in small steps. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 1 Feb 2005, Nick Piggin wrote: Slightly OT: are you still planning to move the update_mem_hiwater and friends crud out of these fastpaths? It looks like at least that function is unsafe to be lockless. Yes. I have a patch pending and the author of the CSA patches is a cowoerker of mine. The patch will be resubmitted once certain aspects of the timer subsystem are stabilized and/or when he gets back from his vacation. The statistics are not critical to system operation. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 1 Feb 2005, Nick Piggin wrote: pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ if (!PageReserved(old_page)) page_cache_get(old_page); - spin_unlock(mm-page_table_lock); I don't think you can do this unless you have done something funky that I missed. And that kind of shoots down your lockless COW too, although it looks like you can safely have the second part of do_wp_page without the lock. Basically - your lockless COW patch itself seems like it should be OK, but this hunk does not. See my comment at the end of this message. I would be very interested if you are seeing performance gains with your lockless COW patches, BTW. So far I have not had time to focus on benchmarking that. Basically, getting a reference on a struct page was the only thing I found I wasn't able to do lockless with pte cmpxchg. Because it can race with unmapping in rmap.c and reclaim and reuse, which probably isn't too good. That means: the only operations you are able to do lockless is when there is no backing page (ie. the anonymous unpopulated-populated case). A per-pte lock is sufficient for this case, of course, which is why the pte-locked system is completely free of the page table lock. Introducing pte locking would allow us to go further with parallelizing this but its another invasive procedure. I think parallelizing COW is only possible to do reliable with some pte locking scheme. But then the question is if the pte locking is really faster than obtaining a spinlock. I suspect this may not be the case. Although I may have some fact fundamentally wrong? The unmapping in rmap.c would change the pte. This would be discovered after acquiring the spinlock later in do_wp_page. Which would then lead to the operation being abandoned. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 2005-02-01 at 11:01 -0800, Christoph Lameter wrote: On Tue, 1 Feb 2005, Nick Piggin wrote: A per-pte lock is sufficient for this case, of course, which is why the pte-locked system is completely free of the page table lock. Introducing pte locking would allow us to go further with parallelizing this but its another invasive procedure. I think parallelizing COW is only possible to do reliable with some pte locking scheme. But then the question is if the pte locking is really faster than obtaining a spinlock. I suspect this may not be the case. Well most likely not although I haven't been able to detect much difference. But in your case you would probably be happy to live with that if it meant better parallelising of an important function... but we'll leave future discussion to another thread ;) Although I may have some fact fundamentally wrong? The unmapping in rmap.c would change the pte. This would be discovered after acquiring the spinlock later in do_wp_page. Which would then lead to the operation being abandoned. Oh yes, but suppose your page_cache_get is happening at the same time as free_pages_check, after the page gets freed by the scanner? I can't actually think of anything that would cause a real problem (ie. not a debug check), off the top of my head. But can you say there _isn't_ anything? Regardless, it seems pretty dirty to me. But could possibly be made workable, of course. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Wed, 2 Feb 2005, Nick Piggin wrote: The unmapping in rmap.c would change the pte. This would be discovered after acquiring the spinlock later in do_wp_page. Which would then lead to the operation being abandoned. Oh yes, but suppose your page_cache_get is happening at the same time as free_pages_check, after the page gets freed by the scanner? I can't actually think of anything that would cause a real problem (ie. not a debug check), off the top of my head. But can you say there _isn't_ anything? Regardless, it seems pretty dirty to me. But could possibly be made workable, of course. Would it make you feel better if we would move the spin_unlock back to the prior position? This would ensure that the fallback case is exactly the same. Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-01-31 08:59:07.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-01 10:55:30.0 -0800 @@ -1318,7 +1318,6 @@ static int do_wp_page(struct mm_struct * } } pte_unmap(page_table); - page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. @@ -1326,6 +1325,7 @@ static int do_wp_page(struct mm_struct * if (!PageReserved(old_page)) page_cache_get(old_page); + page_table_atomic_stop(mm); if (unlikely(anon_vma_prepare(vma))) goto no_new_page; if (old_page == ZERO_PAGE(address)) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 2005-02-01 at 17:20 -0800, Christoph Lameter wrote: On Wed, 2 Feb 2005, Nick Piggin wrote: The unmapping in rmap.c would change the pte. This would be discovered after acquiring the spinlock later in do_wp_page. Which would then lead to the operation being abandoned. Oh yes, but suppose your page_cache_get is happening at the same time as free_pages_check, after the page gets freed by the scanner? I can't actually think of anything that would cause a real problem (ie. not a debug check), off the top of my head. But can you say there _isn't_ anything? Regardless, it seems pretty dirty to me. But could possibly be made workable, of course. Would it make you feel better if we would move the spin_unlock back to the prior position? This would ensure that the fallback case is exactly the same. Well yeah, but the interesting case is when that isn't a lock ;) I'm not saying what you've got is no good. I'm sure it would be fine for testing. And if it happens that we can do the page_count doesn't mean anything after it has reached zero and been freed. Nor will it necessarily be zero when a new page is allocated thing without many problems, then this may be a fine way to do it. I was just pointing out this could be a problem without putting a lot of thought into it... Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Wed, 2 Feb 2005, Nick Piggin wrote: Well yeah, but the interesting case is when that isn't a lock ;) I'm not saying what you've got is no good. I'm sure it would be fine for testing. And if it happens that we can do the page_count doesn't mean anything after it has reached zero and been freed. Nor will it necessarily be zero when a new page is allocated thing without many problems, then this may be a fine way to do it. I was just pointing out this could be a problem without putting a lot of thought into it... Surely we need to do this the right way. Do we really need to use page_cache_get()? Is anything relying on page_count == 2 of the old_page? I mean we could just speculatively copy, risk copying crap and discard that later when we find that the pte has changed. This would simplify the function: Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-02-01 18:10:46.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-01 18:43:08.0 -0800 @@ -1323,9 +1323,6 @@ static int do_wp_page(struct mm_struct * /* * Ok, we need to copy. Oh, well.. */ - if (!PageReserved(old_page)) - page_cache_get(old_page); - if (unlikely(anon_vma_prepare(vma))) goto no_new_page; if (old_page == ZERO_PAGE(address)) { @@ -1336,6 +1333,10 @@ static int do_wp_page(struct mm_struct * new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!new_page) goto no_new_page; + /* +* No page_cache_get so we may copy some crap +* that is later discarded if the pte has changed +*/ copy_user_highpage(new_page, old_page, address); } /* @@ -1352,7 +1353,6 @@ static int do_wp_page(struct mm_struct * acct_update_integrals(); update_mem_hiwater(); } else - page_remove_rmap(old_page); break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); @@ -1363,7 +1363,6 @@ static int do_wp_page(struct mm_struct * } pte_unmap(page_table); page_cache_release(new_page); - page_cache_release(old_page); spin_unlock(mm-page_table_lock); return VM_FAULT_MINOR; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 1 Feb 2005, Nick Piggin wrote: Hmm, this is moving toward the direction my patches take. You are right. But I am still weary of the transactional idea in your patchset (which is really not comparable with a database transaction after all...). I think moving to cmpxchg and xchg operations will give this more transparency and make it easier to understand and handle. Naturally I prefer the complete replacement that is made with my patch - however this of course means one has to move *everything* over to be pte_cmpxchg safe, which runs against your goal of getting the low hanging fruit with as little fuss as possible for the moment. I would also prefer a replacement but there are certain cost-benefit tradeoffs with atomic operations vs. spinlock that may better be tackled on a case by case basis. Also this is pretty much at the core of the Linux VM and thus highly sensitive. Given its history and the danger of breaking things it may be best to preserve it intact as much as possible and move in small steps. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
On Tue, 2005-02-01 at 18:49 -0800, Christoph Lameter wrote: On Wed, 2 Feb 2005, Nick Piggin wrote: Well yeah, but the interesting case is when that isn't a lock ;) I'm not saying what you've got is no good. I'm sure it would be fine for testing. And if it happens that we can do the page_count doesn't mean anything after it has reached zero and been freed. Nor will it necessarily be zero when a new page is allocated thing without many problems, then this may be a fine way to do it. I was just pointing out this could be a problem without putting a lot of thought into it... Surely we need to do this the right way. Do we really need to use page_cache_get()? Is anything relying on page_count == 2 of the old_page? I mean we could just speculatively copy, risk copying crap and discard that later when we find that the pte has changed. This would simplify the function: I think this may be the better approach. Anyone else? Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
Christoph Lameter wrote: The page fault handler attempts to use the page_table_lock only for short time periods. It repeatedly drops and reacquires the lock. When the lock is reacquired, checks are made if the underlying pte has changed before replacing the pte value. These locations are a good fit for the use of ptep_cmpxchg. The following patch allows to remove the first time the page_table_lock is acquired and uses atomic operations on the page table instead. A section using atomic pte operations is begun with page_table_atomic_start(struct mm_struct *) and ends with page_table_atomic_stop(struct mm_struct *) Hmm, this is moving toward the direction my patches take. I think it may be the right way to go if you're lifting the ptl from some core things, because some architectures won't want to audit and stuff, and some may need the lock. Naturally I prefer the complete replacement that is made with my patch - however this of course means one has to move *everything* over to be pte_cmpxchg safe, which runs against your goal of getting the low hanging fruit with as little fuss as possible for the moment. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
Christoph Lameter wrote: Slightly OT: are you still planning to move the update_mem_hiwater and friends crud out of these fastpaths? It looks like at least that function is unsafe to be lockless. @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); + /* +* If the bits are not updated then another fault +* will be generated with another chance of updating. +*/ + if (ptep_cmpxchg(page_table, pte, entry)) + update_mmu_cache(vma, address, entry); + else + inc_page_state(cmpxchg_fail_flag_reuse); pte_unmap(page_table); - spin_unlock(>page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } } pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ if (!PageReserved(old_page)) page_cache_get(old_page); - spin_unlock(>page_table_lock); I don't think you can do this unless you have done something funky that I missed. And that kind of shoots down your lockless COW too, although it looks like you can safely have the second part of do_wp_page without the lock. Basically - your lockless COW patch itself seems like it should be OK, but this hunk does not. I would be very interested if you are seeing performance gains with your lockless COW patches, BTW. Basically, getting a reference on a struct page was the only thing I found I wasn't able to do lockless with pte cmpxchg. Because it can race with unmapping in rmap.c and reclaim and reuse, which probably isn't too good. That means: the only operations you are able to do lockless is when there is no backing page (ie. the anonymous unpopulated->populated case). A per-pte lock is sufficient for this case, of course, which is why the pte-locked system is completely free of the page table lock. Although I may have some fact fundamentally wrong? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
Christoph Lameter wrote: Slightly OT: are you still planning to move the update_mem_hiwater and friends crud out of these fastpaths? It looks like at least that function is unsafe to be lockless. @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); + /* +* If the bits are not updated then another fault +* will be generated with another chance of updating. +*/ + if (ptep_cmpxchg(page_table, pte, entry)) + update_mmu_cache(vma, address, entry); + else + inc_page_state(cmpxchg_fail_flag_reuse); pte_unmap(page_table); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } } pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ if (!PageReserved(old_page)) page_cache_get(old_page); - spin_unlock(mm-page_table_lock); I don't think you can do this unless you have done something funky that I missed. And that kind of shoots down your lockless COW too, although it looks like you can safely have the second part of do_wp_page without the lock. Basically - your lockless COW patch itself seems like it should be OK, but this hunk does not. I would be very interested if you are seeing performance gains with your lockless COW patches, BTW. Basically, getting a reference on a struct page was the only thing I found I wasn't able to do lockless with pte cmpxchg. Because it can race with unmapping in rmap.c and reclaim and reuse, which probably isn't too good. That means: the only operations you are able to do lockless is when there is no backing page (ie. the anonymous unpopulated-populated case). A per-pte lock is sufficient for this case, of course, which is why the pte-locked system is completely free of the page table lock. Although I may have some fact fundamentally wrong? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
Christoph Lameter wrote: The page fault handler attempts to use the page_table_lock only for short time periods. It repeatedly drops and reacquires the lock. When the lock is reacquired, checks are made if the underlying pte has changed before replacing the pte value. These locations are a good fit for the use of ptep_cmpxchg. The following patch allows to remove the first time the page_table_lock is acquired and uses atomic operations on the page table instead. A section using atomic pte operations is begun with page_table_atomic_start(struct mm_struct *) and ends with page_table_atomic_stop(struct mm_struct *) Hmm, this is moving toward the direction my patches take. I think it may be the right way to go if you're lifting the ptl from some core things, because some architectures won't want to audit and stuff, and some may need the lock. Naturally I prefer the complete replacement that is made with my patch - however this of course means one has to move *everything* over to be pte_cmpxchg safe, which runs against your goal of getting the low hanging fruit with as little fuss as possible for the moment. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
The page fault handler attempts to use the page_table_lock only for short time periods. It repeatedly drops and reacquires the lock. When the lock is reacquired, checks are made if the underlying pte has changed before replacing the pte value. These locations are a good fit for the use of ptep_cmpxchg. The following patch allows to remove the first time the page_table_lock is acquired and uses atomic operations on the page table instead. A section using atomic pte operations is begun with page_table_atomic_start(struct mm_struct *) and ends with page_table_atomic_stop(struct mm_struct *) Both of these become spin_lock(page_table_lock) and spin_unlock(page_table_lock) if atomic page table operations are not configured (CONFIG_ATOMIC_TABLE_OPS undefined). Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest layer of the page table. Higher layers may also be populated in an atomic way by defining pmd_test_and_populate() etc. The generic versions of these functions fall back to the page_table_lock (populating higher level page table entries is rare and therefore this is not likely to be performance critical). For ia64 the definitions for higher level atomic operations is included and these may easily be added for other architectures. This patch depends on the pte_cmpxchg patch to be applied first and will only remove the first use of the page_table_lock in the page fault handler. This will allow the following page table operations without acquiring the page_table_lock: 1. Updating of access bits (handle_mm_faults) 2. Anonymous read faults (do_anonymous_page) The page_table_lock is still acquired for creating a new pte for an anonymous write fault and therefore the problems with rss that were addressed by splitting rss into the task structure do not yet occur. The patch also adds some diagnostic features by counting the number of cmpxchg failures (useful for verification if this patch works right) and the number of faults received that led to no change in the page table. These statistics may be viewed via /proc/meminfo Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-01-27 16:27:59.0 -0800 +++ linux-2.6.10/mm/memory.c2005-01-27 16:28:54.0 -0800 @@ -36,6 +36,8 @@ * ([EMAIL PROTECTED]) * * Aug/Sep 2004 Changed to four level page tables (Andi Kleen) + * Jan 2005Scalability improvement by reducing the use and the length of time + * the page table lock is held (Christoph Lameter) */ #include @@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a * change only once the write actually happens. This avoids a few races, * and potentially makes it more efficient. * - * We hold the mm semaphore and the page_table_lock on entry and exit - * with the page_table_lock released. + * We hold the mm semaphore and have started atomic pte operations, + * exit with pte ops completed. */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte) @@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct * pte_unmap(page_table); printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n", address); - spin_unlock(>page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_OOM; } old_page = pfn_to_page(pfn); @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); + /* +* If the bits are not updated then another fault +* will be generated with another chance of updating. +*/ + if (ptep_cmpxchg(page_table, pte, entry)) + update_mmu_cache(vma, address, entry); + else + inc_page_state(cmpxchg_fail_flag_reuse); pte_unmap(page_table); - spin_unlock(>page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } } pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ if (!PageReserved(old_page)) page_cache_get(old_page); - spin_unlock(>page_table_lock); if (unlikely(anon_vma_prepare(vma)))
page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault
The page fault handler attempts to use the page_table_lock only for short time periods. It repeatedly drops and reacquires the lock. When the lock is reacquired, checks are made if the underlying pte has changed before replacing the pte value. These locations are a good fit for the use of ptep_cmpxchg. The following patch allows to remove the first time the page_table_lock is acquired and uses atomic operations on the page table instead. A section using atomic pte operations is begun with page_table_atomic_start(struct mm_struct *) and ends with page_table_atomic_stop(struct mm_struct *) Both of these become spin_lock(page_table_lock) and spin_unlock(page_table_lock) if atomic page table operations are not configured (CONFIG_ATOMIC_TABLE_OPS undefined). Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest layer of the page table. Higher layers may also be populated in an atomic way by defining pmd_test_and_populate() etc. The generic versions of these functions fall back to the page_table_lock (populating higher level page table entries is rare and therefore this is not likely to be performance critical). For ia64 the definitions for higher level atomic operations is included and these may easily be added for other architectures. This patch depends on the pte_cmpxchg patch to be applied first and will only remove the first use of the page_table_lock in the page fault handler. This will allow the following page table operations without acquiring the page_table_lock: 1. Updating of access bits (handle_mm_faults) 2. Anonymous read faults (do_anonymous_page) The page_table_lock is still acquired for creating a new pte for an anonymous write fault and therefore the problems with rss that were addressed by splitting rss into the task structure do not yet occur. The patch also adds some diagnostic features by counting the number of cmpxchg failures (useful for verification if this patch works right) and the number of faults received that led to no change in the page table. These statistics may be viewed via /proc/meminfo Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-01-27 16:27:59.0 -0800 +++ linux-2.6.10/mm/memory.c2005-01-27 16:28:54.0 -0800 @@ -36,6 +36,8 @@ * ([EMAIL PROTECTED]) * * Aug/Sep 2004 Changed to four level page tables (Andi Kleen) + * Jan 2005Scalability improvement by reducing the use and the length of time + * the page table lock is held (Christoph Lameter) */ #include linux/kernel_stat.h @@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a * change only once the write actually happens. This avoids a few races, * and potentially makes it more efficient. * - * We hold the mm semaphore and the page_table_lock on entry and exit - * with the page_table_lock released. + * We hold the mm semaphore and have started atomic pte operations, + * exit with pte ops completed. */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte) @@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct * pte_unmap(page_table); printk(KERN_ERR do_wp_page: bogus page at address %08lx\n, address); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_OOM; } old_page = pfn_to_page(pfn); @@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); + /* +* If the bits are not updated then another fault +* will be generated with another chance of updating. +*/ + if (ptep_cmpxchg(page_table, pte, entry)) + update_mmu_cache(vma, address, entry); + else + inc_page_state(cmpxchg_fail_flag_reuse); pte_unmap(page_table); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } } pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ if (!PageReserved(old_page)) page_cache_get(old_page); - spin_unlock(mm-page_table_lock); if