Re: [PATCH] lazy freeing of memory through MADV_FREE
Paul Mackerras wrote: Rik van Riel writes: I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. I don't see why; once ptep_test_and_clear_young has returned, the entry in the hash table has already been removed. OK, so this one won't be necessary. Good to know that. Andrew, it looks like things won't be that bad :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel writes: > I guess we'll need to call tlb_remove_tlb_entry() inside the > MADV_FREE code to keep powerpc happy. I don't see why; once ptep_test_and_clear_young has returned, the entry in the hash table has already been removed. Adding the tlb_remove_tlb_entry call certainly won't do anything on 64-bit powerpc, since it expands to do {} while (0) there, and in fact it won't do anything on 32-bit powerpc either. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > I don't see why we need the attached, but in case you find > a good reason, here's my signed-off-by line for Andrew :) Andew is in a defensive crouch trying to work his way through all the bugs he's been sent. After I've managed to release 2.6.21-rc7-mm1 (say, December) I expect I'll drop the MADV_FREE stuff, give you a run at creating a new patch series. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. I think this is still the case, to a degree. There should be no harm in removing the TLB entries after the page table has been unlocked, right? Or is something like the attached really needed? From what I can see, the page table lock should be enough synchronization between unmap_mapping_range, MADV_FREE and MADV_DONTNEED. I don't see why we need the attached, but in case you find a good reason, here's my signed-off-by line for Andrew :) Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.flushme 2007-04-23 22:26:06.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 22:42:06.0 -0400 @@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc long *zap_work, struct zap_details *details) { struct mm_struct *mm = tlb->mm; + unsigned long start_addr = addr; pte_t *pte; spinlock_t *ptl; int file_rss = 0; @@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc add_mm_rss(mm, file_rss, anon_rss); arch_leave_lazy_mmu_mode(); + if (details && details->madv_free) { + /* Protect against MADV_DONTNEED or unmap_mapping_range */ + tlb_finish_mmu(tlb, start_addr, addr); + tlb = tlb_gather_mmu(mm, 0); + } pte_unmap_unlock(pte - 1, ptl); return addr;
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Nick Piggin wrote: Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... What about an unmap_mapping_range, or another MADV_FREE or MADV_DONTNEED? --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Nick Piggin wrote: Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page);
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: First some ebizzy runs... This is interesting. Ginormous speedups in ebizzy[1] on my quad core test system. The following numbers are the average of 10 runs, since ebizzy shows some variability. You can see a big influence from the tlb batching and from Nick's madv_sem patch. The reduction in system time from 100 seconds to 3 seconds is way more than I had expected, but I'm not complaining. The 4 fold reduction in wall clock time is a nice bonus. According to Val, ebizzy shows the weaknesses of Linux with a real workload, so this could be a useful result. kernel user system wall clock%CPU vanilla 186s101s 123s 230% madv_free (madv)175s 96s 120s 230% mmap_sem (sem) 100s 40s40s 370% madv+sem200s140s 100s 393% madv+sem+tlb118s 3s30s 395% madv+tlb150s 10s50s 310% [1] http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1699.html -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. All the tlb flush code seems to assume is that the tlb entries should be invalidated. I'm not saying there is any bugs, but just suggesting there might be. Jakub found a potential bug, in that I did not use an atomic operation to clear the page table entries. I've attached a new patch which simply uses ptep_test_and_clear_dirty/young to get rid of the dirty and accessed bits. It uses the same atomic accesses we use elsewhere in the VM and the code is a line shorter than before. Andrew, please use this one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400 @@ -677,11 +677,14 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); + ptep_test_and_clear_dirty(vma, addr, pte); + ptep_test_and_clear_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } }
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Mon, Apr 23, 2007 at 08:21:37PM +1000, Nick Piggin wrote: > I guess it is a good idea to batch these things. But can you > do that on all architectures? What happens if your tlb flush > happens after another thread already accesses it again, or > after it subsequently gets removed from the address space via > another CPU? Accessing the page by another thread before madvise (MADV_FREE) returns is undefined behavior, it can act as if that access happened right before the madvise (MADV_FREE) call or right after it. That's ok for glibc and supposedly any other malloc implementation, madvise (MADV_FREE) is called while holding containing's arena lock and for whatever malloc implementaton, madvise (MADV_FREE) would be part of free operations and you definitely need some synchronization between one thread freeing some memory and other thread deciding to reuse that memory and return it from malloc/realloc/calloc/etc. My only concern is whether using non-atomic update of the pte is ok or not. ptep_test_and_clear_young/ptep_test_and_clear_dirty Rik's patch was doing before are done using atomic instructions, at least on x86_64. The operation we want for MADV_FREE is, clear young/dirty bits if they have been set on entry to the MADV_FREE madvise call, undefined values for these 2 bits if some other task modifies the young/dirty bits concurrently with this MADV_FREE zap_page_range, but I'd say other bits need to be unmodified. Now, is there some kernel code which while either not holding corresponding mmap_sem at all or holding it just down_read modifies other bits in the pte? If yes, we need to do this clearing atomically, basically do a cmpxchg loop until we succeed to clear the 2 bits and then flush the tlb if any of them was set before (ptep_test_and_clear_dirty_and_young?), if not, set_pte_at is ok and faster than a lock prefixed insn. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Nick Piggin wrote: It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? I have thought about this a lot tonight, and have come to the conclusion that they are ok. The reason is simple: 1) we do the TLB flush before we return from the madvise(MADV_FREE) syscall. 2) anything that accessess the pages between the start and end of the MADV_FREE procedure does not know in which order we go through the pages, so it could hit a page either before or after we get to processing it 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. I'm not saying there is any bugs, but just suggesting there might be. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? I have thought about this a lot tonight, and have come to the conclusion that they are ok. The reason is simple: 1) we do the TLB flush before we return from the madvise(MADV_FREE) syscall. 2) anything that accessess the pages between the start and end of the MADV_FREE procedure does not know in which order we go through the pages, so it could hit a page either before or after we get to processing it 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. With the attached patch to make MADV_FREE use tlb batching, not only do we gain an additional 10-15% performance but Nick's mmap_sem patch also shows the performance increase that we expected to see. It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? The second column from the right has Nick's patch and my own two patches. Performance with 16 threads is almost triple what it used to be... vanilla glibc glibc glibcglibc glibc glibc madv_free madv_free madv_free madv_free mmap_sem mmap_sem mmap_sem tlb batch tlb_batch threads 1 610 609 596 545 534 547 537 21032113611961200118012931194 41070112820142024202722482040 81000108816652087208923141869 16779107313101999201222141557 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. This is for another day. :) First some ebizzy runs... --- linux-2.6.20.x86_64/mm/memory.c.orig2007-04-23 02:48:36.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400 @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + ptent = *pte; + set_pte_at(mm, addr, pte, + pte_mkclean(pte_mkold(ptent))); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } } -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: I haven't tested your MADV_FREE patch yet. Good. It turned out that one behaved a bit strange without tlb batching anyway. I'm now running ebizzy across the whole set of kernels I tested before, and will post the results in a bit. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. Thanks! (I edited slightly so it doesn't wrap) vanilla new glibc madv_freemmap_semboth threads 1 610 609 596 534 545 210321136119611801200 410701128201420272024 810001088166520892087 167791073131020121999 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. Yes. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... I'm getting the 16 core box out of retirement as we speak :) OK, 10 runs at 1 client, 2.6.21-rc6, MySQL version 5.33, and new Jakub's glibc gives a 99.9% confidence of: vanilla: 467.2 +/- 7.9 (tps) mmap_sem: 470.5 +/- 9.3 (tps) However, it seems those means jump around a bit from boot to boot, so there could be some some memory placement luck for cache and/or NUMA goodness involved. So I think it is safe to say that the mmap_sem patch doesn't hurt single threaded performance (from looking at the numbers and the patch). And that's the most important thing for that patch. I'll post some scalability results tomorrow. From my first round of tests, after new glibc and the mmap_sem patch, it doesn't seem like rwsem improvements, private futexes, or avoiding zero_page make any significant differences. I haven't tested your MADV_FREE patch yet. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. With the attached patch to make MADV_FREE use tlb batching, not only do we gain an additional 10-15% performance but Nick's mmap_sem patch also shows the performance increase that we expected to see. It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. The second column from the right has Nick's patch and my own two patches. Performance with 16 threads is almost triple what it used to be... vanilla glibc glibc glibcglibc glibc glibc madv_free madv_free madv_free madv_free mmap_sem mmap_sem mmap_sem tlb batch tlb_batch threads 1 610 609 596 545 534 547 537 21032113611961200118012931194 41070112820142024202722482040 81000108816652087208923141869 16779107313101999201222141557 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. This is for another day. :) First some ebizzy runs... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400 @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + ptent = *pte; + set_pte_at(mm, addr, pte, + pte_mkclean(pte_mkold(ptent))); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } }
Re: [PATCH] lazy freeing of memory through MADV_FREE
Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. With the attached patch to make MADV_FREE use tlb batching, not only do we gain an additional 10-15% performance but Nick's mmap_sem patch also shows the performance increase that we expected to see. It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. The second column from the right has Nick's patch and my own two patches. Performance with 16 threads is almost triple what it used to be... vanilla glibc glibc glibcglibc glibc glibc madv_free madv_free madv_free madv_free mmap_sem mmap_sem mmap_sem tlb batch tlb_batch threads 1 610 609 596 545 534 547 537 21032113611961200118012931194 41070112820142024202722482040 81000108816652087208923141869 16779107313101999201222141557 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. This is for another day. :) First some ebizzy runs... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400 @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + ptent = *pte; + set_pte_at(mm, addr, pte, + pte_mkclean(pte_mkold(ptent))); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } }
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. Thanks! (I edited slightly so it doesn't wrap) vanilla new glibc madv_freemmap_semboth threads 1 610 609 596 534 545 210321136119611801200 410701128201420272024 810001088166520892087 167791073131020121999 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. Yes. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... I'm getting the 16 core box out of retirement as we speak :) OK, 10 runs at 1 client, 2.6.21-rc6, MySQL version 5.33, and new Jakub's glibc gives a 99.9% confidence of: vanilla: 467.2 +/- 7.9 (tps) mmap_sem: 470.5 +/- 9.3 (tps) However, it seems those means jump around a bit from boot to boot, so there could be some some memory placement luck for cache and/or NUMA goodness involved. So I think it is safe to say that the mmap_sem patch doesn't hurt single threaded performance (from looking at the numbers and the patch). And that's the most important thing for that patch. I'll post some scalability results tomorrow. From my first round of tests, after new glibc and the mmap_sem patch, it doesn't seem like rwsem improvements, private futexes, or avoiding zero_page make any significant differences. I haven't tested your MADV_FREE patch yet. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: I haven't tested your MADV_FREE patch yet. Good. It turned out that one behaved a bit strange without tlb batching anyway. I'm now running ebizzy across the whole set of kernels I tested before, and will post the results in a bit. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. With the attached patch to make MADV_FREE use tlb batching, not only do we gain an additional 10-15% performance but Nick's mmap_sem patch also shows the performance increase that we expected to see. It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? The second column from the right has Nick's patch and my own two patches. Performance with 16 threads is almost triple what it used to be... vanilla glibc glibc glibcglibc glibc glibc madv_free madv_free madv_free madv_free mmap_sem mmap_sem mmap_sem tlb batch tlb_batch threads 1 610 609 596 545 534 547 537 21032113611961200118012931194 41070112820142024202722482040 81000108816652087208923141869 16779107313101999201222141557 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. This is for another day. :) First some ebizzy runs... --- linux-2.6.20.x86_64/mm/memory.c.orig2007-04-23 02:48:36.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400 @@ -677,11 +677,15 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + ptent = *pte; + set_pte_at(mm, addr, pte, + pte_mkclean(pte_mkold(ptent))); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } } -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? I have thought about this a lot tonight, and have come to the conclusion that they are ok. The reason is simple: 1) we do the TLB flush before we return from the madvise(MADV_FREE) syscall. 2) anything that accessess the pages between the start and end of the MADV_FREE procedure does not know in which order we go through the pages, so it could hit a page either before or after we get to processing it 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Nick Piggin wrote: It looks like the tlb flushes (and IPIs) from zap_pte_range() could have been the problem. They're gone now. I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? I have thought about this a lot tonight, and have come to the conclusion that they are ok. The reason is simple: 1) we do the TLB flush before we return from the madvise(MADV_FREE) syscall. 2) anything that accessess the pages between the start and end of the MADV_FREE procedure does not know in which order we go through the pages, so it could hit a page either before or after we get to processing it 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. I'm not saying there is any bugs, but just suggesting there might be. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Mon, Apr 23, 2007 at 08:21:37PM +1000, Nick Piggin wrote: I guess it is a good idea to batch these things. But can you do that on all architectures? What happens if your tlb flush happens after another thread already accesses it again, or after it subsequently gets removed from the address space via another CPU? Accessing the page by another thread before madvise (MADV_FREE) returns is undefined behavior, it can act as if that access happened right before the madvise (MADV_FREE) call or right after it. That's ok for glibc and supposedly any other malloc implementation, madvise (MADV_FREE) is called while holding containing's arena lock and for whatever malloc implementaton, madvise (MADV_FREE) would be part of free operations and you definitely need some synchronization between one thread freeing some memory and other thread deciding to reuse that memory and return it from malloc/realloc/calloc/etc. My only concern is whether using non-atomic update of the pte is ok or not. ptep_test_and_clear_young/ptep_test_and_clear_dirty Rik's patch was doing before are done using atomic instructions, at least on x86_64. The operation we want for MADV_FREE is, clear young/dirty bits if they have been set on entry to the MADV_FREE madvise call, undefined values for these 2 bits if some other task modifies the young/dirty bits concurrently with this MADV_FREE zap_page_range, but I'd say other bits need to be unmodified. Now, is there some kernel code which while either not holding corresponding mmap_sem at all or holding it just down_read modifies other bits in the pte? If yes, we need to do this clearing atomically, basically do a cmpxchg loop until we succeed to clear the 2 bits and then flush the tlb if any of them was set before (ptep_test_and_clear_dirty_and_young?), if not, set_pte_at is ok and faster than a lock prefixed insn. Jakub - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. All the tlb flush code seems to assume is that the tlb entries should be invalidated. I'm not saying there is any bugs, but just suggesting there might be. Jakub found a potential bug, in that I did not use an atomic operation to clear the page table entries. I've attached a new patch which simply uses ptep_test_and_clear_dirty/young to get rid of the dirty and accessed bits. It uses the same atomic accesses we use elsewhere in the VM and the code is a line shorter than before. Andrew, please use this one. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.orig 2007-04-23 02:48:36.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 02:54:42.0 -0400 @@ -677,11 +677,14 @@ static unsigned long zap_pte_range(struc remove_exclusive_swap_page(page); unlock_page(page); } - ptep_clear_flush_dirty(vma, addr, pte); - ptep_clear_flush_young(vma, addr, pte); + ptep_test_and_clear_dirty(vma, addr, pte); + ptep_test_and_clear_young(vma, addr, pte); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); + /* tlb_remove_page frees it again */ + get_page(page); + tlb_remove_page(tlb, page); continue; } }
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: First some ebizzy runs... This is interesting. Ginormous speedups in ebizzy[1] on my quad core test system. The following numbers are the average of 10 runs, since ebizzy shows some variability. You can see a big influence from the tlb batching and from Nick's madv_sem patch. The reduction in system time from 100 seconds to 3 seconds is way more than I had expected, but I'm not complaining. The 4 fold reduction in wall clock time is a nice bonus. According to Val, ebizzy shows the weaknesses of Linux with a real workload, so this could be a useful result. kernel user system wall clock%CPU vanilla 186s101s 123s 230% madv_free (madv)175s 96s 120s 230% mmap_sem (sem) 100s 40s40s 370% madv+sem200s140s 100s 393% madv+sem+tlb118s 3s30s 395% madv+tlb150s 10s50s 310% [1] http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1699.html -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Use TLB batching for MADV_FREE. Adds another 10-15% extra performance to the MySQL sysbench results on my quad core system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. 4) because we flush the tlb before releasing the page table lock, other CPUs cannot remove this page from the address space - they will block on the page table lock before looking at this pte We don't when the ptl is split. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page);
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: This should fix the MADV_FREE code for PPC's hashed tlb. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Nick Piggin wrote: Nick Piggin wrote: 3) because of this, we can treat any such accesses as happening simultaneously with the MADV_FREE and as illegal, aka undefined behaviour territory and we do not need to worry about them Yes, but I'm wondering if it is legal in all architectures. It's similar to trying to access memory during an munmap. You may be able to for a short time, but it'll come back to haunt you. The question is whether the architecture specific tlb flushing code will break or not. I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. Thanks for pointing this one out. Even then we do. Each invocation of zap_pte_range() only touches one page table page, and it flushes the TLB before releasing the page table lock. What kernel are you looking at? -rc7 and rc6-mm1 don't, AFAIKS. Oh dear. I see it now... The tlb end things inside zap_pte_range() are actually noops and the actual tlb flush only happens inside zap_page_range(). I guess the fact that munmap gets the mmap_sem for writing should save us, though... What about an unmap_mapping_range, or another MADV_FREE or MADV_DONTNEED? --- linux-2.6.20.x86_64/mm/memory.c.noppc 2007-04-23 21:50:09.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 21:48:59.0 -0400 @@ -679,6 +679,7 @@ static unsigned long zap_pte_range(struc } ptep_test_and_clear_dirty(vma, addr, pte); ptep_test_and_clear_young(vma, addr, pte); + tlb_remove_tlb_entry(tlb, pte, addr); SetPageLazyFree(page); if (PageActive(page)) deactivate_tail_page(page); -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: What the tlb flush used to be able to assume is that the page has been removed from the pagetables when they are put in the tlb flush batch. I think this is still the case, to a degree. There should be no harm in removing the TLB entries after the page table has been unlocked, right? Or is something like the attached really needed? From what I can see, the page table lock should be enough synchronization between unmap_mapping_range, MADV_FREE and MADV_DONTNEED. I don't see why we need the attached, but in case you find a good reason, here's my signed-off-by line for Andrew :) Signed-off-by: Rik van Riel [EMAIL PROTECTED] -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.x86_64/mm/memory.c.flushme 2007-04-23 22:26:06.0 -0400 +++ linux-2.6.20.x86_64/mm/memory.c 2007-04-23 22:42:06.0 -0400 @@ -628,6 +628,7 @@ static unsigned long zap_pte_range(struc long *zap_work, struct zap_details *details) { struct mm_struct *mm = tlb-mm; + unsigned long start_addr = addr; pte_t *pte; spinlock_t *ptl; int file_rss = 0; @@ -726,6 +727,11 @@ static unsigned long zap_pte_range(struc add_mm_rss(mm, file_rss, anon_rss); arch_leave_lazy_mmu_mode(); + if (details details-madv_free) { + /* Protect against MADV_DONTNEED or unmap_mapping_range */ + tlb_finish_mmu(tlb, start_addr, addr); + tlb = tlb_gather_mmu(mm, 0); + } pte_unmap_unlock(pte - 1, ptl); return addr;
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Mon, 23 Apr 2007 22:53:49 -0400 Rik van Riel [EMAIL PROTECTED] wrote: I don't see why we need the attached, but in case you find a good reason, here's my signed-off-by line for Andrew :) Andew is in a defensive crouch trying to work his way through all the bugs he's been sent. After I've managed to release 2.6.21-rc7-mm1 (say, December) I expect I'll drop the MADV_FREE stuff, give you a run at creating a new patch series. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel writes: I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. I don't see why; once ptep_test_and_clear_young has returned, the entry in the hash table has already been removed. Adding the tlb_remove_tlb_entry call certainly won't do anything on 64-bit powerpc, since it expands to do {} while (0) there, and in fact it won't do anything on 32-bit powerpc either. Paul. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Paul Mackerras wrote: Rik van Riel writes: I guess we'll need to call tlb_remove_tlb_entry() inside the MADV_FREE code to keep powerpc happy. I don't see why; once ptep_test_and_clear_young has returned, the entry in the hash table has already been removed. OK, so this one won't be necessary. Good to know that. Andrew, it looks like things won't be that bad :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Jakub Jelinek wrote: On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote: It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back to MADV_DONTUSE if MADV_FREE is not available, to http://people.redhat.com/jakub/glibc/2.5.90-21.1/ Hmm, I wonder how glibc malloc stacks up to tcmalloc on this test (after the mmap_sem patch as well). I'll try running that as well! -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? Trying to answer this question, I straced the mysql threads that showed up in top when running a single threaded sysbench workload. There were no mmap, munmap, brk, mprotect or madvise system calls in the trace. MySQL has me puzzled, but it seems to have some other people interested too. I think I'll go play a bit with ebizzy now, to see how other workloads are affected by our kernel changes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. Thanks! (I edited slightly so it doesn't wrap) vanilla new glibc madv_freemmap_semboth threads 1 610 609 596 534 545 210321136119611801200 410701128201420272024 810001088166520892087 167791073131020121999 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. Yes. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... I'm getting the 16 core box out of retirement as we speak :) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Here are the transactions/seconds for each combination: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. vanilla new glibc madv_free kernel madv_free + mmap_sem mmap_sem threads 1 610 609 596545 534 2103211361196 12001180 4107011282014 20242027 8100010881665 20872089 1677910731310 19992012 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. It would also be helpful if other people tried this same benchmark, and others, on their systems. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Here are the transactions/seconds for each combination: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. vanilla new glibc madv_free kernel madv_free + mmap_sem mmap_sem threads 1 610 609 596545 534 2103211361196 12001180 4107011282014 20242027 8100010881665 20872089 1677910731310 19992012 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... Yeah. That's funny, because it means either there is some contention on the mmap_sem (or ptl) at 1 thread, or that my patch alters the uncontended performance. Maybe MySQL has various different threads to do different tasks. Something to look into... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? No, that's just the glibc change, with a vanilla kernel. OK. That would be interesting to see with the mmap_sem change, because that should increase scalability. The third column is glibc change + mmap_sem patch. The fourth column has your patch in it, too. The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). Well, your patch causes the performance to drop from 596 transactions/second to 545. Your patch is the only difference between the third and the fourth column. Yeah. That's funny, because it means either there is some contention on the mmap_sem (or ptl) at 1 thread, or that my patch alters the uncontended performance. However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? I wonder if the increased parallelism simply caused more cache line bouncing, with bounces happening in some inner loop instead of an outer loop. Btw, it is quite possible that the MySQL sysbench thing gives different results on your system. It would be good to know what it does on a real SMP system, vs. a single quad-core chip :) Other architectures would be interesting to know, too. I don't see why parallelism should come into it at 1 thread, unless MySQL is parallelising individual transactions. Anyway, I'll try to do some more digging. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On 4/22/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: Why isn't MADV_FREE defined to 5 for linux? It's our first free madv value? Also the behaviour should better match the one in solaris or BSD, the last thing we need is slightly different behaviour from operating systems supporting this for ages. The behavior should indeed be identical. Both implementations restrict MADV_FREE to work on anonymous memory and it is unspecified whether a renewed access yields to a zerod page being created or whether the old content is still there. So, just use 0x5 for both the Linux and Solaris version on sparc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Sun, Apr 22, 2007 at 01:18:10AM -0700, Andrew Morton wrote: > On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > > > Make it possible for applications to have the kernel free memory > > lazily. This reduces a repeated free/malloc cycle from freeing > > pages and allocating them, to just marking them freeable. If the > > application wants to reuse them before the kernel needs the memory, > > not even a page fault will happen. > > > > This patch, together with Ulrich's glibc change, increases > > MySQL sysbench performance by a factor of 2 on my quad core > > test system. > > > > In file included from include/linux/mman.h:4, > from arch/sparc64/kernel/sys_sparc.c:19: > include/asm/mman.h:36:1: "MADV_FREE" redefined > In file included from include/asm/mman.h:5, > from include/linux/mman.h:4, > from arch/sparc64/kernel/sys_sparc.c:19: > include/asm-generic/mman.h:32:1: this is the location of the previous > definition > > sparc32 and sparc64 already defined MADV_FREE: > > > #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ > > I'll remove the sparc definitions for now, but we need to work out what > we're going to do here. Your patch changes the values of MADV_FREE on > sparc. > > Perhaps this should be renamed to MADV_FREE_LINUX and given a different > number. It depends on how close your proposed behaviour is to Solaris's. Why isn't MADV_FREE defined to 5 for linux? It's our first free madv value? Also the behaviour should better match the one in solaris or BSD, the last thing we need is slightly different behaviour from operating systems supporting this for ages. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Make it possible for applications to have the kernel free memory > lazily. This reduces a repeated free/malloc cycle from freeing > pages and allocating them, to just marking them freeable. If the > application wants to reuse them before the kernel needs the memory, > not even a page fault will happen. > > This patch, together with Ulrich's glibc change, increases > MySQL sysbench performance by a factor of 2 on my quad core > test system. > In file included from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm/mman.h:36:1: "MADV_FREE" redefined In file included from include/asm/mman.h:5, from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm-generic/mman.h:32:1: this is the location of the previous definition sparc32 and sparc64 already defined MADV_FREE: #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ I'll remove the sparc definitions for now, but we need to work out what we're going to do here. Your patch changes the values of MADV_FREE on sparc. Perhaps this should be renamed to MADV_FREE_LINUX and given a different number. It depends on how close your proposed behaviour is to Solaris's. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? No, that's just the glibc change, with a vanilla kernel. The third column is glibc change + mmap_sem patch. The fourth column has your patch in it, too. The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). Well, your patch causes the performance to drop from 596 transactions/second to 545. Your patch is the only difference between the third and the fourth column. However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? I wonder if the increased parallelism simply caused more cache line bouncing, with bounces happening in some inner loop instead of an outer loop. Btw, it is quite possible that the MySQL sysbench thing gives different results on your system. It would be good to know what it does on a real SMP system, vs. a single quad-core chip :) Other architectures would be interesting to know, too. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch? No, that's just the glibc change, with a vanilla kernel. The third column is glibc change + mmap_sem patch. The fourth column has your patch in it, too. The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). Well, your patch causes the performance to drop from 596 transactions/second to 545. Your patch is the only difference between the third and the fourth column. However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? I wonder if the increased parallelism simply caused more cache line bouncing, with bounces happening in some inner loop instead of an outer loop. Btw, it is quite possible that the MySQL sysbench thing gives different results on your system. It would be good to know what it does on a real SMP system, vs. a single quad-core chip :) Other architectures would be interesting to know, too. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them before the kernel needs the memory, not even a page fault will happen. This patch, together with Ulrich's glibc change, increases MySQL sysbench performance by a factor of 2 on my quad core test system. In file included from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm/mman.h:36:1: MADV_FREE redefined In file included from include/asm/mman.h:5, from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm-generic/mman.h:32:1: this is the location of the previous definition sparc32 and sparc64 already defined MADV_FREE: #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ I'll remove the sparc definitions for now, but we need to work out what we're going to do here. Your patch changes the values of MADV_FREE on sparc. Perhaps this should be renamed to MADV_FREE_LINUX and given a different number. It depends on how close your proposed behaviour is to Solaris's. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Sun, Apr 22, 2007 at 01:18:10AM -0700, Andrew Morton wrote: On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them before the kernel needs the memory, not even a page fault will happen. This patch, together with Ulrich's glibc change, increases MySQL sysbench performance by a factor of 2 on my quad core test system. In file included from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm/mman.h:36:1: MADV_FREE redefined In file included from include/asm/mman.h:5, from include/linux/mman.h:4, from arch/sparc64/kernel/sys_sparc.c:19: include/asm-generic/mman.h:32:1: this is the location of the previous definition sparc32 and sparc64 already defined MADV_FREE: #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ I'll remove the sparc definitions for now, but we need to work out what we're going to do here. Your patch changes the values of MADV_FREE on sparc. Perhaps this should be renamed to MADV_FREE_LINUX and given a different number. It depends on how close your proposed behaviour is to Solaris's. Why isn't MADV_FREE defined to 5 for linux? It's our first free madv value? Also the behaviour should better match the one in solaris or BSD, the last thing we need is slightly different behaviour from operating systems supporting this for ages. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On 4/22/07, Christoph Hellwig [EMAIL PROTECTED] wrote: Why isn't MADV_FREE defined to 5 for linux? It's our first free madv value? Also the behaviour should better match the one in solaris or BSD, the last thing we need is slightly different behaviour from operating systems supporting this for ages. The behavior should indeed be identical. Both implementations restrict MADV_FREE to work on anonymous memory and it is unspecified whether a renewed access yields to a zerod page being created or whether the old content is still there. So, just use 0x5 for both the Linux and Solaris version on sparc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch? No, that's just the glibc change, with a vanilla kernel. OK. That would be interesting to see with the mmap_sem change, because that should increase scalability. The third column is glibc change + mmap_sem patch. The fourth column has your patch in it, too. The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). Well, your patch causes the performance to drop from 596 transactions/second to 545. Your patch is the only difference between the third and the fourth column. Yeah. That's funny, because it means either there is some contention on the mmap_sem (or ptl) at 1 thread, or that my patch alters the uncontended performance. However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? I wonder if the increased parallelism simply caused more cache line bouncing, with bounces happening in some inner loop instead of an outer loop. Btw, it is quite possible that the MySQL sysbench thing gives different results on your system. It would be good to know what it does on a real SMP system, vs. a single quad-core chip :) Other architectures would be interesting to know, too. I don't see why parallelism should come into it at 1 thread, unless MySQL is parallelising individual transactions. Anyway, I'll try to do some more digging. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Here are the transactions/seconds for each combination: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. vanilla new glibc madv_free kernel madv_free + mmap_sem mmap_sem threads 1 610 609 596545 534 2103211361196 12001180 4107011282014 20242027 8100010881665 20872089 1677910731310 19992012 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... Yeah. That's funny, because it means either there is some contention on the mmap_sem (or ptl) at 1 thread, or that my patch alters the uncontended performance. Maybe MySQL has various different threads to do different tasks. Something to look into... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Nick Piggin wrote: Rik van Riel wrote: Here are the transactions/seconds for each combination: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. vanilla new glibc madv_free kernel madv_free + mmap_sem mmap_sem threads 1 610 609 596545 534 2103211361196 12001180 4107011282014 20242027 8100010881665 20872089 1677910731310 19992012 Now that I think about it - this is all with the rawhide kernel configuration, which has an ungodly number of debug config options enabled. I should try this with a more normal kernel, on various different systems. It would also be helpful if other people tried this same benchmark, and others, on their systems. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: I've added a 5th column, with just your mmap_sem patch and without my madv_free patch. It is run with the glibc patch, which should make it fall back to MADV_DONTNEED after the first MADV_FREE call fails. Thanks! (I edited slightly so it doesn't wrap) vanilla new glibc madv_freemmap_semboth threads 1 610 609 596 534 545 210321136119611801200 410701128201420272024 810001088166520892087 167791073131020121999 Not doing the mprotect calls is the big one I guess, especially the fact that we don't need to take the mmap_sem for writing. Yes. With both our patches, single and two thread performance with MySQL sysbench is somewhat better than with just your patch, 4 and 8 thread performance are basically the same and just your patch gives a slight benefit with 16 threads. I guess I should benchmark up to 64 or 128 threads tomorrow, to see if this is just luck or if the cache benefit of doing the page faults and reusing hot pages is faster than not having page faults at all. I should run some benchmarks on other systems, too. Some of these results could be an artifact of my quad core CPU. The results could be very different on other systems... I'm getting the 16 core box out of retirement as we speak :) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? Trying to answer this question, I straced the mysql threads that showed up in top when running a single threaded sysbench workload. There were no mmap, munmap, brk, mprotect or madvise system calls in the trace. MySQL has me puzzled, but it seems to have some other people interested too. I think I'll go play a bit with ebizzy now, to see how other workloads are affected by our kernel changes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Jakub Jelinek wrote: On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote: It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back to MADV_DONTUSE if MADV_FREE is not available, to http://people.redhat.com/jakub/glibc/2.5.90-21.1/ Hmm, I wonder how glibc malloc stacks up to tcmalloc on this test (after the mmap_sem patch as well). I'll try running that as well! -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? x86_64's rwsems are crap under heavy parallelism (even read-only), as I fixed in my recent generic rwsems patch. I don't expect MySQL to be such a mmap_sem microbenchmark, but I wonder how much this would help? What if we ran the private futexes patch to further cut down mmap_sem contention? Hmm, without the MADV_FREE patch, I wonder if it isn't doing something silly like read-faulting in a ZERO_PAGE then write faulting a new page straight afterwards.. I'll have to try a few tests. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is "new glibc" meaning MADV_DONTNEED + kernel with mmap_sem patch? The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? x86_64's rwsems are crap under heavy parallelism (even read-only), as I fixed in my recent generic rwsems patch. I don't expect MySQL to be such a mmap_sem microbenchmark, but I wonder how much this would help? What if we ran the private futexes patch to further cut down mmap_sem contention? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Hugh Dickins wrote: On Fri, 20 Apr 2007, Rik van Riel wrote: Andrew Morton wrote: I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. If you want, I can take a look at folding this into the ->mapping pointer. I can guarantee you it won't be pretty, though :) Please don't. If we're going to stuff another pageflag into there, let it be PageSwapCache the natural partner of PageAnon, rather than whatever our latest pageflag happens to be. I looked at doing what Andrew wanted, and it did indeed not look like the right thing to do. The locking on page->mapping is the kind of locking we want to avoid during zap_page_range and in the pageout code. I like your suggestion better. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On 4/21/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: But the Linux MADV_DONTNEED does throw away data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those changes are discarded, and a subsequent access will revert to zeroes or the underlying mapped file. Been like that since before 2.4.0. I didn't say it changed. I just say that there is a hole in the current implementation as it does not allow to implement POSIX_MADV_DONTNEED with anything but a no-op. The POSIX_MADV_DONTNEED behavior is useful and something IMO should be added to allow implementing it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On Fri, 20 Apr 2007, Ulrich Drepper wrote: > > Just for reference: the MADV_CURRENT behavior is to throw away data in > the range. Not exactly. The Linux MADV_DONTNEED never throws away data from a PROT_WRITE,MAP_SHARED mapping (or shm) - it propagates the dirty bit, the page will eventually get written out to file, and can be retrieved later by subsequent access. But the Linux MADV_DONTNEED does throw away data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those changes are discarded, and a subsequent access will revert to zeroes or the underlying mapped file. Been like that since before 2.4.0. > The POSIX_MADV_DONTNEED behavior is to never lose data. > I.e., file backed data is written back, anon data is at most swapped > out. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Fri, 20 Apr 2007, Rik van Riel wrote: > Andrew Morton wrote: > > > I do go on about that. But we're adding page flags at about one per > > year, and when we run out we're screwed - we'll need to grow the > > pageframe. > > If you want, I can take a look at folding this into the > ->mapping pointer. I can guarantee you it won't be > pretty, though :) Please don't. If we're going to stuff another pageflag into there, let it be PageSwapCache the natural partner of PageAnon, rather than whatever our latest pageflag happens to be. I'll look into it - but do keep an eye on me, I've developed a dubious track record of obstructing other people's attempts to save pageflags. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote: > It turns out that Nick's patch does not improve peak > performance much, but it does prevent the decline when > running with 16 threads on my quad core CPU! > > We _definately_ want both patches, there's a huge benefit > in having them both. > > Here are the transactions/seconds for each combination: > >vanilla new glibc madv_free kernel madv_free + mmap_sem > threads > > 1 610 609 596545 > 2103211361196 1200 > 4107011282014 2024 > 8100010881665 2087 > 1677910731310 1999 FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back to MADV_DONTUSE if MADV_FREE is not available, to http://people.redhat.com/jakub/glibc/2.5.90-21.1/ and I'm also attaching the glibc patch for those who want to build it themselves: 2007-04-19 Ulrich Drepper <[EMAIL PROTECTED]> Jakub Jelinek <[EMAIL PROTECTED]> * malloc/arena.c (heap_info): Add mprotect_size field, adjust pad. (new_heap): Initialize mprotect_size. (no_madv_free): New variable. (grow_heap): When growing, only mprotect from mprotect_size till new_size if mprotect_size is smaller. When shrinking, use PROT_NONE MMAP for __libc_enable_secure only, otherwise if MADV_FREE is available use it and fall back to MADV_DONTNEED. * sysdeps/unix/sysv/linux/alpha/bits/mman.h (MADV_FREE): Define. * sysdeps/unix/sysv/linux/ia64/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/i386/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/s390/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/powerpc/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/x86_64/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/sparc/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/sh/bits/mman.h (MADV_FREE): Likewise. --- libc/malloc/arena.c.jj 2006-10-31 23:05:31.0 +0100 +++ libc/malloc/arena.c 2007-04-19 18:54:20.0 +0200 @@ -1,5 +1,6 @@ /* Malloc implementation for multiple threads without lock contention. - Copyright (C) 2001,2002,2003,2004,2005,2006 Free Software Foundation, Inc. + Copyright (C) 2001,2002,2003,2004,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. Contributed by Wolfram Gloger <[EMAIL PROTECTED]>, 2001. @@ -59,10 +60,12 @@ typedef struct _heap_info { mstate ar_ptr; /* Arena for this heap. */ struct _heap_info *prev; /* Previous heap. */ size_t size; /* Current size in bytes. */ + size_t mprotect_size;/* Size in bytes that has been mprotected + PROT_READ|PROT_WRITE. */ /* Make sure the following data is properly aligned, particularly that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of - MALLOG_ALIGNMENT. */ - char pad[-5 * SIZE_SZ & MALLOC_ALIGN_MASK]; + MALLOC_ALIGNMENT. */ + char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK]; } heap_info; /* Get a compile-time error if the heap_info padding is not correct @@ -692,10 +695,15 @@ new_heap(size, top_pad) size_t size, top } h = (heap_info *)p2; h->size = size; + h->mprotect_size = size; THREAD_STAT(stat_n_heaps++); return h; } +#if defined _LIBC && defined MADV_FREE +static int no_madv_free; +#endif + /* Grow or shrink a heap. size is automatically rounded up to a multiple of the page size if it is positive. */ @@ -714,17 +722,49 @@ grow_heap(h, diff) heap_info *h; long di new_size = (long)h->size + diff; if((unsigned long) new_size > (unsigned long) HEAP_MAX_SIZE) return -1; -if(mprotect((char *)h + h->size, diff, PROT_READ|PROT_WRITE) != 0) - return -2; +if((unsigned long) new_size > h->mprotect_size) { + if (mprotect((char *)h + h->mprotect_size, + (unsigned long) new_size - h->mprotect_size, + PROT_READ|PROT_WRITE) != 0) + return -2; + h->mprotect_size = new_size; +} } else { new_size = (long)h->size + diff; if(new_size < (long)sizeof(*h)) return -1; /* Try to re-map the extra heap space freshly to save memory, and make it inaccessible. */ -if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE, -MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED) - return -2; +#ifdef _LIBC +if (__builtin_expect (__libc_enable_secure, 0)) +#else +if (1) +#endif + { + if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE, + MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED) + return -2; + h->mprotect_size = new_size; + } +#ifdef _LIBC +else + { +# ifdef MADV_FREE + if (!__builtin_expect (no_madv_free, 0))
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Fri, Apr 20, 2007 at 07:52:44PM -0400, Rik van Riel wrote: It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 FYI, I have uploaded a testing glibc that uses MADV_FREE and falls back to MADV_DONTUSE if MADV_FREE is not available, to http://people.redhat.com/jakub/glibc/2.5.90-21.1/ and I'm also attaching the glibc patch for those who want to build it themselves: 2007-04-19 Ulrich Drepper [EMAIL PROTECTED] Jakub Jelinek [EMAIL PROTECTED] * malloc/arena.c (heap_info): Add mprotect_size field, adjust pad. (new_heap): Initialize mprotect_size. (no_madv_free): New variable. (grow_heap): When growing, only mprotect from mprotect_size till new_size if mprotect_size is smaller. When shrinking, use PROT_NONE MMAP for __libc_enable_secure only, otherwise if MADV_FREE is available use it and fall back to MADV_DONTNEED. * sysdeps/unix/sysv/linux/alpha/bits/mman.h (MADV_FREE): Define. * sysdeps/unix/sysv/linux/ia64/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/i386/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/s390/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/powerpc/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/x86_64/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/sparc/bits/mman.h (MADV_FREE): Likewise. * sysdeps/unix/sysv/linux/sh/bits/mman.h (MADV_FREE): Likewise. --- libc/malloc/arena.c.jj 2006-10-31 23:05:31.0 +0100 +++ libc/malloc/arena.c 2007-04-19 18:54:20.0 +0200 @@ -1,5 +1,6 @@ /* Malloc implementation for multiple threads without lock contention. - Copyright (C) 2001,2002,2003,2004,2005,2006 Free Software Foundation, Inc. + Copyright (C) 2001,2002,2003,2004,2005,2006,2007 + Free Software Foundation, Inc. This file is part of the GNU C Library. Contributed by Wolfram Gloger [EMAIL PROTECTED], 2001. @@ -59,10 +60,12 @@ typedef struct _heap_info { mstate ar_ptr; /* Arena for this heap. */ struct _heap_info *prev; /* Previous heap. */ size_t size; /* Current size in bytes. */ + size_t mprotect_size;/* Size in bytes that has been mprotected + PROT_READ|PROT_WRITE. */ /* Make sure the following data is properly aligned, particularly that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of - MALLOG_ALIGNMENT. */ - char pad[-5 * SIZE_SZ MALLOC_ALIGN_MASK]; + MALLOC_ALIGNMENT. */ + char pad[-6 * SIZE_SZ MALLOC_ALIGN_MASK]; } heap_info; /* Get a compile-time error if the heap_info padding is not correct @@ -692,10 +695,15 @@ new_heap(size, top_pad) size_t size, top } h = (heap_info *)p2; h-size = size; + h-mprotect_size = size; THREAD_STAT(stat_n_heaps++); return h; } +#if defined _LIBC defined MADV_FREE +static int no_madv_free; +#endif + /* Grow or shrink a heap. size is automatically rounded up to a multiple of the page size if it is positive. */ @@ -714,17 +722,49 @@ grow_heap(h, diff) heap_info *h; long di new_size = (long)h-size + diff; if((unsigned long) new_size (unsigned long) HEAP_MAX_SIZE) return -1; -if(mprotect((char *)h + h-size, diff, PROT_READ|PROT_WRITE) != 0) - return -2; +if((unsigned long) new_size h-mprotect_size) { + if (mprotect((char *)h + h-mprotect_size, + (unsigned long) new_size - h-mprotect_size, + PROT_READ|PROT_WRITE) != 0) + return -2; + h-mprotect_size = new_size; +} } else { new_size = (long)h-size + diff; if(new_size (long)sizeof(*h)) return -1; /* Try to re-map the extra heap space freshly to save memory, and make it inaccessible. */ -if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE, -MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED) - return -2; +#ifdef _LIBC +if (__builtin_expect (__libc_enable_secure, 0)) +#else +if (1) +#endif + { + if((char *)MMAP((char *)h + new_size, -diff, PROT_NONE, + MAP_PRIVATE|MAP_FIXED) == (char *) MAP_FAILED) + return -2; + h-mprotect_size = new_size; + } +#ifdef _LIBC +else + { +# ifdef MADV_FREE + if (!__builtin_expect (no_madv_free, 0)) + { + if
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Fri, 20 Apr 2007, Rik van Riel wrote: Andrew Morton wrote: I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. If you want, I can take a look at folding this into the -mapping pointer. I can guarantee you it won't be pretty, though :) Please don't. If we're going to stuff another pageflag into there, let it be PageSwapCache the natural partner of PageAnon, rather than whatever our latest pageflag happens to be. I'll look into it - but do keep an eye on me, I've developed a dubious track record of obstructing other people's attempts to save pageflags. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On Fri, 20 Apr 2007, Ulrich Drepper wrote: Just for reference: the MADV_CURRENT behavior is to throw away data in the range. Not exactly. The Linux MADV_DONTNEED never throws away data from a PROT_WRITE,MAP_SHARED mapping (or shm) - it propagates the dirty bit, the page will eventually get written out to file, and can be retrieved later by subsequent access. But the Linux MADV_DONTNEED does throw away data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those changes are discarded, and a subsequent access will revert to zeroes or the underlying mapped file. Been like that since before 2.4.0. The POSIX_MADV_DONTNEED behavior is to never lose data. I.e., file backed data is written back, anon data is at most swapped out. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On 4/21/07, Hugh Dickins [EMAIL PROTECTED] wrote: But the Linux MADV_DONTNEED does throw away data from a PROT_WRITE,MAP_PRIVATE mapping (or brk or stack) - those changes are discarded, and a subsequent access will revert to zeroes or the underlying mapped file. Been like that since before 2.4.0. I didn't say it changed. I just say that there is a hole in the current implementation as it does not allow to implement POSIX_MADV_DONTNEED with anything but a no-op. The POSIX_MADV_DONTNEED behavior is useful and something IMO should be added to allow implementing it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Hugh Dickins wrote: On Fri, 20 Apr 2007, Rik van Riel wrote: Andrew Morton wrote: I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. If you want, I can take a look at folding this into the -mapping pointer. I can guarantee you it won't be pretty, though :) Please don't. If we're going to stuff another pageflag into there, let it be PageSwapCache the natural partner of PageAnon, rather than whatever our latest pageflag happens to be. I looked at doing what Andrew wanted, and it did indeed not look like the right thing to do. The locking on page-mapping is the kind of locking we want to avoid during zap_page_range and in the pageout code. I like your suggestion better. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel wrote: Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch? The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? x86_64's rwsems are crap under heavy parallelism (even read-only), as I fixed in my recent generic rwsems patch. I don't expect MySQL to be such a mmap_sem microbenchmark, but I wonder how much this would help? What if we ran the private futexes patch to further cut down mmap_sem contention? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Nick Piggin wrote: Rik van Riel wrote: Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Is new glibc meaning MADV_DONTNEED + kernel with mmap_sem patch? The strange thing with your madv_free kernel is that it doesn't help single-threaded performance at all. So that work to avoid zeroing the new page is not a win at all there (maybe due to the cache effects I was worried about?). However MADV_FREE does improve scalability, which is interesting. The most likely reason I can see why that may be the case is that it avoids mmap_sem when faulting pages back in (I doubt it is due to avoiding the page allocator, but maybe?). So where is the down_write coming from in this workload, I wonder? Heap management? What syscalls? x86_64's rwsems are crap under heavy parallelism (even read-only), as I fixed in my recent generic rwsems patch. I don't expect MySQL to be such a mmap_sem microbenchmark, but I wonder how much this would help? What if we ran the private futexes patch to further cut down mmap_sem contention? Hmm, without the MADV_FREE patch, I wonder if it isn't doing something silly like read-faulting in a ZERO_PAGE then write faulting a new page straight afterwards.. I'll have to try a few tests. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Eric Dumazet wrote: Rik van Riel a écrit : Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 545 tps versus 610 tps for one thread ? It seems quite bad, no ? Could you please find an explanation for this ? I have no idea why this happens. Especially the last one, going from a write lock to a read lock on the mmap_sem should not make ANY difference whatsoever since we're running single threaded! 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Performance with 2 database threads is way better though, and performance with 4 or more threads more than doubles... If you have an explanation on why single threaded performance went down a little on my quad core system, please let me know. Does performance suffer at all on a real UP system? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel a écrit : Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 545 tps versus 610 tps for one thread ? It seems quite bad, no ? Could you please find an explanation for this ? 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Thank you - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". > > > > - Nick's patch also will help this problem. It could be that your patch > > no longer offers a 2x speedup when combined with Nick's patch. > > > > It could well be that the combination of the two is even better, but it > > would be nice to firm that up a bit. > > I'll test that. Thanks. > > I do go on about that. But we're adding page flags at about one per > > year, and when we run out we're screwed - we'll need to grow the > > pageframe. > > If you want, I can take a look at folding this into the > ->mapping pointer. I can guarantee you it won't be > pretty, though :) Well, let's see how fugly it ends up looking? > > - I need to update your patch for Nick's patch. Please confirm that > > down_read(mmap_sem) is sufficient for MADV_FREE. > > It is. MADV_FREE needs no more protection than MADV_DONTNEED. > > > Stylistic nit: > > > >> + if (PageLazyFree(page) && !migration) { > >> + /* There is new data in the page. Reinstate it. */ > >> + if (unlikely(pte_dirty(pteval))) { > >> + set_pte_at(mm, address, pte, pteval); > >> + ret = SWAP_FAIL; > >> + goto out_unmap; > >> + } > > > > The comment should be inside the second `if' statement. As it is, It > > looks like we reinstate the page if (PageLazyFree(page) && !migration). > > Want me to move it? I did that, thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Andrew Morton wrote: I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. If you want, I can take a look at folding this into the ->mapping pointer. I can guarantee you it won't be pretty, though :) - I need to update your patch for Nick's patch. Please confirm that down_read(mmap_sem) is sufficient for MADV_FREE. It is. MADV_FREE needs no more protection than MADV_DONTNEED. Stylistic nit: + if (PageLazyFree(page) && !migration) { + /* There is new data in the page. Reinstate it. */ + if (unlikely(pte_dirty(pteval))) { + set_pte_at(mm, address, pte, pteval); + ret = SWAP_FAIL; + goto out_unmap; + } The comment should be inside the second `if' statement. As it is, It looks like we reinstate the page if (PageLazyFree(page) && !migration). Want me to move it? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On 4/20/07, Andrew Morton <[EMAIL PROTECTED]> wrote: OK, we need to flesh this out a lot please. People often get confused about what our MADV_DONTNEED behaviour is. Well, there's not really much to flesh out. The current MADV_DONTNEED is useful in some situations. The behavior cannot be changed, even glibc will rely on it for the case when MADV_FREE is not supported. What might be nice to have is to have a POSIX-compliant POSIX_MADV_DONTNEED implementation. We currently do nothing which is OK since no test suite can detect that. But some code might want to use the real behavior and we're missing an optimization possibility. Just for reference: the MADV_CURRENT behavior is to throw away data in the range. The POSIX_MADV_DONTNEED behavior is to never lose data. I.e., file backed data is written back, anon data is at most swapped out. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On Thu, 19 Apr 2007 17:15:28 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Restore MADV_DONTNEED to its original Linux behaviour. This is still > not the same behaviour as POSIX, but applications may be depending on > the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED > and makes sure nothing is done... OK, we need to flesh this out a lot please. People often get confused about what our MADV_DONTNEED behaviour is. I regularly forget, then look at the code, then get it wrong. That's for mainline, let alone older kernels whose behaviour is gawd-knows-what. So... For the changelog (and the manpage) could we please have a full description of the 2.6.21 behaviour and the 2.6.21-post-rik behaviour (and the 2.4 behaviour, if it differs at all)? Also some code comments to demystify all of this once and for all? Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Make it possible for applications to have the kernel free memory > lazily. This reduces a repeated free/malloc cycle from freeing > pages and allocating them, to just marking them freeable. If the > application wants to reuse them before the kernel needs the memory, > not even a page fault will happen. > > This patch, together with Ulrich's glibc change, increases > MySQL sysbench performance by a factor of 2 on my quad core > test system. > > Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> > > --- > Ulrich Drepper has test glibc RPMS for this functionality at: > > http://people.redhat.com/drepper/rpms > > Andrew, I have stress tested this patch for a few days now and > have not been able to find any more bugs. I believe it is ready > to be merged in -mm, and upstream at the next merge window. > > When the patch goes upstream, I will submit a small follow-up > patch to revert MADV_DONTNEED behaviour to what it did previously > and have the new behaviour trigger only on MADV_FREE: at that > point people will have to get new test RPMs of glibc. > > I've also merged Nick's "mm: madvise avoid exclusive mmap_sem". - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. Chewing a page flag is an expensive thing to do. I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. - I need to update your patch for Nick's patch. Please confirm that down_read(mmap_sem) is sufficient for MADV_FREE. Stylistic nit: > + if (PageLazyFree(page) && !migration) { > + /* There is new data in the page. Reinstate it. */ > + if (unlikely(pte_dirty(pteval))) { > + set_pte_at(mm, address, pte, pteval); > + ret = SWAP_FAIL; > + goto out_unmap; > + } The comment should be inside the second `if' statement. As it is, It looks like we reinstate the page if (PageLazyFree(page) && !migration). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Tue, 17 Apr 2007 03:15:51 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them before the kernel needs the memory, not even a page fault will happen. This patch, together with Ulrich's glibc change, increases MySQL sysbench performance by a factor of 2 on my quad core test system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Ulrich Drepper has test glibc RPMS for this functionality at: http://people.redhat.com/drepper/rpms Andrew, I have stress tested this patch for a few days now and have not been able to find any more bugs. I believe it is ready to be merged in -mm, and upstream at the next merge window. When the patch goes upstream, I will submit a small follow-up patch to revert MADV_DONTNEED behaviour to what it did previously and have the new behaviour trigger only on MADV_FREE: at that point people will have to get new test RPMs of glibc. I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. Chewing a page flag is an expensive thing to do. I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. - I need to update your patch for Nick's patch. Please confirm that down_read(mmap_sem) is sufficient for MADV_FREE. Stylistic nit: + if (PageLazyFree(page) !migration) { + /* There is new data in the page. Reinstate it. */ + if (unlikely(pte_dirty(pteval))) { + set_pte_at(mm, address, pte, pteval); + ret = SWAP_FAIL; + goto out_unmap; + } The comment should be inside the second `if' statement. As it is, It looks like we reinstate the page if (PageLazyFree(page) !migration). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On Thu, 19 Apr 2007 17:15:28 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Restore MADV_DONTNEED to its original Linux behaviour. This is still not the same behaviour as POSIX, but applications may be depending on the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED and makes sure nothing is done... OK, we need to flesh this out a lot please. People often get confused about what our MADV_DONTNEED behaviour is. I regularly forget, then look at the code, then get it wrong. That's for mainline, let alone older kernels whose behaviour is gawd-knows-what. So... For the changelog (and the manpage) could we please have a full description of the 2.6.21 behaviour and the 2.6.21-post-rik behaviour (and the 2.4 behaviour, if it differs at all)? Also some code comments to demystify all of this once and for all? Thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
On 4/20/07, Andrew Morton [EMAIL PROTECTED] wrote: OK, we need to flesh this out a lot please. People often get confused about what our MADV_DONTNEED behaviour is. Well, there's not really much to flesh out. The current MADV_DONTNEED is useful in some situations. The behavior cannot be changed, even glibc will rely on it for the case when MADV_FREE is not supported. What might be nice to have is to have a POSIX-compliant POSIX_MADV_DONTNEED implementation. We currently do nothing which is OK since no test suite can detect that. But some code might want to use the real behavior and we're missing an optimization possibility. Just for reference: the MADV_CURRENT behavior is to throw away data in the range. The POSIX_MADV_DONTNEED behavior is to never lose data. I.e., file backed data is written back, anon data is at most swapped out. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. If you want, I can take a look at folding this into the -mapping pointer. I can guarantee you it won't be pretty, though :) - I need to update your patch for Nick's patch. Please confirm that down_read(mmap_sem) is sufficient for MADV_FREE. It is. MADV_FREE needs no more protection than MADV_DONTNEED. Stylistic nit: + if (PageLazyFree(page) !migration) { + /* There is new data in the page. Reinstate it. */ + if (unlikely(pte_dirty(pteval))) { + set_pte_at(mm, address, pte, pteval); + ret = SWAP_FAIL; + goto out_unmap; + } The comment should be inside the second `if' statement. As it is, It looks like we reinstate the page if (PageLazyFree(page) !migration). Want me to move it? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. I do go on about that. But we're adding page flags at about one per year, and when we run out we're screwed - we'll need to grow the pageframe. If you want, I can take a look at folding this into the -mapping pointer. I can guarantee you it won't be pretty, though :) Well, let's see how fugly it ends up looking? - I need to update your patch for Nick's patch. Please confirm that down_read(mmap_sem) is sufficient for MADV_FREE. It is. MADV_FREE needs no more protection than MADV_DONTNEED. Stylistic nit: + if (PageLazyFree(page) !migration) { + /* There is new data in the page. Reinstate it. */ + if (unlikely(pte_dirty(pteval))) { + set_pte_at(mm, address, pte, pteval); + ret = SWAP_FAIL; + goto out_unmap; + } The comment should be inside the second `if' statement. As it is, It looks like we reinstate the page if (PageLazyFree(page) !migration). Want me to move it? I did that, thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Rik van Riel a écrit : Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 545 tps versus 610 tps for one thread ? It seems quite bad, no ? Could you please find an explanation for this ? 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Thank you - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE
Eric Dumazet wrote: Rik van Riel a écrit : Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer offers a 2x speedup when combined with Nick's patch. It could well be that the combination of the two is even better, but it would be nice to firm that up a bit. I'll test that. Thanks. Well, good news. It turns out that Nick's patch does not improve peak performance much, but it does prevent the decline when running with 16 threads on my quad core CPU! We _definately_ want both patches, there's a huge benefit in having them both. Here are the transactions/seconds for each combination: vanilla new glibc madv_free kernel madv_free + mmap_sem threads 1 610 609 596545 545 tps versus 610 tps for one thread ? It seems quite bad, no ? Could you please find an explanation for this ? I have no idea why this happens. Especially the last one, going from a write lock to a read lock on the mmap_sem should not make ANY difference whatsoever since we're running single threaded! 2103211361196 1200 4107011282014 2024 8100010881665 2087 1677910731310 1999 Performance with 2 database threads is way better though, and performance with 4 or more threads more than doubles... If you have an explanation on why single threaded performance went down a little on my quad core system, please let me know. Does performance suffer at all on a real UP system? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
Restore MADV_DONTNEED to its original Linux behaviour. This is still not the same behaviour as POSIX, but applications may be depending on the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED and makes sure nothing is done... Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- This is to be applied over of the original MADV_FREE patch. It turns out that the current glibc patch already falls back to MADV_DONTNEED if it gets an -EINVAL. --- linux-2.6.20.x86_64/mm/madvise.c.madv_free 2007-04-19 16:46:22.0 -0400 +++ linux-2.6.20.x86_64/mm/madvise.c 2007-04-19 16:52:19.0 -0400 @@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a */ static long madvise_dontneed(struct vm_area_struct * vma, struct vm_area_struct ** prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + int behavior) { *prev = vma; if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) @@ -142,12 +143,14 @@ static long madvise_dontneed(struct vm_a .last_index = ULONG_MAX, }; zap_page_range(vma, start, end - start, ); - } else { + } else if (behavior == MADV_FREE) { struct zap_details details = { .madv_free = 1, }; zap_page_range(vma, start, end - start, ); - } + } else /* behavior == MADV_DONTNEED */ + zap_page_range(vma, start, end - start, NULL); + return 0; } @@ -219,10 +222,9 @@ madvise_vma(struct vm_area_struct *vma, error = madvise_willneed(vma, prev, start, end); break; - /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: case MADV_FREE: - error = madvise_dontneed(vma, prev, start, end); + error = madvise_dontneed(vma, prev, start, end, behavior); break; default:
Re: [PATCH] lazy freeing of memory through MADV_FREE 2/2
Restore MADV_DONTNEED to its original Linux behaviour. This is still not the same behaviour as POSIX, but applications may be depending on the Linux behaviour already. Besides, glibc catches POSIX_MADV_DONTNEED and makes sure nothing is done... Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- This is to be applied over of the original MADV_FREE patch. It turns out that the current glibc patch already falls back to MADV_DONTNEED if it gets an -EINVAL. --- linux-2.6.20.x86_64/mm/madvise.c.madv_free 2007-04-19 16:46:22.0 -0400 +++ linux-2.6.20.x86_64/mm/madvise.c 2007-04-19 16:52:19.0 -0400 @@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a */ static long madvise_dontneed(struct vm_area_struct * vma, struct vm_area_struct ** prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + int behavior) { *prev = vma; if (vma-vm_flags (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) @@ -142,12 +143,14 @@ static long madvise_dontneed(struct vm_a .last_index = ULONG_MAX, }; zap_page_range(vma, start, end - start, details); - } else { + } else if (behavior == MADV_FREE) { struct zap_details details = { .madv_free = 1, }; zap_page_range(vma, start, end - start, details); - } + } else /* behavior == MADV_DONTNEED */ + zap_page_range(vma, start, end - start, NULL); + return 0; } @@ -219,10 +222,9 @@ madvise_vma(struct vm_area_struct *vma, error = madvise_willneed(vma, prev, start, end); break; - /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: case MADV_FREE: - error = madvise_dontneed(vma, prev, start, end); + error = madvise_dontneed(vma, prev, start, end, behavior); break; default:
[PATCH] lazy freeing of memory through MADV_FREE
Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them before the kernel needs the memory, not even a page fault will happen. This patch, together with Ulrich's glibc change, increases MySQL sysbench performance by a factor of 2 on my quad core test system. Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> --- Ulrich Drepper has test glibc RPMS for this functionality at: http://people.redhat.com/drepper/rpms Andrew, I have stress tested this patch for a few days now and have not been able to find any more bugs. I believe it is ready to be merged in -mm, and upstream at the next merge window. When the patch goes upstream, I will submit a small follow-up patch to revert MADV_DONTNEED behaviour to what it did previously and have the new behaviour trigger only on MADV_FREE: at that point people will have to get new test RPMs of glibc. --- linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h.madv_free 2007-04-17 02:17:19.0 -0400 +++ linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h 2007-04-17 02:22:46.0 -0400 @@ -38,6 +38,7 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ +#define MADV_FREE 8 /* don't need the pages or the data */ /* common/generic parameters */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.21-rc6-mm1/include/asm-mips/mman.h.madv_free 2007-04-17 02:17:19.0 -0400 +++ linux-2.6.21-rc6-mm1/include/asm-mips/mman.h 2007-04-17 02:22:46.0 -0400 @@ -65,6 +65,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h.madv_free 2007-04-17 02:17:19.0 -0400 +++ linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h 2007-04-17 02:22:46.0 -0400 @@ -72,6 +72,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.21-rc6-mm1/include/linux/swap.h.madv_free 2007-04-17 02:17:43.0 -0400 +++ linux-2.6.21-rc6-mm1/include/linux/swap.h 2007-04-17 02:22:46.0 -0400 @@ -182,6 +182,7 @@ extern void FASTCALL(lru_cache_add(struc extern void FASTCALL(lru_cache_add_active(struct page *)); extern void FASTCALL(lru_cache_add_tail(struct page *)); extern void FASTCALL(activate_page(struct page *)); +extern void FASTCALL(deactivate_tail_page(struct page *)); extern void FASTCALL(mark_page_accessed(struct page *)); extern void lru_add_drain(void); extern int lru_add_drain_all(void); --- linux-2.6.21-rc6-mm1/include/linux/mm.h.madv_free 2007-04-17 02:17:43.0 -0400 +++ linux-2.6.21-rc6-mm1/include/linux/mm.h 2007-04-17 02:22:46.0 -0400 @@ -767,6 +767,7 @@ struct zap_details { pgoff_t last_index; /* Highest page->index to unmap */ spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ + short madv_free; /* MADV_FREE anonymous memory */ }; struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); --- linux-2.6.21-rc6-mm1/include/linux/page-flags.h.madv_free 2007-04-17 02:17:43.0 -0400 +++ linux-2.6.21-rc6-mm1/include/linux/page-flags.h 2007-04-17 02:23:16.0 -0400 @@ -91,6 +91,7 @@ #define PG_booked 20 /* Has blocks reserved on-disk */ #define PG_readahead 21 /* Reminder to do read-ahead */ +#define PG_lazyfree 22 /* MADV_FREE potential throwaway */ /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ @@ -216,6 +217,11 @@ static inline void SetPageUptodate(struc #define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags) #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags) +#define PageLazyFree(page) test_bit(PG_lazyfree, &(page)->flags) +#define SetPageLazyFree(page) set_bit(PG_lazyfree, &(page)->flags) +#define ClearPageLazyFree(page) clear_bit(PG_lazyfree, &(page)->flags) +#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags) + #define PageCompound(page) test_bit(PG_compound, &(page)->flags) #define
[PATCH] lazy freeing of memory through MADV_FREE
Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them before the kernel needs the memory, not even a page fault will happen. This patch, together with Ulrich's glibc change, increases MySQL sysbench performance by a factor of 2 on my quad core test system. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- Ulrich Drepper has test glibc RPMS for this functionality at: http://people.redhat.com/drepper/rpms Andrew, I have stress tested this patch for a few days now and have not been able to find any more bugs. I believe it is ready to be merged in -mm, and upstream at the next merge window. When the patch goes upstream, I will submit a small follow-up patch to revert MADV_DONTNEED behaviour to what it did previously and have the new behaviour trigger only on MADV_FREE: at that point people will have to get new test RPMs of glibc. --- linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h.madv_free 2007-04-17 02:17:19.0 -0400 +++ linux-2.6.21-rc6-mm1/include/asm-parisc/mman.h 2007-04-17 02:22:46.0 -0400 @@ -38,6 +38,7 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ +#define MADV_FREE 8 /* don't need the pages or the data */ /* common/generic parameters */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.21-rc6-mm1/include/asm-mips/mman.h.madv_free 2007-04-17 02:17:19.0 -0400 +++ linux-2.6.21-rc6-mm1/include/asm-mips/mman.h 2007-04-17 02:22:46.0 -0400 @@ -65,6 +65,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h.madv_free 2007-04-17 02:17:19.0 -0400 +++ linux-2.6.21-rc6-mm1/include/asm-xtensa/mman.h 2007-04-17 02:22:46.0 -0400 @@ -72,6 +72,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.21-rc6-mm1/include/linux/swap.h.madv_free 2007-04-17 02:17:43.0 -0400 +++ linux-2.6.21-rc6-mm1/include/linux/swap.h 2007-04-17 02:22:46.0 -0400 @@ -182,6 +182,7 @@ extern void FASTCALL(lru_cache_add(struc extern void FASTCALL(lru_cache_add_active(struct page *)); extern void FASTCALL(lru_cache_add_tail(struct page *)); extern void FASTCALL(activate_page(struct page *)); +extern void FASTCALL(deactivate_tail_page(struct page *)); extern void FASTCALL(mark_page_accessed(struct page *)); extern void lru_add_drain(void); extern int lru_add_drain_all(void); --- linux-2.6.21-rc6-mm1/include/linux/mm.h.madv_free 2007-04-17 02:17:43.0 -0400 +++ linux-2.6.21-rc6-mm1/include/linux/mm.h 2007-04-17 02:22:46.0 -0400 @@ -767,6 +767,7 @@ struct zap_details { pgoff_t last_index; /* Highest page-index to unmap */ spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ + short madv_free; /* MADV_FREE anonymous memory */ }; struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); --- linux-2.6.21-rc6-mm1/include/linux/page-flags.h.madv_free 2007-04-17 02:17:43.0 -0400 +++ linux-2.6.21-rc6-mm1/include/linux/page-flags.h 2007-04-17 02:23:16.0 -0400 @@ -91,6 +91,7 @@ #define PG_booked 20 /* Has blocks reserved on-disk */ #define PG_readahead 21 /* Reminder to do read-ahead */ +#define PG_lazyfree 22 /* MADV_FREE potential throwaway */ /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ @@ -216,6 +217,11 @@ static inline void SetPageUptodate(struc #define ClearPageReclaim(page) clear_bit(PG_reclaim, (page)-flags) #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, (page)-flags) +#define PageLazyFree(page) test_bit(PG_lazyfree, (page)-flags) +#define SetPageLazyFree(page) set_bit(PG_lazyfree, (page)-flags) +#define ClearPageLazyFree(page) clear_bit(PG_lazyfree, (page)-flags) +#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, (page)-flags) + #define PageCompound(page) test_bit(PG_compound, (page)-flags) #define __SetPageCompound(page)