Re: prezeroing V6 [2/3]: ScrubD
On Tue, 8 Feb 2005 12:51:05 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Tue, 8 Feb 2005, Andrew Morton wrote: > > > We also need to try to identify workloads whcih might experience a > > regression and test them too. It isn't very hard. > > I'd be glad if you could provide some instructions on how exactly to do > that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of > configurations. But which configurations do you want? If we can run some tests for you on STP let me know. ( we do 1,2,4,8 CPU x86 boxes ) cliffw > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- "Ive always gone through periods where I bolt upright at four in the morning; now at least theres a reason." -Michael Feldman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Tue, 8 Feb 2005, Andrew Morton wrote: > We also need to try to identify workloads whcih might experience a > regression and test them too. It isn't very hard. I'd be glad if you could provide some instructions on how exactly to do that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of configurations. But which configurations do you want? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > On Mon, 7 Feb 2005, Andrew Morton wrote: > > > > No its a page fault benchmark. Dave Miller has done some kernel compiles > > > and I have some benchmarks here that I never posted because they do not > > > show any material change as far as I can see. I will be posting that soon > > > when this is complete (also need to do the same for the atomic page fault > > > ops and the prefaulting patch). > > > > OK, thanks. That's important work. After all, this patch is a performance > > optimisation. > > Well its a bit complicated due to the various configuration. UP, and then > more and more processors. Plus the NUMA stuff and the standard benchmarks > that are basically not suited for SMP tests make this a bit difficult. The patch is supposed to speed the kernel up with at least some workloads. We 100% need to see testing results with some such workloads to verify that the patch is desirable. We also need to try to identify workloads whcih might experience a regression and test them too. It isn't very hard. > > > memory node is bound to a set of cpus. This may be controlled by the > > > NUMA node configuration. F.e. for nodes without cpus. > > > > kthread_bind() should be able to do this. From a quick read it appears to > > have shortcomings in this department (it expects to be bound to a single > > CPU). > > Sorry but I still do not get what the problem is? kscrubd does exactly > what kswapd does and can be handled in the same way. It works fine here > on various multi node configurations and correctly gets CPUs assigned. We now have a standard API for starting, binding and stopping kernel threads. It's best to use it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Mon, 7 Feb 2005, Andrew Morton wrote: > > No its a page fault benchmark. Dave Miller has done some kernel compiles > > and I have some benchmarks here that I never posted because they do not > > show any material change as far as I can see. I will be posting that soon > > when this is complete (also need to do the same for the atomic page fault > > ops and the prefaulting patch). > > OK, thanks. That's important work. After all, this patch is a performance > optimisation. Well its a bit complicated due to the various configuration. UP, and then more and more processors. Plus the NUMA stuff and the standard benchmarks that are basically not suited for SMP tests make this a bit difficult. > > memory node is bound to a set of cpus. This may be controlled by the > > NUMA node configuration. F.e. for nodes without cpus. > > kthread_bind() should be able to do this. From a quick read it appears to > have shortcomings in this department (it expects to be bound to a single > CPU). Sorry but I still do not get what the problem is? kscrubd does exactly what kswapd does and can be handled in the same way. It works fine here on various multi node configurations and correctly gets CPUs assigned. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Mon, 7 Feb 2005, Andrew Morton wrote: No its a page fault benchmark. Dave Miller has done some kernel compiles and I have some benchmarks here that I never posted because they do not show any material change as far as I can see. I will be posting that soon when this is complete (also need to do the same for the atomic page fault ops and the prefaulting patch). OK, thanks. That's important work. After all, this patch is a performance optimisation. Well its a bit complicated due to the various configuration. UP, and then more and more processors. Plus the NUMA stuff and the standard benchmarks that are basically not suited for SMP tests make this a bit difficult. memory node is bound to a set of cpus. This may be controlled by the NUMA node configuration. F.e. for nodes without cpus. kthread_bind() should be able to do this. From a quick read it appears to have shortcomings in this department (it expects to be bound to a single CPU). Sorry but I still do not get what the problem is? kscrubd does exactly what kswapd does and can be handled in the same way. It works fine here on various multi node configurations and correctly gets CPUs assigned. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter [EMAIL PROTECTED] wrote: On Mon, 7 Feb 2005, Andrew Morton wrote: No its a page fault benchmark. Dave Miller has done some kernel compiles and I have some benchmarks here that I never posted because they do not show any material change as far as I can see. I will be posting that soon when this is complete (also need to do the same for the atomic page fault ops and the prefaulting patch). OK, thanks. That's important work. After all, this patch is a performance optimisation. Well its a bit complicated due to the various configuration. UP, and then more and more processors. Plus the NUMA stuff and the standard benchmarks that are basically not suited for SMP tests make this a bit difficult. The patch is supposed to speed the kernel up with at least some workloads. We 100% need to see testing results with some such workloads to verify that the patch is desirable. We also need to try to identify workloads whcih might experience a regression and test them too. It isn't very hard. memory node is bound to a set of cpus. This may be controlled by the NUMA node configuration. F.e. for nodes without cpus. kthread_bind() should be able to do this. From a quick read it appears to have shortcomings in this department (it expects to be bound to a single CPU). Sorry but I still do not get what the problem is? kscrubd does exactly what kswapd does and can be handled in the same way. It works fine here on various multi node configurations and correctly gets CPUs assigned. We now have a standard API for starting, binding and stopping kernel threads. It's best to use it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Tue, 8 Feb 2005, Andrew Morton wrote: We also need to try to identify workloads whcih might experience a regression and test them too. It isn't very hard. I'd be glad if you could provide some instructions on how exactly to do that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of configurations. But which configurations do you want? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Tue, 8 Feb 2005 12:51:05 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 8 Feb 2005, Andrew Morton wrote: We also need to try to identify workloads whcih might experience a regression and test them too. It isn't very hard. I'd be glad if you could provide some instructions on how exactly to do that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of configurations. But which configurations do you want? If we can run some tests for you on STP let me know. ( we do 1,2,4,8 CPU x86 boxes ) cliffw - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Ive always gone through periods where I bolt upright at four in the morning; now at least theres a reason. -Michael Feldman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > On Mon, 7 Feb 2005, Andrew Morton wrote: > > > > Look at the early posts. I plan to put that up on the web. I have some > > > stats attached to the end of this message from an earlier post. > > > > But that's a patch-specific microbenchmark, isn't it? Has this work been > > benchmarked against real-world stuff? > > No its a page fault benchmark. Dave Miller has done some kernel compiles > and I have some benchmarks here that I never posted because they do not > show any material change as far as I can see. I will be posting that soon > when this is complete (also need to do the same for the atomic page fault > ops and the prefaulting patch). OK, thanks. That's important work. After all, this patch is a performance optimisation. > > > > Should we be managing the kernel threads with the kthread() API? > > > > > > What would you like to manage? > > > > Startup, perhaps binding the threads to their cpus too. > > That is all already controllable in the same way as the swapper. kswapd uses an old API. > Each > memory node is bound to a set of cpus. This may be controlled by the > NUMA node configuration. F.e. for nodes without cpus. kthread_bind() should be able to do this. From a quick read it appears to have shortcomings in this department (it expects to be bound to a single CPU). We should fix kthread_bind() so that it can accomodate the kscrub/kswapd requirement. That's one of the _reasons_ for using the provided infrastructure rather than open-coding around it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Mon, 7 Feb 2005, Andrew Morton wrote: > > Look at the early posts. I plan to put that up on the web. I have some > > stats attached to the end of this message from an earlier post. > > But that's a patch-specific microbenchmark, isn't it? Has this work been > benchmarked against real-world stuff? No its a page fault benchmark. Dave Miller has done some kernel compiles and I have some benchmarks here that I never posted because they do not show any material change as far as I can see. I will be posting that soon when this is complete (also need to do the same for the atomic page fault ops and the prefaulting patch). > > > Should we be managing the kernel threads with the kthread() API? > > > > What would you like to manage? > > Startup, perhaps binding the threads to their cpus too. That is all already controllable in the same way as the swapper. Each memory node is bound to a set of cpus. This may be controlled by the NUMA node configuration. F.e. for nodes without cpus. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > > What were the benchmarking results for this work? I think you had some, > > but this is pretty vital info, so it should be retained in the changelogs. > > Look at the early posts. I plan to put that up on the web. I have some > stats attached to the end of this message from an earlier post. But that's a patch-specific microbenchmark, isn't it? Has this work been benchmarked against real-world stuff? > > Should we be managing the kernel threads with the kthread() API? > > What would you like to manage? Startup, perhaps binding the threads to their cpus too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Mon, 7 Feb 2005, Andrew Morton wrote: > Christoph Lameter <[EMAIL PROTECTED]> wrote: > > > > Adds management of ZEROED and NOT_ZEROED pages and a background daemon > > called scrubd. > > What were the benchmarking results for this work? I think you had some, > but this is pretty vital info, so it should be retained in the changelogs. Look at the early posts. I plan to put that up on the web. I have some stats attached to the end of this message from an earlier post. > Having one kscrubd per node seems like the right thing to do. Yes that is what is happening. Otherwise our NUMA stuff would not work right ;-) > Should we be managing the kernel threads with the kthread() API? What would you like to manage? -- Earlier post The scrub daemon is invoked when a unzeroed page of a certain order has been generated so that its worth running it. If no higher order pages are present then the logic will favor hot zeroing rather than simply shifting processing around. kscrubd typically runs only for a fraction of a second and sleeps for long periods of time even under memory benchmarking. kscrubd performs short bursts of zeroing when needed and tries to stay out off the processor as much as possible. The result is a significant increase of the page fault performance even for single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in each run): w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 110.006s 0.389s 0.039s157455.320 157070.694 0 120.007s 0.607s 0.032s101476.689 190350.885 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 110.008s 0.083s 0.009s672151.422 664045.899 0 120.005s 0.129s 0.008s459629.796 741857.373 The performance can only be upheld if enough zeroed pages are available. In a heavy memory intensive benchmark the system may run out of these very fast but the efficient algorithm for page zeroing still makes this a winner (2 way system with 384MB RAM, no hardware zeroing support). In the following measurement the test is repeated 10 times allocating 256M each in rapid succession which would deplete the pool of zeroed pages quickly): w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 1010.058s 3.913s 3.097s157335.774 157076.932 0 1020.063s 6.139s 3.027s100756.788 190572.486 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 1010.059s 1.828s 1.089s330913.517 330225.515 0 1020.082s 1.951s 1.094s307172.100 320680.232 Note that zeroing of pages makes no sense if the application touches all cache lines of a page allocated (there is no influence of prezeroing on benchmarks like lmbench for that reason) since the extensive caching of modern cpus means that the zeroes written to a hot zeroed page will then be overwritten by the application in the cpu cache and thus the zeros will never make it to memory! The test program used above only touches one 128 byte cache line of a 16k page (ia64). Sparsely populated and accessed areas are typical for lots of applications. Here is another test in order to gauge the influence of the number of cache lines touched on the performance of the prezero enhancements: Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 11 10.01s 0.12s 0.01s500813.853 497925.891 1 11 20.01s 0.11s 0.01s493453.103 472877.725 1 11 40.02s 0.10s 0.01s479351.658 471507.415 1 11 80.01s 0.13s 0.01s424742.054 416725.013 1 11 160.05s 0.12s 0.01s347715.359 336983.834 1 11 320.12s 0.13s 0.02s258112.286 256246.731 1 11 640.24s 0.14s 0.03s169896.381 168189.283 1 11 1280.49s 0.14s 0.06s102300.257 101674.435 The benefits of prezeroing are reduced to minimal quantities if all cachelines of a page are touched. Prezeroing can only be effective if the whole page is not immediately used after the page fault. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter <[EMAIL PROTECTED]> wrote: > > Adds management of ZEROED and NOT_ZEROED pages and a background daemon > called scrubd. What were the benchmarking results for this work? I think you had some, but this is pretty vital info, so it should be retained in the changelogs. Having one kscrubd per node seems like the right thing to do. Should we be managing the kernel threads with the kthread() API? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
prezeroing V6 [2/3]: ScrubD
Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. If a page is coalesced of the order specified in /proc /sys/scrub_start or higher then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep. In an SMP environment the scrub daemon is typically running on the most idle cpu. Thus a single threaded application running on one cpu may have the other cpu zeroing pages for it etc. The scrub daemon is hardly noticable and usually finishes zeroing quickly since most processors are optimized for linear memory filling. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.10/mm/page_alloc.c === --- linux-2.6.10.orig/mm/page_alloc.c 2005-02-03 22:51:57.0 -0800 +++ linux-2.6.10/mm/page_alloc.c2005-02-03 22:52:19.0 -0800 @@ -12,6 +12,8 @@ * Zone balancing, Kanoj Sarcar, SGI, Jan 2000 * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002 * (lots of bits borrowed from Ingo Molnar & Andrew Morton) + * Page zeroing by Christoph Lameter, SGI, Dec 2004 using + * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004. */ #include @@ -33,6 +35,7 @@ #include #include #include +#include #include #include "internal.h" @@ -175,16 +178,16 @@ static void destroy_compound_page(struct * zone->lock is already acquired when we use these. * So, we don't need atomic page->flags operations here. */ -static inline unsigned long page_order(struct page *page) { +static inline unsigned long page_zorder(struct page *page) { return page->private; } -static inline void set_page_order(struct page *page, int order) { - page->private = order; +static inline void set_page_zorder(struct page *page, int order, int zero) { + page->private = order + (zero << 10); __SetPagePrivate(page); } -static inline void rmv_page_order(struct page *page) +static inline void rmv_page_zorder(struct page *page) { __ClearPagePrivate(page); page->private = 0; @@ -195,14 +198,15 @@ static inline void rmv_page_order(struct * we can do coalesce a page and its buddy if * (a) the buddy is free && * (b) the buddy is on the buddy system && - * (c) a page and its buddy have the same order. + * (c) a page and its buddy have the same order and the same + * zeroing status. * for recording page's order, we use page->private and PG_private. * */ -static inline int page_is_buddy(struct page *page, int order) +static inline int page_is_buddy(struct page *page, int order, int zero) { if (PagePrivate(page) && - (page_order(page) == order) && + (page_zorder(page) == order + (zero << 10)) && !PageReserved(page) && page_count(page) == 0) return 1; @@ -233,22 +237,20 @@ static inline int page_is_buddy(struct p * -- wli */ -static inline void __free_pages_bulk (struct page *page, struct page *base, - struct zone *zone, unsigned int order) +static inline int __free_pages_bulk (struct page *page, struct page *base, + struct zone *zone, unsigned int order, int zero) { unsigned long page_idx; struct page *coalesced; - int order_size = 1 << order; if (unlikely(order)) destroy_compound_page(page, order); page_idx = page - base; - BUG_ON(page_idx & (order_size - 1)); + BUG_ON(page_idx & (( 1 << order) - 1)); BUG_ON(bad_range(zone, page)); - zone->free_pages += order_size; while (order < MAX_ORDER-1) { struct free_area *area; struct page *buddy; @@ -258,20 +260,21 @@ static inline void __free_pages_bulk (st buddy = base + buddy_idx; if (bad_range(zone, buddy)) break; - if (!page_is_buddy(buddy, order)) + if (!page_is_buddy(buddy, order, zero)) break; /* Move the buddy up one level. */ list_del(>lru); - area = zone->free_area + order; + area = zone->free_area[zero] + order; area->nr_free--; - rmv_page_order(buddy); + rmv_page_zorder(buddy); page_idx &= buddy_idx; order++; } coalesced = base + page_idx; - set_page_order(coalesced, order); - list_add(>lru, >free_area[order].free_list); - zone->free_area[order].nr_free++; + set_page_zorder(coalesced, order, zero); + list_add(>lru, >free_area[zero][order].free_list); + zone->free_area[zero][order].nr_free++; + return order; } static inline void free_pages_check(const char *function, struct page *page) @@ -320,8 +323,11 @@ free_pages_bulk(struct zone *zone,
prezeroing V6 [2/3]: ScrubD
Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. If a page is coalesced of the order specified in /proc /sys/scrub_start or higher then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep. In an SMP environment the scrub daemon is typically running on the most idle cpu. Thus a single threaded application running on one cpu may have the other cpu zeroing pages for it etc. The scrub daemon is hardly noticable and usually finishes zeroing quickly since most processors are optimized for linear memory filling. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.10/mm/page_alloc.c === --- linux-2.6.10.orig/mm/page_alloc.c 2005-02-03 22:51:57.0 -0800 +++ linux-2.6.10/mm/page_alloc.c2005-02-03 22:52:19.0 -0800 @@ -12,6 +12,8 @@ * Zone balancing, Kanoj Sarcar, SGI, Jan 2000 * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002 * (lots of bits borrowed from Ingo Molnar Andrew Morton) + * Page zeroing by Christoph Lameter, SGI, Dec 2004 using + * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004. */ #include linux/config.h @@ -33,6 +35,7 @@ #include linux/cpu.h #include linux/nodemask.h #include linux/vmalloc.h +#include linux/scrub.h #include asm/tlbflush.h #include internal.h @@ -175,16 +178,16 @@ static void destroy_compound_page(struct * zone-lock is already acquired when we use these. * So, we don't need atomic page-flags operations here. */ -static inline unsigned long page_order(struct page *page) { +static inline unsigned long page_zorder(struct page *page) { return page-private; } -static inline void set_page_order(struct page *page, int order) { - page-private = order; +static inline void set_page_zorder(struct page *page, int order, int zero) { + page-private = order + (zero 10); __SetPagePrivate(page); } -static inline void rmv_page_order(struct page *page) +static inline void rmv_page_zorder(struct page *page) { __ClearPagePrivate(page); page-private = 0; @@ -195,14 +198,15 @@ static inline void rmv_page_order(struct * we can do coalesce a page and its buddy if * (a) the buddy is free * (b) the buddy is on the buddy system - * (c) a page and its buddy have the same order. + * (c) a page and its buddy have the same order and the same + * zeroing status. * for recording page's order, we use page-private and PG_private. * */ -static inline int page_is_buddy(struct page *page, int order) +static inline int page_is_buddy(struct page *page, int order, int zero) { if (PagePrivate(page) - (page_order(page) == order) + (page_zorder(page) == order + (zero 10)) !PageReserved(page) page_count(page) == 0) return 1; @@ -233,22 +237,20 @@ static inline int page_is_buddy(struct p * -- wli */ -static inline void __free_pages_bulk (struct page *page, struct page *base, - struct zone *zone, unsigned int order) +static inline int __free_pages_bulk (struct page *page, struct page *base, + struct zone *zone, unsigned int order, int zero) { unsigned long page_idx; struct page *coalesced; - int order_size = 1 order; if (unlikely(order)) destroy_compound_page(page, order); page_idx = page - base; - BUG_ON(page_idx (order_size - 1)); + BUG_ON(page_idx (( 1 order) - 1)); BUG_ON(bad_range(zone, page)); - zone-free_pages += order_size; while (order MAX_ORDER-1) { struct free_area *area; struct page *buddy; @@ -258,20 +260,21 @@ static inline void __free_pages_bulk (st buddy = base + buddy_idx; if (bad_range(zone, buddy)) break; - if (!page_is_buddy(buddy, order)) + if (!page_is_buddy(buddy, order, zero)) break; /* Move the buddy up one level. */ list_del(buddy-lru); - area = zone-free_area + order; + area = zone-free_area[zero] + order; area-nr_free--; - rmv_page_order(buddy); + rmv_page_zorder(buddy); page_idx = buddy_idx; order++; } coalesced = base + page_idx; - set_page_order(coalesced, order); - list_add(coalesced-lru, zone-free_area[order].free_list); - zone-free_area[order].nr_free++; + set_page_zorder(coalesced, order, zero); + list_add(coalesced-lru, zone-free_area[zero][order].free_list); + zone-free_area[zero][order].nr_free++; + return order; } static inline void free_pages_check(const char *function,
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter [EMAIL PROTECTED] wrote: Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. What were the benchmarking results for this work? I think you had some, but this is pretty vital info, so it should be retained in the changelogs. Having one kscrubd per node seems like the right thing to do. Should we be managing the kernel threads with the kthread() API? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Mon, 7 Feb 2005, Andrew Morton wrote: Christoph Lameter [EMAIL PROTECTED] wrote: Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. What were the benchmarking results for this work? I think you had some, but this is pretty vital info, so it should be retained in the changelogs. Look at the early posts. I plan to put that up on the web. I have some stats attached to the end of this message from an earlier post. Having one kscrubd per node seems like the right thing to do. Yes that is what is happening. Otherwise our NUMA stuff would not work right ;-) Should we be managing the kernel threads with the kthread() API? What would you like to manage? -- Earlier post The scrub daemon is invoked when a unzeroed page of a certain order has been generated so that its worth running it. If no higher order pages are present then the logic will favor hot zeroing rather than simply shifting processing around. kscrubd typically runs only for a fraction of a second and sleeps for long periods of time even under memory benchmarking. kscrubd performs short bursts of zeroing when needed and tries to stay out off the processor as much as possible. The result is a significant increase of the page fault performance even for single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in each run): w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 110.006s 0.389s 0.039s157455.320 157070.694 0 120.007s 0.607s 0.032s101476.689 190350.885 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 110.008s 0.083s 0.009s672151.422 664045.899 0 120.005s 0.129s 0.008s459629.796 741857.373 The performance can only be upheld if enough zeroed pages are available. In a heavy memory intensive benchmark the system may run out of these very fast but the efficient algorithm for page zeroing still makes this a winner (2 way system with 384MB RAM, no hardware zeroing support). In the following measurement the test is repeated 10 times allocating 256M each in rapid succession which would deplete the pool of zeroed pages quickly): w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 1010.058s 3.913s 3.097s157335.774 157076.932 0 1020.063s 6.139s 3.027s100756.788 190572.486 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 0 1010.059s 1.828s 1.089s330913.517 330225.515 0 1020.082s 1.951s 1.094s307172.100 320680.232 Note that zeroing of pages makes no sense if the application touches all cache lines of a page allocated (there is no influence of prezeroing on benchmarks like lmbench for that reason) since the extensive caching of modern cpus means that the zeroes written to a hot zeroed page will then be overwritten by the application in the cpu cache and thus the zeros will never make it to memory! The test program used above only touches one 128 byte cache line of a 16k page (ia64). Sparsely populated and accessed areas are typical for lots of applications. Here is another test in order to gauge the influence of the number of cache lines touched on the performance of the prezero enhancements: Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 11 10.01s 0.12s 0.01s500813.853 497925.891 1 11 20.01s 0.11s 0.01s493453.103 472877.725 1 11 40.02s 0.10s 0.01s479351.658 471507.415 1 11 80.01s 0.13s 0.01s424742.054 416725.013 1 11 160.05s 0.12s 0.01s347715.359 336983.834 1 11 320.12s 0.13s 0.02s258112.286 256246.731 1 11 640.24s 0.14s 0.03s169896.381 168189.283 1 11 1280.49s 0.14s 0.06s102300.257 101674.435 The benefits of prezeroing are reduced to minimal quantities if all cachelines of a page are touched. Prezeroing can only be effective if the whole page is not immediately used after the page fault. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter [EMAIL PROTECTED] wrote: What were the benchmarking results for this work? I think you had some, but this is pretty vital info, so it should be retained in the changelogs. Look at the early posts. I plan to put that up on the web. I have some stats attached to the end of this message from an earlier post. But that's a patch-specific microbenchmark, isn't it? Has this work been benchmarked against real-world stuff? Should we be managing the kernel threads with the kthread() API? What would you like to manage? Startup, perhaps binding the threads to their cpus too. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
On Mon, 7 Feb 2005, Andrew Morton wrote: Look at the early posts. I plan to put that up on the web. I have some stats attached to the end of this message from an earlier post. But that's a patch-specific microbenchmark, isn't it? Has this work been benchmarked against real-world stuff? No its a page fault benchmark. Dave Miller has done some kernel compiles and I have some benchmarks here that I never posted because they do not show any material change as far as I can see. I will be posting that soon when this is complete (also need to do the same for the atomic page fault ops and the prefaulting patch). Should we be managing the kernel threads with the kthread() API? What would you like to manage? Startup, perhaps binding the threads to their cpus too. That is all already controllable in the same way as the swapper. Each memory node is bound to a set of cpus. This may be controlled by the NUMA node configuration. F.e. for nodes without cpus. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: prezeroing V6 [2/3]: ScrubD
Christoph Lameter [EMAIL PROTECTED] wrote: On Mon, 7 Feb 2005, Andrew Morton wrote: Look at the early posts. I plan to put that up on the web. I have some stats attached to the end of this message from an earlier post. But that's a patch-specific microbenchmark, isn't it? Has this work been benchmarked against real-world stuff? No its a page fault benchmark. Dave Miller has done some kernel compiles and I have some benchmarks here that I never posted because they do not show any material change as far as I can see. I will be posting that soon when this is complete (also need to do the same for the atomic page fault ops and the prefaulting patch). OK, thanks. That's important work. After all, this patch is a performance optimisation. Should we be managing the kernel threads with the kthread() API? What would you like to manage? Startup, perhaps binding the threads to their cpus too. That is all already controllable in the same way as the swapper. kswapd uses an old API. Each memory node is bound to a set of cpus. This may be controlled by the NUMA node configuration. F.e. for nodes without cpus. kthread_bind() should be able to do this. From a quick read it appears to have shortcomings in this department (it expects to be bound to a single CPU). We should fix kthread_bind() so that it can accomodate the kscrub/kswapd requirement. That's one of the _reasons_ for using the provided infrastructure rather than open-coding around it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/