Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread cliff white
On Tue, 8 Feb 2005 12:51:05 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Tue, 8 Feb 2005, Andrew Morton wrote:
> 
> > We also need to try to identify workloads whcih might experience a
> > regression and test them too.  It isn't very hard.
> 
> I'd be glad if you could provide some instructions on how exactly to do
> that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of
> configurations. But which configurations do you want?

If we can run some tests for you on STP let me know.
( we do 1,2,4,8 CPU x86 boxes )
cliffw


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
"Ive always gone through periods where I bolt upright at four in the morning; 
now at least theres a reason." -Michael Feldman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread Christoph Lameter
On Tue, 8 Feb 2005, Andrew Morton wrote:

> We also need to try to identify workloads whcih might experience a
> regression and test them too.  It isn't very hard.

I'd be glad if you could provide some instructions on how exactly to do
that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of
configurations. But which configurations do you want?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> On Mon, 7 Feb 2005, Andrew Morton wrote:
> 
> > > No its a page fault benchmark. Dave Miller has done some kernel compiles
> > > and I have some benchmarks here that I never posted because they do not
> > > show any material change as far as I can see. I will be posting that soon
> > > when this is complete (also need to do the same for the atomic page fault
> > > ops and the prefaulting patch).
> >
> > OK, thanks.  That's important work.  After all, this patch is a performance
> > optimisation.
> 
> Well its a bit complicated due to the various configuration. UP, and then
> more and more processors. Plus the NUMA stuff and the standard benchmarks
> that are basically not suited for SMP tests make this a bit difficult.

The patch is supposed to speed the kernel up with at least some workloads. 
We 100% need to see testing results with some such workloads to verify that
the patch is desirable.

We also need to try to identify workloads whcih might experience a
regression and test them too.  It isn't very hard.

> > > memory node is bound to a set of cpus. This may be controlled by the
> > > NUMA node configuration. F.e. for nodes without cpus.
> >
> > kthread_bind() should be able to do this.  From a quick read it appears to
> > have shortcomings in this department (it expects to be bound to a single
> > CPU).
> 
> Sorry but I still do not get what the problem is? kscrubd does exactly
> what kswapd does and can be handled in the same way. It works fine here
> on various multi node configurations and correctly gets CPUs assigned.

We now have a standard API for starting, binding and stopping kernel
threads.  It's best to use it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread Christoph Lameter
On Mon, 7 Feb 2005, Andrew Morton wrote:

> > No its a page fault benchmark. Dave Miller has done some kernel compiles
> > and I have some benchmarks here that I never posted because they do not
> > show any material change as far as I can see. I will be posting that soon
> > when this is complete (also need to do the same for the atomic page fault
> > ops and the prefaulting patch).
>
> OK, thanks.  That's important work.  After all, this patch is a performance
> optimisation.

Well its a bit complicated due to the various configuration. UP, and then
more and more processors. Plus the NUMA stuff and the standard benchmarks
that are basically not suited for SMP tests make this a bit difficult.

> > memory node is bound to a set of cpus. This may be controlled by the
> > NUMA node configuration. F.e. for nodes without cpus.
>
> kthread_bind() should be able to do this.  From a quick read it appears to
> have shortcomings in this department (it expects to be bound to a single
> CPU).

Sorry but I still do not get what the problem is? kscrubd does exactly
what kswapd does and can be handled in the same way. It works fine here
on various multi node configurations and correctly gets CPUs assigned.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread Christoph Lameter
On Mon, 7 Feb 2005, Andrew Morton wrote:

  No its a page fault benchmark. Dave Miller has done some kernel compiles
  and I have some benchmarks here that I never posted because they do not
  show any material change as far as I can see. I will be posting that soon
  when this is complete (also need to do the same for the atomic page fault
  ops and the prefaulting patch).

 OK, thanks.  That's important work.  After all, this patch is a performance
 optimisation.

Well its a bit complicated due to the various configuration. UP, and then
more and more processors. Plus the NUMA stuff and the standard benchmarks
that are basically not suited for SMP tests make this a bit difficult.

  memory node is bound to a set of cpus. This may be controlled by the
  NUMA node configuration. F.e. for nodes without cpus.

 kthread_bind() should be able to do this.  From a quick read it appears to
 have shortcomings in this department (it expects to be bound to a single
 CPU).

Sorry but I still do not get what the problem is? kscrubd does exactly
what kswapd does and can be handled in the same way. It works fine here
on various multi node configurations and correctly gets CPUs assigned.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Mon, 7 Feb 2005, Andrew Morton wrote:
 
   No its a page fault benchmark. Dave Miller has done some kernel compiles
   and I have some benchmarks here that I never posted because they do not
   show any material change as far as I can see. I will be posting that soon
   when this is complete (also need to do the same for the atomic page fault
   ops and the prefaulting patch).
 
  OK, thanks.  That's important work.  After all, this patch is a performance
  optimisation.
 
 Well its a bit complicated due to the various configuration. UP, and then
 more and more processors. Plus the NUMA stuff and the standard benchmarks
 that are basically not suited for SMP tests make this a bit difficult.

The patch is supposed to speed the kernel up with at least some workloads. 
We 100% need to see testing results with some such workloads to verify that
the patch is desirable.

We also need to try to identify workloads whcih might experience a
regression and test them too.  It isn't very hard.

   memory node is bound to a set of cpus. This may be controlled by the
   NUMA node configuration. F.e. for nodes without cpus.
 
  kthread_bind() should be able to do this.  From a quick read it appears to
  have shortcomings in this department (it expects to be bound to a single
  CPU).
 
 Sorry but I still do not get what the problem is? kscrubd does exactly
 what kswapd does and can be handled in the same way. It works fine here
 on various multi node configurations and correctly gets CPUs assigned.

We now have a standard API for starting, binding and stopping kernel
threads.  It's best to use it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread Christoph Lameter
On Tue, 8 Feb 2005, Andrew Morton wrote:

 We also need to try to identify workloads whcih might experience a
 regression and test them too.  It isn't very hard.

I'd be glad if you could provide some instructions on how exactly to do
that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of
configurations. But which configurations do you want?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-08 Thread cliff white
On Tue, 8 Feb 2005 12:51:05 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Tue, 8 Feb 2005, Andrew Morton wrote:
 
  We also need to try to identify workloads whcih might experience a
  regression and test them too.  It isn't very hard.
 
 I'd be glad if you could provide some instructions on how exactly to do
 that. I have run lmbench, aim9, aim7, unixbench, ubench for a couple of
 configurations. But which configurations do you want?

If we can run some tests for you on STP let me know.
( we do 1,2,4,8 CPU x86 boxes )
cliffw


 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 


-- 
Ive always gone through periods where I bolt upright at four in the morning; 
now at least theres a reason. -Michael Feldman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> On Mon, 7 Feb 2005, Andrew Morton wrote:
> 
> > > Look at the early posts. I plan to put that up on the web. I have some
> > > stats attached to the end of this message from an earlier post.
> >
> > But that's a patch-specific microbenchmark, isn't it?  Has this work been
> > benchmarked against real-world stuff?
> 
> No its a page fault benchmark. Dave Miller has done some kernel compiles
> and I have some benchmarks here that I never posted because they do not
> show any material change as far as I can see. I will be posting that soon
> when this is complete (also need to do the same for the atomic page fault
> ops and the prefaulting patch).

OK, thanks.  That's important work.  After all, this patch is a performance
optimisation.

> > > > Should we be managing the kernel threads with the kthread() API?
> > >
> > > What would you like to manage?
> >
> > Startup, perhaps binding the threads to their cpus too.
> 
> That is all already controllable in the same way as the swapper.

kswapd uses an old API.

> Each
> memory node is bound to a set of cpus. This may be controlled by the
> NUMA node configuration. F.e. for nodes without cpus.

kthread_bind() should be able to do this.  From a quick read it appears to
have shortcomings in this department (it expects to be bound to a single
CPU).

We should fix kthread_bind() so that it can accomodate the kscrub/kswapd
requirement.  That's one of the _reasons_ for using the provided
infrastructure rather than open-coding around it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Christoph Lameter
On Mon, 7 Feb 2005, Andrew Morton wrote:

> > Look at the early posts. I plan to put that up on the web. I have some
> > stats attached to the end of this message from an earlier post.
>
> But that's a patch-specific microbenchmark, isn't it?  Has this work been
> benchmarked against real-world stuff?

No its a page fault benchmark. Dave Miller has done some kernel compiles
and I have some benchmarks here that I never posted because they do not
show any material change as far as I can see. I will be posting that soon
when this is complete (also need to do the same for the atomic page fault
ops and the prefaulting patch).

> > > Should we be managing the kernel threads with the kthread() API?
> >
> > What would you like to manage?
>
> Startup, perhaps binding the threads to their cpus too.

That is all already controllable in the same way as the swapper. Each
memory node is bound to a set of cpus. This may be controlled by the
NUMA node configuration. F.e. for nodes without cpus.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> > What were the benchmarking results for this work?  I think you had some,
> > but this is pretty vital info, so it should be retained in the changelogs.
> 
> Look at the early posts. I plan to put that up on the web. I have some
> stats attached to the end of this message from an earlier post.

But that's a patch-specific microbenchmark, isn't it?  Has this work been
benchmarked against real-world stuff?

> > Should we be managing the kernel threads with the kthread() API?
> 
> What would you like to manage?

Startup, perhaps binding the threads to their cpus too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Christoph Lameter
On Mon, 7 Feb 2005, Andrew Morton wrote:

> Christoph Lameter <[EMAIL PROTECTED]> wrote:
> >
> > Adds management of ZEROED and NOT_ZEROED pages and a background daemon
> > called scrubd.
>
> What were the benchmarking results for this work?  I think you had some,
> but this is pretty vital info, so it should be retained in the changelogs.

Look at the early posts. I plan to put that up on the web. I have some
stats attached to the end of this message from an earlier post.

> Having one kscrubd per node seems like the right thing to do.

Yes that is what is happening. Otherwise our NUMA stuff would not work
right ;-)

> Should we be managing the kernel threads with the kthread() API?

What would you like to manage?

-- Earlier post
The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking.
kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.

The result is a significant increase of the page fault performance even
for
single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in
each run):

w/o patch:
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0   110.006s  0.389s   0.039s157455.320 157070.694
  0   120.007s  0.607s   0.032s101476.689 190350.885

w/patch
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0   110.008s  0.083s   0.009s672151.422 664045.899
  0   120.005s  0.129s   0.008s459629.796 741857.373

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system may run out of these very
fast but the efficient algorithm for page zeroing still makes this a
winner
(2 way system with 384MB RAM, no hardware zeroing support). In the
following
measurement the test is repeated 10 times allocating 256M each in rapid
succession which would deplete the pool of zeroed pages quickly):

w/o patch:
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0  1010.058s  3.913s   3.097s157335.774 157076.932
  0  1020.063s  6.139s   3.027s100756.788 190572.486

w/patch
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0  1010.059s  1.828s   1.089s330913.517 330225.515
  0  1020.082s  1.951s   1.094s307172.100 320680.232

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64). Sparsely
populated and accessed areas are typical for lots of applications.

Here is another test in order to gauge the influence of the number of
cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User  System   Wall  flt/cpu/s fault/wsec
  1  11   10.01s  0.12s   0.01s500813.853 497925.891
  1  11   20.01s  0.11s   0.01s493453.103 472877.725
  1  11   40.02s  0.10s   0.01s479351.658 471507.415
  1  11   80.01s  0.13s   0.01s424742.054 416725.013
  1  11  160.05s  0.12s   0.01s347715.359 336983.834
  1  11  320.12s  0.13s   0.02s258112.286 256246.731
  1  11  640.24s  0.14s   0.03s169896.381 168189.283
  1  11 1280.49s  0.14s   0.06s102300.257 101674.435

The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Andrew Morton
Christoph Lameter <[EMAIL PROTECTED]> wrote:
>
> Adds management of ZEROED and NOT_ZEROED pages and a background daemon
> called scrubd.

What were the benchmarking results for this work?  I think you had some,
but this is pretty vital info, so it should be retained in the changelogs.

Having one kscrubd per node seems like the right thing to do.

Should we be managing the kernel threads with the kthread() API?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Christoph Lameter
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. If a page is coalesced of the order specified in /proc
/sys/scrub_start or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.

In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finishes zeroing quickly since
most processors are optimized for linear memory filling.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.10/mm/page_alloc.c
===
--- linux-2.6.10.orig/mm/page_alloc.c   2005-02-03 22:51:57.0 -0800
+++ linux-2.6.10/mm/page_alloc.c2005-02-03 22:52:19.0 -0800
@@ -12,6 +12,8 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *  (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Page zeroing by Christoph Lameter, SGI, Dec 2004 using
+ * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004.
  */

 #include 
@@ -33,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include "internal.h"
@@ -175,16 +178,16 @@ static void destroy_compound_page(struct
  * zone->lock is already acquired when we use these.
  * So, we don't need atomic page->flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
return page->private;
 }

-static inline void set_page_order(struct page *page, int order) {
-   page->private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+   page->private = order + (zero << 10);
__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
__ClearPagePrivate(page);
page->private = 0;
@@ -195,14 +198,15 @@ static inline void rmv_page_order(struct
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
  * (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ * zeroing status.
  * for recording page's order, we use page->private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
if (PagePrivate(page)   &&
-   (page_order(page) == order) &&
+   (page_zorder(page) == order + (zero << 10)) &&
!PageReserved(page) &&
 page_count(page) == 0)
return 1;
@@ -233,22 +237,20 @@ static inline int page_is_buddy(struct p
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
-   struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+   struct zone *zone, unsigned int order, int zero)
 {
unsigned long page_idx;
struct page *coalesced;
-   int order_size = 1 << order;

if (unlikely(order))
destroy_compound_page(page, order);

page_idx = page - base;

-   BUG_ON(page_idx & (order_size - 1));
+   BUG_ON(page_idx & (( 1 << order) - 1));
BUG_ON(bad_range(zone, page));

-   zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
struct free_area *area;
struct page *buddy;
@@ -258,20 +260,21 @@ static inline void __free_pages_bulk (st
buddy = base + buddy_idx;
if (bad_range(zone, buddy))
break;
-   if (!page_is_buddy(buddy, order))
+   if (!page_is_buddy(buddy, order, zero))
break;
/* Move the buddy up one level. */
list_del(>lru);
-   area = zone->free_area + order;
+   area = zone->free_area[zero] + order;
area->nr_free--;
-   rmv_page_order(buddy);
+   rmv_page_zorder(buddy);
page_idx &= buddy_idx;
order++;
}
coalesced = base + page_idx;
-   set_page_order(coalesced, order);
-   list_add(>lru, >free_area[order].free_list);
-   zone->free_area[order].nr_free++;
+   set_page_zorder(coalesced, order, zero);
+   list_add(>lru, >free_area[zero][order].free_list);
+   zone->free_area[zero][order].nr_free++;
+   return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -320,8 +323,11 @@ free_pages_bulk(struct zone *zone, 

prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Christoph Lameter
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. If a page is coalesced of the order specified in /proc
/sys/scrub_start or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.

In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finishes zeroing quickly since
most processors are optimized for linear memory filling.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.10/mm/page_alloc.c
===
--- linux-2.6.10.orig/mm/page_alloc.c   2005-02-03 22:51:57.0 -0800
+++ linux-2.6.10/mm/page_alloc.c2005-02-03 22:52:19.0 -0800
@@ -12,6 +12,8 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *  (lots of bits borrowed from Ingo Molnar  Andrew Morton)
+ *  Page zeroing by Christoph Lameter, SGI, Dec 2004 using
+ * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004.
  */

 #include linux/config.h
@@ -33,6 +35,7 @@
 #include linux/cpu.h
 #include linux/nodemask.h
 #include linux/vmalloc.h
+#include linux/scrub.h

 #include asm/tlbflush.h
 #include internal.h
@@ -175,16 +178,16 @@ static void destroy_compound_page(struct
  * zone-lock is already acquired when we use these.
  * So, we don't need atomic page-flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
return page-private;
 }

-static inline void set_page_order(struct page *page, int order) {
-   page-private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+   page-private = order + (zero  10);
__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
__ClearPagePrivate(page);
page-private = 0;
@@ -195,14 +198,15 @@ static inline void rmv_page_order(struct
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free 
  * (b) the buddy is on the buddy system 
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ * zeroing status.
  * for recording page's order, we use page-private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
if (PagePrivate(page)   
-   (page_order(page) == order) 
+   (page_zorder(page) == order + (zero  10)) 
!PageReserved(page) 
 page_count(page) == 0)
return 1;
@@ -233,22 +237,20 @@ static inline int page_is_buddy(struct p
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
-   struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+   struct zone *zone, unsigned int order, int zero)
 {
unsigned long page_idx;
struct page *coalesced;
-   int order_size = 1  order;

if (unlikely(order))
destroy_compound_page(page, order);

page_idx = page - base;

-   BUG_ON(page_idx  (order_size - 1));
+   BUG_ON(page_idx  (( 1  order) - 1));
BUG_ON(bad_range(zone, page));

-   zone-free_pages += order_size;
while (order  MAX_ORDER-1) {
struct free_area *area;
struct page *buddy;
@@ -258,20 +260,21 @@ static inline void __free_pages_bulk (st
buddy = base + buddy_idx;
if (bad_range(zone, buddy))
break;
-   if (!page_is_buddy(buddy, order))
+   if (!page_is_buddy(buddy, order, zero))
break;
/* Move the buddy up one level. */
list_del(buddy-lru);
-   area = zone-free_area + order;
+   area = zone-free_area[zero] + order;
area-nr_free--;
-   rmv_page_order(buddy);
+   rmv_page_zorder(buddy);
page_idx = buddy_idx;
order++;
}
coalesced = base + page_idx;
-   set_page_order(coalesced, order);
-   list_add(coalesced-lru, zone-free_area[order].free_list);
-   zone-free_area[order].nr_free++;
+   set_page_zorder(coalesced, order, zero);
+   list_add(coalesced-lru, zone-free_area[zero][order].free_list);
+   zone-free_area[zero][order].nr_free++;
+   return order;
 }

 static inline void free_pages_check(const char *function, 

Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

 Adds management of ZEROED and NOT_ZEROED pages and a background daemon
 called scrubd.

What were the benchmarking results for this work?  I think you had some,
but this is pretty vital info, so it should be retained in the changelogs.

Having one kscrubd per node seems like the right thing to do.

Should we be managing the kernel threads with the kthread() API?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Christoph Lameter
On Mon, 7 Feb 2005, Andrew Morton wrote:

 Christoph Lameter [EMAIL PROTECTED] wrote:
 
  Adds management of ZEROED and NOT_ZEROED pages and a background daemon
  called scrubd.

 What were the benchmarking results for this work?  I think you had some,
 but this is pretty vital info, so it should be retained in the changelogs.

Look at the early posts. I plan to put that up on the web. I have some
stats attached to the end of this message from an earlier post.

 Having one kscrubd per node seems like the right thing to do.

Yes that is what is happening. Otherwise our NUMA stuff would not work
right ;-)

 Should we be managing the kernel threads with the kthread() API?

What would you like to manage?

-- Earlier post
The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking.
kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.

The result is a significant increase of the page fault performance even
for
single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in
each run):

w/o patch:
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0   110.006s  0.389s   0.039s157455.320 157070.694
  0   120.007s  0.607s   0.032s101476.689 190350.885

w/patch
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0   110.008s  0.083s   0.009s672151.422 664045.899
  0   120.005s  0.129s   0.008s459629.796 741857.373

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system may run out of these very
fast but the efficient algorithm for page zeroing still makes this a
winner
(2 way system with 384MB RAM, no hardware zeroing support). In the
following
measurement the test is repeated 10 times allocating 256M each in rapid
succession which would deplete the pool of zeroed pages quickly):

w/o patch:
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0  1010.058s  3.913s   3.097s157335.774 157076.932
  0  1020.063s  6.139s   3.027s100756.788 190572.486

w/patch
 Gb Rep Threads   User  System Wall flt/cpu/s fault/wsec
  0  1010.059s  1.828s   1.089s330913.517 330225.515
  0  1020.082s  1.951s   1.094s307172.100 320680.232

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64). Sparsely
populated and accessed areas are typical for lots of applications.

Here is another test in order to gauge the influence of the number of
cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User  System   Wall  flt/cpu/s fault/wsec
  1  11   10.01s  0.12s   0.01s500813.853 497925.891
  1  11   20.01s  0.11s   0.01s493453.103 472877.725
  1  11   40.02s  0.10s   0.01s479351.658 471507.415
  1  11   80.01s  0.13s   0.01s424742.054 416725.013
  1  11  160.05s  0.12s   0.01s347715.359 336983.834
  1  11  320.12s  0.13s   0.02s258112.286 256246.731
  1  11  640.24s  0.14s   0.03s169896.381 168189.283
  1  11 1280.49s  0.14s   0.06s102300.257 101674.435

The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

  What were the benchmarking results for this work?  I think you had some,
  but this is pretty vital info, so it should be retained in the changelogs.
 
 Look at the early posts. I plan to put that up on the web. I have some
 stats attached to the end of this message from an earlier post.

But that's a patch-specific microbenchmark, isn't it?  Has this work been
benchmarked against real-world stuff?

  Should we be managing the kernel threads with the kthread() API?
 
 What would you like to manage?

Startup, perhaps binding the threads to their cpus too.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Christoph Lameter
On Mon, 7 Feb 2005, Andrew Morton wrote:

  Look at the early posts. I plan to put that up on the web. I have some
  stats attached to the end of this message from an earlier post.

 But that's a patch-specific microbenchmark, isn't it?  Has this work been
 benchmarked against real-world stuff?

No its a page fault benchmark. Dave Miller has done some kernel compiles
and I have some benchmarks here that I never posted because they do not
show any material change as far as I can see. I will be posting that soon
when this is complete (also need to do the same for the atomic page fault
ops and the prefaulting patch).

   Should we be managing the kernel threads with the kthread() API?
 
  What would you like to manage?

 Startup, perhaps binding the threads to their cpus too.

That is all already controllable in the same way as the swapper. Each
memory node is bound to a set of cpus. This may be controlled by the
NUMA node configuration. F.e. for nodes without cpus.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: prezeroing V6 [2/3]: ScrubD

2005-02-07 Thread Andrew Morton
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Mon, 7 Feb 2005, Andrew Morton wrote:
 
   Look at the early posts. I plan to put that up on the web. I have some
   stats attached to the end of this message from an earlier post.
 
  But that's a patch-specific microbenchmark, isn't it?  Has this work been
  benchmarked against real-world stuff?
 
 No its a page fault benchmark. Dave Miller has done some kernel compiles
 and I have some benchmarks here that I never posted because they do not
 show any material change as far as I can see. I will be posting that soon
 when this is complete (also need to do the same for the atomic page fault
 ops and the prefaulting patch).

OK, thanks.  That's important work.  After all, this patch is a performance
optimisation.

Should we be managing the kernel threads with the kthread() API?
  
   What would you like to manage?
 
  Startup, perhaps binding the threads to their cpus too.
 
 That is all already controllable in the same way as the swapper.

kswapd uses an old API.

 Each
 memory node is bound to a set of cpus. This may be controlled by the
 NUMA node configuration. F.e. for nodes without cpus.

kthread_bind() should be able to do this.  From a quick read it appears to
have shortcomings in this department (it expects to be bound to a single
CPU).

We should fix kthread_bind() so that it can accomodate the kscrub/kswapd
requirement.  That's one of the _reasons_ for using the provided
infrastructure rather than open-coding around it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/