Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Nish Aravamudan
On Thu, 17 Mar 2005 18:09:11 -0800 (PST), Christoph Lameter
<[EMAIL PROTECTED]> wrote:
> On Thu, 17 Mar 2005, Jason Uhlenkott wrote:
> 
> > On Thu, Mar 17, 2005 at 05:36:50PM -0800, Christoph Lameter wrote:
> > > +while (avenrun[0] >= ((unsigned long)sysctl_scrub_load << 
> > > FSHIFT)) {
> > > +   set_current_state(TASK_UNINTERRUPTIBLE);
> > > +   schedule_timeout(30*HZ);
> > > +   }
> >
> > This should probably be TASK_INTERRUPTIBLE.  It'll never actually get
> > interrupted either way since kernel threads block all signals, but
> > sleeping uninterruptibly contributes to the load average.
> 
> Correct.  I just do not seem to be able to get this right.

I think msleep_interruptible(3) would be your best choice, then. 
Maybe with a comment that you don't actually expect signals, but are
using TASK_INTERRUPTIBLE to avoid contributing to load average (that
way, if the loadavg calculation changes someday, somebody will be able
to change your sleep over appropriately).

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Christoph Lameter
On Thu, 17 Mar 2005, Jason Uhlenkott wrote:

> On Thu, Mar 17, 2005 at 05:36:50PM -0800, Christoph Lameter wrote:
> > +while (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT)) 
> > {
> > +   set_current_state(TASK_UNINTERRUPTIBLE);
> > +   schedule_timeout(30*HZ);
> > +   }
>
> This should probably be TASK_INTERRUPTIBLE.  It'll never actually get
> interrupted either way since kernel threads block all signals, but
> sleeping uninterruptibly contributes to the load average.

Correct.  I just do not seem to be able to get this right.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Jason Uhlenkott
On Thu, Mar 17, 2005 at 05:36:50PM -0800, Christoph Lameter wrote:
> +while (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT)) {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + schedule_timeout(30*HZ);
> + }

This should probably be TASK_INTERRUPTIBLE.  It'll never actually get
interrupted either way since kernel threads block all signals, but
sleeping uninterruptibly contributes to the load average.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Christoph Lameter
Here is the fixed up zeroing patch with management of hot/cold zeroed
pages.

If quicklists would like the use this then they need to use

free_hot_zeroed_page(page)

and

get_zeroed_page(GFP)

for their management of hot zeroed pages. If the pool is empty then it
will be replenished either from the pool build up by kscrubd or by zeroing
a couple of pages on the fly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the touching of all cache lines of a page by
zeroing the page. This zeroing means that all cachelines of the faulted
page (on Altix that means all 128 cachelines of 128 byte each) must be
handled and later written back. This patch allows to avoid having to
use all cachelines if only a part of the cachelines of that page is
needed immediately after the fault. Doing so will only be effective for
sparsely accessed memory which is typical for anonymous memory and pte
maps.

The patch can make prezeroing more effective by also allowing the use
of hardware devices to offload zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.
For that purpose a driver may register a zeroing driver via

register_zero_driver(z)

When the number of zeroed pages falls below a lower threshhold (defined
by setting /proc/sys/vm/scrub_start) kscrubd is invoked (similar
to the swapper). kscrubd then zeroes free pages until the upper
threshold is reached (set by /proc/sys/vm/scrub_stop). The zeroing
is performed on a percentage of pages at each order of freed pages to
minimize fragmentation of pages.

kscrubd performs short bursts of zeroing when needed and tries to stay
off the processor as much as possible. Kscrubd will only run when the load
is less than set in /proc/sys/vm/scrub_load (defaults to 1).

The patch also provides the management of hot and cold lists for
zeroed pages in the pageset structure.

Patch against 2.6.11.3-bk3. Performance data may be found at
http://oss.sgi.com/projects/page_fault_performance/

Changelog:
- Cleanup and document more clearly
- Add full support for hot/cold zeroed pages.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-17 16:38:55.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-17 17:28:27.0 -0800
@@ -12,6 +12,8 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *  (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Page zeroing by Christoph Lameter, SGI, Dec 2004 using
+ * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004.
  */

 #include 
@@ -34,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include "internal.h"
@@ -180,16 +183,20 @@ static void destroy_compound_page(struct
  * zone->lock is already acquired when we use these.
  * So, we don't need atomic page->flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
return page->private;
 }

-static inline void set_page_order(struct page *page, int order) {
-   page->private = order;
+/* We use bit PAGE_PRIVATE_ZERO_SHIFT in page->private to encode
+ * the zeroing status. This makes buddy pages with different zeroing
+ * status not match to avoid merging zeroed with unzeroed pages
+ */
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+   page->private = order + (zero << PAGE_PRIVATE_ZERO_SHIFT);
__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
__ClearPagePrivate(page);
page->private = 0;
@@ -231,14 +238,15 @@ __find_combined_index(unsigned long page
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
  * (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ * zeroing status.
  * for recording page's order, we use page->private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
if (PagePrivate(page)   &&
-   (page_order(page) == order) &&
+   (page_zorder(page) == order + (zero << PAGE_PRIVATE_ZERO_SHIFT)) &&
!PageReserved(page) &&
 page_count(page) == 0)
return 1;
@@ -270,7 +278,7 @@ static inline int page_is_buddy(struct p
  */

 static inline void __free_pages_bulk (struct page *page,
-   struct zone *zone, unsigned int order)
+   struct zone *zone, unsigned int order, unsigned int zero)
 {
unsigned long page_idx;
int 

Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Christoph Lameter
Here is the fixed up zeroing patch with management of hot/cold zeroed
pages.

If quicklists would like the use this then they need to use

free_hot_zeroed_page(page)

and

get_zeroed_page(GFP)

for their management of hot zeroed pages. If the pool is empty then it
will be replenished either from the pool build up by kscrubd or by zeroing
a couple of pages on the fly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the touching of all cache lines of a page by
zeroing the page. This zeroing means that all cachelines of the faulted
page (on Altix that means all 128 cachelines of 128 byte each) must be
handled and later written back. This patch allows to avoid having to
use all cachelines if only a part of the cachelines of that page is
needed immediately after the fault. Doing so will only be effective for
sparsely accessed memory which is typical for anonymous memory and pte
maps.

The patch can make prezeroing more effective by also allowing the use
of hardware devices to offload zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.
For that purpose a driver may register a zeroing driver via

register_zero_driver(z)

When the number of zeroed pages falls below a lower threshhold (defined
by setting /proc/sys/vm/scrub_start) kscrubd is invoked (similar
to the swapper). kscrubd then zeroes free pages until the upper
threshold is reached (set by /proc/sys/vm/scrub_stop). The zeroing
is performed on a percentage of pages at each order of freed pages to
minimize fragmentation of pages.

kscrubd performs short bursts of zeroing when needed and tries to stay
off the processor as much as possible. Kscrubd will only run when the load
is less than set in /proc/sys/vm/scrub_load (defaults to 1).

The patch also provides the management of hot and cold lists for
zeroed pages in the pageset structure.

Patch against 2.6.11.3-bk3. Performance data may be found at
http://oss.sgi.com/projects/page_fault_performance/

Changelog:
- Cleanup and document more clearly
- Add full support for hot/cold zeroed pages.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-17 16:38:55.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-17 17:28:27.0 -0800
@@ -12,6 +12,8 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *  (lots of bits borrowed from Ingo Molnar  Andrew Morton)
+ *  Page zeroing by Christoph Lameter, SGI, Dec 2004 using
+ * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004.
  */

 #include linux/config.h
@@ -34,6 +36,7 @@
 #include linux/cpuset.h
 #include linux/nodemask.h
 #include linux/vmalloc.h
+#include linux/scrub.h

 #include asm/tlbflush.h
 #include internal.h
@@ -180,16 +183,20 @@ static void destroy_compound_page(struct
  * zone-lock is already acquired when we use these.
  * So, we don't need atomic page-flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
return page-private;
 }

-static inline void set_page_order(struct page *page, int order) {
-   page-private = order;
+/* We use bit PAGE_PRIVATE_ZERO_SHIFT in page-private to encode
+ * the zeroing status. This makes buddy pages with different zeroing
+ * status not match to avoid merging zeroed with unzeroed pages
+ */
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+   page-private = order + (zero  PAGE_PRIVATE_ZERO_SHIFT);
__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
__ClearPagePrivate(page);
page-private = 0;
@@ -231,14 +238,15 @@ __find_combined_index(unsigned long page
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free 
  * (b) the buddy is on the buddy system 
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ * zeroing status.
  * for recording page's order, we use page-private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
if (PagePrivate(page)   
-   (page_order(page) == order) 
+   (page_zorder(page) == order + (zero  PAGE_PRIVATE_ZERO_SHIFT)) 
!PageReserved(page) 
 page_count(page) == 0)
return 1;
@@ -270,7 +278,7 @@ static inline int page_is_buddy(struct p
  */

 static inline void __free_pages_bulk (struct page *page,
-   struct zone *zone, unsigned int order)
+   struct zone *zone, unsigned int order, unsigned int 

Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Jason Uhlenkott
On Thu, Mar 17, 2005 at 05:36:50PM -0800, Christoph Lameter wrote:
 +while (avenrun[0] = ((unsigned long)sysctl_scrub_load  FSHIFT)) {
 + set_current_state(TASK_UNINTERRUPTIBLE);
 + schedule_timeout(30*HZ);
 + }

This should probably be TASK_INTERRUPTIBLE.  It'll never actually get
interrupted either way since kernel threads block all signals, but
sleeping uninterruptibly contributes to the load average.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Christoph Lameter
On Thu, 17 Mar 2005, Jason Uhlenkott wrote:

 On Thu, Mar 17, 2005 at 05:36:50PM -0800, Christoph Lameter wrote:
  +while (avenrun[0] = ((unsigned long)sysctl_scrub_load  FSHIFT)) 
  {
  +   set_current_state(TASK_UNINTERRUPTIBLE);
  +   schedule_timeout(30*HZ);
  +   }

 This should probably be TASK_INTERRUPTIBLE.  It'll never actually get
 interrupted either way since kernel threads block all signals, but
 sleeping uninterruptibly contributes to the load average.

Correct.  I just do not seem to be able to get this right.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Prezeroing V8 + free_hot_zeroed_page + free_cold_zeroed page

2005-03-17 Thread Nish Aravamudan
On Thu, 17 Mar 2005 18:09:11 -0800 (PST), Christoph Lameter
[EMAIL PROTECTED] wrote:
 On Thu, 17 Mar 2005, Jason Uhlenkott wrote:
 
  On Thu, Mar 17, 2005 at 05:36:50PM -0800, Christoph Lameter wrote:
   +while (avenrun[0] = ((unsigned long)sysctl_scrub_load  
   FSHIFT)) {
   +   set_current_state(TASK_UNINTERRUPTIBLE);
   +   schedule_timeout(30*HZ);
   +   }
 
  This should probably be TASK_INTERRUPTIBLE.  It'll never actually get
  interrupted either way since kernel threads block all signals, but
  sleeping uninterruptibly contributes to the load average.
 
 Correct.  I just do not seem to be able to get this right.

I think msleep_interruptible(3) would be your best choice, then. 
Maybe with a comment that you don't actually expect signals, but are
using TASK_INTERRUPTIBLE to avoid contributing to load average (that
way, if the loadavg calculation changes someday, somebody will be able
to change your sleep over appropriately).

Thanks,
Nish
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/