Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-06 Thread Grant Grundler
On Tue, Apr 05, 2005 at 10:15:18PM -0700, Gerrit Huizenga wrote:
> SpecSDET, Aim7 or ReAim from OSDL are probably what you are thinking of.

SDET isn't publicly available.
I hope by now osdl-reaim is called "osdl-aim7":
http://lkml.org/lkml/2003/8/1/172

grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-06 Thread Grant Grundler
On Tue, Apr 05, 2005 at 10:15:18PM -0700, Gerrit Huizenga wrote:
 SpecSDET, Aim7 or ReAim from OSDL are probably what you are thinking of.

SDET isn't publicly available.
I hope by now osdl-reaim is called osdl-aim7:
http://lkml.org/lkml/2003/8/1/172

grant
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread Gerrit Huizenga

On Tue, 05 Apr 2005 21:48:22 PDT, David Mosberger wrote:
> > On Tue, 5 Apr 2005 17:33:59 -0700 (PDT), Christoph Lameter <[EMAIL 
> > PROTECTED]> said:
> 
>   Christoph> Which benchmark would you recommend for this?
> 
> I don't know about "recommend", but I think SPECweb, SPECjbb,
> the-UNIX-multi-user-benchmark-whose-name-I-keep-forgetting, and in
> general anything that involves process-activity and/or large working
> sets might be interesting (in other words: anything but
> microbenchmarks; I'm afraid).

SpecSDET, Aim7 or ReAim from OSDL are probably what you are thinking
of.

gerrit
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread David Mosberger
> On Tue, 5 Apr 2005 17:33:59 -0700 (PDT), Christoph Lameter <[EMAIL 
> PROTECTED]> said:

  Christoph> Which benchmark would you recommend for this?

I don't know about "recommend", but I think SPECweb, SPECjbb,
the-UNIX-multi-user-benchmark-whose-name-I-keep-forgetting, and in
general anything that involves process-activity and/or large working
sets might be interesting (in other words: anything but
microbenchmarks; I'm afraid).

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread Christoph Lameter
On Tue, 5 Apr 2005, David Mosberger wrote:

> What LMbench test other than fork/exec would you have expected to be
> affected by this?  LMbench is not a good benchmark for this (remember:
> it's a _micro_ benchmark).

LMbench does a variety of things and I expected to see at least
something on the page fault test and hopefully also some variations for
other tests.

Which benchmark would you recommend for this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread David Mosberger
> On Tue, 5 Apr 2005 17:15:53 -0700 (PDT), Christoph Lameter <[EMAIL 
> PROTECTED]> said:

  Christoph> On Thu, 24 Mar 2005, David Mosberger wrote:
  >> That's definitely the case.  See my earlier post on this topic:

  >> http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html

  >> Unfortunately, nobody reported any results for larger machines
  >> and/or more interesting workloads, so the patch is in limbo at
  >> this time.  Clearly, if the CPU that's clearing the page is
  >> likely to use that same page soon after, it'd be useful to use
  >> temporal stores.

  Christoph> Here are some numbers using lmbench of temporal writes
  Christoph> vs. non temporal writes on ia64 (8p machine but lmbench
  Christoph> run only for one load). There seems to be some benefit
  Christoph> for fork/exec but overall this does not seem to be a
  Christoph> clear win. I suspect that the distinction between
  Christoph> temporal vs. nontemporal writes is be more beneficial on
  Christoph> machines with smaller pagesizes since the likelyhood that
  Christoph> most cachelines of a page are used soon is increased and
  Christoph> therefore hot zeroing is more beneficial.

What LMbench test other than fork/exec would you have expected to be
affected by this?  LMbench is not a good benchmark for this (remember:
it's a _micro_ benchmark).

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread Christoph Lameter
On Thu, 24 Mar 2005, David Mosberger wrote:

> That's definitely the case.  See my earlier post on this topic:
>
>   http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html
>
> Unfortunately, nobody reported any results for larger machines and/or
> more interesting workloads, so the patch is in limbo at this time.
> Clearly, if the CPU that's clearing the page is likely to use that
> same page soon after, it'd be useful to use temporal stores.

Here are some numbers using lmbench of temporal writes vs. non temporal
writes on ia64 (8p machine but lmbench run only for one load). There seems
to be some benefit for fork/exec but overall this does not seem to be a
clear win. I suspect that the distinction between temporal vs. nontemporal
writes is be more beneficial on machines with smaller pagesizes since
the likelyhood that most cachelines of a page are used soon is increased
and therefore hot zeroing is more beneficial.


 L M B E N C H  3 . 0   S U M M A R Y
 
 (Alpha software, do not distribute)

Basic system parameters
---
Host OS Description  Mhz  tlb  cache  
mem   scal
 pages line   
par   load
  bytes
- - ---  - - 
-- 
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin  Linux 2.6.12-rc1-bk3-dm  ia64-linux-gnu 1300 128
   1
margin  Linux 2.6.12-rc1-bk3-dm  ia64-linux-gnu 1300 128
   1
margin  Linux 2.6.12-rc1-bk3-dm  ia64-linux-gnu 1300 128
   1

Processor, Processes - times in microseconds - smaller is better
--
Host OS  Mhz null null  open slct sig  sig  
fork exec sh
 call  I/O stat clos TCP  inst hndl 
proc proc proc
- -         
  
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.90 6.11 15.7 0.39 2.43 
528. 1926 4853
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.27 4.86 6.10 15.7 0.39 2.45 
522. 1910 4260
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.85 6.10 15.8 0.39 2.40 
526. 1916 4429
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.84 6.11 15.7 0.39 2.40 
531. 1838 4429
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.85 6.11 15.8 0.39 2.47 
553. 1931 5118
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 5.09 6.37 15.7 0.39 2.40 
537. 1934 5133
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 5.09 6.35 15.8 0.39 2.40 
555. 1939 5389
margin  Linux 2.6.12-rc1-bk3-dm 1300 0.04 0.26 4.88 6.10 15.8 0.39 2.42 
519. 1829 4787
margin  Linux 2.6.12-rc1-bk3-dm 1300 0.04 0.26 4.87 6.09 15.8 0.39 2.40 
516. 1830 5057
margin  Linux 2.6.12-rc1-bk3-dm 1300 0.04 0.27 4.86 6.10 15.8 0.39 2.40 
512. 1878 5166

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 
16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   
ctxsw
- - -- -- -- -- -- --- 
---
margin Linux 2.6.12-rc1-bk3 7.3300 2.7400 7.0400 4.4600 6.6200 3.94000 
8.38000
margin Linux 2.6.12-rc1-bk3 7.6100 8.1000 7.3200 4.5900 7.1700 5.5 
7.84000
margin Linux 2.6.12-rc1-bk3 7.2400 8. 7.2100 4.3800 6.7500 4.77000 
7.37000
margin Linux 2.6.12-rc1-bk3 7.4100 8.0400 7.0500 4.5100 7.2500 4.11000 
7.03000
margin Linux 2.6.12-rc1-bk3 7.2600 8.2100 7.2400 4.6500 6.6500 4.08000 
7.81000
margin Linux 2.6.12-rc1-bk3 7.4600 7.9000 7.3800 4.3800 6.6200 4.83000 
7.27000
margin Linux 2.6.12-rc1-bk3 7.4400 8.2000 7.2000 5.8700 6.8000 4.86000 
7.95000
margin  Linux 2.6.12-rc1-bk3-dm 7.4400 8.3100 7.1300 5.6900 6.6500 5.49000 
7.49000
margin  Linux 2.6.12-rc1-bk3-dm 2.1300 8.0100 7.3800 4.6700 6.5500 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread Christoph Lameter
On Thu, 24 Mar 2005, David Mosberger wrote:

 That's definitely the case.  See my earlier post on this topic:

   http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html

 Unfortunately, nobody reported any results for larger machines and/or
 more interesting workloads, so the patch is in limbo at this time.
 Clearly, if the CPU that's clearing the page is likely to use that
 same page soon after, it'd be useful to use temporal stores.

Here are some numbers using lmbench of temporal writes vs. non temporal
writes on ia64 (8p machine but lmbench run only for one load). There seems
to be some benefit for fork/exec but overall this does not seem to be a
clear win. I suspect that the distinction between temporal vs. nontemporal
writes is be more beneficial on machines with smaller pagesizes since
the likelyhood that most cachelines of a page are used soon is increased
and therefore hot zeroing is more beneficial.


 L M B E N C H  3 . 0   S U M M A R Y
 
 (Alpha software, do not distribute)

Basic system parameters
---
Host OS Description  Mhz  tlb  cache  
mem   scal
 pages line   
par   load
  bytes
- - ---  - - 
-- 
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin Linux 2.6.12-rc1-bk3  ia64-linux-gnu 1300 128
   1
margin  Linux 2.6.12-rc1-bk3-dm  ia64-linux-gnu 1300 128
   1
margin  Linux 2.6.12-rc1-bk3-dm  ia64-linux-gnu 1300 128
   1
margin  Linux 2.6.12-rc1-bk3-dm  ia64-linux-gnu 1300 128
   1

Processor, Processes - times in microseconds - smaller is better
--
Host OS  Mhz null null  open slct sig  sig  
fork exec sh
 call  I/O stat clos TCP  inst hndl 
proc proc proc
- -         
  
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.90 6.11 15.7 0.39 2.43 
528. 1926 4853
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.27 4.86 6.10 15.7 0.39 2.45 
522. 1910 4260
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.85 6.10 15.8 0.39 2.40 
526. 1916 4429
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.84 6.11 15.7 0.39 2.40 
531. 1838 4429
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 4.85 6.11 15.8 0.39 2.47 
553. 1931 5118
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 5.09 6.37 15.7 0.39 2.40 
537. 1934 5133
margin Linux 2.6.12-rc1-bk3 1300 0.04 0.26 5.09 6.35 15.8 0.39 2.40 
555. 1939 5389
margin  Linux 2.6.12-rc1-bk3-dm 1300 0.04 0.26 4.88 6.10 15.8 0.39 2.42 
519. 1829 4787
margin  Linux 2.6.12-rc1-bk3-dm 1300 0.04 0.26 4.87 6.09 15.8 0.39 2.40 
516. 1830 5057
margin  Linux 2.6.12-rc1-bk3-dm 1300 0.04 0.27 4.86 6.10 15.8 0.39 2.40 
512. 1878 5166

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 
16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   
ctxsw
- - -- -- -- -- -- --- 
---
margin Linux 2.6.12-rc1-bk3 7.3300 2.7400 7.0400 4.4600 6.6200 3.94000 
8.38000
margin Linux 2.6.12-rc1-bk3 7.6100 8.1000 7.3200 4.5900 7.1700 5.5 
7.84000
margin Linux 2.6.12-rc1-bk3 7.2400 8. 7.2100 4.3800 6.7500 4.77000 
7.37000
margin Linux 2.6.12-rc1-bk3 7.4100 8.0400 7.0500 4.5100 7.2500 4.11000 
7.03000
margin Linux 2.6.12-rc1-bk3 7.2600 8.2100 7.2400 4.6500 6.6500 4.08000 
7.81000
margin Linux 2.6.12-rc1-bk3 7.4600 7.9000 7.3800 4.3800 6.6200 4.83000 
7.27000
margin Linux 2.6.12-rc1-bk3 7.4400 8.2000 7.2000 5.8700 6.8000 4.86000 
7.95000
margin  Linux 2.6.12-rc1-bk3-dm 7.4400 8.3100 7.1300 5.6900 6.6500 5.49000 
7.49000
margin  Linux 2.6.12-rc1-bk3-dm 2.1300 8.0100 7.3800 4.6700 6.5500 4.22000 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread David Mosberger
 On Tue, 5 Apr 2005 17:15:53 -0700 (PDT), Christoph Lameter [EMAIL 
 PROTECTED] said:

  Christoph On Thu, 24 Mar 2005, David Mosberger wrote:
   That's definitely the case.  See my earlier post on this topic:

   http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html

   Unfortunately, nobody reported any results for larger machines
   and/or more interesting workloads, so the patch is in limbo at
   this time.  Clearly, if the CPU that's clearing the page is
   likely to use that same page soon after, it'd be useful to use
   temporal stores.

  Christoph Here are some numbers using lmbench of temporal writes
  Christoph vs. non temporal writes on ia64 (8p machine but lmbench
  Christoph run only for one load). There seems to be some benefit
  Christoph for fork/exec but overall this does not seem to be a
  Christoph clear win. I suspect that the distinction between
  Christoph temporal vs. nontemporal writes is be more beneficial on
  Christoph machines with smaller pagesizes since the likelyhood that
  Christoph most cachelines of a page are used soon is increased and
  Christoph therefore hot zeroing is more beneficial.

What LMbench test other than fork/exec would you have expected to be
affected by this?  LMbench is not a good benchmark for this (remember:
it's a _micro_ benchmark).

--david
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread Christoph Lameter
On Tue, 5 Apr 2005, David Mosberger wrote:

 What LMbench test other than fork/exec would you have expected to be
 affected by this?  LMbench is not a good benchmark for this (remember:
 it's a _micro_ benchmark).

LMbench does a variety of things and I expected to see at least
something on the page fault test and hopefully also some variations for
other tests.

Which benchmark would you recommend for this?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread David Mosberger
 On Tue, 5 Apr 2005 17:33:59 -0700 (PDT), Christoph Lameter [EMAIL 
 PROTECTED] said:

  Christoph Which benchmark would you recommend for this?

I don't know about recommend, but I think SPECweb, SPECjbb,
the-UNIX-multi-user-benchmark-whose-name-I-keep-forgetting, and in
general anything that involves process-activity and/or large working
sets might be interesting (in other words: anything but
microbenchmarks; I'm afraid).

--david
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-04-05 Thread Gerrit Huizenga

On Tue, 05 Apr 2005 21:48:22 PDT, David Mosberger wrote:
  On Tue, 5 Apr 2005 17:33:59 -0700 (PDT), Christoph Lameter [EMAIL 
  PROTECTED] said:
 
   Christoph Which benchmark would you recommend for this?
 
 I don't know about recommend, but I think SPECweb, SPECjbb,
 the-UNIX-multi-user-benchmark-whose-name-I-keep-forgetting, and in
 general anything that involves process-activity and/or large working
 sets might be interesting (in other words: anything but
 microbenchmarks; I'm afraid).

SpecSDET, Aim7 or ReAim from OSDL are probably what you are thinking
of.

gerrit
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-28 Thread Christoph Lameter
On Sun, 27 Mar 2005, Andi Kleen wrote:

> > Clearly, if the CPU that's clearing the page is likely to use that
> > same page soon after, it'd be useful to use temporal stores.
>
> That is always the case in the current code (without Christophers
> pre cleaning daemon). The page fault handler clears and user space
> is guaranteed to need at least one cacheline from the fresh page
> because it just did a page fault on it. With non temporal stores
> you guarantee at least one hard cache miss directly after
> the return to user space.

It is not the case that *all* the cachelines of a page are going to be
used right after zeroing. For the page fault case it is only guaranteed that
*one* cacheline will be used. In the PTE/PMD/PUD page allocation cases it
is likely that only a single cacheline is used.

There are some cases in the code (apart from the fault handler)
where zeroed pages are allocated with no guarantee of use (f.e. the
allocations for buffers for shared memory or pipes).

> I suspect even with precleaning the average time from cleaning to use will be
> quite short.

If the time is short then hot cleaning is the right way to go and then
prezeroing is of no benefit. Prezeroing can only be of benefit if there is
sufficient time between the zeroing and the use of the data. It must be
sufficiently long to cause the the cachelines to no longer be in
in the caches. Then the loading of these cachelines may be avoided which
yields the performance benefit.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-28 Thread Christoph Lameter
On Sun, 27 Mar 2005, Andi Kleen wrote:

  Clearly, if the CPU that's clearing the page is likely to use that
  same page soon after, it'd be useful to use temporal stores.

 That is always the case in the current code (without Christophers
 pre cleaning daemon). The page fault handler clears and user space
 is guaranteed to need at least one cacheline from the fresh page
 because it just did a page fault on it. With non temporal stores
 you guarantee at least one hard cache miss directly after
 the return to user space.

It is not the case that *all* the cachelines of a page are going to be
used right after zeroing. For the page fault case it is only guaranteed that
*one* cacheline will be used. In the PTE/PMD/PUD page allocation cases it
is likely that only a single cacheline is used.

There are some cases in the code (apart from the fault handler)
where zeroed pages are allocated with no guarantee of use (f.e. the
allocations for buffers for shared memory or pipes).

 I suspect even with precleaning the average time from cleaning to use will be
 quite short.

If the time is short then hot cleaning is the right way to go and then
prezeroing is of no benefit. Prezeroing can only be of benefit if there is
sufficient time between the zeroing and the use of the data. It must be
sufficiently long to cause the the cachelines to no longer be in
in the caches. Then the loading of these cachelines may be avoided which
yields the performance benefit.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-27 Thread David S. Miller
On 27 Mar 2005 19:12:20 +0200
Andi Kleen <[EMAIL PROTECTED]> wrote:

> With non temporal stores
> you guarantee at least one hard cache miss directly after
> the return to user space.

This is true if the cacheline were not present already at
the time of the non-temporal store.

I know what you're trying to say, I'm just clarifying.

The real question is if a large enough ratio of those
cachelines in the page get similarly accessed.  I happen
to think the answer to that for any real example is yes.
Yet, I have no way to prove this.

It would be cool to do some hacks under Xen or user-mode
Linux to get some real statistics about this.  Actually,
this could be done also with hacks to valgrind or other
similar tools.  QEMU could also be used.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-27 Thread Andi Kleen
> Clearly, if the CPU that's clearing the page is likely to use that
> same page soon after, it'd be useful to use temporal stores.

That is always the case in the current code (without Christophers 
pre cleaning daemon). The page fault handler clears and user space
is guaranteed to need at least one cacheline from the fresh page
because it just did a page fault on it. With non temporal stores
you guarantee at least one hard cache miss directly after
the return to user space.

I suspect even with precleaning the average time from cleaning to use will be 
quite short.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-27 Thread Andi Kleen
 Clearly, if the CPU that's clearing the page is likely to use that
 same page soon after, it'd be useful to use temporal stores.

That is always the case in the current code (without Christophers 
pre cleaning daemon). The page fault handler clears and user space
is guaranteed to need at least one cacheline from the fresh page
because it just did a page fault on it. With non temporal stores
you guarantee at least one hard cache miss directly after
the return to user space.

I suspect even with precleaning the average time from cleaning to use will be 
quite short.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-27 Thread David S. Miller
On 27 Mar 2005 19:12:20 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 With non temporal stores
 you guarantee at least one hard cache miss directly after
 the return to user space.

This is true if the cacheline were not present already at
the time of the non-temporal store.

I know what you're trying to say, I'm just clarifying.

The real question is if a large enough ratio of those
cachelines in the page get similarly accessed.  I happen
to think the answer to that for any real example is yes.
Yet, I have no way to prove this.

It would be cool to do some hacks under Xen or user-mode
Linux to get some real statistics about this.  Actually,
this could be done also with hacks to valgrind or other
similar tools.  QEMU could also be used.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread Christoph Lameter
On Thu, 24 Mar 2005, David S. Miller wrote:

> Erm... were any of your test builds done with the new CONFIG_CLEAR_COLD
> option enabled? :-)

These were all fixed but I failed to do a "quilt refresh"  sigh... The
email issues are also fixed now  sigh. What a day.

> Next, replace your arch/sparc64/lib/clear_page.S diff with this one and
> things would be working and we'll be using the proper temporal vs.
> non-temporal stores on that platform.

Thanks.

Here is the patch with your changes and a "quilt refresh" ;-)

-
Introduces a new function clear_cold(void *pageaddress, int order) to clear
pages of an arbitrary size with non temporal stores. Cold clearing is typically
faster than hot clearing. Hot clearing is beneficial when the data is to be 
used soon.
(Will also work well with the new hot and cold aware prezeroing daemon)

Use cold clearing for huge pages.

For ia64 also make clear_page uses temporal stores.

Patch needs fixes to work properly on i386 and x86_64.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-24 14:12:53.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage([i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER | __GFP_COLD);
return page;
 }

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-24 13:15:40.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-24 18:39:22.0 -0800
@@ -633,11 +633,17 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, int gfp_flags)
 {
int i;

BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_COLD
+   if ((gfp_flags & __GFP_COLD) && !PageHighMem(page))
+   clear_cold(page_address(page), order);
+   else
+#endif
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/include/linux/gfp.h
===
--- linux-2.6.11.orig/include/linux/gfp.h   2005-03-01 23:37:50.0 
-0800
+++ linux-2.6.11/include/linux/gfp.h2005-03-24 14:16:44.0 -0800
@@ -131,4 +131,5 @@ extern void FASTCALL(free_cold_page(stru

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order, int gfp_flags);
 #endif /* __LINUX_GFP_H */
Index: linux-2.6.11/arch/ia64/Kconfig
===
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-01 23:38:26.0 -0800
+++ linux-2.6.11/arch/ia64/Kconfig  2005-03-24 14:12:53.0 -0800
@@ -46,6 +46,10 @@ config GENERIC_IOMAP
bool
default y

+config CLEAR_COLD
+   bool
+   default y
+
 choice
prompt "System type"
default IA64_GENERIC
Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-24 14:12:53.0 
-0800
@@ -57,6 +57,8 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+/* Clear arbitrary order page using nontemporal writes */
+extern void clear_cold (void *page, unsigned int order);
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-24 14:12:53.0 
-0800
@@ -39,6 +39,7 @@ EXPORT_SYMBOL(__up);

 #include 
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_cold);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include 
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-24 14:24:29.0 
-0800
@@ -7,6 +7,8 @@
  * 1/06/01 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David S. Miller
On Thu, 24 Mar 2005 14:49:55 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> Could you help me fix up this patch replacing the old clear_pages patch?

Ok, first you need to mark the order and gfp arguments as unsigned
for mm/page_alloc.c:prep_zero_page() so that it matches the prototype
you added to include/linux/gfp.h else the compiler warns a lot.

Next, in the same function in mm/page_alloc.c, "PageHighmem()" is typo'd, it 
should be
"PageHighMem()".

The clear_cold() call on the next line needs a semicolon.

Erm... were any of your test builds done with the new CONFIG_CLEAR_COLD
option enabled? :-)

Next, replace your arch/sparc64/lib/clear_page.S diff with this one and
things would be working and we'll be using the proper temporal vs.
non-temporal stores on that platform.

= arch/sparc64/lib/clear_page.S 1.1 vs edited =
--- 1.1/arch/sparc64/lib/clear_page.S   2004-08-08 19:54:07 -07:00
+++ edited/arch/sparc64/lib/clear_page.S2005-03-24 15:56:33 -08:00
@@ -72,26 +72,34 @@
mov 1, %o4
 
 clear_page_common:
-   VISEntryHalf
membar  #StoreLoad | #StoreStore | #LoadStore
-   fzero   %f0
sethi   %hi(PAGE_SIZE/64), %o1
mov %o0, %g1! remember vaddr for tlbflush
-   fzero   %f2
or  %o1, %lo(PAGE_SIZE/64), %o1
-   faddd   %f0, %f2, %f4
-   fmuld   %f0, %f2, %f6
-   faddd   %f0, %f2, %f8
-   fmuld   %f0, %f2, %f10
 
-   faddd   %f0, %f2, %f12
-   fmuld   %f0, %f2, %f14
-1: stda%f0, [%o0 + %g0] ASI_BLK_P
+#define PREFETCH(x, y) prefetch x, y
+#define PREFETCH_CODE  2
+
+   PREFETCH([%o0 + 0x000], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x040], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x080], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x0c0], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x100], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x140], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x180], PREFETCH_CODE)
+1:
+   stx %g0, [%o0 + 0x00]
+   stx %g0, [%o0 + 0x08]
+   stx %g0, [%o0 + 0x10]
+   stx %g0, [%o0 + 0x18]
+   stx %g0, [%o0 + 0x20]
+   stx %g0, [%o0 + 0x28]
+   stx %g0, [%o0 + 0x30]
+   stx %g0, [%o0 + 0x38]
+   PREFETCH([%o0 + 0x1c0], PREFETCH_CODE)
subcc   %o1, 1, %o1
bne,pt  %icc, 1b
 add%o0, 0x40, %o0
-   membar  #Sync
-   VISExitHalf
 
brz,pn  %o4, out
 nop
@@ -101,5 +109,32 @@
stw %o2, [%g6 + TI_PRE_COUNT]
 
 out:   retl
+nop
+
+   .globl  clear_cold
+clear_cold:/* %o0=dest, %o1=order */
+   sethi   %hi(PAGE_SIZE/64), %o2
+   clr %o4
+   or  %o2, %lo(PAGE_SIZE/64), %o2
+   sllx%o2, %o1, %o1
+   VISEntryHalf
+   membar  #StoreLoad | #StoreStore | #LoadStore
+   fzero   %f0
+   fzero   %f2
+   faddd   %f0, %f2, %f4
+   fmuld   %f0, %f2, %f6
+   faddd   %f0, %f2, %f8
+   fmuld   %f0, %f2, %f10
+
+   faddd   %f0, %f2, %f12
+   fmuld   %f0, %f2, %f14
+2: stda%f0, [%o0 + %g0] ASI_BLK_P
+   subcc   %o1, 1, %o1
+   bne,pt  %icc, 2b
+add%o0, 0x40, %o0
+   membar  #Sync
+   VISExitHalf
+
+   retl
 nop
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David S. Miller
On Thu, 24 Mar 2005 14:49:55 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Thu, 24 Mar 2005, David S. Miller wrote:
> 
> > > prep_zero_page would use a temporal clear for an order 0 page but a
> > > nontemporal clear for higher order pages.
> >
> > That sounds about right to me.
> >
> > Hmmm, I'm inspired to experiment with this on sparc64 a bit.
> 
> Could you help me fix up this patch replacing the old clear_pages patch?

Sure, I'll play with it.

Meanwhile, here are some numbers.  I changed just the clear_page()
implementation on sparc64 so that it used prefetching and normal
temporal stores.  The machine is a uniprocessor 1.5Ghz Ultra-IIIi,
64K write-through D-cache, 64K I-cache, 1MB L2 cache.  I did 4
timed 'vmlinux' builds after a fresh boot:

BEFORE:
real9m8.720s
user8m28.345s
sys 0m32.734s

real9m2.034s
user8m28.763s
sys 0m32.512s

real9m1.848s
user8m28.970s
sys 0m32.204s

real9m1.701s
user8m28.715s
sys 0m32.394s

AFTER:
real9m2.241s
user8m16.633s
sys 0m36.451s

real8m53.739s
user8m17.165s
sys 0m36.052s

real8m54.089s
user8m17.266s
sys 0m36.219s

real8m54.071s
user8m17.473s
sys 0m36.073s

So, at the very least, my results agree with D. Mosberger's on IA64.

At the cost of ~4 seconds of system time, we gain ~11 seconds of
user time.

I'm pretty much convinced this is a win.  I wonder if it matters to
do something similar for copy_page*() as well.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread Christoph Lameter
On Thu, 24 Mar 2005, David S. Miller wrote:

> > prep_zero_page would use a temporal clear for an order 0 page but a
> > nontemporal clear for higher order pages.
>
> That sounds about right to me.
>
> Hmmm, I'm inspired to experiment with this on sparc64 a bit.

Could you help me fix up this patch replacing the old clear_pages patch?

Introduces a new function clear_cold(void *pageaddress, int order) to clear
pages of an arbitrary size with non temporal stores. Cold clearing is typically
faster than hot clearing. Hot clearing is beneficial when the data is to be 
used soon.
(The hot cold distincion also work well with the new hot and cold aware 
prezeroing daemon)

- Use cold clearing for huge pages.
- For ia64 also make clear_page uses temporal stores.
- Patch needs fixes to work properly on i386, x86_64 and sparc64.
- There may be other allocations that can benefit from the increased
  performance possible for cold zeroed pages if the pages are not to be
  used right away. Add __GFP_COLD to the gfp_flags for those.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-24 14:12:53.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage([i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER | __GFP_COLD);
return page;
 }

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-24 13:15:40.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-24 14:15:15.0 -0800
@@ -633,11 +633,17 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, int order, int gfp_flags)
 {
int i;

BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_COLD
+   if ((gfp_flags & __GFP_COLD) && !PageHighmem(page))
+   clear_cold(page_address(page), order)
+   else
+#endif
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/include/linux/gfp.h
===
--- linux-2.6.11.orig/include/linux/gfp.h   2005-03-01 23:37:50.0 
-0800
+++ linux-2.6.11/include/linux/gfp.h2005-03-24 14:12:53.0 -0800
@@ -131,4 +131,5 @@ extern void FASTCALL(free_cold_page(stru

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order, unsigned int gfp_flags);
 #endif /* __LINUX_GFP_H */
Index: linux-2.6.11/arch/ia64/Kconfig
===
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-01 23:38:26.0 -0800
+++ linux-2.6.11/arch/ia64/Kconfig  2005-03-24 14:12:53.0 -0800
@@ -46,6 +46,10 @@ config GENERIC_IOMAP
bool
default y

+config CLEAR_COLD
+   bool
+   default y
+
 choice
prompt "System type"
default IA64_GENERIC
Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-24 14:12:53.0 
-0800
@@ -57,6 +57,8 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+/* Clear arbitrary order page using nontemporal writes */
+extern void clear_cold (void *page, unsigned int order);
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-24 14:12:53.0 
-0800
@@ -39,6 +39,7 @@ EXPORT_SYMBOL(__up);

 #include 
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_cold);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include 
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-24 14:12:53.0 
-0800
@@ -7,6 +7,8 @@
  * 1/06/01 davidm  Tuned for Itanium.

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David S. Miller
On Thu, 24 Mar 2005 10:41:06 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> So it would be useful to have
> 
> clear_page-> Temporal. Only zaps one page
> 
>   and
> 
> clear_pages   -> Zaps arbitrary order of page non-temporal
> 
> 
> Rework the clear_pages patch to do just that? Maybe rename clear_pages
> clear_pages_nt?
> 
> prep_zero_page would use a temporal clear for an order 0 page but a
> nontemporal clear for higher order pages.

That sounds about right to me.

Hmmm, I'm inspired to experiment with this on sparc64 a bit.
:-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread Christoph Lameter
On Thu, 24 Mar 2005, David Mosberger wrote:

> That's definitely the case.  See my earlier post on this topic:
>
>   http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html
>
> Unfortunately, nobody reported any results for larger machines and/or
> more interesting workloads, so the patch is in limbo at this time.
> Clearly, if the CPU that's clearing the page is likely to use that
> same page soon after, it'd be useful to use temporal stores.


So it would be useful to have

clear_page  -> Temporal. Only zaps one page

and

clear_pages -> Zaps arbitrary order of page non-temporal


Rework the clear_pages patch to do just that? Maybe rename clear_pages
clear_pages_nt?

prep_zero_page would use a temporal clear for an order 0 page but a
nontemporal clear for higher order pages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David Mosberger
> On Fri, 18 Mar 2005 20:28:08 +0100, Andi Kleen <[EMAIL PROTECTED]> said:

  >> stores in general for clearing pages? I checked and Itanium has
  >> always used non-temporal stores. So there will be no benefit for
  >> us from this

  Andi> That is weird. I would actually try to switch to temporal
  Andi> stores, maybe it will improve some benchmarks.

That's definitely the case.  See my earlier post on this topic:

  http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html

Unfortunately, nobody reported any results for larger machines and/or
more interesting workloads, so the patch is in limbo at this time.
Clearly, if the CPU that's clearing the page is likely to use that
same page soon after, it'd be useful to use temporal stores.

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David Mosberger
 On Fri, 18 Mar 2005 20:28:08 +0100, Andi Kleen [EMAIL PROTECTED] said:

   stores in general for clearing pages? I checked and Itanium has
   always used non-temporal stores. So there will be no benefit for
   us from this

  Andi That is weird. I would actually try to switch to temporal
  Andi stores, maybe it will improve some benchmarks.

That's definitely the case.  See my earlier post on this topic:

  http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html

Unfortunately, nobody reported any results for larger machines and/or
more interesting workloads, so the patch is in limbo at this time.
Clearly, if the CPU that's clearing the page is likely to use that
same page soon after, it'd be useful to use temporal stores.

--david
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread Christoph Lameter
On Thu, 24 Mar 2005, David Mosberger wrote:

 That's definitely the case.  See my earlier post on this topic:

   http://www.gelato.unsw.edu.au/linux-ia64/0409/11012.html

 Unfortunately, nobody reported any results for larger machines and/or
 more interesting workloads, so the patch is in limbo at this time.
 Clearly, if the CPU that's clearing the page is likely to use that
 same page soon after, it'd be useful to use temporal stores.


So it would be useful to have

clear_page  - Temporal. Only zaps one page

and

clear_pages - Zaps arbitrary order of page non-temporal


Rework the clear_pages patch to do just that? Maybe rename clear_pages
clear_pages_nt?

prep_zero_page would use a temporal clear for an order 0 page but a
nontemporal clear for higher order pages.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David S. Miller
On Thu, 24 Mar 2005 10:41:06 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 So it would be useful to have
 
 clear_page- Temporal. Only zaps one page
 
   and
 
 clear_pages   - Zaps arbitrary order of page non-temporal
 
 
 Rework the clear_pages patch to do just that? Maybe rename clear_pages
 clear_pages_nt?
 
 prep_zero_page would use a temporal clear for an order 0 page but a
 nontemporal clear for higher order pages.

That sounds about right to me.

Hmmm, I'm inspired to experiment with this on sparc64 a bit.
:-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread Christoph Lameter
On Thu, 24 Mar 2005, David S. Miller wrote:

  prep_zero_page would use a temporal clear for an order 0 page but a
  nontemporal clear for higher order pages.

 That sounds about right to me.

 Hmmm, I'm inspired to experiment with this on sparc64 a bit.

Could you help me fix up this patch replacing the old clear_pages patch?

Introduces a new function clear_cold(void *pageaddress, int order) to clear
pages of an arbitrary size with non temporal stores. Cold clearing is typically
faster than hot clearing. Hot clearing is beneficial when the data is to be 
used soon.
(The hot cold distincion also work well with the new hot and cold aware 
prezeroing daemon)

- Use cold clearing for huge pages.
- For ia64 also make clear_page uses temporal stores.
- Patch needs fixes to work properly on i386, x86_64 and sparc64.
- There may be other allocations that can benefit from the increased
  performance possible for cold zeroed pages if the pages are not to be
  used right away. Add __GFP_COLD to the gfp_flags for those.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-24 14:12:53.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(hugetlb_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i  (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage(page[i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER | __GFP_COLD);
return page;
 }

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-24 13:15:40.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-24 14:15:15.0 -0800
@@ -633,11 +633,17 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, int order, int gfp_flags)
 {
int i;

BUG_ON((gfp_flags  (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_COLD
+   if ((gfp_flags  __GFP_COLD)  !PageHighmem(page))
+   clear_cold(page_address(page), order)
+   else
+#endif
for(i = 0; i  (1  order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/include/linux/gfp.h
===
--- linux-2.6.11.orig/include/linux/gfp.h   2005-03-01 23:37:50.0 
-0800
+++ linux-2.6.11/include/linux/gfp.h2005-03-24 14:12:53.0 -0800
@@ -131,4 +131,5 @@ extern void FASTCALL(free_cold_page(stru

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order, unsigned int gfp_flags);
 #endif /* __LINUX_GFP_H */
Index: linux-2.6.11/arch/ia64/Kconfig
===
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-01 23:38:26.0 -0800
+++ linux-2.6.11/arch/ia64/Kconfig  2005-03-24 14:12:53.0 -0800
@@ -46,6 +46,10 @@ config GENERIC_IOMAP
bool
default y

+config CLEAR_COLD
+   bool
+   default y
+
 choice
prompt System type
default IA64_GENERIC
Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-24 14:12:53.0 
-0800
@@ -57,6 +57,8 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+/* Clear arbitrary order page using nontemporal writes */
+extern void clear_cold (void *page, unsigned int order);
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-24 14:12:53.0 
-0800
@@ -39,6 +39,7 @@ EXPORT_SYMBOL(__up);

 #include asm/page.h
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_cold);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include linux/bootmem.h
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-24 14:12:53.0 
-0800
@@ -7,6 +7,8 @@
  * 1/06/01 davidm  

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David S. Miller
On Thu, 24 Mar 2005 14:49:55 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Thu, 24 Mar 2005, David S. Miller wrote:
 
   prep_zero_page would use a temporal clear for an order 0 page but a
   nontemporal clear for higher order pages.
 
  That sounds about right to me.
 
  Hmmm, I'm inspired to experiment with this on sparc64 a bit.
 
 Could you help me fix up this patch replacing the old clear_pages patch?

Sure, I'll play with it.

Meanwhile, here are some numbers.  I changed just the clear_page()
implementation on sparc64 so that it used prefetching and normal
temporal stores.  The machine is a uniprocessor 1.5Ghz Ultra-IIIi,
64K write-through D-cache, 64K I-cache, 1MB L2 cache.  I did 4
timed 'vmlinux' builds after a fresh boot:

BEFORE:
real9m8.720s
user8m28.345s
sys 0m32.734s

real9m2.034s
user8m28.763s
sys 0m32.512s

real9m1.848s
user8m28.970s
sys 0m32.204s

real9m1.701s
user8m28.715s
sys 0m32.394s

AFTER:
real9m2.241s
user8m16.633s
sys 0m36.451s

real8m53.739s
user8m17.165s
sys 0m36.052s

real8m54.089s
user8m17.266s
sys 0m36.219s

real8m54.071s
user8m17.473s
sys 0m36.073s

So, at the very least, my results agree with D. Mosberger's on IA64.

At the cost of ~4 seconds of system time, we gain ~11 seconds of
user time.

I'm pretty much convinced this is a win.  I wonder if it matters to
do something similar for copy_page*() as well.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread David S. Miller
On Thu, 24 Mar 2005 14:49:55 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 Could you help me fix up this patch replacing the old clear_pages patch?

Ok, first you need to mark the order and gfp arguments as unsigned
for mm/page_alloc.c:prep_zero_page() so that it matches the prototype
you added to include/linux/gfp.h else the compiler warns a lot.

Next, in the same function in mm/page_alloc.c, PageHighmem() is typo'd, it 
should be
PageHighMem().

The clear_cold() call on the next line needs a semicolon.

Erm... were any of your test builds done with the new CONFIG_CLEAR_COLD
option enabled? :-)

Next, replace your arch/sparc64/lib/clear_page.S diff with this one and
things would be working and we'll be using the proper temporal vs.
non-temporal stores on that platform.

= arch/sparc64/lib/clear_page.S 1.1 vs edited =
--- 1.1/arch/sparc64/lib/clear_page.S   2004-08-08 19:54:07 -07:00
+++ edited/arch/sparc64/lib/clear_page.S2005-03-24 15:56:33 -08:00
@@ -72,26 +72,34 @@
mov 1, %o4
 
 clear_page_common:
-   VISEntryHalf
membar  #StoreLoad | #StoreStore | #LoadStore
-   fzero   %f0
sethi   %hi(PAGE_SIZE/64), %o1
mov %o0, %g1! remember vaddr for tlbflush
-   fzero   %f2
or  %o1, %lo(PAGE_SIZE/64), %o1
-   faddd   %f0, %f2, %f4
-   fmuld   %f0, %f2, %f6
-   faddd   %f0, %f2, %f8
-   fmuld   %f0, %f2, %f10
 
-   faddd   %f0, %f2, %f12
-   fmuld   %f0, %f2, %f14
-1: stda%f0, [%o0 + %g0] ASI_BLK_P
+#define PREFETCH(x, y) prefetch x, y
+#define PREFETCH_CODE  2
+
+   PREFETCH([%o0 + 0x000], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x040], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x080], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x0c0], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x100], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x140], PREFETCH_CODE)
+   PREFETCH([%o0 + 0x180], PREFETCH_CODE)
+1:
+   stx %g0, [%o0 + 0x00]
+   stx %g0, [%o0 + 0x08]
+   stx %g0, [%o0 + 0x10]
+   stx %g0, [%o0 + 0x18]
+   stx %g0, [%o0 + 0x20]
+   stx %g0, [%o0 + 0x28]
+   stx %g0, [%o0 + 0x30]
+   stx %g0, [%o0 + 0x38]
+   PREFETCH([%o0 + 0x1c0], PREFETCH_CODE)
subcc   %o1, 1, %o1
bne,pt  %icc, 1b
 add%o0, 0x40, %o0
-   membar  #Sync
-   VISExitHalf
 
brz,pn  %o4, out
 nop
@@ -101,5 +109,32 @@
stw %o2, [%g6 + TI_PRE_COUNT]
 
 out:   retl
+nop
+
+   .globl  clear_cold
+clear_cold:/* %o0=dest, %o1=order */
+   sethi   %hi(PAGE_SIZE/64), %o2
+   clr %o4
+   or  %o2, %lo(PAGE_SIZE/64), %o2
+   sllx%o2, %o1, %o1
+   VISEntryHalf
+   membar  #StoreLoad | #StoreStore | #LoadStore
+   fzero   %f0
+   fzero   %f2
+   faddd   %f0, %f2, %f4
+   fmuld   %f0, %f2, %f6
+   faddd   %f0, %f2, %f8
+   fmuld   %f0, %f2, %f10
+
+   faddd   %f0, %f2, %f12
+   fmuld   %f0, %f2, %f14
+2: stda%f0, [%o0 + %g0] ASI_BLK_P
+   subcc   %o1, 1, %o1
+   bne,pt  %icc, 2b
+add%o0, 0x40, %o0
+   membar  #Sync
+   VISExitHalf
+
+   retl
 nop
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-24 Thread Christoph Lameter
On Thu, 24 Mar 2005, David S. Miller wrote:

 Erm... were any of your test builds done with the new CONFIG_CLEAR_COLD
 option enabled? :-)

These were all fixed but I failed to do a quilt refresh  sigh... The
email issues are also fixed now  sigh. What a day.

 Next, replace your arch/sparc64/lib/clear_page.S diff with this one and
 things would be working and we'll be using the proper temporal vs.
 non-temporal stores on that platform.

Thanks.

Here is the patch with your changes and a quilt refresh ;-)

-
Introduces a new function clear_cold(void *pageaddress, int order) to clear
pages of an arbitrary size with non temporal stores. Cold clearing is typically
faster than hot clearing. Hot clearing is beneficial when the data is to be 
used soon.
(Will also work well with the new hot and cold aware prezeroing daemon)

Use cold clearing for huge pages.

For ia64 also make clear_page uses temporal stores.

Patch needs fixes to work properly on i386 and x86_64.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-24 14:12:53.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(hugetlb_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i  (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage(page[i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER | __GFP_COLD);
return page;
 }

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-24 13:15:40.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-24 18:39:22.0 -0800
@@ -633,11 +633,17 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, int gfp_flags)
 {
int i;

BUG_ON((gfp_flags  (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_COLD
+   if ((gfp_flags  __GFP_COLD)  !PageHighMem(page))
+   clear_cold(page_address(page), order);
+   else
+#endif
for(i = 0; i  (1  order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/include/linux/gfp.h
===
--- linux-2.6.11.orig/include/linux/gfp.h   2005-03-01 23:37:50.0 
-0800
+++ linux-2.6.11/include/linux/gfp.h2005-03-24 14:16:44.0 -0800
@@ -131,4 +131,5 @@ extern void FASTCALL(free_cold_page(stru

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order, int gfp_flags);
 #endif /* __LINUX_GFP_H */
Index: linux-2.6.11/arch/ia64/Kconfig
===
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-01 23:38:26.0 -0800
+++ linux-2.6.11/arch/ia64/Kconfig  2005-03-24 14:12:53.0 -0800
@@ -46,6 +46,10 @@ config GENERIC_IOMAP
bool
default y

+config CLEAR_COLD
+   bool
+   default y
+
 choice
prompt System type
default IA64_GENERIC
Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-24 14:12:53.0 
-0800
@@ -57,6 +57,8 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+/* Clear arbitrary order page using nontemporal writes */
+extern void clear_cold (void *page, unsigned int order);
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-24 14:12:53.0 
-0800
@@ -39,6 +39,7 @@ EXPORT_SYMBOL(__up);

 #include asm/page.h
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_cold);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include linux/bootmem.h
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-24 14:24:29.0 
-0800
@@ -7,6 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-21 Thread Denis Vlasenko
On Friday 18 March 2005 21:28, Andi Kleen wrote:
> On Fri, Mar 18, 2005 at 07:00:06AM -0800, Christoph Lameter wrote:
> > On Fri, 18 Mar 2005, Denis Vlasenko wrote:
> > 
> > > NT stores are not about 5% increase. 200%-300%. Provided you are ok with
> > > the fact that zeroed page ends up evicted from cache. Luckily, this is 
> > > exactly
> > > what you want with prezeroing.
> > 
> > These are pretty significant results. Maybe its best to use non-temporal
> 
> The differences are actually less. I do not know what Denis benchmarked,
> but in my tests the difference was never more than ~10%.  He got a zero
> too much? 

No. See attached.

# gcc -O2 0main.c
# ./a.out
Page clear/copy benchmark program.
buffer size: 1 Mb
Each test tried 64 times, max and min CPU cycles per page are reported.
Please disregard max values. They are due to system interference only.
clear_page() tests:
   normal_clear_page - took 44214 max,12615 min cycles per page
   normal_clear_page - took 18969 max,12649 min cycles per page
 repstosl_clear_page - took 19897 max,12655 min cycles per page
 movq_clear_page - took 39391 max,10782 min cycles per page
   movntq_clear_page - took 21612 max, 4779 min cycles per page

copy_page() tests:


I'm basically saying that 'microbenchmark-visible'
performance of NT stores is 200-300% higher than 'normal' stores.

BTW: cache eviction is not an intrisic property of non-temporal
stores. It's merely how they're implemented in current CPUs:
if NT stores hit cached line, invalidate it and
push stores to bus. Else just push stores to bus
without reading cacheline from RAM first.

It is possible that some future CPU won't evict cacheline
if NT stores happened to hit it: "if NT stores hit cached line,
MODIFY it and push stores to bus".
--
vda


page_asm.tar.bz2
Description: application/tbz


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-21 Thread Denis Vlasenko
On Friday 18 March 2005 21:28, Andi Kleen wrote:
 On Fri, Mar 18, 2005 at 07:00:06AM -0800, Christoph Lameter wrote:
  On Fri, 18 Mar 2005, Denis Vlasenko wrote:
  
   NT stores are not about 5% increase. 200%-300%. Provided you are ok with
   the fact that zeroed page ends up evicted from cache. Luckily, this is 
   exactly
   what you want with prezeroing.
  
  These are pretty significant results. Maybe its best to use non-temporal
 
 The differences are actually less. I do not know what Denis benchmarked,
 but in my tests the difference was never more than ~10%.  He got a zero
 too much? 

No. See attached.

# gcc -O2 0main.c
# ./a.out
Page clear/copy benchmark program.
buffer size: 1 Mb
Each test tried 64 times, max and min CPU cycles per page are reported.
Please disregard max values. They are due to system interference only.
clear_page() tests:
   normal_clear_page - took 44214 max,12615 min cycles per page
   normal_clear_page - took 18969 max,12649 min cycles per page
 repstosl_clear_page - took 19897 max,12655 min cycles per page
 movq_clear_page - took 39391 max,10782 min cycles per page
   movntq_clear_page - took 21612 max, 4779 min cycles per page

copy_page() tests:


I'm basically saying that 'microbenchmark-visible'
performance of NT stores is 200-300% higher than 'normal' stores.

BTW: cache eviction is not an intrisic property of non-temporal
stores. It's merely how they're implemented in current CPUs:
if NT stores hit cached line, invalidate it and
push stores to bus. Else just push stores to bus
without reading cacheline from RAM first.

It is possible that some future CPU won't evict cacheline
if NT stores happened to hit it: if NT stores hit cached line,
MODIFY it and push stores to bus.
--
vda


page_asm.tar.bz2
Description: application/tbz


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Christoph Lameter
On Fri, 18 Mar 2005, Andi Kleen wrote:

> It does not make any sense if you think of it - the memory bus
> of the CPU cannot be that much faster than the cache.

The memory bus would be able to reach a higher rate if properly optimized
for sequential writes to memory. A cache typically does random writes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Andi Kleen
On Fri, Mar 18, 2005 at 07:00:06AM -0800, Christoph Lameter wrote:
> On Fri, 18 Mar 2005, Denis Vlasenko wrote:
> 
> > NT stores are not about 5% increase. 200%-300%. Provided you are ok with
> > the fact that zeroed page ends up evicted from cache. Luckily, this is 
> > exactly
> > what you want with prezeroing.
> 
> These are pretty significant results. Maybe its best to use non-temporal

The differences are actually less. I do not know what Denis benchmarked,
but in my tests the difference was never more than ~10%.  He got a zero
too much? 

It does not make any sense if you think of it - the memory bus
of the CPU cannot be that much faster than the cache.

And the drawback of eating the cache misses later is really very
significant.

> stores in general for clearing pages? I checked and Itanium has always
> used non-temporal stores. So there will be no benefit for us from this

That is weird. I would actually try to switch to temporal stores, maybe
it will improve some benchmarks. 

> approach (we have 16k and 64k page sizes which may make the situation a
> bit different). Try to update the i386 architectures to do the same?

Definitely not. 

You can experiment with using it for the cleaner daemon, but even
there I would use some heuristic to make sure you only use it 
on a page that are at the end of a pretty long queue.

e.g. if you can guarantee that the page allocator will go through
500k-1MB before going to the NT page that is cache cold it may
be a good idea. But that might be pretty complicated and I am not
sure it will be worth it.

But for the clear running in the page fault handler context it is 
definitely a bad idea.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Christoph Lameter
On Fri, 18 Mar 2005, Denis Vlasenko wrote:

> NT stores are not about 5% increase. 200%-300%. Provided you are ok with
> the fact that zeroed page ends up evicted from cache. Luckily, this is exactly
> what you want with prezeroing.

These are pretty significant results. Maybe its best to use non-temporal
stores in general for clearing pages? I checked and Itanium has always
used non-temporal stores. So there will be no benefit for us from this
approach (we have 16k and 64k page sizes which may make the situation a
bit different). Try to update the i386 architectures to do the same?

Or for prezeroing, you could register a zeroing driver that would use the
non-temporal stores with V8 of the prezeroing patches. In any case the
clear_pages patch is not useful the way it was intended for us and I am
have dropped this from the prezeroing patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Andi Kleen
> Andi Kleen (iirc) says that non-temporal stores seem to be
> big win in microbenchmarks (and I second that), but they are
> a net loss when we are going to use zeroed page just after
> zeroing. He recommends avoid using non-temporal stores

The rule of thumb is to only use non temporal stores when your
data set is bigger than the L2/L3 caches of the CPU. This means >1MB.
The kernel normally never works on data sets that big.

For Christophers new background cleaner daemon it may be worth it 
when the queue is a LILO. This means it is likely there is a relatively
long time between the clearing operation and a workload using it.
But even then it is a very close call and would need clear benchmark 
numbers in macrobenchmarks.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Denis Vlasenko
On Thursday 17 March 2005 03:33, Christoph Lameter wrote:
> On Fri, 11 Mar 2005, Denis Vlasenko wrote:
> 
> > Andi Kleen (iirc) says that non-temporal stores seem to be
> > big win in microbenchmarks (and I second that), but they are
> > a net loss when we are going to use zeroed page just after
> > zeroing. He recommends avoid using non-temporal stores
> >
> > With this new page prezeroing infrastructure, that argument
> > most likely is not right anymore. Especially clearing of
> > high-order pages definitely will benefit from NT stores
> > because they do not kill L1 data cache in the process.
> >
> > I don't have K8 and therefore cannot be 100% sure, but
> > I really doubt that K8 optimize "rep stosq" into _NT_ stores.
> 
> Hmm. That would be interesting to know and may be necessary to justify
> the continued existence of this patch. I tried to get some numbers on
> the performance wins for zeroing larger pages with the patch as is (no
> NT stores) and came up with:
> 
> Processor Performance Increase
> 
> Itanium 2 1.3Ghz M1/R51.5%
> AMD Athlon 64 3200+ i386 mode 3%
> AMD Athlon 64 3200+ x86_64 mode   3.3%
> 
> (this is if the zeroing engine is the cpu of course. Prezeroing
> may be done through some DMA gizmo independent of the cpu)
> 
> Itanium has more extensive optimization capabilities and
> seems to be able to better cope with the loop logic for regular
> clear_page. Thus the improvement is even less on Itanium.
> 
> Numbers obtained with the following patch that allows to get performance
> data from /proc/meminfo on zeroing performance (just divide Cycles by
> Pages for clear_page and clear_pages):

Here is a patch which allows to try different page zeroing
optimizations to be tested at runtime via sysctl.
Was run tested in 2.6.8 time. Rediffed to 2.6.11.
Feel free to adapt to your patch and test.

Also attached is a tarball for microbenchmarking routines. There are two
result files. Duron:

   normal_clear_page - took  8644 max, 8400 min cycles per page
 repstosl_clear_page - took  8626 max, 8418 min cycles per page
 movq_clear_page - took  8647 max, 8300 min cycles per page
   movntq_clear_page - took  2777 max, 2720 min cycles per page

And amd64:
   normal_clear_page - took  9427 max, 5781 min cycles per page
 repstosl_clear_page - took  9305 max, 5680 min cycles per page
 movq_clear_page - took  6167 max, 5576 min cycles per page
   movntq_clear_page - took  5456 max, 2354 min cycles per page

NT stores are not about 5% increase. 200%-300%. Provided you are ok with
the fact that zeroed page ends up evicted from cache. Luckily, this is exactly
what you want with prezeroing.
--
vda
diff -urpN linux-2.6.11.src/arch/i386/lib/Makefile linux-2.6.11-nt.src/arch/i386/lib/Makefile
--- linux-2.6.11.src/arch/i386/lib/Makefile	Tue Oct 19 00:53:10 2004
+++ linux-2.6.11-nt.src/arch/i386/lib/Makefile	Fri Mar 18 11:30:51 2005
@@ -4,7 +4,7 @@
 
 
 lib-y = checksum.o delay.o usercopy.o getuser.o memcpy.o strstr.o \
-	bitops.o
+	bitops.o page_ops.o mmx_page.o sse_page.o
 
 lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
 lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
diff -urpN linux-2.6.11.src/arch/i386/lib/mmx.c linux-2.6.11-nt.src/arch/i386/lib/mmx.c
--- linux-2.6.11.src/arch/i386/lib/mmx.c	Tue Oct 19 00:54:23 2004
+++ linux-2.6.11-nt.src/arch/i386/lib/mmx.c	Fri Mar 18 11:30:51 2005
@@ -120,280 +120,3 @@ void *_mmx_memcpy(void *to, const void *
 	kernel_fpu_end();
 	return p;
 }
-
-#ifdef CONFIG_MK7
-
-/*
- *	The K7 has streaming cache bypass load/store. The Cyrix III, K6 and
- *	other MMX using processors do not.
- */
-
-static void fast_clear_page(void *page)
-{
-	int i;
-
-	kernel_fpu_begin();
-	
-	__asm__ __volatile__ (
-		"  pxor %%mm0, %%mm0\n" : :
-	);
-
-	for(i=0;i<4096/64;i++)
-	{
-		__asm__ __volatile__ (
-		"  movntq %%mm0, (%0)\n"
-		"  movntq %%mm0, 8(%0)\n"
-		"  movntq %%mm0, 16(%0)\n"
-		"  movntq %%mm0, 24(%0)\n"
-		"  movntq %%mm0, 32(%0)\n"
-		"  movntq %%mm0, 40(%0)\n"
-		"  movntq %%mm0, 48(%0)\n"
-		"  movntq %%mm0, 56(%0)\n"
-		: : "r" (page) : "memory");
-		page+=64;
-	}
-	/* since movntq is weakly-ordered, a "sfence" is needed to become
-	 * ordered again.
-	 */
-	__asm__ __volatile__ (
-		"  sfence \n" : :
-	);
-	kernel_fpu_end();
-}
-
-static void fast_copy_page(void *to, void *from)
-{
-	int i;
-
-	kernel_fpu_begin();
-
-	/* maybe the prefetch stuff can go before the expensive fnsave...
-	 * but that is for later. -AV
-	 */
-	__asm__ __volatile__ (
-		"1: prefetch (%0)\n"
-		"   prefetch 64(%0)\n"
-		"   prefetch 128(%0)\n"
-		"   prefetch 192(%0)\n"
-		"   prefetch 256(%0)\n"
-		"2:  \n"
-		".section .fixup, \"ax\"\n"
-		"3: movw $0x1AEB, 1b\n"	/* jmp on 26 bytes */
-		"   jmp 2b\n"
-		".previous\n"
-		".section __ex_table,\"a\"\n"
-		"	.align 4\n"
-		

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Denis Vlasenko
On Thursday 17 March 2005 03:33, Christoph Lameter wrote:
 On Fri, 11 Mar 2005, Denis Vlasenko wrote:
 
  Andi Kleen (iirc) says that non-temporal stores seem to be
  big win in microbenchmarks (and I second that), but they are
  a net loss when we are going to use zeroed page just after
  zeroing. He recommends avoid using non-temporal stores
 
  With this new page prezeroing infrastructure, that argument
  most likely is not right anymore. Especially clearing of
  high-order pages definitely will benefit from NT stores
  because they do not kill L1 data cache in the process.
 
  I don't have K8 and therefore cannot be 100% sure, but
  I really doubt that K8 optimize rep stosq into _NT_ stores.
 
 Hmm. That would be interesting to know and may be necessary to justify
 the continued existence of this patch. I tried to get some numbers on
 the performance wins for zeroing larger pages with the patch as is (no
 NT stores) and came up with:
 
 Processor Performance Increase
 
 Itanium 2 1.3Ghz M1/R51.5%
 AMD Athlon 64 3200+ i386 mode 3%
 AMD Athlon 64 3200+ x86_64 mode   3.3%
 
 (this is if the zeroing engine is the cpu of course. Prezeroing
 may be done through some DMA gizmo independent of the cpu)
 
 Itanium has more extensive optimization capabilities and
 seems to be able to better cope with the loop logic for regular
 clear_page. Thus the improvement is even less on Itanium.
 
 Numbers obtained with the following patch that allows to get performance
 data from /proc/meminfo on zeroing performance (just divide Cycles by
 Pages for clear_page and clear_pages):

Here is a patch which allows to try different page zeroing
optimizations to be tested at runtime via sysctl.
Was run tested in 2.6.8 time. Rediffed to 2.6.11.
Feel free to adapt to your patch and test.

Also attached is a tarball for microbenchmarking routines. There are two
result files. Duron:

   normal_clear_page - took  8644 max, 8400 min cycles per page
 repstosl_clear_page - took  8626 max, 8418 min cycles per page
 movq_clear_page - took  8647 max, 8300 min cycles per page
   movntq_clear_page - took  2777 max, 2720 min cycles per page

And amd64:
   normal_clear_page - took  9427 max, 5781 min cycles per page
 repstosl_clear_page - took  9305 max, 5680 min cycles per page
 movq_clear_page - took  6167 max, 5576 min cycles per page
   movntq_clear_page - took  5456 max, 2354 min cycles per page

NT stores are not about 5% increase. 200%-300%. Provided you are ok with
the fact that zeroed page ends up evicted from cache. Luckily, this is exactly
what you want with prezeroing.
--
vda
diff -urpN linux-2.6.11.src/arch/i386/lib/Makefile linux-2.6.11-nt.src/arch/i386/lib/Makefile
--- linux-2.6.11.src/arch/i386/lib/Makefile	Tue Oct 19 00:53:10 2004
+++ linux-2.6.11-nt.src/arch/i386/lib/Makefile	Fri Mar 18 11:30:51 2005
@@ -4,7 +4,7 @@
 
 
 lib-y = checksum.o delay.o usercopy.o getuser.o memcpy.o strstr.o \
-	bitops.o
+	bitops.o page_ops.o mmx_page.o sse_page.o
 
 lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
 lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
diff -urpN linux-2.6.11.src/arch/i386/lib/mmx.c linux-2.6.11-nt.src/arch/i386/lib/mmx.c
--- linux-2.6.11.src/arch/i386/lib/mmx.c	Tue Oct 19 00:54:23 2004
+++ linux-2.6.11-nt.src/arch/i386/lib/mmx.c	Fri Mar 18 11:30:51 2005
@@ -120,280 +120,3 @@ void *_mmx_memcpy(void *to, const void *
 	kernel_fpu_end();
 	return p;
 }
-
-#ifdef CONFIG_MK7
-
-/*
- *	The K7 has streaming cache bypass load/store. The Cyrix III, K6 and
- *	other MMX using processors do not.
- */
-
-static void fast_clear_page(void *page)
-{
-	int i;
-
-	kernel_fpu_begin();
-	
-	__asm__ __volatile__ (
-		  pxor %%mm0, %%mm0\n : :
-	);
-
-	for(i=0;i4096/64;i++)
-	{
-		__asm__ __volatile__ (
-		  movntq %%mm0, (%0)\n
-		  movntq %%mm0, 8(%0)\n
-		  movntq %%mm0, 16(%0)\n
-		  movntq %%mm0, 24(%0)\n
-		  movntq %%mm0, 32(%0)\n
-		  movntq %%mm0, 40(%0)\n
-		  movntq %%mm0, 48(%0)\n
-		  movntq %%mm0, 56(%0)\n
-		: : r (page) : memory);
-		page+=64;
-	}
-	/* since movntq is weakly-ordered, a sfence is needed to become
-	 * ordered again.
-	 */
-	__asm__ __volatile__ (
-		  sfence \n : :
-	);
-	kernel_fpu_end();
-}
-
-static void fast_copy_page(void *to, void *from)
-{
-	int i;
-
-	kernel_fpu_begin();
-
-	/* maybe the prefetch stuff can go before the expensive fnsave...
-	 * but that is for later. -AV
-	 */
-	__asm__ __volatile__ (
-		1: prefetch (%0)\n
-		   prefetch 64(%0)\n
-		   prefetch 128(%0)\n
-		   prefetch 192(%0)\n
-		   prefetch 256(%0)\n
-		2:  \n
-		.section .fixup, \ax\\n
-		3: movw $0x1AEB, 1b\n	/* jmp on 26 bytes */
-		   jmp 2b\n
-		.previous\n
-		.section __ex_table,\a\\n
-			.align 4\n
-			.long 1b, 3b\n
-		.previous
-		: : r (from) );
-
-	for(i=0; i(4096-320)/64; i++)
-	{
-		__asm__ 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Andi Kleen
 Andi Kleen (iirc) says that non-temporal stores seem to be
 big win in microbenchmarks (and I second that), but they are
 a net loss when we are going to use zeroed page just after
 zeroing. He recommends avoid using non-temporal stores

The rule of thumb is to only use non temporal stores when your
data set is bigger than the L2/L3 caches of the CPU. This means 1MB.
The kernel normally never works on data sets that big.

For Christophers new background cleaner daemon it may be worth it 
when the queue is a LILO. This means it is likely there is a relatively
long time between the clearing operation and a workload using it.
But even then it is a very close call and would need clear benchmark 
numbers in macrobenchmarks.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Christoph Lameter
On Fri, 18 Mar 2005, Denis Vlasenko wrote:

 NT stores are not about 5% increase. 200%-300%. Provided you are ok with
 the fact that zeroed page ends up evicted from cache. Luckily, this is exactly
 what you want with prezeroing.

These are pretty significant results. Maybe its best to use non-temporal
stores in general for clearing pages? I checked and Itanium has always
used non-temporal stores. So there will be no benefit for us from this
approach (we have 16k and 64k page sizes which may make the situation a
bit different). Try to update the i386 architectures to do the same?

Or for prezeroing, you could register a zeroing driver that would use the
non-temporal stores with V8 of the prezeroing patches. In any case the
clear_pages patch is not useful the way it was intended for us and I am
have dropped this from the prezeroing patch.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Andi Kleen
On Fri, Mar 18, 2005 at 07:00:06AM -0800, Christoph Lameter wrote:
 On Fri, 18 Mar 2005, Denis Vlasenko wrote:
 
  NT stores are not about 5% increase. 200%-300%. Provided you are ok with
  the fact that zeroed page ends up evicted from cache. Luckily, this is 
  exactly
  what you want with prezeroing.
 
 These are pretty significant results. Maybe its best to use non-temporal

The differences are actually less. I do not know what Denis benchmarked,
but in my tests the difference was never more than ~10%.  He got a zero
too much? 

It does not make any sense if you think of it - the memory bus
of the CPU cannot be that much faster than the cache.

And the drawback of eating the cache misses later is really very
significant.

 stores in general for clearing pages? I checked and Itanium has always
 used non-temporal stores. So there will be no benefit for us from this

That is weird. I would actually try to switch to temporal stores, maybe
it will improve some benchmarks. 

 approach (we have 16k and 64k page sizes which may make the situation a
 bit different). Try to update the i386 architectures to do the same?

Definitely not. 

You can experiment with using it for the cleaner daemon, but even
there I would use some heuristic to make sure you only use it 
on a page that are at the end of a pretty long queue.

e.g. if you can guarantee that the page allocator will go through
500k-1MB before going to the NT page that is cache cold it may
be a good idea. But that might be pretty complicated and I am not
sure it will be worth it.

But for the clear running in the page fault handler context it is 
definitely a bad idea.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-18 Thread Christoph Lameter
On Fri, 18 Mar 2005, Andi Kleen wrote:

 It does not make any sense if you think of it - the memory bus
 of the CPU cannot be that much faster than the cache.

The memory bus would be able to reach a higher rate if properly optimized
for sequential writes to memory. A cache typically does random writes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-16 Thread Christoph Lameter
On Fri, 11 Mar 2005, Denis Vlasenko wrote:

> Andi Kleen (iirc) says that non-temporal stores seem to be
> big win in microbenchmarks (and I second that), but they are
> a net loss when we are going to use zeroed page just after
> zeroing. He recommends avoid using non-temporal stores
>
> With this new page prezeroing infrastructure, that argument
> most likely is not right anymore. Especially clearing of
> high-order pages definitely will benefit from NT stores
> because they do not kill L1 data cache in the process.
>
> I don't have K8 and therefore cannot be 100% sure, but
> I really doubt that K8 optimize "rep stosq" into _NT_ stores.

Hmm. That would be interesting to know and may be necessary to justify
the continued existence of this patch. I tried to get some numbers on
the performance wins for zeroing larger pages with the patch as is (no
NT stores) and came up with:

Processor   Performance Increase

Itanium 2 1.3Ghz M1/R5  1.5%
AMD Athlon 64 3200+ i386 mode   3%
AMD Athlon 64 3200+ x86_64 mode 3.3%

(this is if the zeroing engine is the cpu of course. Prezeroing
may be done through some DMA gizmo independent of the cpu)

Itanium has more extensive optimization capabilities and
seems to be able to better cope with the loop logic for regular
clear_page. Thus the improvement is even less on Itanium.

Numbers obtained with the following patch that allows to get performance
data from /proc/meminfo on zeroing performance (just divide Cycles by
Pages for clear_page and clear_pages):

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-16 17:12:51.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-16 17:17:28.0 -0800
@@ -633,13 +633,33 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int 
gfp_flags)
 {
int i;
+   unsigned long t1;

BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_PAGES
+   if (!PageHighMem(page) && order>4) {
+   unsigned long t;
+
+   t1=get_cycles();
+   clear_pages(page_address(page), order);
+   t = get_cycles() - t1;
+   add_page_state(clear_pages_cycles, t);
+   add_page_state(clear_pages_order, 1 << order);
+   inc_page_state(clear_pages_nr);
+   return;
+   }
+#endif
+
+   t1=get_cycles();
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
+   add_page_state(clear_page_cycles, get_cycles() - t1);
+   add_page_state(clear_page_order, 1 << order);
+   inc_page_state(clear_page_nr);
 }

 /*
Index: linux-2.6.11/include/linux/page-flags.h
===
--- linux-2.6.11.orig/include/linux/page-flags.h2005-03-16 
17:12:51.0 -0800
+++ linux-2.6.11/include/linux/page-flags.h 2005-03-16 17:13:02.0 
-0800
@@ -131,6 +131,13 @@ struct page_state {
unsigned long allocstall;   /* direct reclaim calls */

unsigned long pgrotated;/* pages rotated to tail of the LRU */
+
+   unsigned long clear_page_nr;/* Nr of clear_page request */
+   unsigned long clear_page_cycles; /* Cycles spent in clear_page */
+   unsigned long clear_page_order; /* Sum of orders */
+   unsigned long clear_pages_nr;   /* Nr of clear_pages requests */
+   unsigned long clear_pages_cycles;   /* Nr of cycles in clear_pages 
*/
+   unsigned long clear_pages_order;/* Sum of orders */
 };

 extern void get_page_state(struct page_state *ret);
Index: linux-2.6.11/fs/proc/proc_misc.c
===
--- linux-2.6.11.orig/fs/proc/proc_misc.c   2005-03-16 17:12:50.0 
-0800
+++ linux-2.6.11/fs/proc/proc_misc.c2005-03-16 17:22:18.0 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
unsigned long allowed;
struct vmalloc_info vmi;

-   get_page_state();
+   get_full_page_state();
get_zone_counts(, , );

 /*
@@ -168,7 +168,13 @@ static int meminfo_read_proc(char *page,
"PageTables:   %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed:  %8lu kB\n"
-   "VmallocChunk: %8lu kB\n",
+   "VmallocChunk: %8lu kB\n"
+   "ClearPage #   %8lu\n"
+   "ClearPage Pgs %8lu\n"
+   "ClearPage Cyc %8lu\n"
+   "ClearPages #  %8lu\n"
+   "ClearPages Pg %8lu\n"
+   "ClearPages Cy %8lu\n",

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-16 Thread Christoph Lameter
On Fri, 11 Mar 2005, Denis Vlasenko wrote:

 Andi Kleen (iirc) says that non-temporal stores seem to be
 big win in microbenchmarks (and I second that), but they are
 a net loss when we are going to use zeroed page just after
 zeroing. He recommends avoid using non-temporal stores

 With this new page prezeroing infrastructure, that argument
 most likely is not right anymore. Especially clearing of
 high-order pages definitely will benefit from NT stores
 because they do not kill L1 data cache in the process.

 I don't have K8 and therefore cannot be 100% sure, but
 I really doubt that K8 optimize rep stosq into _NT_ stores.

Hmm. That would be interesting to know and may be necessary to justify
the continued existence of this patch. I tried to get some numbers on
the performance wins for zeroing larger pages with the patch as is (no
NT stores) and came up with:

Processor   Performance Increase

Itanium 2 1.3Ghz M1/R5  1.5%
AMD Athlon 64 3200+ i386 mode   3%
AMD Athlon 64 3200+ x86_64 mode 3.3%

(this is if the zeroing engine is the cpu of course. Prezeroing
may be done through some DMA gizmo independent of the cpu)

Itanium has more extensive optimization capabilities and
seems to be able to better cope with the loop logic for regular
clear_page. Thus the improvement is even less on Itanium.

Numbers obtained with the following patch that allows to get performance
data from /proc/meminfo on zeroing performance (just divide Cycles by
Pages for clear_page and clear_pages):

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-16 17:12:51.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-16 17:17:28.0 -0800
@@ -633,13 +633,33 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int 
gfp_flags)
 {
int i;
+   unsigned long t1;

BUG_ON((gfp_flags  (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_PAGES
+   if (!PageHighMem(page)  order4) {
+   unsigned long t;
+
+   t1=get_cycles();
+   clear_pages(page_address(page), order);
+   t = get_cycles() - t1;
+   add_page_state(clear_pages_cycles, t);
+   add_page_state(clear_pages_order, 1  order);
+   inc_page_state(clear_pages_nr);
+   return;
+   }
+#endif
+
+   t1=get_cycles();
for(i = 0; i  (1  order); i++)
clear_highpage(page + i);
+   add_page_state(clear_page_cycles, get_cycles() - t1);
+   add_page_state(clear_page_order, 1  order);
+   inc_page_state(clear_page_nr);
 }

 /*
Index: linux-2.6.11/include/linux/page-flags.h
===
--- linux-2.6.11.orig/include/linux/page-flags.h2005-03-16 
17:12:51.0 -0800
+++ linux-2.6.11/include/linux/page-flags.h 2005-03-16 17:13:02.0 
-0800
@@ -131,6 +131,13 @@ struct page_state {
unsigned long allocstall;   /* direct reclaim calls */

unsigned long pgrotated;/* pages rotated to tail of the LRU */
+
+   unsigned long clear_page_nr;/* Nr of clear_page request */
+   unsigned long clear_page_cycles; /* Cycles spent in clear_page */
+   unsigned long clear_page_order; /* Sum of orders */
+   unsigned long clear_pages_nr;   /* Nr of clear_pages requests */
+   unsigned long clear_pages_cycles;   /* Nr of cycles in clear_pages 
*/
+   unsigned long clear_pages_order;/* Sum of orders */
 };

 extern void get_page_state(struct page_state *ret);
Index: linux-2.6.11/fs/proc/proc_misc.c
===
--- linux-2.6.11.orig/fs/proc/proc_misc.c   2005-03-16 17:12:50.0 
-0800
+++ linux-2.6.11/fs/proc/proc_misc.c2005-03-16 17:22:18.0 -0800
@@ -127,7 +127,7 @@ static int meminfo_read_proc(char *page,
unsigned long allowed;
struct vmalloc_info vmi;

-   get_page_state(ps);
+   get_full_page_state(ps);
get_zone_counts(active, inactive, free);

 /*
@@ -168,7 +168,13 @@ static int meminfo_read_proc(char *page,
PageTables:   %8lu kB\n
VmallocTotal: %8lu kB\n
VmallocUsed:  %8lu kB\n
-   VmallocChunk: %8lu kB\n,
+   VmallocChunk: %8lu kB\n
+   ClearPage #   %8lu\n
+   ClearPage Pgs %8lu\n
+   ClearPage Cyc %8lu\n
+   ClearPages #  %8lu\n
+   ClearPages Pg %8lu\n
+   ClearPages Cy %8lu\n,
K(i.totalram),

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-11 Thread Denis Vlasenko
On Friday 11 March 2005 03:03, Christoph Lameter wrote:
> Changelog:
> - use Kconfig and CONFIG_CLEAR_PAGES
> 
> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c 
> may benefit from a
> clear_page that is capable of zeroing multiple pages at once. The following 
> patch adds
> a function "clear_pages" that is capable of clearing multiple continuous 
> pages at once.
> 
> Patch against 2.6.11-bk6
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
[snip]
> -clear_page_end:
> +clear_pages_end:
> 
>   /* C stepping K8 run faster using the string instructions.
>  It is also a lot simpler. Use this when possible */

Andi Kleen (iirc) says that non-temporal stores seem to be
big win in microbenchmarks (and I second that), but they are
a net loss when we are going to use zeroed page just after
zeroing. He recommends avoid using non-temporal stores

With this new page prezeroing infrastructure, that argument
most likely is not right anymore. Especially clearing of
high-order pages definitely will benefit from NT stores
because they do not kill L1 data cache in the process.

I don't have K8 and therefore cannot be 100% sure, but
I really doubt that K8 optimize "rep stosq" into _NT_ stores.

Andi?
--
vda

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-11 Thread Denis Vlasenko
On Friday 11 March 2005 03:03, Christoph Lameter wrote:
 Changelog:
 - use Kconfig and CONFIG_CLEAR_PAGES
 
 The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c 
 may benefit from a
 clear_page that is capable of zeroing multiple pages at once. The following 
 patch adds
 a function clear_pages that is capable of clearing multiple continuous 
 pages at once.
 
 Patch against 2.6.11-bk6
 
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
[snip]
 -clear_page_end:
 +clear_pages_end:
 
   /* C stepping K8 run faster using the string instructions.
  It is also a lot simpler. Use this when possible */

Andi Kleen (iirc) says that non-temporal stores seem to be
big win in microbenchmarks (and I second that), but they are
a net loss when we are going to use zeroed page just after
zeroing. He recommends avoid using non-temporal stores

With this new page prezeroing infrastructure, that argument
most likely is not right anymore. Especially clearing of
high-order pages definitely will benefit from NT stores
because they do not kill L1 data cache in the process.

I don't have K8 and therefore cannot be 100% sure, but
I really doubt that K8 optimize rep stosq into _NT_ stores.

Andi?
--
vda

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Christoph Lameter
Changelog:
- use Kconfig and CONFIG_CLEAR_PAGES

The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may 
benefit from a
clear_page that is capable of zeroing multiple pages at once. The following 
patch adds
a function "clear_pages" that is capable of clearing multiple continuous pages 
at once.

Patch against 2.6.11-bk6

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-10 14:42:43.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-10 15:01:53.0 -0800
@@ -628,11 +628,19 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int 
gfp_flags)
 {
int i;

BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_PAGES
+   if (!PageHighMem(page)) {
+   clear_pages(page_address(page), order);
+   return;
+   }
+#endif
+
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-10 15:01:53.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage([i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
return page;
 }

Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-10 15:02:47.0 
-0800
@@ -56,8 +56,9 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_pages (void *page, int order);
 extern void copy_page (void *to, void *from);
+#define clear_page(__page) clear_pages(__page, 0)

 /*
  * clear_user_page() and copy_user_page() can't be inline functions because
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-10 15:01:53.0 
-0800
@@ -38,7 +38,7 @@ EXPORT_SYMBOL(__down_trylock);
 EXPORT_SYMBOL(__up);

 #include 
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_pages);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include 
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-10 15:01:53.0 
-0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm  Tuned for Itanium.
  * 2/12/02 kchen   Tuned for both Itanium and McKinley
  * 3/08/02 davidm  Some more tweaking
+ * 12/10/04 clameter   Make it work on pages of order size
  */
 #include 

@@ -29,27 +30,33 @@
 #define dst4   r11

 #define dst_last   r31
+#define totsizer14

-GLOBAL_ENTRY(clear_page)
+GLOBAL_ENTRY(clear_pages)
.prologue
-   .regstk 1,0,0,0
-   mov r16 = PAGE_SIZE/L3_LINE_SIZE-1  // main loop count, 
-1=repeat/until
+   .regstk 2,0,0,0
+   mov r16 = PAGE_SIZE/L3_LINE_SIZE// main loop count
+   mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+   ;;
.body
+   adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
-   adds dst1 = 16, in0
adds dst2 = 32, in0
+   shl r16 = r16, in1
+   shl totsize = totsize, in1
;;
 .fetch:stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is 
harmless
br.cloop.sptk.few .fetch
+   add r16 = -1,r16
+   add dst_last = totsize, dst_fetch
+   adds dst4 = 64, in0
;;
-   addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
-   adds dst4 = 64, in0
+   adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
;;
 #ifdef CONFIG_ITANIUM
// 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Christoph Lameter
On Thu, 10 Mar 2005, Dave Hansen wrote:

> > +extern void clear_pages (void *page, int order);
> >  extern void copy_page (void *to, void *from);
> > +#define clear_page(__page) clear_pages(__page, 0)
> > +#define __HAVE_ARCH_CLEAR_PAGES
>
> Although this is a simple instance, could this please be done in a
> Kconfig file?  If that #define happens inside of other #ifdefs, it can
> be quite hard to decipher the special .config incantation to get it set.
> On the other hand, if the dependencies are spelled out in a Kconfig
> entry...

Ok will do.

> BTW, I tried applying this to 2.6.11-bk6, and it rejected:
> ...
> patching file include/asm-i386/page.h
> Hunk #2 FAILED at 28.
> 1 out of 2 hunks FAILED -- saving rejects to file
> include/asm-i386/page.h.rej
> ...
>
> There were some more rejects as well.  Were there some other patches
> applied first?

Patches work fine here.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Dave Hansen
On Thu, 2005-03-10 at 12:35 -0800, Christoph Lameter wrote:
> +#ifdef __HAVE_ARCH_CLEAR_PAGES
> + if (!PageHighMem(page)) {
> + clear_pages(page_address(page), order);
> + return;
> + }
> +#endif
> +
>   for(i = 0; i < (1 << order); i++)
>   clear_highpage(page + i);
>  }
...
> --- linux-2.6.11.orig/include/asm-ia64/page.h 2005-03-01 23:37:48.0 
> -0800
> +++ linux-2.6.11/include/asm-ia64/page.h  2005-03-10 10:57:10.0 
> -0800
> @@ -56,8 +56,10 @@
>  # ifdef __KERNEL__
>  #  define STRICT_MM_TYPECHECKS
> 
> -extern void clear_page (void *page);
> +extern void clear_pages (void *page, int order);
>  extern void copy_page (void *to, void *from);
> +#define clear_page(__page) clear_pages(__page, 0)
> +#define __HAVE_ARCH_CLEAR_PAGES

Although this is a simple instance, could this please be done in a
Kconfig file?  If that #define happens inside of other #ifdefs, it can
be quite hard to decipher the special .config incantation to get it set.
On the other hand, if the dependencies are spelled out in a Kconfig
entry...

BTW, I tried applying this to 2.6.11-bk6, and it rejected:
...
patching file include/asm-i386/page.h
Hunk #2 FAILED at 28.
1 out of 2 hunks FAILED -- saving rejects to file
include/asm-i386/page.h.rej
...

There were some more rejects as well.  Were there some other patches
applied first?

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Christoph Lameter
The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may 
benefit from a
clear_page that is capable of zeroing multiple pages at once. The following 
patch adds
a function "clear_pages" that is capable of clearing multiple continuous pages 
at once.

This used to be part of the prezeroing patchset but there may be benefits
to huge pages and regular kernel code as well. Also Mel Gorman's patchset
to reduce fragmentation and introduce prezeroing in a different way may
benefit from this patch. The patch only provides a clear_pages function
for ia32, ia64, x86_64 and sparc64 (all tested). Other platforms may
provide a clear_pages function by defining __HAVE_ARCH_CLEAR_PAGES.

Patch against 2.6.11-bk6

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-10 10:57:06.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-10 10:57:10.0 -0800
@@ -628,11 +628,19 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int 
gfp_flags)
 {
int i;

BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef __HAVE_ARCH_CLEAR_PAGES
+   if (!PageHighMem(page)) {
+   clear_pages(page_address(page), order);
+   return;
+   }
+#endif
+
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-10 10:57:10.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage([i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
return page;
 }

Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-10 10:57:10.0 
-0800
@@ -56,8 +56,10 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_pages (void *page, int order);
 extern void copy_page (void *to, void *from);
+#define clear_page(__page) clear_pages(__page, 0)
+#define __HAVE_ARCH_CLEAR_PAGES

 /*
  * clear_user_page() and copy_user_page() can't be inline functions because
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-10 10:57:10.0 
-0800
@@ -38,7 +38,7 @@ EXPORT_SYMBOL(__down_trylock);
 EXPORT_SYMBOL(__up);

 #include 
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_pages);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include 
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-10 10:57:10.0 
-0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm  Tuned for Itanium.
  * 2/12/02 kchen   Tuned for both Itanium and McKinley
  * 3/08/02 davidm  Some more tweaking
+ * 12/10/04 clameter   Make it work on pages of order size
  */
 #include 

@@ -29,27 +30,33 @@
 #define dst4   r11

 #define dst_last   r31
+#define totsizer14

-GLOBAL_ENTRY(clear_page)
+GLOBAL_ENTRY(clear_pages)
.prologue
-   .regstk 1,0,0,0
-   mov r16 = PAGE_SIZE/L3_LINE_SIZE-1  // main loop count, 
-1=repeat/until
+   .regstk 2,0,0,0
+   mov r16 = PAGE_SIZE/L3_LINE_SIZE// main loop count
+   mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+   ;;
.body
+   adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
-   adds dst1 = 16, in0
adds dst2 = 32, in0
+   shl r16 = r16, in1
+   shl totsize = totsize, in1
;;
 .fetch:stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is 
harmless

[PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Christoph Lameter
The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may 
benefit from a
clear_page that is capable of zeroing multiple pages at once. The following 
patch adds
a function clear_pages that is capable of clearing multiple continuous pages 
at once.

This used to be part of the prezeroing patchset but there may be benefits
to huge pages and regular kernel code as well. Also Mel Gorman's patchset
to reduce fragmentation and introduce prezeroing in a different way may
benefit from this patch. The patch only provides a clear_pages function
for ia32, ia64, x86_64 and sparc64 (all tested). Other platforms may
provide a clear_pages function by defining __HAVE_ARCH_CLEAR_PAGES.

Patch against 2.6.11-bk6

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-10 10:57:06.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-10 10:57:10.0 -0800
@@ -628,11 +628,19 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int 
gfp_flags)
 {
int i;

BUG_ON((gfp_flags  (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef __HAVE_ARCH_CLEAR_PAGES
+   if (!PageHighMem(page)) {
+   clear_pages(page_address(page), order);
+   return;
+   }
+#endif
+
for(i = 0; i  (1  order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-10 10:57:10.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(hugetlb_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i  (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage(page[i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
return page;
 }

Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-10 10:57:10.0 
-0800
@@ -56,8 +56,10 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_pages (void *page, int order);
 extern void copy_page (void *to, void *from);
+#define clear_page(__page) clear_pages(__page, 0)
+#define __HAVE_ARCH_CLEAR_PAGES

 /*
  * clear_user_page() and copy_user_page() can't be inline functions because
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-10 10:57:10.0 
-0800
@@ -38,7 +38,7 @@ EXPORT_SYMBOL(__down_trylock);
 EXPORT_SYMBOL(__up);

 #include asm/page.h
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_pages);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include linux/bootmem.h
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-10 10:57:10.0 
-0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm  Tuned for Itanium.
  * 2/12/02 kchen   Tuned for both Itanium and McKinley
  * 3/08/02 davidm  Some more tweaking
+ * 12/10/04 clameter   Make it work on pages of order size
  */
 #include linux/config.h

@@ -29,27 +30,33 @@
 #define dst4   r11

 #define dst_last   r31
+#define totsizer14

-GLOBAL_ENTRY(clear_page)
+GLOBAL_ENTRY(clear_pages)
.prologue
-   .regstk 1,0,0,0
-   mov r16 = PAGE_SIZE/L3_LINE_SIZE-1  // main loop count, 
-1=repeat/until
+   .regstk 2,0,0,0
+   mov r16 = PAGE_SIZE/L3_LINE_SIZE// main loop count
+   mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+   ;;
.body
+   adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
-   adds dst1 = 16, in0
adds dst2 = 32, in0
+   shl r16 = r16, in1
+   shl totsize = totsize, in1
;;
 .fetch:stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // 

Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Dave Hansen
On Thu, 2005-03-10 at 12:35 -0800, Christoph Lameter wrote:
 +#ifdef __HAVE_ARCH_CLEAR_PAGES
 + if (!PageHighMem(page)) {
 + clear_pages(page_address(page), order);
 + return;
 + }
 +#endif
 +
   for(i = 0; i  (1  order); i++)
   clear_highpage(page + i);
  }
...
 --- linux-2.6.11.orig/include/asm-ia64/page.h 2005-03-01 23:37:48.0 
 -0800
 +++ linux-2.6.11/include/asm-ia64/page.h  2005-03-10 10:57:10.0 
 -0800
 @@ -56,8 +56,10 @@
  # ifdef __KERNEL__
  #  define STRICT_MM_TYPECHECKS
 
 -extern void clear_page (void *page);
 +extern void clear_pages (void *page, int order);
  extern void copy_page (void *to, void *from);
 +#define clear_page(__page) clear_pages(__page, 0)
 +#define __HAVE_ARCH_CLEAR_PAGES

Although this is a simple instance, could this please be done in a
Kconfig file?  If that #define happens inside of other #ifdefs, it can
be quite hard to decipher the special .config incantation to get it set.
On the other hand, if the dependencies are spelled out in a Kconfig
entry...

BTW, I tried applying this to 2.6.11-bk6, and it rejected:
...
patching file include/asm-i386/page.h
Hunk #2 FAILED at 28.
1 out of 2 hunks FAILED -- saving rejects to file
include/asm-i386/page.h.rej
...

There were some more rejects as well.  Were there some other patches
applied first?

-- Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Christoph Lameter
On Thu, 10 Mar 2005, Dave Hansen wrote:

  +extern void clear_pages (void *page, int order);
   extern void copy_page (void *to, void *from);
  +#define clear_page(__page) clear_pages(__page, 0)
  +#define __HAVE_ARCH_CLEAR_PAGES

 Although this is a simple instance, could this please be done in a
 Kconfig file?  If that #define happens inside of other #ifdefs, it can
 be quite hard to decipher the special .config incantation to get it set.
 On the other hand, if the dependencies are spelled out in a Kconfig
 entry...

Ok will do.

 BTW, I tried applying this to 2.6.11-bk6, and it rejected:
 ...
 patching file include/asm-i386/page.h
 Hunk #2 FAILED at 28.
 1 out of 2 hunks FAILED -- saving rejects to file
 include/asm-i386/page.h.rej
 ...

 There were some more rejects as well.  Were there some other patches
 applied first?

Patches work fine here.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add a clear_pages function to clear pages of higher order

2005-03-10 Thread Christoph Lameter
Changelog:
- use Kconfig and CONFIG_CLEAR_PAGES

The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may 
benefit from a
clear_page that is capable of zeroing multiple pages at once. The following 
patch adds
a function clear_pages that is capable of clearing multiple continuous pages 
at once.

Patch against 2.6.11-bk6

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.11/mm/page_alloc.c
===
--- linux-2.6.11.orig/mm/page_alloc.c   2005-03-10 14:42:43.0 -0800
+++ linux-2.6.11/mm/page_alloc.c2005-03-10 15:01:53.0 -0800
@@ -628,11 +628,19 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int 
gfp_flags)
 {
int i;

BUG_ON((gfp_flags  (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+
+#ifdef CONFIG_CLEAR_PAGES
+   if (!PageHighMem(page)) {
+   clear_pages(page_address(page), order);
+   return;
+   }
+#endif
+
for(i = 0; i  (1  order); i++)
clear_highpage(page + i);
 }
Index: linux-2.6.11/mm/hugetlb.c
===
--- linux-2.6.11.orig/mm/hugetlb.c  2005-03-01 23:38:12.0 -0800
+++ linux-2.6.11/mm/hugetlb.c   2005-03-10 15:01:53.0 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
struct page *page;
-   int i;

spin_lock(hugetlb_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
-   for (i = 0; i  (HPAGE_SIZE/PAGE_SIZE); ++i)
-   clear_highpage(page[i]);
+   prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
return page;
 }

Index: linux-2.6.11/include/asm-ia64/page.h
===
--- linux-2.6.11.orig/include/asm-ia64/page.h   2005-03-01 23:37:48.0 
-0800
+++ linux-2.6.11/include/asm-ia64/page.h2005-03-10 15:02:47.0 
-0800
@@ -56,8 +56,9 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_pages (void *page, int order);
 extern void copy_page (void *to, void *from);
+#define clear_page(__page) clear_pages(__page, 0)

 /*
  * clear_user_page() and copy_user_page() can't be inline functions because
Index: linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c
===
--- linux-2.6.11.orig/arch/ia64/kernel/ia64_ksyms.c 2005-03-01 
23:38:08.0 -0800
+++ linux-2.6.11/arch/ia64/kernel/ia64_ksyms.c  2005-03-10 15:01:53.0 
-0800
@@ -38,7 +38,7 @@ EXPORT_SYMBOL(__down_trylock);
 EXPORT_SYMBOL(__up);

 #include asm/page.h
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(clear_pages);

 #ifdef CONFIG_VIRTUAL_MEM_MAP
 #include linux/bootmem.h
Index: linux-2.6.11/arch/ia64/lib/clear_page.S
===
--- linux-2.6.11.orig/arch/ia64/lib/clear_page.S2005-03-01 
23:37:47.0 -0800
+++ linux-2.6.11/arch/ia64/lib/clear_page.S 2005-03-10 15:01:53.0 
-0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm  Tuned for Itanium.
  * 2/12/02 kchen   Tuned for both Itanium and McKinley
  * 3/08/02 davidm  Some more tweaking
+ * 12/10/04 clameter   Make it work on pages of order size
  */
 #include linux/config.h

@@ -29,27 +30,33 @@
 #define dst4   r11

 #define dst_last   r31
+#define totsizer14

-GLOBAL_ENTRY(clear_page)
+GLOBAL_ENTRY(clear_pages)
.prologue
-   .regstk 1,0,0,0
-   mov r16 = PAGE_SIZE/L3_LINE_SIZE-1  // main loop count, 
-1=repeat/until
+   .regstk 2,0,0,0
+   mov r16 = PAGE_SIZE/L3_LINE_SIZE// main loop count
+   mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+   ;;
.body
+   adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
-   adds dst1 = 16, in0
adds dst2 = 32, in0
+   shl r16 = r16, in1
+   shl totsize = totsize, in1
;;
 .fetch:stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is 
harmless
br.cloop.sptk.few .fetch
+   add r16 = -1,r16
+   add dst_last = totsize, dst_fetch
+   adds dst4 = 64, in0
;;
-   addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
-   adds dst4 = 64, in0
+   adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last