subject:"\[PATCH\/RFC\] A method for clearing out page cache"

Re: [PATCH/RFC] A method for clearing out page cache

2005-03-01 Thread Pavel Machek

Hi!

> So what it comes down to is
> 
> sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
> 
> where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
> mapped pagecache, anonymous memory, slab, ...).

Heh, swsusp needs shrink_all_memory() and I'd like to use something
more generic as shrink_all_memory() does not seem to work properly. I
guess that loop over all node_ids should be easy ;-).

Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-03-01 Thread Pavel Machek

Hi!

 So what it comes down to is
 
 sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
 
 where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
 mapped pagecache, anonymous memory, slab, ...).

Heh, swsusp needs shrink_all_memory() and I'd like to use something
more generic as shrink_all_memory() does not seem to work properly. I
guess that loop over all node_ids should be easy ;-).

Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Paul Jackson

Andrew asked:
> So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
> prior to exitting?

Hang on ...

The replies of Ray and Martin answer your immediate question.

But we (SGI) are still busy discussing the bigger picture behind the
scenes ...

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant

Andrew Morton wrote:
Paul Jackson <[EMAIL PROTECTED]> wrote:
As Martin wrote, when he submitted this patch:
> The motivation for this patch is for setting up High Performance
> Computing jobs, where initial memory placement is very important to
> overall performance.
Any left over cache is wrong, for this situation.

So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?
Even if we modified all applications to do this, it still wouldn't help for
dirty page cache, which would eventually become cleaned, and hang around long
after the application has departed.
But the previous statement has a false hypothesis, namely, that we could
change all applications to do this.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Martin Hicks


On Tue, Feb 22, 2005 at 10:45:35AM -0800, Andrew Morton wrote:
> Paul Jackson <[EMAIL PROTECTED]> wrote:
> >
> >  As Martin wrote, when he submitted this patch:
> >  > The motivation for this patch is for setting up High Performance
> >  > Computing jobs, where initial memory placement is very important to
> >  > overall performance.
> > 
> >  Any left over cache is wrong, for this situation.
> 
> So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
> prior to exitting?

I think Paul's referring to pagecache (as well as other caches) that are
on the node from other uses, not necessarily another HPC job that has
recently terminated.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Andrew Morton

Paul Jackson <[EMAIL PROTECTED]> wrote:
>
>  As Martin wrote, when he submitted this patch:
>  > The motivation for this patch is for setting up High Performance
>  > Computing jobs, where initial memory placement is very important to
>  > overall performance.
> 
>  Any left over cache is wrong, for this situation.

So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant

Ingo Molnar wrote:
* Andrew Morton <[EMAIL PROTECTED]> wrote:

. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that.
Yes, that would be preferable - I don't know what the difficulty is
with that.  sys_set_mempolicy() should provide a sufficiently good
hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.
We're pretty agnostic about this.  I agree that if we were to make this
a system call, then it should be restricted to root.  Or make it a
sysctl.  Whichever way you guys want to go is fine with us.
It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).
I have two observations about this:
(1)  It is our intent to use the infrastructure provided by this patch
 as the basis for an automatic (i. e. included with the VM) approach
 that selectively removes unused page cache pages before spilling
 off node.  We just figured it would be easier to get the
 infrastructure in place first.
(2)  If a sufficiently well behaved application knows in advance how
 much free memory it needs per node, then it makes sense to provide
 a mechanism for the application to request this, rather than for
 the VM to try to puzzle this out later.  Automatic algorithms in
 the VM are never perfect; they should be reserved to work in those
 cases where the application(s) either cooperate in such a way to
 make memory demands impossible to predict, or the application
 programmer can't (or can't take the time to) predict how much
 memory the application will use.
At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Paul Jackson

Ingo wrote:
> app designers very frequently think that the VM gets its act wrong (most
> of the time for the wrong reasons),

As Martin wrote, when he submitted this patch:
> The motivation for this patch is for setting up High Performance
> Computing jobs, where initial memory placement is very important to
> overall performance.

Any left over cache is wrong, for this situation.  The only right
answer, no fault of the VM that it can't predict such, is to clear the
past cache and ensure that all allocations are satisfied with node-local
memory, and no page out delays, for all the threads in such tightly
coupled jobs.  These jobs have been sized to use every ounce of CPU and
Memory from sometimes hundreds of nodes, and for hours or days, using
tightly coupled MPI and OpenMP codes.  Any misplaced pages (off the
local node) or paging delays quickly leads to erratic and reduced
performance.

Flushing all the cache like this hurts any normal load that has any
continuity of working set, and such flushing is not cheap.  I have not
observed much interest in doing this, outside of appropriate use when
starting up a big HPC app, as described above, or the test and debug
situations that you mention.  For certain HPC apps, it can be essential
to repeatable job performance.

Granted, this might not be for most systems.  Perhaps a CONFIG option,
so that by default this worked on builds for big honkin numa boxes, but
was an -ENOSYS error on ordinary sized systems?  Though I prefer not to
create artificial distinctions between configurations, without good
reason, perhaps this is such a reason.

Making the API ugly won't reduce its use much, rather just increase code
maintenance costs a bit, and breed a few more bugs.  Those who think
they want this will find a way to do it.  If something's worth doing,
it's worth doing cleanly.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> >  . enable users to
> >  specify an 'allocation priority' of some sort, which kicks out the
> >  pagecache on the local node - or something like that.
> 
> Yes, that would be preferable - I don't know what the difficulty is
> with that.  sys_set_mempolicy() should provide a sufficiently good
> hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.

It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).

At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Andrew Morton

Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> app designers very frequently think that the VM gets its act wrong (most
>  of the time for the wrong reasons), and the last thing we want to enable
>  them is to hack real problems around.

Not really.  Memory reclaim tries to predict the future and expects some
sort of "average" workload.  For some workloads that prediction is
hopelessly wrong.  Although we could surely provide manual hinting
machinery which is less crude than this proposal.

>  . enable users to
>  specify an 'allocation priority' of some sort, which kicks out the
>  pagecache on the local node - or something like that.

Yes, that would be preferable - I don't know what the difficulty is with
that.  sys_set_mempolicy() should provide a sufficiently good hint.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Andrew Morton

Ingo Molnar [EMAIL PROTECTED] wrote:

 app designers very frequently think that the VM gets its act wrong (most
  of the time for the wrong reasons), and the last thing we want to enable
  them is to hack real problems around.

Not really.  Memory reclaim tries to predict the future and expects some
sort of average workload.  For some workloads that prediction is
hopelessly wrong.  Although we could surely provide manual hinting
machinery which is less crude than this proposal.

  . enable users to
  specify an 'allocation priority' of some sort, which kicks out the
  pagecache on the local node - or something like that.

Yes, that would be preferable - I don't know what the difficulty is with
that.  sys_set_mempolicy() should provide a sufficiently good hint.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ingo Molnar


* Andrew Morton [EMAIL PROTECTED] wrote:

   . enable users to
   specify an 'allocation priority' of some sort, which kicks out the
   pagecache on the local node - or something like that.
 
 Yes, that would be preferable - I don't know what the difficulty is
 with that.  sys_set_mempolicy() should provide a sufficiently good
 hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.

It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).

At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Paul Jackson

Ingo wrote:
 app designers very frequently think that the VM gets its act wrong (most
 of the time for the wrong reasons),

As Martin wrote, when he submitted this patch:
 The motivation for this patch is for setting up High Performance
 Computing jobs, where initial memory placement is very important to
 overall performance.

Any left over cache is wrong, for this situation.  The only right
answer, no fault of the VM that it can't predict such, is to clear the
past cache and ensure that all allocations are satisfied with node-local
memory, and no page out delays, for all the threads in such tightly
coupled jobs.  These jobs have been sized to use every ounce of CPU and
Memory from sometimes hundreds of nodes, and for hours or days, using
tightly coupled MPI and OpenMP codes.  Any misplaced pages (off the
local node) or paging delays quickly leads to erratic and reduced
performance.

Flushing all the cache like this hurts any normal load that has any
continuity of working set, and such flushing is not cheap.  I have not
observed much interest in doing this, outside of appropriate use when
starting up a big HPC app, as described above, or the test and debug
situations that you mention.  For certain HPC apps, it can be essential
to repeatable job performance.

Granted, this might not be for most systems.  Perhaps a CONFIG option,
so that by default this worked on builds for big honkin numa boxes, but
was an -ENOSYS error on ordinary sized systems?  Though I prefer not to
create artificial distinctions between configurations, without good
reason, perhaps this is such a reason.

Making the API ugly won't reduce its use much, rather just increase code
maintenance costs a bit, and breed a few more bugs.  Those who think
they want this will find a way to do it.  If something's worth doing,
it's worth doing cleanly.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant

Ingo Molnar wrote:
* Andrew Morton [EMAIL PROTECTED] wrote:

. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that.
Yes, that would be preferable - I don't know what the difficulty is
with that.  sys_set_mempolicy() should provide a sufficiently good
hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.
We're pretty agnostic about this.  I agree that if we were to make this
a system call, then it should be restricted to root.  Or make it a
sysctl.  Whichever way you guys want to go is fine with us.
It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).
I have two observations about this:
(1)  It is our intent to use the infrastructure provided by this patch
 as the basis for an automatic (i. e. included with the VM) approach
 that selectively removes unused page cache pages before spilling
 off node.  We just figured it would be easier to get the
 infrastructure in place first.
(2)  If a sufficiently well behaved application knows in advance how
 much free memory it needs per node, then it makes sense to provide
 a mechanism for the application to request this, rather than for
 the VM to try to puzzle this out later.  Automatic algorithms in
 the VM are never perfect; they should be reserved to work in those
 cases where the application(s) either cooperate in such a way to
 make memory demands impossible to predict, or the application
 programmer can't (or can't take the time to) predict how much
 memory the application will use.
At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.
Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Andrew Morton

Paul Jackson [EMAIL PROTECTED] wrote:

  As Martin wrote, when he submitted this patch:
   The motivation for this patch is for setting up High Performance
   Computing jobs, where initial memory placement is very important to
   overall performance.
 
  Any left over cache is wrong, for this situation.

So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Martin Hicks


On Tue, Feb 22, 2005 at 10:45:35AM -0800, Andrew Morton wrote:
 Paul Jackson [EMAIL PROTECTED] wrote:
 
   As Martin wrote, when he submitted this patch:
The motivation for this patch is for setting up High Performance
Computing jobs, where initial memory placement is very important to
overall performance.
  
   Any left over cache is wrong, for this situation.
 
 So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
 prior to exitting?

I think Paul's referring to pagecache (as well as other caches) that are
on the node from other uses, not necessarily another HPC job that has
recently terminated.

mh

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant

Andrew Morton wrote:
Paul Jackson [EMAIL PROTECTED] wrote:
As Martin wrote, when he submitted this patch:
 The motivation for this patch is for setting up High Performance
 Computing jobs, where initial memory placement is very important to
 overall performance.
Any left over cache is wrong, for this situation.

So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?
Even if we modified all applications to do this, it still wouldn't help for
dirty page cache, which would eventually become cleaned, and hang around long
after the application has departed.
But the previous statement has a false hypothesis, namely, that we could
change all applications to do this.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Paul Jackson

Andrew asked:
 So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
 prior to exitting?

Hang on ...

The replies of Ray and Martin answer your immediate question.

But we (SGI) are still busy discussing the bigger picture behind the
scenes ...

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> > However, the first step is to do this manually from user space.
> 
> Yup.  The thing is, lots of people want this feature for various
> reasons.  Not just numerical-computing-users-on-NUMA.  We should get
> it right for them too.
> 
> Especially kernel developers, who have various nasty userspace tools
> which will manually reclaim pagecache.  But non-kernel-developers will
> use it too, when they think the VM is screwing them over ;)

app designers very frequently think that the VM gets its act wrong (most
of the time for the wrong reasons), and the last thing we want to enable
them is to hack real problems around. How are we supposed to debug VM
problems where one player periodically flushes the whole pagecache? If
that flushing, when disabled, 'results in the app being broken' (_if_
the app gives any option to disable the flushing). Providing APIs to
flush system caches, sysctl or syscall, is the road to VM madness.

If the goal is to override the pagecache (and other kernel caches) on a
given node then for God's sake, think a bit harder. E.g. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that. Giving a
half-assed tool to clean out one aspect of the system caches will only
muddy the waters, with no real road back to sanity.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Paul Jackson

Andrew wrote:
> Yes, I ... [clarifies pj's various confusions]

Yup - all sounds good - thanks.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant

Andrew Morton wrote:
Ray Bryant <[EMAIL PROTECTED]> wrote:

We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.

aarggh.  This is why you should target kernel.org kernels first.  Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.
I agree, but time-to-market decisions overrode that.  Anyway, everyone
uses a program called "bcfree" to actually do the buffer-cache freeing,
so changing the interface is not as bad as all that.
Let us put something together along these lines and we will get back to you.
Thanks,
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Andrew Morton

Ray Bryant <[EMAIL PROTECTED]> wrote:
>
> Andrew Morton wrote:
> > Martin Hicks <[EMAIL PROTECTED]> wrote:
> > 
> >>This patch introduces a new sysctl for NUMA systems that tries to drop
> >> as much of the page cache as possible from a set of nodes.  The
> >> motivation for this patch is for setting up High Performance Computing
> >> jobs, where initial memory placement is very important to overall
> >> performance.
> > 
> > 
> > - Using a write to /proc for this seems a bit hacky.  Why not simply add
> >   a new system call for it?
> > 
> 
> We did it this way because it was easier to get it into SLES9 that way.
> But there is no particular reason that we couldn't use a system call.
> It's just that we figured adding system calls is hard.

aarggh.  This is why you should target kernel.org kernels first.  Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.

> >   If it does, then userspace could arrange for that concurrency by
> >   starting a number of processes to perform the toss, each with a different
> >   nodemask.
> > 
> 
> That works fine as well if we can get a system call number assigned and
> avoids the hackiness of both /proc and the kernel threads.

syscall numbers are per-arch.  We don't need to assign a syscall number for
this one - we can surely have this ready for 2.6.12.  Simply include i386
and ia64 in the initial patch and other architectures will catch up pretty
quickly.  (It would be nice to generate patches for the arch maintainers,
however).

> > - Dropping "as much pagecache as possible" might be a bit crude.  I
> >   wonder if we should pass in some additional parameter which specifies how
> >   much of the node's pagecache should be removed.
> > 
> >   Or, better, specify how much free memory we will actually require on
> >   this node.  The syscall terminates when it determines that enough
> >   pagecache has been removed.
> 
> Our thoughts exactly.  This is clearly a "big hammer" and we want to
> make a lighter hammer to free up a certain number of pages.  Indeed,
> we would like to have these calls occur automatically from __alloc_pages()
> when we try to allocate local storage and find that there isn't any.
> For our workloads, we want to free up unmapped, clean pagecache, if that
> is what is keeping us from allocating a local page.  Not all workloads
> want that, however, so we would probably use a sysctl() to enable/disable
> this.
> 
> However, the first step is to do this manually from user space.

Yup.  The thing is, lots of people want this feature for various reasons. 
Not just numerical-computing-users-on-NUMA.  We should get it right for
them too.

Especially kernel developers, who have various nasty userspace tools which
will manually reclaim pagecache.  But non-kernel-developers will use it
too, when they think the VM is screwing them over ;)

I think Solaris used to have such a tool - /usr/etc/chill, although I
don't know if it had kernel support.

> > 
> > - To make the syscall more general, we should be able to reclaim mapped
> >   pagecache and anonymous memory as well.
> > 
> > 
> > So what it comes down to is
> > 
> > sys_free_node_memory(long node_id, long pages_to_make_free, long 
> > what_to_free)
> > 
> > where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
> > mapped pagecache, anonymous memory, slab, ...).
> 
> Do we have to implement all of those or just allow for the possibility of that
> being implemented in the future?  E. g. in our case we'd just implement the
> bit that says "unmapped pagecache".

Well...  please take a look at what's involved.  It should just be a matter
of sprinkling a few test such as

+   if (sc->mode & SC_RECLAIM_SLAB) {
...
+   }

into the existing code.  If things turn nasty then we can take another look
at it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Andrew Morton

Paul Jackson <[EMAIL PROTECTED]> wrote:
>
> Andrew wrote:
> > sys_free_node_memory(long node_id, long pages_to_make_free, long 
> > what_to_free)
> > ...
> > - To make the syscall more general, we should be able to reclaim mapped
> >   pagecache and anonymous memory as well.
> 
> sys_free_node_memory() - nice.
> 
> Does it make sense to also have it be able to free up slab cache,
> calling shrink_slab()?

Yes, I suggested that slab be one of the `what_to_free' flags.  (Some of
this may be tricky to implement.  But a good interface with an
initially-crappy implementation is OK ;)

> Did you mean to pass a nodemask, or a single node id?  Passing a single
> node id is easier - we've shown that it is difficult to pass bitmaps
> across the user/kernel boundary without confusions.  But if only a
> single node id is passed, then you get the thread per node that you just
> argued was sometimes overkill.

I meant a single node ID.  With a bitmap, the kernel needs to futz around
scanning the bitmap, launching kernel threads, etc.

I'm proposing that there be no kernel threads at all.   If you have four nodes:

for i in 0 1 2 3
do
call-sys_free_node_memory $i -1 -1 &
done

> I'd prefer the single node id, because it's easier to get right.

yup.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant

Andrew Morton wrote:
Martin Hicks <[EMAIL PROTECTED]> wrote:
This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

- Using a write to /proc for this seems a bit hacky.  Why not simply add
  a new system call for it?
We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.
- Starting a kernel thread for each node might be overkill.  Yes, it
  would take longer if one process was to do all the work, but does this
  operation need to be very fast?
It is possible that this call might need to be executed at the start of
each batch job in the system.  The reason for using a kernel thread was
that there was no good way to start concurrency due to a write to /proc.
  If it does, then userspace could arrange for that concurrency by
  starting a number of processes to perform the toss, each with a different
  nodemask.
That works fine as well if we can get a system call number assigned and
avoids the hackiness of both /proc and the kernel threads.
- Dropping "as much pagecache as possible" might be a bit crude.  I
  wonder if we should pass in some additional parameter which specifies how
  much of the node's pagecache should be removed.
  Or, better, specify how much free memory we will actually require on
  this node.  The syscall terminates when it determines that enough
  pagecache has been removed.
Our thoughts exactly.  This is clearly a "big hammer" and we want to
make a lighter hammer to free up a certain number of pages.  Indeed,
we would like to have these calls occur automatically from __alloc_pages()
when we try to allocate local storage and find that there isn't any.
For our workloads, we want to free up unmapped, clean pagecache, if that
is what is keeping us from allocating a local page.  Not all workloads
want that, however, so we would probably use a sysctl() to enable/disable
this.
However, the first step is to do this manually from user space.
- To make the syscall more general, we should be able to reclaim mapped
  pagecache and anonymous memory as well.
So what it comes down to is
sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).
Do we have to implement all of those or just allow for the possibility of 
that
being implemented in the future?  E. g. in our case we'd just implement the
bit that says "unmapped pagecache".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Paul Jackson

Andrew wrote:
> sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
> ...
> - To make the syscall more general, we should be able to reclaim mapped
>   pagecache and anonymous memory as well.

sys_free_node_memory() - nice.

Does it make sense to also have it be able to free up slab cache,
calling shrink_slab()?

Did you mean to pass a nodemask, or a single node id?  Passing a single
node id is easier - we've shown that it is difficult to pass bitmaps
across the user/kernel boundary without confusions.  But if only a
single node id is passed, then you get the thread per node that you just
argued was sometimes overkill.

I'd prefer the single node id, because it's easier to get right.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Nish Aravamudan

On Mon, 21 Feb 2005 14:27:21 -0500, Martin Hicks
<[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I've made a bunch of changes that Paul suggested.  I've also responded
> to his concerns further down.  Paul correctly pointed out that this
> patch uses some helper functions that are part of the cpusets patch.  I
> should have mentioned this before.



> This patch introduces a new sysctl for NUMA systems that tries to drop
> as much of the page cache as possible from a set of nodes.  The
> motivation for this patch is for setting up High Performance Computing
> jobs, where initial memory placement is very important to overall
> performance.



> +   /* wait for the kernel threads to complete */
> +   while (atomic_read(_toss_threads_active) > 0) {
> +   __set_current_state(TASK_INTERRUPTIBLE);
> +   schedule_timeout(10);
> +   }



Would it be possible to use msleep_interruptible() here? Or is it a
strict check every 10 ticks, regardless of HZ? Could a comment be
inserted indicating
which is the case?

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Andrew Morton

Martin Hicks <[EMAIL PROTECTED]> wrote:
>
> This patch introduces a new sysctl for NUMA systems that tries to drop
>  as much of the page cache as possible from a set of nodes.  The
>  motivation for this patch is for setting up High Performance Computing
>  jobs, where initial memory placement is very important to overall
>  performance.

- Using a write to /proc for this seems a bit hacky.  Why not simply add
  a new system call for it?

- Starting a kernel thread for each node might be overkill.  Yes, it
  would take longer if one process was to do all the work, but does this
  operation need to be very fast?

  If it does, then userspace could arrange for that concurrency by
  starting a number of processes to perform the toss, each with a different
  nodemask.

- Dropping "as much pagecache as possible" might be a bit crude.  I
  wonder if we should pass in some additional parameter which specifies how
  much of the node's pagecache should be removed.

  Or, better, specify how much free memory we will actually require on
  this node.  The syscall terminates when it determines that enough
  pagecache has been removed.

- To make the syscall more general, we should be able to reclaim mapped
  pagecache and anonymous memory as well.

So what it comes down to is

sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)

where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Martin Hicks


Hi,

I've made a bunch of changes that Paul suggested.  I've also responded
to his concerns further down.  Paul correctly pointed out that this
patch uses some helper functions that are part of the cpusets patch.  I
should have mentioned this before.

The major changes are:

- Cleanup proc_dobitmask_list() a bit more, including adding bounds
  checking on *lenp.

- An important bugfix in vmscan.c around line 390.  Go to the
  keep_locked label, not the "keep" label.

- Add locking in proc_do_toss_page_cache_nodes() to protect the global
  nodemask_t from getting corrupted.

- Change a few functions to "static"

- Paul Jackson's suggested changes to greatly simplify
  proc_do_toss_page_cache_nodes()

The patch is inlined at the end of the mail.


On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote:
> 
>   1) A couple of kmalloc's are done using lengths that
>  so far as I could tell, came straight from user land.

Okay, I've stuck in maximums that are based on MAX_NUMNODES.

> 
>   2) Beware that this patch depends on the cpuset patch:
>   new-bitmap-list-format-for-cpusets.patch
>  which is still in *-mm only, for the routines
>  bitmap_scnlistprintf/bitmap_parselist.

Thanks.  I hadn't realized that.

>   3) Should the maxlen of a nodemask for the sysctl
>  handler for proc_do_toss_page_cache_nodes be the byte
>  length of the kernels internal binary nodemask, or

It is the byte length of the kernel's bitmask struct.

>   5) The requirement to read the string in one read(2) syscall
>  seemed like it might be draconian.  If the available

But that's the way the rest of the sysctl read functions work.  There's
no safe way that I can see to ensure that the data doesn't change in
between two consecutive read calls.

>   9) Comment - dont we need to protect the kernel global variable
>  toss_page_cache_nodes from simulaneous access by two tasks?

yes, I protected this with a semaphore.

mh

-- 
Martin HicksWild Open Source Inc.
[EMAIL PROTECTED] 613-266-2296



This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

Signed-off-by: Martin Hicks <[EMAIL PROTECTED]>
Signed-off-by: Ray Bryant <[EMAIL PROTECTED]>

[EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes_v2.patch 
 include/linux/sysctl.h |3 +
 kernel/sysctl.c|   95 
 mm/vmscan.c|  105 -
 3 files changed, 201 insertions(+), 2 deletions(-)


Index: linux-2.6.10/include/linux/sysctl.h
===
--- linux-2.6.10.orig/include/linux/sysctl.h2005-02-16 12:43:19.0 
-0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-02-19 10:36:41.0 -0800
@@ -170,6 +170,7 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space 
layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+   VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on 
*/
 };
 
 
@@ -803,6 +804,8 @@
  void __user *, size_t *, loff_t *);
 extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int,
  struct file *, void __user *, size_t *, 
loff_t *);
+extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
+ void __user *, size_t *, loff_t *);
 
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
Index: linux-2.6.10/kernel/sysctl.c
===
--- linux-2.6.10.orig/kernel/sysctl.c   2005-02-16 12:43:19.0 -0800
+++ linux-2.6.10/kernel/sysctl.c2005-02-21 10:49:18.0 -0800
@@ -41,6 +41,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -72,6 +74,12 @@
  void __user *, size_t *, loff_t *);
 #endif
 
+#ifdef CONFIG_NUMA
+extern nodemask_t toss_page_cache_nodes;
+extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *,
+void __user *, size_t *, loff_t *);
+#endif
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -836,6 +844,16 @@
.strategy   = _jiffies,
},
 #endif
+#ifdef CONFIG_NUMA
+   {
+   .ctl_name   = VM_TOSS_PAGE_CACHE_NODES,
+   .procname   = "toss_page_cache_nodes",
+   .data   = _page_cache_nodes,
+   .maxlen =

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Martin Hicks


Hi,

I've made a bunch of changes that Paul suggested.  I've also responded
to his concerns further down.  Paul correctly pointed out that this
patch uses some helper functions that are part of the cpusets patch.  I
should have mentioned this before.

The major changes are:

- Cleanup proc_dobitmask_list() a bit more, including adding bounds
  checking on *lenp.

- An important bugfix in vmscan.c around line 390.  Go to the
  keep_locked label, not the keep label.

- Add locking in proc_do_toss_page_cache_nodes() to protect the global
  nodemask_t from getting corrupted.

- Change a few functions to static

- Paul Jackson's suggested changes to greatly simplify
  proc_do_toss_page_cache_nodes()

The patch is inlined at the end of the mail.


On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote:
 
   1) A couple of kmalloc's are done using lengths that
  so far as I could tell, came straight from user land.

Okay, I've stuck in maximums that are based on MAX_NUMNODES.

 
   2) Beware that this patch depends on the cpuset patch:
   new-bitmap-list-format-for-cpusets.patch
  which is still in *-mm only, for the routines
  bitmap_scnlistprintf/bitmap_parselist.

Thanks.  I hadn't realized that.

   3) Should the maxlen of a nodemask for the sysctl
  handler for proc_do_toss_page_cache_nodes be the byte
  length of the kernels internal binary nodemask, or

It is the byte length of the kernel's bitmask struct.

   5) The requirement to read the string in one read(2) syscall
  seemed like it might be draconian.  If the available

But that's the way the rest of the sysctl read functions work.  There's
no safe way that I can see to ensure that the data doesn't change in
between two consecutive read calls.

   9) Comment - dont we need to protect the kernel global variable
  toss_page_cache_nodes from simulaneous access by two tasks?

yes, I protected this with a semaphore.

mh

-- 
Martin HicksWild Open Source Inc.
[EMAIL PROTECTED] 613-266-2296



This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

Signed-off-by: Martin Hicks [EMAIL PROTECTED]
Signed-off-by: Ray Bryant [EMAIL PROTECTED]

[EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes_v2.patch 
 include/linux/sysctl.h |3 +
 kernel/sysctl.c|   95 
 mm/vmscan.c|  105 -
 3 files changed, 201 insertions(+), 2 deletions(-)


Index: linux-2.6.10/include/linux/sysctl.h
===
--- linux-2.6.10.orig/include/linux/sysctl.h2005-02-16 12:43:19.0 
-0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-02-19 10:36:41.0 -0800
@@ -170,6 +170,7 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space 
layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+   VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on 
*/
 };
 
 
@@ -803,6 +804,8 @@
  void __user *, size_t *, loff_t *);
 extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int,
  struct file *, void __user *, size_t *, 
loff_t *);
+extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
+ void __user *, size_t *, loff_t *);
 
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
Index: linux-2.6.10/kernel/sysctl.c
===
--- linux-2.6.10.orig/kernel/sysctl.c   2005-02-16 12:43:19.0 -0800
+++ linux-2.6.10/kernel/sysctl.c2005-02-21 10:49:18.0 -0800
@@ -41,6 +41,8 @@
 #include linux/limits.h
 #include linux/dcache.h
 #include linux/syscalls.h
+#include linux/bitmap.h
+#include linux/nodemask.h
 
 #include asm/uaccess.h
 #include asm/processor.h
@@ -72,6 +74,12 @@
  void __user *, size_t *, loff_t *);
 #endif
 
+#ifdef CONFIG_NUMA
+extern nodemask_t toss_page_cache_nodes;
+extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *,
+void __user *, size_t *, loff_t *);
+#endif
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -836,6 +844,16 @@
.strategy   = sysctl_jiffies,
},
 #endif
+#ifdef CONFIG_NUMA
+   {
+   .ctl_name   = VM_TOSS_PAGE_CACHE_NODES,
+   .procname   = toss_page_cache_nodes,
+

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Andrew Morton

Martin Hicks [EMAIL PROTECTED] wrote:

 This patch introduces a new sysctl for NUMA systems that tries to drop
  as much of the page cache as possible from a set of nodes.  The
  motivation for this patch is for setting up High Performance Computing
  jobs, where initial memory placement is very important to overall
  performance.

- Using a write to /proc for this seems a bit hacky.  Why not simply add
  a new system call for it?

- Starting a kernel thread for each node might be overkill.  Yes, it
  would take longer if one process was to do all the work, but does this
  operation need to be very fast?

  If it does, then userspace could arrange for that concurrency by
  starting a number of processes to perform the toss, each with a different
  nodemask.

- Dropping as much pagecache as possible might be a bit crude.  I
  wonder if we should pass in some additional parameter which specifies how
  much of the node's pagecache should be removed.

  Or, better, specify how much free memory we will actually require on
  this node.  The syscall terminates when it determines that enough
  pagecache has been removed.

- To make the syscall more general, we should be able to reclaim mapped
  pagecache and anonymous memory as well.


So what it comes down to is

sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)

where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Nish Aravamudan

On Mon, 21 Feb 2005 14:27:21 -0500, Martin Hicks
[EMAIL PROTECTED] wrote:
 
 Hi,
 
 I've made a bunch of changes that Paul suggested.  I've also responded
 to his concerns further down.  Paul correctly pointed out that this
 patch uses some helper functions that are part of the cpusets patch.  I
 should have mentioned this before.

snip

 This patch introduces a new sysctl for NUMA systems that tries to drop
 as much of the page cache as possible from a set of nodes.  The
 motivation for this patch is for setting up High Performance Computing
 jobs, where initial memory placement is very important to overall
 performance.

snip

 +   /* wait for the kernel threads to complete */
 +   while (atomic_read(num_toss_threads_active)  0) {
 +   __set_current_state(TASK_INTERRUPTIBLE);
 +   schedule_timeout(10);
 +   }

snip

Would it be possible to use msleep_interruptible() here? Or is it a
strict check every 10 ticks, regardless of HZ? Could a comment be
inserted indicating
which is the case?

Thanks,
Nish
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Paul Jackson

Andrew wrote:
 sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
 ...
 - To make the syscall more general, we should be able to reclaim mapped
   pagecache and anonymous memory as well.

sys_free_node_memory() - nice.

Does it make sense to also have it be able to free up slab cache,
calling shrink_slab()?

Did you mean to pass a nodemask, or a single node id?  Passing a single
node id is easier - we've shown that it is difficult to pass bitmaps
across the user/kernel boundary without confusions.  But if only a
single node id is passed, then you get the thread per node that you just
argued was sometimes overkill.

I'd prefer the single node id, because it's easier to get right.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant

Andrew Morton wrote:
Martin Hicks [EMAIL PROTECTED] wrote:
This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

- Using a write to /proc for this seems a bit hacky.  Why not simply add
  a new system call for it?
We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.
- Starting a kernel thread for each node might be overkill.  Yes, it
  would take longer if one process was to do all the work, but does this
  operation need to be very fast?
It is possible that this call might need to be executed at the start of
each batch job in the system.  The reason for using a kernel thread was
that there was no good way to start concurrency due to a write to /proc.
  If it does, then userspace could arrange for that concurrency by
  starting a number of processes to perform the toss, each with a different
  nodemask.
That works fine as well if we can get a system call number assigned and
avoids the hackiness of both /proc and the kernel threads.
- Dropping as much pagecache as possible might be a bit crude.  I
  wonder if we should pass in some additional parameter which specifies how
  much of the node's pagecache should be removed.
  Or, better, specify how much free memory we will actually require on
  this node.  The syscall terminates when it determines that enough
  pagecache has been removed.
Our thoughts exactly.  This is clearly a big hammer and we want to
make a lighter hammer to free up a certain number of pages.  Indeed,
we would like to have these calls occur automatically from __alloc_pages()
when we try to allocate local storage and find that there isn't any.
For our workloads, we want to free up unmapped, clean pagecache, if that
is what is keeping us from allocating a local page.  Not all workloads
want that, however, so we would probably use a sysctl() to enable/disable
this.
However, the first step is to do this manually from user space.
- To make the syscall more general, we should be able to reclaim mapped
  pagecache and anonymous memory as well.
So what it comes down to is
sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).
Do we have to implement all of those or just allow for the possibility of 
that
being implemented in the future?  E. g. in our case we'd just implement the
bit that says unmapped pagecache.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Andrew Morton

Paul Jackson [EMAIL PROTECTED] wrote:

 Andrew wrote:
  sys_free_node_memory(long node_id, long pages_to_make_free, long 
  what_to_free)
  ...
  - To make the syscall more general, we should be able to reclaim mapped
pagecache and anonymous memory as well.
 
 sys_free_node_memory() - nice.
 
 Does it make sense to also have it be able to free up slab cache,
 calling shrink_slab()?

Yes, I suggested that slab be one of the `what_to_free' flags.  (Some of
this may be tricky to implement.  But a good interface with an
initially-crappy implementation is OK ;)

 Did you mean to pass a nodemask, or a single node id?  Passing a single
 node id is easier - we've shown that it is difficult to pass bitmaps
 across the user/kernel boundary without confusions.  But if only a
 single node id is passed, then you get the thread per node that you just
 argued was sometimes overkill.

I meant a single node ID.  With a bitmap, the kernel needs to futz around
scanning the bitmap, launching kernel threads, etc.

I'm proposing that there be no kernel threads at all.   If you have four nodes:

for i in 0 1 2 3
do
call-sys_free_node_memory $i -1 -1 
done

 I'd prefer the single node id, because it's easier to get right.

yup.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Andrew Morton

Ray Bryant [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  Martin Hicks [EMAIL PROTECTED] wrote:
  
 This patch introduces a new sysctl for NUMA systems that tries to drop
  as much of the page cache as possible from a set of nodes.  The
  motivation for this patch is for setting up High Performance Computing
  jobs, where initial memory placement is very important to overall
  performance.
  
  
  - Using a write to /proc for this seems a bit hacky.  Why not simply add
a new system call for it?
  
 
 We did it this way because it was easier to get it into SLES9 that way.
 But there is no particular reason that we couldn't use a system call.
 It's just that we figured adding system calls is hard.

aarggh.  This is why you should target kernel.org kernels first.  Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.

If it does, then userspace could arrange for that concurrency by
starting a number of processes to perform the toss, each with a different
nodemask.
  
 
 That works fine as well if we can get a system call number assigned and
 avoids the hackiness of both /proc and the kernel threads.

syscall numbers are per-arch.  We don't need to assign a syscall number for
this one - we can surely have this ready for 2.6.12.  Simply include i386
and ia64 in the initial patch and other architectures will catch up pretty
quickly.  (It would be nice to generate patches for the arch maintainers,
however).

  - Dropping as much pagecache as possible might be a bit crude.  I
wonder if we should pass in some additional parameter which specifies how
much of the node's pagecache should be removed.
  
Or, better, specify how much free memory we will actually require on
this node.  The syscall terminates when it determines that enough
pagecache has been removed.
 
 Our thoughts exactly.  This is clearly a big hammer and we want to
 make a lighter hammer to free up a certain number of pages.  Indeed,
 we would like to have these calls occur automatically from __alloc_pages()
 when we try to allocate local storage and find that there isn't any.
 For our workloads, we want to free up unmapped, clean pagecache, if that
 is what is keeping us from allocating a local page.  Not all workloads
 want that, however, so we would probably use a sysctl() to enable/disable
 this.
 
 However, the first step is to do this manually from user space.

Yup.  The thing is, lots of people want this feature for various reasons. 
Not just numerical-computing-users-on-NUMA.  We should get it right for
them too.

Especially kernel developers, who have various nasty userspace tools which
will manually reclaim pagecache.  But non-kernel-developers will use it
too, when they think the VM is screwing them over ;)

I think Solaris used to have such a tool - /usr/etc/chill, although I
don't know if it had kernel support.

  
  - To make the syscall more general, we should be able to reclaim mapped
pagecache and anonymous memory as well.
  
  
  So what it comes down to is
  
  sys_free_node_memory(long node_id, long pages_to_make_free, long 
  what_to_free)
  
  where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
  mapped pagecache, anonymous memory, slab, ...).
 
 Do we have to implement all of those or just allow for the possibility of that
 being implemented in the future?  E. g. in our case we'd just implement the
 bit that says unmapped pagecache.

Well...  please take a look at what's involved.  It should just be a matter
of sprinkling a few test such as

+   if (sc-mode  SC_RECLAIM_SLAB) {
...
+   }

into the existing code.  If things turn nasty then we can take another look
at it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant

Andrew Morton wrote:
Ray Bryant [EMAIL PROTECTED] wrote:

We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.

aarggh.  This is why you should target kernel.org kernels first.  Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.
I agree, but time-to-market decisions overrode that.  Anyway, everyone
uses a program called bcfree to actually do the buffer-cache freeing,
so changing the interface is not as bad as all that.
Let us put something together along these lines and we will get back to you.
Thanks,
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Paul Jackson

Andrew wrote:
 Yes, I ... [clarifies pj's various confusions]

Yup - all sounds good - thanks.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ingo Molnar


* Andrew Morton [EMAIL PROTECTED] wrote:

  However, the first step is to do this manually from user space.
 
 Yup.  The thing is, lots of people want this feature for various
 reasons.  Not just numerical-computing-users-on-NUMA.  We should get
 it right for them too.
 
 Especially kernel developers, who have various nasty userspace tools
 which will manually reclaim pagecache.  But non-kernel-developers will
 use it too, when they think the VM is screwing them over ;)

app designers very frequently think that the VM gets its act wrong (most
of the time for the wrong reasons), and the last thing we want to enable
them is to hack real problems around. How are we supposed to debug VM
problems where one player periodically flushes the whole pagecache? If
that flushing, when disabled, 'results in the app being broken' (_if_
the app gives any option to disable the flushing). Providing APIs to
flush system caches, sysctl or syscall, is the road to VM madness.

If the goal is to override the pagecache (and other kernel caches) on a
given node then for God's sake, think a bit harder. E.g. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that. Giving a
half-assed tool to clean out one aspect of the system caches will only
muddy the waters, with no real road back to sanity.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-16 Thread Martin Hicks



On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote:
> Questions concerning this page cache patch that Martin submitted,
> as a merge of something originally written by Ray Bryant.
> 
> The following patch is not really a patch.  It is a few questions, a
> couple minor space tweaks, and a never compiled nor tested rewrite of
> proc_do_toss_page_cache_nodes() to try to make it look a little
> prettier.

Thanks for the review Paul.  I'll take a harder look at your feedback
and reply.

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-16 Thread Martin Hicks



On Mon, Feb 14, 2005 at 07:37:04PM -0800, Paul Jackson wrote:
 Questions concerning this page cache patch that Martin submitted,
 as a merge of something originally written by Ray Bryant.
 
 The following patch is not really a patch.  It is a few questions, a
 couple minor space tweaks, and a never compiled nor tested rewrite of
 proc_do_toss_page_cache_nodes() to try to make it look a little
 prettier.

Thanks for the review Paul.  I'll take a harder look at your feedback
and reply.

-- 
Martin Hicks   ||   Silicon Graphics Inc.   ||   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-14 Thread Paul Jackson

Questions concerning this page cache patch that Martin submitted,
as a merge of something originally written by Ray Bryant.

The following patch is not really a patch.  It is a few questions, a
couple minor space tweaks, and a never compiled nor tested rewrite of
proc_do_toss_page_cache_nodes() to try to make it look a little
prettier.

Some of the issues are cosmetic, but some I suspect warrant competent
response by Martin or Ray, before this goes into *-mm, such as some
questions as to whether locking is adequate, or a kmalloc() size might
be forced huge by the user.  And my suggested rewrite changes the kernel
API in one error case, so better to decide that matter before it is
too widely used.

Specifically:

  1) A couple of kmalloc's are done using lengths that
 so far as I could tell, came straight from user land.
 Never let the user size a kernel malloc without limit,
 as it makes it way too easy to ask for something huge,
 and give the kernel indigestion.  If the lengths in
 question are actually limited, then nevermind (or comment
 in a terse one-liner, for worry warts such as myself).

  2) Beware that this patch depends on the cpuset patch:
new-bitmap-list-format-for-cpusets.patch
 which is still in *-mm only, for the routines
 bitmap_scnlistprintf/bitmap_parselist.

  3) Should the maxlen of a nodemask for the sysctl
 handler for proc_do_toss_page_cache_nodes be the byte
 length of the kernels internal binary nodemask, or
 a reasonable upper bound on the max length of the
 ascii representation thereof, which is about the value:
100 + 6 * MAX_NUMNODES
 when using the bitmap_scnlistprintf/bitmap_parselist
 format.

  4) A couple of existing blank lines were nuked by this
 patch - I restored them.  I though them to be nice blank lines ;).

  5) The requirement to read the string in one read(2) syscall
 seemed like it might be draconian.  If the available
 apparatus supports it, better to allocate the ascii buffer
 on the open for read, let the reads (and seeks) feast on
 that buffer, using f_pos as it should be used, and freeing
 the buffer on the close.  Mind you, I have no idea if the
 sysctl.c apparatus conveniently supports this.

  6) The kernel header bitops.h is no longer needed by sysctl.c,
 following my (uncompiled, untested) rewrite.

  7) Instead of two counters to track how many threads remained
 to be waited for, toss_done and nodes_to_toss, my rewrite
 just has one: num_toss_threads_active.  It bumps that value
 once each kthread it starts, decrements it as each thread
 finishes, and waits for it to get back to zero in the loop.

  8) Several changes in the rewrite of proc_do_toss_page_cache_nodes():
- rename 'retval' to 'ret' (more common, shorter)
- nuke bitmap and use nodemask routines
- dont error if some nodes offline (general idea is to
either do something useful and claim success, or do
nothing at all, and complain of error, but dont both
do something useful and complain.)
- convert to a single return, at bottom of function
- XXX Comment: doesn't this code require locking node_online_map?
- Remove unused 'started'
- Remove no longer used 'i'
- Remove no longer used 'errors'
- Replace 3 line bitop for loop with one line for_each_node_mask
- Replace 15 lines of 'validity checking' with one line check
  for node being online

  9) Comment - dont we need to protect the kernel global variable
 toss_page_cache_nodes from simulaneous access by two tasks?

Index: 2.6.11-rc4/include/linux/sysctl.h
===
--- 2.6.11-rc4.orig/include/linux/sysctl.h  2005-02-14 18:26:28.0 
-0800
+++ 2.6.11-rc4/include/linux/sysctl.h   2005-02-14 18:27:31.0 -0800
@@ -803,6 +803,7 @@ extern int proc_doulongvec_ms_jiffies_mi
  struct file *, void __user *, size_t *, 
loff_t *);
 extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
  void __user *, size_t *, loff_t *);
+
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
  void __user *newval, size_t newlen);
Index: 2.6.11-rc4/kernel/sysctl.c
===
--- 2.6.11-rc4.orig/kernel/sysctl.c 2005-02-14 18:26:28.0 -0800
+++ 2.6.11-rc4/kernel/sysctl.c  2005-02-14 18:27:46.0 -0800
@@ -42,7 +42,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -839,6 +838,8 @@ static ctl_table vm_table[] = {
.ctl_name   = VM_TOSS_PAGE_CACHE_NODES,
.procname   = "toss_page_cache_nodes",
.data   = _page_cache_nodes,
+/* XXX

[PATCH/RFC] A method for clearing out page cache

2005-02-14 Thread Martin Hicks


Hi,

This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.
 
Currently if a job is started and there is page cache lying around on a
particular node then allocations will spill onto remote nodes and page
cache won't be reclaimed until the whole system is short on memory.
This can result in a signficiant performance hit for HPC applications
that planned on that memory being allocated locally.

This patch is intended to be used to clean out the entire page cache before
starting a new job.  Ideally, we would like to only clear as much page
cache as is required to avoid non-local memory allocation.  Patches
to do this can be built on top of this patch, so this patch should
be regarded as the first step in that direction. The long term goal is to
have some mechanism that would better control the page cache (and other
memory caches) for machines that put a higher priority on memory
placement than maintaining big caches.

It allows you to clear page cache on nodes in the following manner:

echo 1,3,9-12 > /proc/sys/vm/toss_page_cache_nodes

The patch was written by Ray Bryant <[EMAIL PROTECTED]> and forward ported
by me, Martin Hicks <[EMAIL PROTECTED]>, to 2.6.11-rc3-mm2.

Could we get this included in -mm Andrew?

mh

-- 
Martin HicksWild Open Source Inc.
[EMAIL PROTECTED] 613-266-2296




This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

It allows you to clear page cache on nodes in the following manner:

echo 1,3,9-12 > /proc/sys/vm/toss_page_cache_nodes



Signed-off-by: Martin Hicks <[EMAIL PROTECTED]>
Signed-off-by: Ray Bryant <[EMAIL PROTECTED]>


[EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes.patch
 include/linux/sysctl.h |4 +
 kernel/sysctl.c|   82 +++
 mm/vmscan.c|  128 -
 3 files changed, 211 insertions(+), 3 deletions(-)


Index: linux-2.6.10/include/linux/sysctl.h
===
--- linux-2.6.10.orig/include/linux/sysctl.h2005-02-11 10:54:13.0 
-0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-02-11 10:54:14.0 -0800
@@ -170,6 +170,7 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space 
layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+   VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on 
*/
 };
 
 
@@ -803,7 +804,8 @@
  void __user *, size_t *, loff_t *);
 extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int,
  struct file *, void __user *, size_t *, 
loff_t *);
-
+extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
+ void __user *, size_t *, loff_t *);
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
  void __user *newval, size_t newlen);
Index: linux-2.6.10/kernel/sysctl.c
===
--- linux-2.6.10.orig/kernel/sysctl.c   2005-02-11 10:54:14.0 -0800
+++ linux-2.6.10/kernel/sysctl.c2005-02-11 10:54:14.0 -0800
@@ -41,6 +41,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -72,6 +75,12 @@
  void __user *, size_t *, loff_t *);
 #endif
 
+#ifdef CONFIG_NUMA
+extern nodemask_t toss_page_cache_nodes;
+extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *,
+void __user *, size_t *, loff_t *);
+#endif
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -836,6 +845,16 @@
.strategy   = _jiffies,
},
 #endif
+#ifdef CONFIG_NUMA
+   {
+   .ctl_name   = VM_TOSS_PAGE_CACHE_NODES,
+   .procname   = "toss_page_cache_nodes",
+   .data   = _page_cache_nodes,
+   .maxlen = sizeof(nodemask_t),
+   .mode   = 0644,
+   .proc_handler   = _do_toss_page_cache_nodes,
+   },
+#endif
{ .ctl_name = 0 }
 };
 
@@ -2071,6 +2090,68 @@
do_proc_dointvec_userhz_jiffies_conv,NULL);
 }
 
+/**
+ * proc_dobitmap_list -- read/write a

[PATCH/RFC] A method for clearing out page cache

2005-02-14 Thread Martin Hicks


Hi,

This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.
 
Currently if a job is started and there is page cache lying around on a
particular node then allocations will spill onto remote nodes and page
cache won't be reclaimed until the whole system is short on memory.
This can result in a signficiant performance hit for HPC applications
that planned on that memory being allocated locally.

This patch is intended to be used to clean out the entire page cache before
starting a new job.  Ideally, we would like to only clear as much page
cache as is required to avoid non-local memory allocation.  Patches
to do this can be built on top of this patch, so this patch should
be regarded as the first step in that direction. The long term goal is to
have some mechanism that would better control the page cache (and other
memory caches) for machines that put a higher priority on memory
placement than maintaining big caches.

It allows you to clear page cache on nodes in the following manner:

echo 1,3,9-12  /proc/sys/vm/toss_page_cache_nodes

The patch was written by Ray Bryant [EMAIL PROTECTED] and forward ported
by me, Martin Hicks [EMAIL PROTECTED], to 2.6.11-rc3-mm2.

Could we get this included in -mm Andrew?

mh

-- 
Martin HicksWild Open Source Inc.
[EMAIL PROTECTED] 613-266-2296




This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

It allows you to clear page cache on nodes in the following manner:

echo 1,3,9-12  /proc/sys/vm/toss_page_cache_nodes



Signed-off-by: Martin Hicks [EMAIL PROTECTED]
Signed-off-by: Ray Bryant [EMAIL PROTECTED]


[EMAIL PROTECTED] patches]$ diffstat toss_page_cache_nodes.patch
 include/linux/sysctl.h |4 +
 kernel/sysctl.c|   82 +++
 mm/vmscan.c|  128 -
 3 files changed, 211 insertions(+), 3 deletions(-)


Index: linux-2.6.10/include/linux/sysctl.h
===
--- linux-2.6.10.orig/include/linux/sysctl.h2005-02-11 10:54:13.0 
-0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-02-11 10:54:14.0 -0800
@@ -170,6 +170,7 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space 
layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+   VM_TOSS_PAGE_CACHE_NODES=29, /* nodemask_t: nodes to free page cache on 
*/
 };
 
 
@@ -803,7 +804,8 @@
  void __user *, size_t *, loff_t *);
 extern int proc_doulongvec_ms_jiffies_minmax(ctl_table *table, int,
  struct file *, void __user *, size_t *, 
loff_t *);
-
+extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
+ void __user *, size_t *, loff_t *);
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
  void __user *newval, size_t newlen);
Index: linux-2.6.10/kernel/sysctl.c
===
--- linux-2.6.10.orig/kernel/sysctl.c   2005-02-11 10:54:14.0 -0800
+++ linux-2.6.10/kernel/sysctl.c2005-02-11 10:54:14.0 -0800
@@ -41,6 +41,9 @@
 #include linux/limits.h
 #include linux/dcache.h
 #include linux/syscalls.h
+#include linux/bitmap.h
+#include linux/bitops.h
+#include linux/nodemask.h
 
 #include asm/uaccess.h
 #include asm/processor.h
@@ -72,6 +75,12 @@
  void __user *, size_t *, loff_t *);
 #endif
 
+#ifdef CONFIG_NUMA
+extern nodemask_t toss_page_cache_nodes;
+extern int proc_do_toss_page_cache_nodes(ctl_table *, int, struct file *,
+void __user *, size_t *, loff_t *);
+#endif
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -836,6 +845,16 @@
.strategy   = sysctl_jiffies,
},
 #endif
+#ifdef CONFIG_NUMA
+   {
+   .ctl_name   = VM_TOSS_PAGE_CACHE_NODES,
+   .procname   = toss_page_cache_nodes,
+   .data   = toss_page_cache_nodes,
+   .maxlen = sizeof(nodemask_t),
+   .mode   = 0644,
+   .proc_handler   = proc_do_toss_page_cache_nodes,
+   },
+#endif
{ .ctl_name = 0 }
 };
 
@@ -2071,6 +2090,68 @@

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-14 Thread Paul Jackson

Questions concerning this page cache patch that Martin submitted,
as a merge of something originally written by Ray Bryant.

The following patch is not really a patch.  It is a few questions, a
couple minor space tweaks, and a never compiled nor tested rewrite of
proc_do_toss_page_cache_nodes() to try to make it look a little
prettier.

Some of the issues are cosmetic, but some I suspect warrant competent
response by Martin or Ray, before this goes into *-mm, such as some
questions as to whether locking is adequate, or a kmalloc() size might
be forced huge by the user.  And my suggested rewrite changes the kernel
API in one error case, so better to decide that matter before it is
too widely used.

Specifically:

  1) A couple of kmalloc's are done using lengths that
 so far as I could tell, came straight from user land.
 Never let the user size a kernel malloc without limit,
 as it makes it way too easy to ask for something huge,
 and give the kernel indigestion.  If the lengths in
 question are actually limited, then nevermind (or comment
 in a terse one-liner, for worry warts such as myself).

  2) Beware that this patch depends on the cpuset patch:
new-bitmap-list-format-for-cpusets.patch
 which is still in *-mm only, for the routines
 bitmap_scnlistprintf/bitmap_parselist.

  3) Should the maxlen of a nodemask for the sysctl
 handler for proc_do_toss_page_cache_nodes be the byte
 length of the kernels internal binary nodemask, or
 a reasonable upper bound on the max length of the
 ascii representation thereof, which is about the value:
100 + 6 * MAX_NUMNODES
 when using the bitmap_scnlistprintf/bitmap_parselist
 format.

  4) A couple of existing blank lines were nuked by this
 patch - I restored them.  I though them to be nice blank lines ;).

  5) The requirement to read the string in one read(2) syscall
 seemed like it might be draconian.  If the available
 apparatus supports it, better to allocate the ascii buffer
 on the open for read, let the reads (and seeks) feast on
 that buffer, using f_pos as it should be used, and freeing
 the buffer on the close.  Mind you, I have no idea if the
 sysctl.c apparatus conveniently supports this.

  6) The kernel header bitops.h is no longer needed by sysctl.c,
 following my (uncompiled, untested) rewrite.

  7) Instead of two counters to track how many threads remained
 to be waited for, toss_done and nodes_to_toss, my rewrite
 just has one: num_toss_threads_active.  It bumps that value
 once each kthread it starts, decrements it as each thread
 finishes, and waits for it to get back to zero in the loop.

  8) Several changes in the rewrite of proc_do_toss_page_cache_nodes():
- rename 'retval' to 'ret' (more common, shorter)
- nuke bitmap and use nodemask routines
- dont error if some nodes offline (general idea is to
either do something useful and claim success, or do
nothing at all, and complain of error, but dont both
do something useful and complain.)
- convert to a single return, at bottom of function
- XXX Comment: doesn't this code require locking node_online_map?
- Remove unused 'started'
- Remove no longer used 'i'
- Remove no longer used 'errors'
- Replace 3 line bitop for loop with one line for_each_node_mask
- Replace 15 lines of 'validity checking' with one line check
  for node being online

  9) Comment - dont we need to protect the kernel global variable
 toss_page_cache_nodes from simulaneous access by two tasks?

Index: 2.6.11-rc4/include/linux/sysctl.h
===
--- 2.6.11-rc4.orig/include/linux/sysctl.h  2005-02-14 18:26:28.0 
-0800
+++ 2.6.11-rc4/include/linux/sysctl.h   2005-02-14 18:27:31.0 -0800
@@ -803,6 +803,7 @@ extern int proc_doulongvec_ms_jiffies_mi
  struct file *, void __user *, size_t *, 
loff_t *);
 extern int proc_dobitmap_list(ctl_table *table, int, struct file *,
  void __user *, size_t *, loff_t *);
+
 extern int do_sysctl (int __user *name, int nlen,
  void __user *oldval, size_t __user *oldlenp,
  void __user *newval, size_t newlen);
Index: 2.6.11-rc4/kernel/sysctl.c
===
--- 2.6.11-rc4.orig/kernel/sysctl.c 2005-02-14 18:26:28.0 -0800
+++ 2.6.11-rc4/kernel/sysctl.c  2005-02-14 18:27:46.0 -0800
@@ -42,7 +42,6 @@
 #include linux/dcache.h
 #include linux/syscalls.h
 #include linux/bitmap.h
-#include linux/bitops.h
 #include linux/nodemask.h
 
 #include asm/uaccess.h
@@ -839,6 +838,8 @@ static ctl_table vm_table[] = {
.ctl_name   = VM_TOSS_PAGE_CACHE_NODES,
.procname   =

44 matches

Mail list logo