subject:"userspace pagecache management tool"

Re: userspace pagecache management tool

2007-03-08 Thread Andrew Morton

> On Thu, 08 Mar 2007 13:29:02 +0530 Vaidyanathan Srinivasan <[EMAIL 
> PROTECTED]> wrote:
> > That all sounds reasonably doable.  It'd be pretty complex to do it
> > in-kernel but we could do it there too.  Problem is if course that the
> > above strategy is explicitly optimised for the backup program and if it's
> > in-kernel it becomes applicable to all other workloads.
> 
> This strategy looks very good.  However we are not considering the
> performance impact on the 'backup' application as such.  By removing
> pagecache pages brought in by the application without the knowledge of
> the applications usage and behavior may severely affect its performance.
> 
> Certainly we are interested in improving system performance at the
> cost certain applications, but not to an extend that the backup
> process will drag on and on to an unreasonable amount of time.
> 
> Also backup processes may consist of a group of applications working
> on the same stream of data.  Like compression program, encryption
> program etc which could be independent applications.

Well yes, if the application is that funky then suitably funky userspace
tricks will be needed to avoid hurting it.

> We should consider having a limit on pagecache usage rather than
> denying any space in the pagecache for these applications.

That's what containerisation is for:

run-in-container --memory=16M /bin/backup-program

This can be done today with x86_64 fake-numa, controlled by cpusets.  One
day, when we get our containerisation story sorted out, things will be more
convenient...

> Can fadvice() be enhanced to have a limit on pagecache usage and
> reclaim used pages in LRU order?  This way data stays for a little
> while for other applications to pickup from pagecache.
> 
> Pages already in memory or brought in by other applications need not
> be placed in this list and hence we prevent any collateral pageouts.

We could teach the presently-unimplemented POSIX_FADV_NOREUSE to dump this
file's pages at the tail of the inactive list (after cleaning them if
needed).  That way, they're the first to get reclaimed.

The standard says "Specifies that the application expects to access the
specified data once and then not reuse it thereafter." That's a bit
ambiguous: it it before the process accessed the data, or after?  Before, I
suspect.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-08 Thread Vaidyanathan Srinivasan



Andrew Morton wrote:
> On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert <[EMAIL PROTECTED]> wrote:
> 
>> On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
> It is *not* a global instruction.  It uses setenv, so the user's policy
> affects only the target process and its forked children.
 ... and all other processes accessing the same file(s)!

 Your library and the system calls may be limited to one process,
 but the consequences are global.
>>> Yes.  So what?  If the user wants to go and evict libc.so from pagecache
>>> then he can do so - the kernel has provided syscalls with which this can be
>>> done for at least seven years.  Bad user, shouldn't do that.
>> While I agree with your sentiments that userspace can have a good idea on
>> how to deal with the page cache, your program does more than it claims to
>> do - because of how linux implements posix_fadvise.
>>
>> I don't think anybody expects or desires your program to actually *evict*
>> the stuff from the cache you are trying access, which happens in case the
>> data was in the cache prior to starting your program.
>>
>> What people expect is that a solution such as you wrote it simply won't
>> *add* anything to the cache. They don't expect it will actually globally
>> *remove* stuff from the cache.
>>
>> Making a backup this way would hurt even worse than usual with your
>> pagecache management tool if the file being backupped was still being read.
>>
>> This is not your fault, but in practice, it makes your program less useful
>> than it could be.
> 
> yup.  As I said, it's a proof-of-concept.  It's a project.  And I have about 
> one
> free femtosecond per fortnight :(
> 
>> One could conceivably fix that up using mincore and simply not fadvise if a
>> page was in core already.
> 
> Yes.  Let's flesh it out the backup program policy some more:
> 
> - Unconditionally invalidate output files
> 
> - on entry to read(), probe pagecache, record which pages in the range are 
> present
> 
> - on entry to next read(), shoot down those pages from the previous read
>   which weren't in pagecache.
> 
> - But we can do better!  LRU the page's files up to a certain number of pages.
> 
> - Once that point is exceeded, we need to reclaim some pages.  Which
>   ones?  Well, we've been observing all reads, so we can record which pages
>   were referenced once, and which ones were referenced multiple times so we
>   can do arbitrarily complex page aging in there.
> 
> - On close(), nuke all pages which weren't in core during open(), even if
>   this app referenced them multiple times.
> 
> - If the backup program decided to read its input files with mmap we're
>   rather screwed.  We can't intercept pagefaults so the best we can do is
>   to restore the file's pagecache to its previous state on close().
> 
>   Or if it's really a problem, get control in there somehow and
>   periodically poll the pagecache occupancy via mincore(), use madvise()
>   then fadvise() to trim it back.
> 
> That all sounds reasonably doable.  It'd be pretty complex to do it
> in-kernel but we could do it there too.  Problem is if course that the
> above strategy is explicitly optimised for the backup program and if it's
> in-kernel it becomes applicable to all other workloads.

This strategy looks very good.  However we are not considering the
performance impact on the 'backup' application as such.  By removing
pagecache pages brought in by the application without the knowledge of
the applications usage and behavior may severely affect its performance.

Certainly we are interested in improving system performance at the
cost certain applications, but not to an extend that the backup
process will drag on and on to an unreasonable amount of time.

Also backup processes may consist of a group of applications working
on the same stream of data.  Like compression program, encryption
program etc which could be independent applications.

We should consider having a limit on pagecache usage rather than
denying any space in the pagecache for these applications.

Can fadvice() be enhanced to have a limit on pagecache usage and
reclaim used pages in LRU order?  This way data stays for a little
while for other applications to pickup from pagecache.

Pages already in memory or brought in by other applications need not
be placed in this list and hence we prevent any collateral pageouts.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-08 Thread Vaidyanathan Srinivasan



Andrew Morton wrote:
 On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert [EMAIL PROTECTED] wrote:
 
 On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
 It is *not* a global instruction.  It uses setenv, so the user's policy
 affects only the target process and its forked children.
 ... and all other processes accessing the same file(s)!

 Your library and the system calls may be limited to one process,
 but the consequences are global.
 Yes.  So what?  If the user wants to go and evict libc.so from pagecache
 then he can do so - the kernel has provided syscalls with which this can be
 done for at least seven years.  Bad user, shouldn't do that.
 While I agree with your sentiments that userspace can have a good idea on
 how to deal with the page cache, your program does more than it claims to
 do - because of how linux implements posix_fadvise.

 I don't think anybody expects or desires your program to actually *evict*
 the stuff from the cache you are trying access, which happens in case the
 data was in the cache prior to starting your program.

 What people expect is that a solution such as you wrote it simply won't
 *add* anything to the cache. They don't expect it will actually globally
 *remove* stuff from the cache.

 Making a backup this way would hurt even worse than usual with your
 pagecache management tool if the file being backupped was still being read.

 This is not your fault, but in practice, it makes your program less useful
 than it could be.
 
 yup.  As I said, it's a proof-of-concept.  It's a project.  And I have about 
 one
 free femtosecond per fortnight :(
 
 One could conceivably fix that up using mincore and simply not fadvise if a
 page was in core already.
 
 Yes.  Let's flesh it out the backup program policy some more:
 
 - Unconditionally invalidate output files
 
 - on entry to read(), probe pagecache, record which pages in the range are 
 present
 
 - on entry to next read(), shoot down those pages from the previous read
   which weren't in pagecache.
 
 - But we can do better!  LRU the page's files up to a certain number of pages.
 
 - Once that point is exceeded, we need to reclaim some pages.  Which
   ones?  Well, we've been observing all reads, so we can record which pages
   were referenced once, and which ones were referenced multiple times so we
   can do arbitrarily complex page aging in there.
 
 - On close(), nuke all pages which weren't in core during open(), even if
   this app referenced them multiple times.
 
 - If the backup program decided to read its input files with mmap we're
   rather screwed.  We can't intercept pagefaults so the best we can do is
   to restore the file's pagecache to its previous state on close().
 
   Or if it's really a problem, get control in there somehow and
   periodically poll the pagecache occupancy via mincore(), use madvise()
   then fadvise() to trim it back.
 
 That all sounds reasonably doable.  It'd be pretty complex to do it
 in-kernel but we could do it there too.  Problem is if course that the
 above strategy is explicitly optimised for the backup program and if it's
 in-kernel it becomes applicable to all other workloads.

This strategy looks very good.  However we are not considering the
performance impact on the 'backup' application as such.  By removing
pagecache pages brought in by the application without the knowledge of
the applications usage and behavior may severely affect its performance.

Certainly we are interested in improving system performance at the
cost certain applications, but not to an extend that the backup
process will drag on and on to an unreasonable amount of time.

Also backup processes may consist of a group of applications working
on the same stream of data.  Like compression program, encryption
program etc which could be independent applications.

We should consider having a limit on pagecache usage rather than
denying any space in the pagecache for these applications.

Can fadvice() be enhanced to have a limit on pagecache usage and
reclaim used pages in LRU order?  This way data stays for a little
while for other applications to pickup from pagecache.

Pages already in memory or brought in by other applications need not
be placed in this list and hence we prevent any collateral pageouts.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-08 Thread Andrew Morton

 On Thu, 08 Mar 2007 13:29:02 +0530 Vaidyanathan Srinivasan [EMAIL 
 PROTECTED] wrote:
  That all sounds reasonably doable.  It'd be pretty complex to do it
  in-kernel but we could do it there too.  Problem is if course that the
  above strategy is explicitly optimised for the backup program and if it's
  in-kernel it becomes applicable to all other workloads.
 
 This strategy looks very good.  However we are not considering the
 performance impact on the 'backup' application as such.  By removing
 pagecache pages brought in by the application without the knowledge of
 the applications usage and behavior may severely affect its performance.
 
 Certainly we are interested in improving system performance at the
 cost certain applications, but not to an extend that the backup
 process will drag on and on to an unreasonable amount of time.
 
 Also backup processes may consist of a group of applications working
 on the same stream of data.  Like compression program, encryption
 program etc which could be independent applications.

Well yes, if the application is that funky then suitably funky userspace
tricks will be needed to avoid hurting it.

 We should consider having a limit on pagecache usage rather than
 denying any space in the pagecache for these applications.

That's what containerisation is for:

run-in-container --memory=16M /bin/backup-program

This can be done today with x86_64 fake-numa, controlled by cpusets.  One
day, when we get our containerisation story sorted out, things will be more
convenient...

 Can fadvice() be enhanced to have a limit on pagecache usage and
 reclaim used pages in LRU order?  This way data stays for a little
 while for other applications to pickup from pagecache.
 
 Pages already in memory or brought in by other applications need not
 be placed in this list and hence we prevent any collateral pageouts.

We could teach the presently-unimplemented POSIX_FADV_NOREUSE to dump this
file's pages at the tail of the inactive list (after cleaning them if
needed).  That way, they're the first to get reclaimed.

The standard says Specifies that the application expects to access the
specified data once and then not reuse it thereafter. That's a bit
ambiguous: it it before the process accessed the data, or after?  Before, I
suspect.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-07 Thread Andrew Morton

On Wed, 07 Mar 2007 11:39:02 + Pádraig Brady <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Tue, 06 Mar 2007 12:10:49 +
> > P__draig Brady <[EMAIL PROTECTED]> wrote:
> >> Perhaps one could possibly just evict pages with _mapcount==0 ?
> > 
> > That is the present fadvise(FADV_DONTNEED) behaviour.
> 
> Ah right. It doesn't invalidate page_mapped() pages.

yup

> If that means it doesn't invalidate pages previously cached
> by other processes, then great.

It will do that.  This is why I point out that this userspace tool
could (easily) be enhanced to not invalidate pages which were
in pagecache prior to their being read by the managed application.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-07 Thread Pádraig Brady

Andrew Morton wrote:
> On Tue, 06 Mar 2007 12:10:49 +
> P__draig Brady <[EMAIL PROTECTED]> wrote:
>> Perhaps one could possibly just evict pages with _mapcount==0 ?
> 
> That is the present fadvise(FADV_DONTNEED) behaviour.

Ah right. It doesn't invalidate page_mapped() pages.
If that means it doesn't invalidate pages previously cached
by other processes, then great.

However I think what I meant though was fadvise(FADV_DONTNEED)
should only invalidate pages where page_count()<=1

>From include/linux/mm.h

" For pages belonging to inodes, the page_count() is the number of
  attaches, plus 1 if `private' contains something, plus one for
  the page cache itself."

cheers,
Pádraig.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-07 Thread Pádraig Brady

Andrew Morton wrote:
 On Tue, 06 Mar 2007 12:10:49 +
 P__draig Brady [EMAIL PROTECTED] wrote:
 Perhaps one could possibly just evict pages with _mapcount==0 ?
 
 That is the present fadvise(FADV_DONTNEED) behaviour.

Ah right. It doesn't invalidate page_mapped() pages.
If that means it doesn't invalidate pages previously cached
by other processes, then great.

However I think what I meant though was fadvise(FADV_DONTNEED)
should only invalidate pages where page_count()=1

From include/linux/mm.h

 For pages belonging to inodes, the page_count() is the number of
  attaches, plus 1 if `private' contains something, plus one for
  the page cache itself.

cheers,
Pádraig.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-07 Thread Andrew Morton

On Wed, 07 Mar 2007 11:39:02 + Pádraig Brady [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Tue, 06 Mar 2007 12:10:49 +
  P__draig Brady [EMAIL PROTECTED] wrote:
  Perhaps one could possibly just evict pages with _mapcount==0 ?
  
  That is the present fadvise(FADV_DONTNEED) behaviour.
 
 Ah right. It doesn't invalidate page_mapped() pages.

yup

 If that means it doesn't invalidate pages previously cached
 by other processes, then great.

It will do that.  This is why I point out that this userspace tool
could (easily) be enhanced to not invalidate pages which were
in pagecache prior to their being read by the managed application.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-06 Thread Rik van Riel


Andrew Morton wrote:

On Tue, 06 Mar 2007 12:10:49 +
P__draig Brady <[EMAIL PROTECTED]> wrote:

Andrew Morton wrote:



If I'm the target
audience for that API then it's broken as I'd mess it up,
or would take too long to get it right.

Can't we just fix the posix_fadvise() implementation to
only evict pages paged in by the current process.


The kernel doesn't have that information.


It doesn't _keep_ the information.  File readahead is done
in the process context, so we had it originally.

I agree though that we probably should not bother trying
to keep that kind of information :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-06 Thread Andrew Morton

On Tue, 06 Mar 2007 12:10:49 +
P__draig Brady <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > Yes.  Let's flesh it out the backup program policy some more:
> > 
> > - Unconditionally invalidate output files
> > 
> > - on entry to read(), probe pagecache, record which pages in the range are 
> > present
> > 
> > - on entry to next read(), shoot down those pages from the previous read
> >   which weren't in pagecache.
> > 
> > - But we can do better!  LRU the page's files up to a certain number of 
> > pages.
> > 
> > - Once that point is exceeded, we need to reclaim some pages.  Which
> >   ones?  Well, we've been observing all reads, so we can record which pages
> >   were referenced once, and which ones were referenced multiple times so we
> >   can do arbitrarily complex page aging in there.
> > 
> > - On close(), nuke all pages which weren't in core during open(), even if
> >   this app referenced them multiple times.
> > 
> > - If the backup program decided to read its input files with mmap we're
> >   rather screwed.  We can't intercept pagefaults so the best we can do is
> >   to restore the file's pagecache to its previous state on close().
> > 
> >   Or if it's really a problem, get control in there somehow and
> >   periodically poll the pagecache occupancy via mincore(), use madvise()
> >   then fadvise() to trim it back.
> > 
> > That all sounds reasonably doable.  It'd be pretty complex to do it
> > in-kernel but we could do it there too.  Problem is if course that the
> > above strategy is explicitly optimised for the backup program and if it's
> > in-kernel it becomes applicable to all other workloads.
> 
> I can see the above being possible, but I can't see the reason
> for exposing that complexity to userspace.

That's sophistication, not complexity.  It doesn't have to do all that stuff
to be effective.

> If I'm the target
> audience for that API then it's broken as I'd mess it up,
> or would take too long to get it right.
> 
> Can't we just fix the posix_fadvise() implementation to
> only evict pages paged in by the current process.

The kernel doesn't have that information.

> Perhaps one could possibly just evict pages with _mapcount==0 ?

That is the present fadvise(FADV_DONTNEED) behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-06 Thread Pádraig Brady

Andrew Morton wrote:
> Yes.  Let's flesh it out the backup program policy some more:
> 
> - Unconditionally invalidate output files
> 
> - on entry to read(), probe pagecache, record which pages in the range are 
> present
> 
> - on entry to next read(), shoot down those pages from the previous read
>   which weren't in pagecache.
> 
> - But we can do better!  LRU the page's files up to a certain number of pages.
> 
> - Once that point is exceeded, we need to reclaim some pages.  Which
>   ones?  Well, we've been observing all reads, so we can record which pages
>   were referenced once, and which ones were referenced multiple times so we
>   can do arbitrarily complex page aging in there.
> 
> - On close(), nuke all pages which weren't in core during open(), even if
>   this app referenced them multiple times.
> 
> - If the backup program decided to read its input files with mmap we're
>   rather screwed.  We can't intercept pagefaults so the best we can do is
>   to restore the file's pagecache to its previous state on close().
> 
>   Or if it's really a problem, get control in there somehow and
>   periodically poll the pagecache occupancy via mincore(), use madvise()
>   then fadvise() to trim it back.
> 
> That all sounds reasonably doable.  It'd be pretty complex to do it
> in-kernel but we could do it there too.  Problem is if course that the
> above strategy is explicitly optimised for the backup program and if it's
> in-kernel it becomes applicable to all other workloads.

I can see the above being possible, but I can't see the reason
for exposing that complexity to userspace. If I'm the target
audience for that API then it's broken as I'd mess it up,
or would take too long to get it right.

Can't we just fix the posix_fadvise() implementation to
only evict pages paged in by the current process.
Perhaps one could possibly just evict pages with _mapcount==0 ?

cheers,
Pádraig.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-06 Thread Pádraig Brady

Andrew Morton wrote:
 Yes.  Let's flesh it out the backup program policy some more:
 
 - Unconditionally invalidate output files
 
 - on entry to read(), probe pagecache, record which pages in the range are 
 present
 
 - on entry to next read(), shoot down those pages from the previous read
   which weren't in pagecache.
 
 - But we can do better!  LRU the page's files up to a certain number of pages.
 
 - Once that point is exceeded, we need to reclaim some pages.  Which
   ones?  Well, we've been observing all reads, so we can record which pages
   were referenced once, and which ones were referenced multiple times so we
   can do arbitrarily complex page aging in there.
 
 - On close(), nuke all pages which weren't in core during open(), even if
   this app referenced them multiple times.
 
 - If the backup program decided to read its input files with mmap we're
   rather screwed.  We can't intercept pagefaults so the best we can do is
   to restore the file's pagecache to its previous state on close().
 
   Or if it's really a problem, get control in there somehow and
   periodically poll the pagecache occupancy via mincore(), use madvise()
   then fadvise() to trim it back.
 
 That all sounds reasonably doable.  It'd be pretty complex to do it
 in-kernel but we could do it there too.  Problem is if course that the
 above strategy is explicitly optimised for the backup program and if it's
 in-kernel it becomes applicable to all other workloads.

I can see the above being possible, but I can't see the reason
for exposing that complexity to userspace. If I'm the target
audience for that API then it's broken as I'd mess it up,
or would take too long to get it right.

Can't we just fix the posix_fadvise() implementation to
only evict pages paged in by the current process.
Perhaps one could possibly just evict pages with _mapcount==0 ?

cheers,
Pádraig.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-06 Thread Andrew Morton

On Tue, 06 Mar 2007 12:10:49 +
P__draig Brady [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  Yes.  Let's flesh it out the backup program policy some more:
  
  - Unconditionally invalidate output files
  
  - on entry to read(), probe pagecache, record which pages in the range are 
  present
  
  - on entry to next read(), shoot down those pages from the previous read
which weren't in pagecache.
  
  - But we can do better!  LRU the page's files up to a certain number of 
  pages.
  
  - Once that point is exceeded, we need to reclaim some pages.  Which
ones?  Well, we've been observing all reads, so we can record which pages
were referenced once, and which ones were referenced multiple times so we
can do arbitrarily complex page aging in there.
  
  - On close(), nuke all pages which weren't in core during open(), even if
this app referenced them multiple times.
  
  - If the backup program decided to read its input files with mmap we're
rather screwed.  We can't intercept pagefaults so the best we can do is
to restore the file's pagecache to its previous state on close().
  
Or if it's really a problem, get control in there somehow and
periodically poll the pagecache occupancy via mincore(), use madvise()
then fadvise() to trim it back.
  
  That all sounds reasonably doable.  It'd be pretty complex to do it
  in-kernel but we could do it there too.  Problem is if course that the
  above strategy is explicitly optimised for the backup program and if it's
  in-kernel it becomes applicable to all other workloads.
 
 I can see the above being possible, but I can't see the reason
 for exposing that complexity to userspace.

That's sophistication, not complexity.  It doesn't have to do all that stuff
to be effective.

 If I'm the target
 audience for that API then it's broken as I'd mess it up,
 or would take too long to get it right.
 
 Can't we just fix the posix_fadvise() implementation to
 only evict pages paged in by the current process.

The kernel doesn't have that information.

 Perhaps one could possibly just evict pages with _mapcount==0 ?

That is the present fadvise(FADV_DONTNEED) behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-06 Thread Rik van Riel


Andrew Morton wrote:

On Tue, 06 Mar 2007 12:10:49 +
P__draig Brady [EMAIL PROTECTED] wrote:

Andrew Morton wrote:



If I'm the target
audience for that API then it's broken as I'd mess it up,
or would take too long to get it right.

Can't we just fix the posix_fadvise() implementation to
only evict pages paged in by the current process.


The kernel doesn't have that information.


It doesn't _keep_ the information.  File readahead is done
in the process context, so we had it originally.

I agree though that we probably should not bother trying
to keep that kind of information :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-05 Thread Andrew Morton

On Mon, 05 Mar 2007 11:02:43 + Pádraig Brady <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a
> > little tool which permits the management of the pagecache usage of
> > arbitrary applications.  Effectively it prevents the targetted application
> > from using any pagecache at all.
> 
> Cool, Kinda like noca?
> http://kernel.umbrella.ro/vm/

yup, same concept.

> Though I could easily read your code,
> but couldn't immediately figure out what noca was doing.
> 
> I used posix_fadvise in an app I did recently:
> http://www.pixelbeat.org/programs/dvd-vr/
> There is a stream_data() func there that does:
> 
> read(src)
> write(dst)
> posix_fadvise(src)
> posix_fadvise(dst)
> 
> for performance I found I needed to do it in that order
> so that any readahead done with the read(src)
> was not thrown away by the posix_fadvise(src).
> In addition to the order, one must be careful
> to throw away only what you've actually written.
> 
> I'm not sure your lib gives enough control over this,
> as you essentially do:
> 
> posix_fadvise(src)
> read(src)
> posix_fadvise(dst)
> write(dst)

That could be so - it's just a demo.  But readahead should be OK - I only
invalidate from start-of-file up to current-offset-minus-pagesize.  So the
cache at and ahead of the linear reader is undisturbed.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-05 Thread Pádraig Brady

Andrew Morton wrote:
> I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a
> little tool which permits the management of the pagecache usage of
> arbitrary applications.  Effectively it prevents the targetted application
> from using any pagecache at all.

Cool, Kinda like noca?
http://kernel.umbrella.ro/vm/
Though I could easily read your code,
but couldn't immediately figure out what noca was doing.

I used posix_fadvise in an app I did recently:
http://www.pixelbeat.org/programs/dvd-vr/
There is a stream_data() func there that does:

read(src)
write(dst)
posix_fadvise(src)
posix_fadvise(dst)

for performance I found I needed to do it in that order
so that any readahead done with the read(src)
was not thrown away by the posix_fadvise(src).
In addition to the order, one must be careful
to throw away only what you've actually written.

I'm not sure your lib gives enough control over this,
as you essentially do:

posix_fadvise(src)
read(src)
posix_fadvise(dst)
write(dst)

cheers,
Pádraig.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-05 Thread Pádraig Brady

Andrew Morton wrote:
 I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a
 little tool which permits the management of the pagecache usage of
 arbitrary applications.  Effectively it prevents the targetted application
 from using any pagecache at all.

Cool, Kinda like noca?
http://kernel.umbrella.ro/vm/
Though I could easily read your code,
but couldn't immediately figure out what noca was doing.

I used posix_fadvise in an app I did recently:
http://www.pixelbeat.org/programs/dvd-vr/
There is a stream_data() func there that does:

read(src)
write(dst)
posix_fadvise(src)
posix_fadvise(dst)

for performance I found I needed to do it in that order
so that any readahead done with the read(src)
was not thrown away by the posix_fadvise(src).
In addition to the order, one must be careful
to throw away only what you've actually written.

I'm not sure your lib gives enough control over this,
as you essentially do:

posix_fadvise(src)
read(src)
posix_fadvise(dst)
write(dst)

cheers,
Pádraig.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-05 Thread Andrew Morton

On Mon, 05 Mar 2007 11:02:43 + Pádraig Brady [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a
  little tool which permits the management of the pagecache usage of
  arbitrary applications.  Effectively it prevents the targetted application
  from using any pagecache at all.
 
 Cool, Kinda like noca?
 http://kernel.umbrella.ro/vm/

yup, same concept.

 Though I could easily read your code,
 but couldn't immediately figure out what noca was doing.
 
 I used posix_fadvise in an app I did recently:
 http://www.pixelbeat.org/programs/dvd-vr/
 There is a stream_data() func there that does:
 
 read(src)
 write(dst)
 posix_fadvise(src)
 posix_fadvise(dst)
 
 for performance I found I needed to do it in that order
 so that any readahead done with the read(src)
 was not thrown away by the posix_fadvise(src).
 In addition to the order, one must be careful
 to throw away only what you've actually written.
 
 I'm not sure your lib gives enough control over this,
 as you essentially do:
 
 posix_fadvise(src)
 read(src)
 posix_fadvise(dst)
 write(dst)

That could be so - it's just a demo.  But readahead should be OK - I only
invalidate from start-of-file up to current-offset-minus-pagesize.  So the
cache at and ahead of the linear reader is undisturbed.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-04 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


Doing a refault thing would help a bit, but stops working at a certain point.

At what point does it stop working?

We need to store that this-page-got-reclaimed info somewhere.  I don't know
how space-efficient that is.  Did anyone ever do an implementation?

One 32 bit word per evicted page that we keep track of.


ok...

I wonder if we really need a new data structure to track that.  I mean,
once a file-backed (or indeed swapcache) page has been reclaimed, its
radix-tree slot is just sitting there with zeroes in it, asking us to reuse
that space for something interesting, no?

Of course, if all 64 pages in a radix-tree node get removed, we'll
currently free the node itself.  We could stop doing that, but the effects
of that might be pretty bad sometimes.  Instead, it sounds sensible to
populate the now-null slot in the parent radix-tree node with an
average/max/min/per-child-bitmap/whatever of the metrics for the 64
non-resident pages which that non-leaf slot represents.  So as the period
since a single page got evicted increases and increases, our information
about its state becomes less and less accurate.

If that inaccuracy is a problem then perhaps we could defer the collapsing
of a now-empty node into its parent in some manner.


We know exactly how far to defer that collapsing, too.

We know at what rate we rotate through the active list,
and the size of the active list.

We also know the rate at which we reclaim pages, and
the size of the inactive list.

Combine the two, and you have an idea roughly how many
page faults there are between the accesses to the coldest
page on the active list.

We don't have to keep the evicted page history beyond
that point, because pages that get refaulted after such
a long interval have a longer inter-reference distance
and should go onto the inactive list - ie. the default
list for unknown pages.


If you can find holes in http://linux-mm.org/PageReplacementDesign
please let me know :)


That all looks pretty non-crazy and implementable to me.  Alas, getting the
stuff written and working is 1% of the effort.  The rest is the nasty hunt
for new corner-cases and general productisation hassle.  But if initial
results show benefit, I expect we could manage all that.


True, but I've looked through a few hundred VM bugzillas
to validate the design against all the common corner cases,
all the way from RHEL3 (which also has split anon/file
lists) through today.

I'm trying to keep the known-good bits of our policy as
much as possible, introducing big changes only for those
corner cases that plagued multiple VMs in the past.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-04 Thread Peter Zijlstra

On Sun, 2007-03-04 at 04:07 -0800, Andrew Morton wrote:
> On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> > Andrew Morton wrote:
> > 
> > >>> Doing a refault thing would help a bit, but stops working at a certain 
> > >>> point.
> > >> At what point does it stop working?
> > > 
> > > We need to store that this-page-got-reclaimed info somewhere.  I don't 
> > > know
> > > how space-efficient that is.  Did anyone ever do an implementation?
> > 
> > One 32 bit word per evicted page that we keep track of.
> 
> ok...
> 
> I wonder if we really need a new data structure to track that.  I mean,
> once a file-backed (or indeed swapcache) page has been reclaimed, its
> radix-tree slot is just sitting there with zeroes in it, asking us to reuse
> that space for something interesting, no?
> 
> Of course, if all 64 pages in a radix-tree node get removed, we'll
> currently free the node itself.  We could stop doing that, but the effects
> of that might be pretty bad sometimes.  Instead, it sounds sensible to
> populate the now-null slot in the parent radix-tree node with an
> average/max/min/per-child-bitmap/whatever of the metrics for the 64
> non-resident pages which that non-leaf slot represents.  So as the period
> since a single page got evicted increases and increases, our information
> about its state becomes less and less accurate.
> 
> If that inaccuracy is a problem then perhaps we could defer the collapsing
> of a now-empty node into its parent in some manner.

Getting the refault distance out of such a radix tree would be tricky.
One solution I can think of would entail keeping a global fault count
and storing the current fault count in the radix node and on refault
subtract from the global count. The downside however is this global
thing, perhaps we could do some smart percpu count aggregate to fix it.

The other point you mention is when to we reap these radix tree nodes,
normally nonresident information gets dropped once the distance is
further than our memory is big, but these nodes don´t have explicit
order.

The collapsing idea is interesting, esp. if we could delay the collapse
so that the avg refault distance would be in some relation to the error.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-04 Thread Andrew Morton

On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> 
> >>> Doing a refault thing would help a bit, but stops working at a certain 
> >>> point.
> >> At what point does it stop working?
> > 
> > We need to store that this-page-got-reclaimed info somewhere.  I don't know
> > how space-efficient that is.  Did anyone ever do an implementation?
> 
> One 32 bit word per evicted page that we keep track of.

ok...

I wonder if we really need a new data structure to track that.  I mean,
once a file-backed (or indeed swapcache) page has been reclaimed, its
radix-tree slot is just sitting there with zeroes in it, asking us to reuse
that space for something interesting, no?

Of course, if all 64 pages in a radix-tree node get removed, we'll
currently free the node itself.  We could stop doing that, but the effects
of that might be pretty bad sometimes.  Instead, it sounds sensible to
populate the now-null slot in the parent radix-tree node with an
average/max/min/per-child-bitmap/whatever of the metrics for the 64
non-resident pages which that non-leaf slot represents.  So as the period
since a single page got evicted increases and increases, our information
about its state becomes less and less accurate.

If that inaccuracy is a problem then perhaps we could defer the collapsing
of a now-empty node into its parent in some manner.

> > You mean design it and review the design before coding it?  You'll find few
> > objections there.
> 
> Few objections, but sadly also very few people interested in
> actually reviewing the design :(
> 
> If you can find holes in http://linux-mm.org/PageReplacementDesign
> please let me know :)

That all looks pretty non-crazy and implementable to me.  Alas, getting the
stuff written and working is 1% of the effort.  The rest is the nasty hunt
for new corner-cases and general productisation hassle.  But if initial
results show benefit, I expect we could manage all that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-04 Thread Andrew Morton

On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
  Doing a refault thing would help a bit, but stops working at a certain 
  point.
  At what point does it stop working?
  
  We need to store that this-page-got-reclaimed info somewhere.  I don't know
  how space-efficient that is.  Did anyone ever do an implementation?
 
 One 32 bit word per evicted page that we keep track of.

ok...

I wonder if we really need a new data structure to track that.  I mean,
once a file-backed (or indeed swapcache) page has been reclaimed, its
radix-tree slot is just sitting there with zeroes in it, asking us to reuse
that space for something interesting, no?

Of course, if all 64 pages in a radix-tree node get removed, we'll
currently free the node itself.  We could stop doing that, but the effects
of that might be pretty bad sometimes.  Instead, it sounds sensible to
populate the now-null slot in the parent radix-tree node with an
average/max/min/per-child-bitmap/whatever of the metrics for the 64
non-resident pages which that non-leaf slot represents.  So as the period
since a single page got evicted increases and increases, our information
about its state becomes less and less accurate.

If that inaccuracy is a problem then perhaps we could defer the collapsing
of a now-empty node into its parent in some manner.

  You mean design it and review the design before coding it?  You'll find few
  objections there.
 
 Few objections, but sadly also very few people interested in
 actually reviewing the design :(
 
 If you can find holes in http://linux-mm.org/PageReplacementDesign
 please let me know :)

That all looks pretty non-crazy and implementable to me.  Alas, getting the
stuff written and working is 1% of the effort.  The rest is the nasty hunt
for new corner-cases and general productisation hassle.  But if initial
results show benefit, I expect we could manage all that.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-04 Thread Peter Zijlstra

On Sun, 2007-03-04 at 04:07 -0800, Andrew Morton wrote:
 On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel [EMAIL PROTECTED] wrote:
 
  Andrew Morton wrote:
  
   Doing a refault thing would help a bit, but stops working at a certain 
   point.
   At what point does it stop working?
   
   We need to store that this-page-got-reclaimed info somewhere.  I don't 
   know
   how space-efficient that is.  Did anyone ever do an implementation?
  
  One 32 bit word per evicted page that we keep track of.
 
 ok...
 
 I wonder if we really need a new data structure to track that.  I mean,
 once a file-backed (or indeed swapcache) page has been reclaimed, its
 radix-tree slot is just sitting there with zeroes in it, asking us to reuse
 that space for something interesting, no?
 
 Of course, if all 64 pages in a radix-tree node get removed, we'll
 currently free the node itself.  We could stop doing that, but the effects
 of that might be pretty bad sometimes.  Instead, it sounds sensible to
 populate the now-null slot in the parent radix-tree node with an
 average/max/min/per-child-bitmap/whatever of the metrics for the 64
 non-resident pages which that non-leaf slot represents.  So as the period
 since a single page got evicted increases and increases, our information
 about its state becomes less and less accurate.
 
 If that inaccuracy is a problem then perhaps we could defer the collapsing
 of a now-empty node into its parent in some manner.

Getting the refault distance out of such a radix tree would be tricky.
One solution I can think of would entail keeping a global fault count
and storing the current fault count in the radix node and on refault
subtract from the global count. The downside however is this global
thing, perhaps we could do some smart percpu count aggregate to fix it.

The other point you mention is when to we reap these radix tree nodes,
normally nonresident information gets dropped once the distance is
further than our memory is big, but these nodes don´t have explicit
order.

The collapsing idea is interesting, esp. if we could delay the collapse
so that the avg refault distance would be in some relation to the error.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-04 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 20:56:27 -0500 Rik van Riel [EMAIL PROTECTED] wrote:


Andrew Morton wrote:


Doing a refault thing would help a bit, but stops working at a certain point.

At what point does it stop working?

We need to store that this-page-got-reclaimed info somewhere.  I don't know
how space-efficient that is.  Did anyone ever do an implementation?

One 32 bit word per evicted page that we keep track of.


ok...

I wonder if we really need a new data structure to track that.  I mean,
once a file-backed (or indeed swapcache) page has been reclaimed, its
radix-tree slot is just sitting there with zeroes in it, asking us to reuse
that space for something interesting, no?

Of course, if all 64 pages in a radix-tree node get removed, we'll
currently free the node itself.  We could stop doing that, but the effects
of that might be pretty bad sometimes.  Instead, it sounds sensible to
populate the now-null slot in the parent radix-tree node with an
average/max/min/per-child-bitmap/whatever of the metrics for the 64
non-resident pages which that non-leaf slot represents.  So as the period
since a single page got evicted increases and increases, our information
about its state becomes less and less accurate.

If that inaccuracy is a problem then perhaps we could defer the collapsing
of a now-empty node into its parent in some manner.


We know exactly how far to defer that collapsing, too.

We know at what rate we rotate through the active list,
and the size of the active list.

We also know the rate at which we reclaim pages, and
the size of the inactive list.

Combine the two, and you have an idea roughly how many
page faults there are between the accesses to the coldest
page on the active list.

We don't have to keep the evicted page history beyond
that point, because pages that get refaulted after such
a long interval have a longer inter-reference distance
and should go onto the inactive list - ie. the default
list for unknown pages.


If you can find holes in http://linux-mm.org/PageReplacementDesign
please let me know :)


That all looks pretty non-crazy and implementable to me.  Alas, getting the
stuff written and working is 1% of the effort.  The rest is the nasty hunt
for new corner-cases and general productisation hassle.  But if initial
results show benefit, I expect we could manage all that.


True, but I've looked through a few hundred VM bugzillas
to validate the design against all the common corner cases,
all the way from RHEL3 (which also has split anon/file
lists) through today.

I'm trying to keep the known-good bits of our policy as
much as possible, introducing big changes only for those
corner cases that plagued multiple VMs in the past.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 21:35:59 -0500 "Lee Revell" <[EMAIL PROTECTED]> wrote:

> On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > But yes, updatedb's pagecache usage will be mainly metadata, and this tool
> > doesn't address metadata pagecache, although it could do so.
> >
> 
> With no kernel changes?  How?  I can't find an equivalent API to
> posix_fadvise() for metadata.
> 

We can use mincore and fadvise against /dev/sda1, too.

mincore's linear search would hurt but you could just run fadvise
regularly.  A lot of the blockdev pagecache is pretty useless anyway: we've
already copied it much of it into dentries and inodes, and some of ext2/3/4's
pagecache is already pinned by the fs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Lee Revell


On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

But yes, updatedb's pagecache usage will be mainly metadata, and this tool
doesn't address metadata pagecache, although it could do so.



With no kernel changes?  How?  I can't find an equivalent API to
posix_fadvise() for metadata.

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:


Doing a refault thing would help a bit, but stops working at a certain point.

At what point does it stop working?


We need to store that this-page-got-reclaimed info somewhere.  I don't know
how space-efficient that is.  Did anyone ever do an implementation?


One 32 bit word per evicted page that we keep track of.


Of course, the pages need to be re-read again so there's a potential 100%
hit there, which is in fact not a huge amount in this context.  Depends how
often it occurs (all the time when refault is being useful?) versus what we
gain from it.


At this point, when we see that a refaulted page is more
active than the coldest page on the active list, we can
also immediately shrink the active list.  That gives the
next inactive page a better chance to get promoted before
it gets evicted.


I am not asking this to be difficult, I just want to get Linux
a VM that does not need to be kludged up every time a distro
ships it to its customers.


We have a communication problem here.  Please please please work harder to
get these problems communicated to the MM developers.  The only vendor MM
kludge of which I'm aware is a thing which Andrea is working on to address
a large-shm-segment versus bulk-IO problem (yup, database).

If you have enough of an understanding of a problem to be able to develop
and productise a fix then share that info madly, asap.


The problem is that most of the distro patches are
kludges, which we would rather not see again in
future kernels.  They tend to work around the problem,
instead of being a proper fix, since reorganizing the
VM in the middle of a release is not an option.

However, incremental small-to-medium changes might
be an option for the upstream kernel, if you are
interested.


I believe one starting point would be a concept that people
cannot shoot holes in any more.  That is no guarantee, but
as long as the concept has known holes coding it up is likely
to be a waste of time since the code will need kludges to
deal with the problems later on and we'd be back to square
one.


You mean design it and review the design before coding it?  You'll find few
objections there.


Few objections, but sadly also very few people interested in
actually reviewing the design :(

If you can find holes in http://linux-mm.org/PageReplacementDesign
please let me know :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 20:23:07 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> >> The use-once policy we have in the kernel should work
> >> perfectly fine for backups.  All we need to do is
> >> actually honor the accessed bit on active page cache
> >> pages, instead of flushing them onto the inactive
> >> list.
> >>
> >> What am I overlooking?
> > 
> > That'll improve backups but will break other things.
> > 
> > To do this effectively we'd need to change the policy so that new pagecache
> > allocations cause no scanning of used-twice pages at all.  So that even
> > after many gigs of backing up, the working set is still there.
> > 
> > Problem is, (for example) what about the person who has 80% of memory in
> > used-twice state and who then reads a file or files which are 20% or more of
> > the size of memory, two or more times.  It'll be 100% cache misses, every 
> > time.
> > This will happen quite a lot.  IOW, once those pages are in used-twice 
> > state,
> > how does further pagecache activity ever get them _out_ of that state?  Only
> > by joining the used-twice page set, and that can't happen if the 
> > used-once-so-far
> > pages got reclaimed.
> > 
> > Doing a refault thing would help a bit, but stops working at a certain 
> > point.
> 
> At what point does it stop working?

We need to store that this-page-got-reclaimed info somewhere.  I don't know
how space-efficient that is.  Did anyone ever do an implementation?

Of course, the pages need to be re-read again so there's a potential 100%
hit there, which is in fact not a huge amount in this context.  Depends how
often it occurs (all the time when refault is being useful?) versus what we
gain from it.

> I am not asking this to be difficult, I just want to get Linux
> a VM that does not need to be kludged up every time a distro
> ships it to its customers.

We have a communication problem here.  Please please please work harder to
get these problems communicated to the MM developers.  The only vendor MM
kludge of which I'm aware is a thing which Andrea is working on to address
a large-shm-segment versus bulk-IO problem (yup, database).

If you have enough of an understanding of a problem to be able to develop
and productise a fix then share that info madly, asap.

otoh, rhel-on-the-desktop-or-smaller probably isn't a huge priority, which
can be taken advantage of.

> I believe one starting point would be a concept that people
> cannot shoot holes in any more.  That is no guarantee, but
> as long as the concept has known holes coding it up is likely
> to be a waste of time since the code will need kludges to
> deal with the problems later on and we'd be back to square
> one.

You mean design it and review the design before coding it?  You'll find few
objections there.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Eric St-Laurent wrote:


While I think that more user space applications should use fadvise() to
avoid polluting the page cache with unneeded data, I still think the
kernel should be more fair in regard to page cache management.



Personally, I've experienced some sluggish performance after copying
large files around. Even more when using NFS. It's difficult to file a
bug report for "interactive feel", I don't know how to measure it. I
just feel it's a weak aspect of the OS.


Fairness and interactiveness are very hard to quantify and
measure, which makes it hard to justify patches that improve
this behaviour.

On the other hand, patches that improve benchmark results
are easily justifyable, which makes it easy to merge those
even if it comes at the expense of fairness.

I think fairness and robustness are important, but I have
not figured out a way to justify such changes for upstream
inclusion.  Well, except perhaps by coming up with artificial
test cases, but that feels like cheating :)


My personal opinion is that the VM seem tuned for database types
workloads. Of course, making the page cache more fair to prevent one
process to use most of it will most likely slowdown database type
applications.


The database people disagree.  For one, the accessed bit
on active page gets pages gets ignored, so Linux does not
properly keep the most actively used page cache pages in
memory.

Secondly, the VM can waste quite a lot of time scanning
over the anonymous pages that it does not even want to
evict from memory.  If the VM does not plan on evicting
anonymous memory (or shared memory segments), why waste
CPU time scanning them and randomizing their LRU order?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 20:16:09 -0500 "Lee Revell" <[EMAIL PROTECTED]> wrote:

> On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(),
> > write(), pwrite(), close() and dup2() functions.  pagecache control is done
> > via posix_fadvise() and sync_file_range().
> >
> 
> How could this have any effect on the updatedb problem?  updatedb does
> not read() anything, it just open()s and stat()s every file on the
> disk.
> 

err, good point.  _one_ of those dang things which goes off when you've
stayed up too late does a lot of pagecache IO, not sure which one.  Maybe
rpmq?  But I'd expect that to be doing direct-io.

But yes, updatedb's pagecache usage will be mainly metadata, and this tool
doesn't address metadata pagecache, although it could do so.

It instantiated 5MB of pagecache and 20MB of slab, took about one minute.

rpm uses rather a lot of pagecache.

So yes, it looks like updatedb is a slab problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:



The use-once policy we have in the kernel should work
perfectly fine for backups.  All we need to do is
actually honor the accessed bit on active page cache
pages, instead of flushing them onto the inactive
list.

What am I overlooking?


That'll improve backups but will break other things.

To do this effectively we'd need to change the policy so that new pagecache
allocations cause no scanning of used-twice pages at all.  So that even
after many gigs of backing up, the working set is still there.

Problem is, (for example) what about the person who has 80% of memory in
used-twice state and who then reads a file or files which are 20% or more of
the size of memory, two or more times.  It'll be 100% cache misses, every time.
This will happen quite a lot.  IOW, once those pages are in used-twice state,
how does further pagecache activity ever get them _out_ of that state?  Only
by joining the used-twice page set, and that can't happen if the 
used-once-so-far
pages got reclaimed.

Doing a refault thing would help a bit, but stops working at a certain point.


At what point does it stop working?

I am not asking this to be difficult, I just want to get Linux
a VM that does not need to be kludged up every time a distro
ships it to its customers.

I believe one starting point would be a concept that people
cannot shoot holes in any more.  That is no guarantee, but
as long as the concept has known holes coding it up is likely
to be a waste of time since the code will need kludges to
deal with the problems later on and we'd be back to square
one.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:02:31 -0800 Ray Lee <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > 
> 
> Would there be any other users of it than updatedb?

updatedb is the notorious one.

Alas, one can envisage sane workloads which really really really really
want to cache millions of dentries and inodes.  Workloads which get run
more often than once-per-day-at-4AM.  So if we "fix" updatedb, those
people with kernel-indistinguishable workloads get real unhappy.  We have
to pass instructions to the kernel to resolve this.

A sys_reclaim_dentry() would also need a sys_is_dentry_there() so updatedb
could restore the previous state.

Probably it'd be better to fix the well-known internal fragmentation problem
we have with the VFS caches.  That's fairly hard.

> I'm not coming up
> with much, but given that I'm not always clever, that doesn't mean much.
> 
>  A hypothetical on-demand file virus scanner is
> going to hit already cached or about-to-be-cached entries by definition.
> Perhaps some system audit daemon, such as tripwire. Well, that has the
> same access patterns as updatedb, doesn't it: a directory at a time.
> find, cp -a, the same.
> 
> So instead of sys_reclaim_dentry, how about extending fadvise to work on
> the fd returned via opendir?

That'd be pretty simple, but a) would reclaim the pagecache for the
directory and not the dentry object itself and b) will only be easy to do
for ext2 and minixfs, which maintain a separate pagecache per directory.

Yes we could do a "nuke all the dentries in this directory thing", but
that's equivalent to sys_reclaim_dentry() in a loop.

> And extending POSIX_FADV_NOREUSE on a file
> fd to drop the dentry at close?
> 
> (Call me chicken; I just don't want to be the guy suggesting a new
> syscall for a single or few users.)
> 
>  ~ ~
> 
> Alternately, there have been requests for a way for userspace to get
> notification of all file events for indexing of data and metadata
> (inotify, unfortunately, doesn't scale to a full filesystem). (cf.
> http://lkml.org/lkml/2006/9/30/98 .)

yes, that's a disappointment.

> That'd allow an updatedb daemon to
> keep the index up to date all the time, amortizing the cost. More
> usefully, it'd allow a content indexing daemon to stay up to date all
> the time, though inotify mostly works for those, I suppose.
> 
> (Hmm...
>   [EMAIL PROTECTED]:~$ find ~ -type d | wc -l
>   14067
> 
> ...right. So it probably works fine for normal people.)
> 
> Hey, waitaminute. This should be a solved problem? SELinux must have
> some sort of requirement for logging file access attempts. Google, at
> least, implies so. Perhaps whatever it implements could be lifted into
> the core kernel without dragging the rest behind it.

Maybe the syscall auditing code can be persuaded to spit out records which
can be used for this.

> Dunno. Who do we CC?

That's a problem.  Nobody and everybody.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Lee Revell


On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(),
write(), pwrite(), close() and dup2() functions.  pagecache control is done
via posix_fadvise() and sync_file_range().



How could this have any effect on the updatedb problem?  updatedb does
not read() anything, it just open()s and stat()s every file on the
disk.

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 19:14:59 -0500 Eric St-Laurent <[EMAIL PROTECTED]> wrote:

> On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote:
> 
> 
> > There is much more which could be done to make this code smarter, but I
> > think the lesson here is that we can produce a far, far better result doing
> > this work in userspace than we could ever hope to do with an in-kernel
> > implementation.  There are some enhancement suggestions in the
> > documentation file.
> 
> While I think that more user space applications should use fadvise() to
> avoid polluting the page cache with unneeded data, I still think the
> kernel should be more fair in regard to page cache management.
> 
> Personally, I've experienced some sluggish performance after copying
> large files around. Even more when using NFS. It's difficult to file a
> bug report for "interactive feel", I don't know how to measure it. I
> just feel it's a weak aspect of the OS.

yeah.  It'd be worth spending some time, try to come up with some set of
commands which produce an effect which you find objectionable.

> Surely it's possible to make the kernel a little bit better to protect
> the page cache from abuse, from simple or badly designed applications.
> 
> Why fairness is provided by the process scheduler with good results, yet
> it somewhat easy for a process to cause slowdowns from page cache usage.
> 
> My personal opinion is that the VM seem tuned for database types
> workloads.

VM hasn't actually been tuned *for* anything much at all, really.  Looking
back on it, much of the tweaking in there has been to avoid really bad
situations.  We put much work into avoiding the 100%, 1000% or 1%
slowdowns, but not a lot of work into providing the 15% speedups.

So it may well be that the result is not particularly great at anything,
but it's also not horridly bad at anything, either.  Or at least, it's not
supposed to be.

> Of course, making the page cache more fair to prevent one
> process to use most of it will most likely slowdown database type
> applications.

databases actually like to manage their own cache via various means.  There
are some situations in which bulk IO activities can trash the databases's
cache.  That's the sort of thing which this tool is trying to help address.

> Maybe the situation should be reversed, much like the process scheduler.
> Fairness by default, and the possibility to request for more system
> resources by asking for them with necessary privileges. Much like
> SCHED_FIFO policy.

Well.  If the CPU scheduler makes a mistake, we see 5% or 15% degredations.
 If VM make a mistake (or fails to read the operator's mind), we go to disk
and can suffer 1000% degredations or worse.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Eric St-Laurent

On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote:

> There is much more which could be done to make this code smarter, but I
> think the lesson here is that we can produce a far, far better result doing
> this work in userspace than we could ever hope to do with an in-kernel
> implementation.  There are some enhancement suggestions in the
> documentation file.

While I think that more user space applications should use fadvise() to
avoid polluting the page cache with unneeded data, I still think the
kernel should be more fair in regard to page cache management.

Personally, I've experienced some sluggish performance after copying
large files around. Even more when using NFS. It's difficult to file a
bug report for "interactive feel", I don't know how to measure it. I
just feel it's a weak aspect of the OS.

Surely it's possible to make the kernel a little bit better to protect
the page cache from abuse, from simple or badly designed applications.

Why fairness is provided by the process scheduler with good results, yet
it somewhat easy for a process to cause slowdowns from page cache usage.

My personal opinion is that the VM seem tuned for database types
workloads. Of course, making the page cache more fair to prevent one
process to use most of it will most likely slowdown database type
applications.

Maybe the situation should be reversed, much like the process scheduler.
Fairness by default, and the possibility to request for more system
resources by asking for them with necessary privileges. Much like
SCHED_FIFO policy.

- Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:
> > 
> >> backup program
> > 
> > A suitable policy for a backup program would probably be to invalidate any
> > output file(s) and to invalidate those pages of the input files which were
> > not in cache when the backup program first opened those files.  That way
> > the backup program will have no effect on the cache state, except for the
> > race situation where someone read an uncached file while the backup program
> > was reading from it too.
> 
> The use-once policy we have in the kernel should work
> perfectly fine for backups.  All we need to do is
> actually honor the accessed bit on active page cache
> pages, instead of flushing them onto the inactive
> list.
> 
> What am I overlooking?

That'll improve backups but will break other things.

To do this effectively we'd need to change the policy so that new pagecache
allocations cause no scanning of used-twice pages at all.  So that even
after many gigs of backing up, the working set is still there.

Problem is, (for example) what about the person who has 80% of memory in
used-twice state and who then reads a file or files which are 20% or more of
the size of memory, two or more times.  It'll be 100% cache misses, every time.
This will happen quite a lot.  IOW, once those pages are in used-twice state,
how does further pagecache activity ever get them _out_ of that state?  Only
by joining the used-twice page set, and that can't happen if the 
used-once-so-far
pages got reclaimed.

Doing a refault thing would help a bit, but stops working at a certain point.

> > This can be added in an hour or two with no kernel changes (use mincore).
> 
> mincore only works for mmaped areas, we'd need an fincore
> to work with file handles.

The LD_PRELOAD code has the fd and can mmap it to perform the pagecache
probe.

fincore() would be a bit neater, but given the rarity with which mincore()
is used it's perhaps hard to justify adding a slightly more efficient and
slightly more convenient subset of mincore().

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Ray Lee

Andrew Morton wrote:
> 

Would there be any other users of it than updatedb? I'm not coming up
with much, but given that I'm not always clever, that doesn't mean much.

 A hypothetical on-demand file virus scanner is
going to hit already cached or about-to-be-cached entries by definition.
Perhaps some system audit daemon, such as tripwire. Well, that has the
same access patterns as updatedb, doesn't it: a directory at a time.
find, cp -a, the same.

So instead of sys_reclaim_dentry, how about extending fadvise to work on
the fd returned via opendir? And extending POSIX_FADV_NOREUSE on a file
fd to drop the dentry at close?

(Call me chicken; I just don't want to be the guy suggesting a new
syscall for a single or few users.)

 ~ ~

Alternately, there have been requests for a way for userspace to get
notification of all file events for indexing of data and metadata
(inotify, unfortunately, doesn't scale to a full filesystem). (cf.
http://lkml.org/lkml/2006/9/30/98 .) That'd allow an updatedb daemon to
keep the index up to date all the time, amortizing the cost. More
usefully, it'd allow a content indexing daemon to stay up to date all
the time, though inotify mostly works for those, I suppose.

(Hmm...
[EMAIL PROTECTED]:~$ find ~ -type d | wc -l
14067

...right. So it probably works fine for normal people.)

Hey, waitaminute. This should be a solved problem? SELinux must have
some sort of requirement for logging file access attempts. Google, at
least, implies so. Perhaps whatever it implements could be lifted into
the core kernel without dragging the rest behind it.

Dunno. Who do we CC?

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:


backup program


A suitable policy for a backup program would probably be to invalidate any
output file(s) and to invalidate those pages of the input files which were
not in cache when the backup program first opened those files.  That way
the backup program will have no effect on the cache state, except for the
race situation where someone read an uncached file while the backup program
was reading from it too.


The use-once policy we have in the kernel should work
perfectly fine for backups.  All we need to do is
actually honor the accessed bit on active page cache
pages, instead of flushing them onto the inactive
list.

What am I overlooking?


This can be added in an hour or two with no kernel changes (use mincore).


mincore only works for mmaped areas, we'd need an fincore
to work with file handles.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert <[EMAIL PROTECTED]> wrote:

> On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
> > > > It is *not* a global instruction.  It uses setenv, so the user's policy
> > > > affects only the target process and its forked children.
> > > 
> > > ... and all other processes accessing the same file(s)!
> > > 
> > > Your library and the system calls may be limited to one process,
> > > but the consequences are global.
> > 
> > Yes.  So what?  If the user wants to go and evict libc.so from pagecache
> > then he can do so - the kernel has provided syscalls with which this can be
> > done for at least seven years.  Bad user, shouldn't do that.
> 
> While I agree with your sentiments that userspace can have a good idea on
> how to deal with the page cache, your program does more than it claims to
> do - because of how linux implements posix_fadvise.
> 
> I don't think anybody expects or desires your program to actually *evict*
> the stuff from the cache you are trying access, which happens in case the
> data was in the cache prior to starting your program.
> 
> What people expect is that a solution such as you wrote it simply won't
> *add* anything to the cache. They don't expect it will actually globally
> *remove* stuff from the cache.
> 
> Making a backup this way would hurt even worse than usual with your
> pagecache management tool if the file being backupped was still being read.
> 
> This is not your fault, but in practice, it makes your program less useful
> than it could be.

yup.  As I said, it's a proof-of-concept.  It's a project.  And I have about one
free femtosecond per fortnight :(

> One could conceivably fix that up using mincore and simply not fadvise if a
> page was in core already.

Yes.  Let's flesh it out the backup program policy some more:

- Unconditionally invalidate output files

- on entry to read(), probe pagecache, record which pages in the range are 
present

- on entry to next read(), shoot down those pages from the previous read
  which weren't in pagecache.

- But we can do better!  LRU the page's files up to a certain number of pages.

- Once that point is exceeded, we need to reclaim some pages.  Which
  ones?  Well, we've been observing all reads, so we can record which pages
  were referenced once, and which ones were referenced multiple times so we
  can do arbitrarily complex page aging in there.

- On close(), nuke all pages which weren't in core during open(), even if
  this app referenced them multiple times.

- If the backup program decided to read its input files with mmap we're
  rather screwed.  We can't intercept pagefaults so the best we can do is
  to restore the file's pagecache to its previous state on close().

  Or if it's really a problem, get control in there somehow and
  periodically poll the pagecache occupancy via mincore(), use madvise()
  then fadvise() to trim it back.

That all sounds reasonably doable.  It'd be pretty complex to do it
in-kernel but we could do it there too.  Problem is if course that the
above strategy is explicitly optimised for the backup program and if it's
in-kernel it becomes applicable to all other workloads.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Erik Andersen

On Sat Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
> On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> > > It is *not* a global instruction.  It uses setenv, so the user's policy
> > > affects only the target process and its forked children.
> > 
> > ... and all other processes accessing the same file(s)!
> > 
> > Your library and the system calls may be limited to one process,
> > but the consequences are global.
> 
> Yes.  So what?  If the user wants to go and evict libc.so from pagecache
> then he can do so - the kernel has provided syscalls with which this can be
> done for at least seven years.  Bad user, shouldn't do that.

I think what Rik is pointing out is that as currently
implemented, posix_fadvise is a much bigger hammer than is
generally useful or desirable.

Using posix_fadvise on the other hand says "immediately drop this
stuff from the pagecache, consequences be damned".  If someone
else happens to be using the specified data, well too bad, they
suffer collateral damage.  Process A can, maliciously or
ignorantly, deny service to process B.

On the other hand, your old but super cool O_STREAMING patch took
a kinder gentler approach, where applications could tell the
kernel "please do not keep this file descriptor's data in cache
on my account since I will not reuse it." If someone else however
was using the same data, the kernel would keep things cached as
usual and thereby avoid doing collateral damage.

 -Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 14:58:48 -0800 "Ray Lee" <[EMAIL PROTECTED]> wrote:

> On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > It is to address the "waah, backups fill my memory with pagecache" and the
> > "waah, updatedb swapped everything out" and the "waah, copying a DVD
> > gobbled all my memory" problems.
> 
> Is the updatedb problem really due to pagecache?

It's a combination of pagecache, slab cache and of course contention for
the disk.  In my experience the latter preponderates: the disk is sekeing
like mad and I can't get its attention.  Others report lots of swapout,
which will be a combination of slab and pagecache, varying degrees of each.

> > When running
> >
> > pagecache-management.sh dd if=100-mb-file of=foo
> > or
> > pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo
> >
> > the amount of pagecache in the machine is pretty much unaltered.  Maybe a
> > megabyte of additional cache in the second case, because of ext3 indirect
> > blocks.
> 
> [EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i
> /proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i
> /proc/slabinfo
> ext3_inode_cache   21024  23722   158421 : tunables   24   12
>   0 : slabdata  11861  11861  0
> ext3_inode_cache   41332  41332   158421 : tunables   24   12
>   0 : slabdata  20666  20666  0
> [EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * 
> (41332-21024) ))
> 32167872

If 32 MB is the whole lot then by eliminating pagecache, we just solved the
problem.  But perhaps you instantiated a lot more VFS cache and all you're
seeing there is the leftovers.

> Or is there a /proc/sys/vm/* knob that can be tweaked for this
> before/after the updatedb?

/proc/sys/vm/vfs_cache_pressure should help.  I don't recall anyone
reporting its effects with updatedb.

> But yeah, I for one would happily submit patches to upstream authors
> to address this there. There's no reason code should be making the
> kernel guess its intention on these things.

I think so.  We're dealing with super-special cases here and often trying
to fix those in-kernel will degrade other, often more common cases.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread bert hubert

On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
> > > It is *not* a global instruction.  It uses setenv, so the user's policy
> > > affects only the target process and its forked children.
> > 
> > ... and all other processes accessing the same file(s)!
> > 
> > Your library and the system calls may be limited to one process,
> > but the consequences are global.
> 
> Yes.  So what?  If the user wants to go and evict libc.so from pagecache
> then he can do so - the kernel has provided syscalls with which this can be
> done for at least seven years.  Bad user, shouldn't do that.

While I agree with your sentiments that userspace can have a good idea on
how to deal with the page cache, your program does more than it claims to
do - because of how linux implements posix_fadvise.

I don't think anybody expects or desires your program to actually *evict*
the stuff from the cache you are trying access, which happens in case the
data was in the cache prior to starting your program.

What people expect is that a solution such as you wrote it simply won't
*add* anything to the cache. They don't expect it will actually globally
*remove* stuff from the cache.

Making a backup this way would hurt even worse than usual with your
pagecache management tool if the file being backupped was still being read.

This is not your fault, but in practice, it makes your program less useful
than it could be.

One could conceivably fix that up using mincore and simply not fadvise if a
page was in core already.

Bert
-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Ray Lee


On 3/3/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

It is to address the "waah, backups fill my memory with pagecache" and the
"waah, updatedb swapped everything out" and the "waah, copying a DVD
gobbled all my memory" problems.


Is the updatedb problem really due to pagecache?


When running

pagecache-management.sh dd if=100-mb-file of=foo
or
pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo

the amount of pagecache in the machine is pretty much unaltered.  Maybe a
megabyte of additional cache in the second case, because of ext3 indirect
blocks.


[EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i
/proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i
/proc/slabinfo
ext3_inode_cache   21024  23722   158421 : tunables   24   12
 0 : slabdata  11861  11861  0
ext3_inode_cache   41332  41332   158421 : tunables   24   12
 0 : slabdata  20666  20666  0
[EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * 
(41332-21024) ))
32167872

Or is there a /proc/sys/vm/* knob that can be tweaked for this
before/after the updatedb?

But yeah, I for one would happily submit patches to upstream authors
to address this there. There's no reason code should be making the
kernel guess its intention on these things.

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> backup program

A suitable policy for a backup program would probably be to invalidate any
output file(s) and to invalidate those pages of the input files which were
not in cache when the backup program first opened those files.  That way
the backup program will have no effect on the cache state, except for the
race situation where someone read an uncached file while the backup program
was reading from it too.

This can be added in an hour or two with no kernel changes (use mincore).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:28:35 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:
> > 
> >>> It is *not* a global instruction.  It uses setenv, so the user's policy
> >>> affects only the target process and its forked children.
> >> ... and all other processes accessing the same file(s)!
> >>
> >> Your library and the system calls may be limited to one process,
> >> but the consequences are global.
> > 
> > Yes.  So what?  If the user wants to go and evict libc.so from pagecache
> > then he can do so - the kernel has provided syscalls with which this can be
> > done for at least seven years.  Bad user, shouldn't do that.
> 
> Are you saying the user should not use your script with their
> backup program?

No.

This is getting silly.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> 
> > Well, backup programs are a unique case.  Let's say instead that the user
> > has just generated a 600MB ISO image.
> > 
> > The kernel *just doesn't know* whether the user will next try to read the
> > kernel tree or will next try to read that ISO image.
> > 
> > That, Rik, is my point, and is the entire point of this work.
> 
> I still don't understand why "the backup program flushed my data out
> of the cache with POSIX_FADV_DONTNEED" is an improvement over "the
> backup program flushed my data out of the cache by reading other files".

Oh.  Well, yes, if the user elected to instruct the backup program to
invalidate both its input files and its output files and if it's a full
dump, you end up with nothing in pagecache.  Possibly a more sensible
setting would be to invalidate only the output.

But having some batch program come in from the side and perform a bulk read
of your present working set isn't very common.

> Your code may be useful for a few specialized situations,

That's quite wrong.  It is useful for a great number of well-known problem
scenarios, all of which are *already "specialized situations".  It's *you*
who is chasing down the 1% scenarios and portraying them as general problems.
Backups only happen once in 24 hours, for example.

> but I don't
> see it actually fixing most of the examples you gave in your
> announcement, except for the DVD copying one.
> 

I don't know how much benefit it will provide for the updatedb problem - I
expect it'll help sometimes.  otoh maybe it'll worsen the existing slab internal
fragmentation problem, dunno.

But the other scenarios it solves completely and optimally.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:


It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.

... and all other processes accessing the same file(s)!

Your library and the system calls may be limited to one process,
but the consequences are global.


Yes.  So what?  If the user wants to go and evict libc.so from pagecache
then he can do so - the kernel has provided syscalls with which this can be
done for at least seven years.  Bad user, shouldn't do that.


Are you saying the user should not use your script with their
backup program?

Then what's the point?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> > It is *not* a global instruction.  It uses setenv, so the user's policy
> > affects only the target process and its forked children.
> 
> ... and all other processes accessing the same file(s)!
> 
> Your library and the system calls may be limited to one process,
> but the consequences are global.

Yes.  So what?  If the user wants to go and evict libc.so from pagecache
then he can do so - the kernel has provided syscalls with which this can be
done for at least seven years.  Bad user, shouldn't do that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:


Well, backup programs are a unique case.  Let's say instead that the user
has just generated a 600MB ISO image.

The kernel *just doesn't know* whether the user will next try to read the
kernel tree or will next try to read that ISO image.

That, Rik, is my point, and is the entire point of this work.


I still don't understand why "the backup program flushed my data out
of the cache with POSIX_FADV_DONTNEED" is an improvement over "the
backup program flushed my data out of the cache by reading other files".

Your code may be useful for a few specialized situations, but I don't
see it actually fixing most of the examples you gave in your
announcement, except for the DVD copying one.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert <[EMAIL PROTECTED]> wrote:


How can you make global policy decisions based on the intent
of one program?

By not doing so.


yup.


Andrew's program is fine in principle, except that the
linux kernel treats the communication of a program's intent as a global
instruction.


argh.

That felt good - let's do it again.

argh.


It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.


... and all other processes accessing the same file(s)!

Your library and the system calls may be limited to one process,
but the consequences are global.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert <[EMAIL PROTECTED]> wrote:

> > How can you make global policy decisions based on the intent
> > of one program?
> 
> By not doing so.

yup.

> Andrew's program is fine in principle, except that the
> linux kernel treats the communication of a program's intent as a global
> instruction.

argh.

That felt good - let's do it again.

argh.

It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.

> Also, Andrew's description is a tad misleasing, as the size of the page
> cache might be altered a lot in case content is accessed that was previously
> cached!

That's true.

Although if the user knows that he'll want to use that data again soon, and
he elects to purge it all from cache beforehand then we're dealing with a pretty
dumb user.

If this user doesn't plan to use the data again, but some other user does,
then he loses.

> With my userspace hat on, I'd love to have a proper way to communicate my
> *program's* expectations to the kernel, without stomping other programs.

That is the aim and effect of this work.

> Also with the same hat on, I hope to rarely *need* to communicate my
> expectations because the kernel correctly predicts many cases.

yup.  It's those odd cases where it goes wrong which I'm addressing here.

And of course there's no way for the kernel to work out that it's about to
go wrong - kernel can't read user's mind.

Well.  That's not strictly true.  One could envisage a database-backed
learning program which observes the users work patterns and, based on
various pattern-matchings, works out what the best strategy is likely to be
when the user starts some operation which he has performed previously.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 16:30:56 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> >> I am sick and tired of the "this is hard, let userspace do it" attitude.
> > 
> > Anything you try to do in-kernel will catastrophically screw up some
> > workloads.  You don't have a chance of getting this right.
> 
> Any time you follow the directions of one userspace program,
> you can screw up others.  I suspect that userspace has far
> less of a chance of getting it right than the kernel.
> 
> ALSA would be a good example of why it is bad to export
> tuning knobs directly to userspace - many sound cards have
> non-standard names for the volume controls, making it almost
> impossible for userspace to present the user with a simple
> user interface for tweaking the volume.

What on earth are you talking about?  Please, go and look at the thing.

> > You are the kernel.  The user just read an entire kernel tree.  You face a
> > binary decision: do you cache that tree or do you not?  Your time starts
> > now.  What is your answer?
> 
> Lets turn this around.
> 
> The user has been accessing the kernel tree over and over
> again, for hours on end (compile testing a patch).  Along
> comes a backup program, that tells you to evict the whole
> thing from the cache.
> 
> What do you do?

Well, backup programs are a unique case.  Let's say instead that the user
has just generated a 600MB ISO image.

The kernel *just doesn't know* whether the user will next try to read the
kernel tree or will next try to read that ISO image.

That, Rik, is my point, and is the entire point of this work.

> How can you make global policy decisions based on the intent
> of one program?

You can't, that's why I did this work.

> Only the kernel knows the state of the whole system and has
> observed the behaviour of all the processes.

The kernel knows the past, and tries to predict the future from that past. 
Sometimes, as you well know, that goes badly wrong.  That's why I did this
work.

> One process has
> no idea what the other processes in the system are doing.

argh.  Please, next time click on the link?

http://userweb.kernel.org/~akpm/pagecache-management/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread bert hubert

On Sat, Mar 03, 2007 at 04:30:56PM -0500, Rik van Riel wrote:

> The user has been accessing the kernel tree over and over
> again, for hours on end (compile testing a patch).  Along
> comes a backup program, that tells you to evict the whole
> thing from the cache.

This is arguably due to a linux misimplementation of posix_fadvise. SuS v3
clearly states:

  The posix_fadvise() function shall advise the implementation on the
  expected behavior of the application with respect to the data in the file
  associated with the open file descriptor

Note how it refers to the *application*. This is reiterated here:

  POSIX_FADV_WILLNEED 
Specifies that the application expects to access the specified data in
the near future.

  POSIX_FADV_DONTNEED
Specifies that the application expects that it will not access the
specified data in the near future.

  POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified data once
and then not reuse it thereafter.

Linux however implements posix_fadvise globally:

  POSIX_FADV_DONTNEED 
attempts to free cached pages associated with the specified region.

  POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE 
both initiate a non-blocking read of the specified region into the page
cache.

> How can you make global policy decisions based on the intent
> of one program?

By not doing so. Andrew's program is fine in principle, except that the
linux kernel treats the communication of a program's intent as a global
instruction.

Also, Andrew's description is a tad misleasing, as the size of the page
cache might be altered a lot in case content is accessed that was previously
cached!

With my userspace hat on, I'd love to have a proper way to communicate my
*program's* expectations to the kernel, without stomping other programs.

Also with the same hat on, I hope to rarely *need* to communicate my
expectations because the kernel correctly predicts many cases.

Bert

-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:



I am sick and tired of the "this is hard, let userspace do it" attitude.


Anything you try to do in-kernel will catastrophically screw up some
workloads.  You don't have a chance of getting this right.


Any time you follow the directions of one userspace program,
you can screw up others.  I suspect that userspace has far
less of a chance of getting it right than the kernel.

ALSA would be a good example of why it is bad to export
tuning knobs directly to userspace - many sound cards have
non-standard names for the volume controls, making it almost
impossible for userspace to present the user with a simple
user interface for tweaking the volume.


You are the kernel.  The user just read an entire kernel tree.  You face a
binary decision: do you cache that tree or do you not?  Your time starts
now.  What is your answer?


Lets turn this around.

The user has been accessing the kernel tree over and over
again, for hours on end (compile testing a patch).  Along
comes a backup program, that tells you to evict the whole
thing from the cache.

What do you do?

How can you make global policy decisions based on the intent
of one program?

Only the kernel knows the state of the whole system and has
observed the behaviour of all the processes. One process has
no idea what the other processes in the system are doing.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> 
> > It is to address the "waah, backups fill my memory with pagecache" and the
> > "waah, updatedb swapped everything out" and the "waah, copying a DVD
> > gobbled all my memory" problems.
> 
> By removing pressure from the page cache, you'll only allow updatedb
> to grow the inode and dentry caches larger than before.

Well duh.

That's a two-order-of-magnitude lesser problem and only affects one of many
problematic workloads.

> I am sick and tired of the "this is hard, let userspace do it" attitude.

Anything you try to do in-kernel will catastrophically screw up some
workloads.  You don't have a chance of getting this right.

You are the kernel.  The user just read an entire kernel tree.  You face a
binary decision: do you cache that tree or do you not?  Your time starts
now.  What is your answer?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:


It is to address the "waah, backups fill my memory with pagecache" and the
"waah, updatedb swapped everything out" and the "waah, copying a DVD
gobbled all my memory" problems.


By removing pressure from the page cache, you'll only allow updatedb
to grow the inode and dentry caches larger than before.

I am sick and tired of the "this is hard, let userspace do it" attitude.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

userspace pagecache management tool

2007-03-03 Thread Andrew Morton


I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a
little tool which permits the management of the pagecache usage of
arbitrary applications.  Effectively it prevents the targetted application
from using any pagecache at all.

It is to address the "waah, backups fill my memory with pagecache" and the
"waah, updatedb swapped everything out" and the "waah, copying a DVD
gobbled all my memory" problems.


Although it is little more than a proof-of-concept it seems to be fairly
useful.  When running

pagecache-management.sh dd if=100-mb-file of=foo
or
pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo

the amount of pagecache in the machine is pretty much unaltered.  Maybe a
megabyte of additional cache in the second case, because of ext3 indirect
blocks.


The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(),
write(), pwrite(), close() and dup2() functions.  pagecache control is done
via posix_fadvise() and sync_file_range().

btw, for a while I was using fdatasync() on close(), but it was slow,
because fdatasync() has to run an ext3 commit to commit the metadata. 
sync_file_range() doesn't do that, and the copy-a-kernel-tree testcase sped
up by a factor of five.  So sync_file_range() rocks, but the powerpc guys
haven't wired it up yet.


There is much more which could be done to make this code smarter, but I
think the lesson here is that we can produce a far, far better result doing
this work in userspace than we could ever hope to do with an in-kernel
implementation.  There are some enhancement suggestions in the
documentation file.


It would be good if someone could turn this into a real product, get it fed
into distros.  Once the design is settled we should look at moving all the
functionality into glibc itself, IMO, and get rid of the LD_PRELOAD trick.

It might help if the kernel offered APIs which permit userspace to query
the number of resident pages in a file (well, actually it already does,
kind-of: mincore()) and the ability to query the number of dirty pages in a
file, etc.  I'd be reluctant to tie the kernel ABI too closely to the
current pagecache implementation and data structures, but we can look at
these things.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

userspace pagecache management tool

2007-03-03 Thread Andrew Morton


I've uploaded to http://userweb.kernel.org/~akpm/pagecache-management/ a
little tool which permits the management of the pagecache usage of
arbitrary applications.  Effectively it prevents the targetted application
from using any pagecache at all.

It is to address the waah, backups fill my memory with pagecache and the
waah, updatedb swapped everything out and the waah, copying a DVD
gobbled all my memory problems.


Although it is little more than a proof-of-concept it seems to be fairly
useful.  When running

pagecache-management.sh dd if=100-mb-file of=foo
or
pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo

the amount of pagecache in the machine is pretty much unaltered.  Maybe a
megabyte of additional cache in the second case, because of ext3 indirect
blocks.


The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(),
write(), pwrite(), close() and dup2() functions.  pagecache control is done
via posix_fadvise() and sync_file_range().

btw, for a while I was using fdatasync() on close(), but it was slow,
because fdatasync() has to run an ext3 commit to commit the metadata. 
sync_file_range() doesn't do that, and the copy-a-kernel-tree testcase sped
up by a factor of five.  So sync_file_range() rocks, but the powerpc guys
haven't wired it up yet.


There is much more which could be done to make this code smarter, but I
think the lesson here is that we can produce a far, far better result doing
this work in userspace than we could ever hope to do with an in-kernel
implementation.  There are some enhancement suggestions in the
documentation file.


It would be good if someone could turn this into a real product, get it fed
into distros.  Once the design is settled we should look at moving all the
functionality into glibc itself, IMO, and get rid of the LD_PRELOAD trick.

It might help if the kernel offered APIs which permit userspace to query
the number of resident pages in a file (well, actually it already does,
kind-of: mincore()) and the ability to query the number of dirty pages in a
file, etc.  I'd be reluctant to tie the kernel ABI too closely to the
current pagecache implementation and data structures, but we can look at
these things.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:


It is to address the waah, backups fill my memory with pagecache and the
waah, updatedb swapped everything out and the waah, copying a DVD
gobbled all my memory problems.


By removing pressure from the page cache, you'll only allow updatedb
to grow the inode and dentry caches larger than before.

I am sick and tired of the this is hard, let userspace do it attitude.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
  It is to address the waah, backups fill my memory with pagecache and the
  waah, updatedb swapped everything out and the waah, copying a DVD
  gobbled all my memory problems.
 
 By removing pressure from the page cache, you'll only allow updatedb
 to grow the inode and dentry caches larger than before.

Well duh.

That's a two-order-of-magnitude lesser problem and only affects one of many
problematic workloads.


 I am sick and tired of the this is hard, let userspace do it attitude.

Anything you try to do in-kernel will catastrophically screw up some
workloads.  You don't have a chance of getting this right.

You are the kernel.  The user just read an entire kernel tree.  You face a
binary decision: do you cache that tree or do you not?  Your time starts
now.  What is your answer?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote:



I am sick and tired of the this is hard, let userspace do it attitude.


Anything you try to do in-kernel will catastrophically screw up some
workloads.  You don't have a chance of getting this right.


Any time you follow the directions of one userspace program,
you can screw up others.  I suspect that userspace has far
less of a chance of getting it right than the kernel.

ALSA would be a good example of why it is bad to export
tuning knobs directly to userspace - many sound cards have
non-standard names for the volume controls, making it almost
impossible for userspace to present the user with a simple
user interface for tweaking the volume.


You are the kernel.  The user just read an entire kernel tree.  You face a
binary decision: do you cache that tree or do you not?  Your time starts
now.  What is your answer?


Lets turn this around.

The user has been accessing the kernel tree over and over
again, for hours on end (compile testing a patch).  Along
comes a backup program, that tells you to evict the whole
thing from the cache.

What do you do?

How can you make global policy decisions based on the intent
of one program?

Only the kernel knows the state of the whole system and has
observed the behaviour of all the processes. One process has
no idea what the other processes in the system are doing.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread bert hubert

On Sat, Mar 03, 2007 at 04:30:56PM -0500, Rik van Riel wrote:

 The user has been accessing the kernel tree over and over
 again, for hours on end (compile testing a patch).  Along
 comes a backup program, that tells you to evict the whole
 thing from the cache.

This is arguably due to a linux misimplementation of posix_fadvise. SuS v3
clearly states:

  The posix_fadvise() function shall advise the implementation on the
  expected behavior of the application with respect to the data in the file
  associated with the open file descriptor

Note how it refers to the *application*. This is reiterated here:

  POSIX_FADV_WILLNEED 
Specifies that the application expects to access the specified data in
the near future.

  POSIX_FADV_DONTNEED
Specifies that the application expects that it will not access the
specified data in the near future.

  POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified data once
and then not reuse it thereafter.

Linux however implements posix_fadvise globally:

  POSIX_FADV_DONTNEED 
attempts to free cached pages associated with the specified region.

  POSIX_FADV_WILLNEED and POSIX_FADV_NOREUSE 
both initiate a non-blocking read of the specified region into the page
cache.

 How can you make global policy decisions based on the intent
 of one program?

By not doing so. Andrew's program is fine in principle, except that the
linux kernel treats the communication of a program's intent as a global
instruction.

Also, Andrew's description is a tad misleasing, as the size of the page
cache might be altered a lot in case content is accessed that was previously
cached!

With my userspace hat on, I'd love to have a proper way to communicate my
*program's* expectations to the kernel, without stomping other programs.

Also with the same hat on, I hope to rarely *need* to communicate my
expectations because the kernel correctly predicts many cases.

Bert

-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 16:30:56 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Sat, 03 Mar 2007 15:40:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote:
 
  I am sick and tired of the this is hard, let userspace do it attitude.
  
  Anything you try to do in-kernel will catastrophically screw up some
  workloads.  You don't have a chance of getting this right.
 
 Any time you follow the directions of one userspace program,
 you can screw up others.  I suspect that userspace has far
 less of a chance of getting it right than the kernel.
 
 ALSA would be a good example of why it is bad to export
 tuning knobs directly to userspace - many sound cards have
 non-standard names for the volume controls, making it almost
 impossible for userspace to present the user with a simple
 user interface for tweaking the volume.

What on earth are you talking about?  Please, go and look at the thing.

  You are the kernel.  The user just read an entire kernel tree.  You face a
  binary decision: do you cache that tree or do you not?  Your time starts
  now.  What is your answer?
 
 Lets turn this around.
 
 The user has been accessing the kernel tree over and over
 again, for hours on end (compile testing a patch).  Along
 comes a backup program, that tells you to evict the whole
 thing from the cache.
 
 What do you do?

Well, backup programs are a unique case.  Let's say instead that the user
has just generated a 600MB ISO image.

The kernel *just doesn't know* whether the user will next try to read the
kernel tree or will next try to read that ISO image.

That, Rik, is my point, and is the entire point of this work.

 How can you make global policy decisions based on the intent
 of one program?

You can't, that's why I did this work.

 Only the kernel knows the state of the whole system and has
 observed the behaviour of all the processes.

The kernel knows the past, and tries to predict the future from that past. 
Sometimes, as you well know, that goes badly wrong.  That's why I did this
work.

 One process has
 no idea what the other processes in the system are doing.

argh.  Please, next time click on the link?

http://userweb.kernel.org/~akpm/pagecache-management/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert [EMAIL PROTECTED] wrote:

  How can you make global policy decisions based on the intent
  of one program?
 
 By not doing so.

yup.

 Andrew's program is fine in principle, except that the
 linux kernel treats the communication of a program's intent as a global
 instruction.

argh.

That felt good - let's do it again.

argh.


It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.


 Also, Andrew's description is a tad misleasing, as the size of the page
 cache might be altered a lot in case content is accessed that was previously
 cached!

That's true.

Although if the user knows that he'll want to use that data again soon, and
he elects to purge it all from cache beforehand then we're dealing with a pretty
dumb user.

If this user doesn't plan to use the data again, but some other user does,
then he loses.


 With my userspace hat on, I'd love to have a proper way to communicate my
 *program's* expectations to the kernel, without stomping other programs.

That is the aim and effect of this work.

 Also with the same hat on, I hope to rarely *need* to communicate my
 expectations because the kernel correctly predicts many cases.

yup.  It's those odd cases where it goes wrong which I'm addressing here.

And of course there's no way for the kernel to work out that it's about to
go wrong - kernel can't read user's mind.


Well.  That's not strictly true.  One could envisage a database-backed
learning program which observes the users work patterns and, based on
various pattern-matchings, works out what the best strategy is likely to be
when the user starts some operation which he has performed previously.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 3 Mar 2007 22:41:09 +0100 bert hubert [EMAIL PROTECTED] wrote:


How can you make global policy decisions based on the intent
of one program?

By not doing so.


yup.


Andrew's program is fine in principle, except that the
linux kernel treats the communication of a program's intent as a global
instruction.


argh.

That felt good - let's do it again.

argh.


It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.


... and all other processes accessing the same file(s)!

Your library and the system calls may be limited to one process,
but the consequences are global.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:


Well, backup programs are a unique case.  Let's say instead that the user
has just generated a 600MB ISO image.

The kernel *just doesn't know* whether the user will next try to read the
kernel tree or will next try to read that ISO image.

That, Rik, is my point, and is the entire point of this work.


I still don't understand why the backup program flushed my data out
of the cache with POSIX_FADV_DONTNEED is an improvement over the
backup program flushed my data out of the cache by reading other files.

Your code may be useful for a few specialized situations, but I don't
see it actually fixing most of the examples you gave in your
announcement, except for the DVD copying one.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

  It is *not* a global instruction.  It uses setenv, so the user's policy
  affects only the target process and its forked children.
 
 ... and all other processes accessing the same file(s)!
 
 Your library and the system calls may be limited to one process,
 but the consequences are global.

Yes.  So what?  If the user wants to go and evict libc.so from pagecache
then he can do so - the kernel has provided syscalls with which this can be
done for at least seven years.  Bad user, shouldn't do that.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote:


It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.

... and all other processes accessing the same file(s)!

Your library and the system calls may be limited to one process,
but the consequences are global.


Yes.  So what?  If the user wants to go and evict libc.so from pagecache
then he can do so - the kernel has provided syscalls with which this can be
done for at least seven years.  Bad user, shouldn't do that.


Are you saying the user should not use your script with their
backup program?

Then what's the point?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
  Well, backup programs are a unique case.  Let's say instead that the user
  has just generated a 600MB ISO image.
  
  The kernel *just doesn't know* whether the user will next try to read the
  kernel tree or will next try to read that ISO image.
  
  That, Rik, is my point, and is the entire point of this work.
 
 I still don't understand why the backup program flushed my data out
 of the cache with POSIX_FADV_DONTNEED is an improvement over the
 backup program flushed my data out of the cache by reading other files.

Oh.  Well, yes, if the user elected to instruct the backup program to
invalidate both its input files and its output files and if it's a full
dump, you end up with nothing in pagecache.  Possibly a more sensible
setting would be to invalidate only the output.

But having some batch program come in from the side and perform a bulk read
of your present working set isn't very common.

 Your code may be useful for a few specialized situations,

That's quite wrong.  It is useful for a great number of well-known problem
scenarios, all of which are *already specialized situations.  It's *you*
who is chasing down the 1% scenarios and portraying them as general problems.
Backups only happen once in 24 hours, for example.

 but I don't
 see it actually fixing most of the examples you gave in your
 announcement, except for the DVD copying one.
 

I don't know how much benefit it will provide for the updatedb problem - I
expect it'll help sometimes.  otoh maybe it'll worsen the existing slab internal
fragmentation problem, dunno.

But the other scenarios it solves completely and optimally.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:28:35 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote:
  
  It is *not* a global instruction.  It uses setenv, so the user's policy
  affects only the target process and its forked children.
  ... and all other processes accessing the same file(s)!
 
  Your library and the system calls may be limited to one process,
  but the consequences are global.
  
  Yes.  So what?  If the user wants to go and evict libc.so from pagecache
  then he can do so - the kernel has provided syscalls with which this can be
  done for at least seven years.  Bad user, shouldn't do that.
 
 Are you saying the user should not use your script with their
 backup program?

No.

This is getting silly.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 backup program

A suitable policy for a backup program would probably be to invalidate any
output file(s) and to invalidate those pages of the input files which were
not in cache when the backup program first opened those files.  That way
the backup program will have no effect on the cache state, except for the
race situation where someone read an uncached file while the backup program
was reading from it too.

This can be added in an hour or two with no kernel changes (use mincore).

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Ray Lee


On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote:

It is to address the waah, backups fill my memory with pagecache and the
waah, updatedb swapped everything out and the waah, copying a DVD
gobbled all my memory problems.


Is the updatedb problem really due to pagecache?


When running

pagecache-management.sh dd if=100-mb-file of=foo
or
pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo

the amount of pagecache in the machine is pretty much unaltered.  Maybe a
megabyte of additional cache in the second case, because of ext3 indirect
blocks.


[EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i
/proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i
/proc/slabinfo
ext3_inode_cache   21024  23722   158421 : tunables   24   12
 0 : slabdata  11861  11861  0
ext3_inode_cache   41332  41332   158421 : tunables   24   12
 0 : slabdata  20666  20666  0
[EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * 
(41332-21024) ))
32167872

Or is there a /proc/sys/vm/* knob that can be tweaked for this
before/after the updatedb?

But yeah, I for one would happily submit patches to upstream authors
to address this there. There's no reason code should be making the
kernel guess its intention on these things.

Ray
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread bert hubert

On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
   It is *not* a global instruction.  It uses setenv, so the user's policy
   affects only the target process and its forked children.
  
  ... and all other processes accessing the same file(s)!
  
  Your library and the system calls may be limited to one process,
  but the consequences are global.
 
 Yes.  So what?  If the user wants to go and evict libc.so from pagecache
 then he can do so - the kernel has provided syscalls with which this can be
 done for at least seven years.  Bad user, shouldn't do that.

While I agree with your sentiments that userspace can have a good idea on
how to deal with the page cache, your program does more than it claims to
do - because of how linux implements posix_fadvise.

I don't think anybody expects or desires your program to actually *evict*
the stuff from the cache you are trying access, which happens in case the
data was in the cache prior to starting your program.

What people expect is that a solution such as you wrote it simply won't
*add* anything to the cache. They don't expect it will actually globally
*remove* stuff from the cache.

Making a backup this way would hurt even worse than usual with your
pagecache management tool if the file being backupped was still being read.

This is not your fault, but in practice, it makes your program less useful
than it could be.

One could conceivably fix that up using mincore and simply not fadvise if a
page was in core already.

Bert
-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 14:58:48 -0800 Ray Lee [EMAIL PROTECTED] wrote:

 On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote:
  It is to address the waah, backups fill my memory with pagecache and the
  waah, updatedb swapped everything out and the waah, copying a DVD
  gobbled all my memory problems.
 
 Is the updatedb problem really due to pagecache?

It's a combination of pagecache, slab cache and of course contention for
the disk.  In my experience the latter preponderates: the disk is sekeing
like mad and I can't get its attention.  Others report lots of swapout,
which will be a combination of slab and pagecache, varying degrees of each.

  When running
 
  pagecache-management.sh dd if=100-mb-file of=foo
  or
  pagecache-management.sh cp -a /usr/src/linux-2.6.20 /usr/src/foo
 
  the amount of pagecache in the machine is pretty much unaltered.  Maybe a
  megabyte of additional cache in the second case, because of ext3 indirect
  blocks.
 
 [EMAIL PROTECTED]:~/work/home/pagecache-management$ grep ext3_i
 /proc/slabinfo; ./pagecache-management.sh sudo updatedb; grep ext3_i
 /proc/slabinfo
 ext3_inode_cache   21024  23722   158421 : tunables   24   12
   0 : slabdata  11861  11861  0
 ext3_inode_cache   41332  41332   158421 : tunables   24   12
   0 : slabdata  20666  20666  0
 [EMAIL PROTECTED]:~/work/home/pagecache-management$ echo $(( 1584 * 
 (41332-21024) ))
 32167872

If 32 MB is the whole lot then by eliminating pagecache, we just solved the
problem.  But perhaps you instantiated a lot more VFS cache and all you're
seeing there is the leftovers.

 Or is there a /proc/sys/vm/* knob that can be tweaked for this
 before/after the updatedb?

/proc/sys/vm/vfs_cache_pressure should help.  I don't recall anyone
reporting its effects with updatedb.

 But yeah, I for one would happily submit patches to upstream authors
 to address this there. There's no reason code should be making the
 kernel guess its intention on these things.

I think so.  We're dealing with super-special cases here and often trying
to fix those in-kernel will degrade other, often more common cases.

wonders about sys_reclaim_dentry(const char *pathname)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Erik Andersen

On Sat Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
 On Sat, 03 Mar 2007 17:19:00 -0500 Rik van Riel [EMAIL PROTECTED] wrote:
 
   It is *not* a global instruction.  It uses setenv, so the user's policy
   affects only the target process and its forked children.
  
  ... and all other processes accessing the same file(s)!
  
  Your library and the system calls may be limited to one process,
  but the consequences are global.
 
 Yes.  So what?  If the user wants to go and evict libc.so from pagecache
 then he can do so - the kernel has provided syscalls with which this can be
 done for at least seven years.  Bad user, shouldn't do that.

I think what Rik is pointing out is that as currently
implemented, posix_fadvise is a much bigger hammer than is
generally useful or desirable.

Using posix_fadvise on the other hand says immediately drop this
stuff from the pagecache, consequences be damned.  If someone
else happens to be using the specified data, well too bad, they
suffer collateral damage.  Process A can, maliciously or
ignorantly, deny service to process B.

On the other hand, your old but super cool O_STREAMING patch took
a kinder gentler approach, where applications could tell the
kernel please do not keep this file descriptor's data in cache
on my account since I will not reuse it. If someone else however
was using the same data, the kernel would keep things cached as
usual and thereby avoid doing collateral damage.

 -Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert [EMAIL PROTECTED] wrote:

 On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
It is *not* a global instruction.  It uses setenv, so the user's policy
affects only the target process and its forked children.
   
   ... and all other processes accessing the same file(s)!
   
   Your library and the system calls may be limited to one process,
   but the consequences are global.
  
  Yes.  So what?  If the user wants to go and evict libc.so from pagecache
  then he can do so - the kernel has provided syscalls with which this can be
  done for at least seven years.  Bad user, shouldn't do that.
 
 While I agree with your sentiments that userspace can have a good idea on
 how to deal with the page cache, your program does more than it claims to
 do - because of how linux implements posix_fadvise.
 
 I don't think anybody expects or desires your program to actually *evict*
 the stuff from the cache you are trying access, which happens in case the
 data was in the cache prior to starting your program.
 
 What people expect is that a solution such as you wrote it simply won't
 *add* anything to the cache. They don't expect it will actually globally
 *remove* stuff from the cache.
 
 Making a backup this way would hurt even worse than usual with your
 pagecache management tool if the file being backupped was still being read.
 
 This is not your fault, but in practice, it makes your program less useful
 than it could be.

yup.  As I said, it's a proof-of-concept.  It's a project.  And I have about one
free femtosecond per fortnight :(

 One could conceivably fix that up using mincore and simply not fadvise if a
 page was in core already.

Yes.  Let's flesh it out the backup program policy some more:

- Unconditionally invalidate output files

- on entry to read(), probe pagecache, record which pages in the range are 
present

- on entry to next read(), shoot down those pages from the previous read
  which weren't in pagecache.

- But we can do better!  LRU the page's files up to a certain number of pages.

- Once that point is exceeded, we need to reclaim some pages.  Which
  ones?  Well, we've been observing all reads, so we can record which pages
  were referenced once, and which ones were referenced multiple times so we
  can do arbitrarily complex page aging in there.

- On close(), nuke all pages which weren't in core during open(), even if
  this app referenced them multiple times.

- If the backup program decided to read its input files with mmap we're
  rather screwed.  We can't intercept pagefaults so the best we can do is
  to restore the file's pagecache to its previous state on close().

  Or if it's really a problem, get control in there somehow and
  periodically poll the pagecache occupancy via mincore(), use madvise()
  then fadvise() to trim it back.

That all sounds reasonably doable.  It'd be pretty complex to do it
in-kernel but we could do it there too.  Problem is if course that the
above strategy is explicitly optimised for the backup program and if it's
in-kernel it becomes applicable to all other workloads.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote:


backup program


A suitable policy for a backup program would probably be to invalidate any
output file(s) and to invalidate those pages of the input files which were
not in cache when the backup program first opened those files.  That way
the backup program will have no effect on the cache state, except for the
race situation where someone read an uncached file while the backup program
was reading from it too.


The use-once policy we have in the kernel should work
perfectly fine for backups.  All we need to do is
actually honor the accessed bit on active page cache
pages, instead of flushing them onto the inactive
list.

What am I overlooking?


This can be added in an hour or two with no kernel changes (use mincore).


mincore only works for mmaped areas, we'd need an fincore
to work with file handles.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Sat, 03 Mar 2007 17:25:30 -0500 Rik van Riel [EMAIL PROTECTED] wrote:
  
  backup program
  
  A suitable policy for a backup program would probably be to invalidate any
  output file(s) and to invalidate those pages of the input files which were
  not in cache when the backup program first opened those files.  That way
  the backup program will have no effect on the cache state, except for the
  race situation where someone read an uncached file while the backup program
  was reading from it too.
 
 The use-once policy we have in the kernel should work
 perfectly fine for backups.  All we need to do is
 actually honor the accessed bit on active page cache
 pages, instead of flushing them onto the inactive
 list.
 
 What am I overlooking?

That'll improve backups but will break other things.

To do this effectively we'd need to change the policy so that new pagecache
allocations cause no scanning of used-twice pages at all.  So that even
after many gigs of backing up, the working set is still there.

Problem is, (for example) what about the person who has 80% of memory in
used-twice state and who then reads a file or files which are 20% or more of
the size of memory, two or more times.  It'll be 100% cache misses, every time.
This will happen quite a lot.  IOW, once those pages are in used-twice state,
how does further pagecache activity ever get them _out_ of that state?  Only
by joining the used-twice page set, and that can't happen if the 
used-once-so-far
pages got reclaimed.

Doing a refault thing would help a bit, but stops working at a certain point.


  This can be added in an hour or two with no kernel changes (use mincore).
 
 mincore only works for mmaped areas, we'd need an fincore
 to work with file handles.

The LD_PRELOAD code has the fd and can mmap it to perform the pagecache
probe.

fincore() would be a bit neater, but given the rarity with which mincore()
is used it's perhaps hard to justify adding a slightly more efficient and
slightly more convenient subset of mincore().

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Ray Lee

Andrew Morton wrote:
 wonders about sys_reclaim_dentry(const char *pathname)

Would there be any other users of it than updatedb? I'm not coming up
with much, but given that I'm not always clever, that doesn't mean much.

thinks out loud... A hypothetical on-demand file virus scanner is
going to hit already cached or about-to-be-cached entries by definition.
Perhaps some system audit daemon, such as tripwire. Well, that has the
same access patterns as updatedb, doesn't it: a directory at a time.
find, cp -a, the same.

So instead of sys_reclaim_dentry, how about extending fadvise to work on
the fd returned via opendir? And extending POSIX_FADV_NOREUSE on a file
fd to drop the dentry at close?

(Call me chicken; I just don't want to be the guy suggesting a new
syscall for a single or few users.)

 ~ ~

Alternately, there have been requests for a way for userspace to get
notification of all file events for indexing of data and metadata
(inotify, unfortunately, doesn't scale to a full filesystem). (cf.
http://lkml.org/lkml/2006/9/30/98 .) That'd allow an updatedb daemon to
keep the index up to date all the time, amortizing the cost. More
usefully, it'd allow a content indexing daemon to stay up to date all
the time, though inotify mostly works for those, I suppose.

(Hmm...
[EMAIL PROTECTED]:~$ find ~ -type d | wc -l
14067

...right. So it probably works fine for normal people.)

Hey, waitaminute. This should be a solved problem? SELinux must have
some sort of requirement for logging file access attempts. Google, at
least, implies so. Perhaps whatever it implements could be lifted into
the core kernel without dragging the rest behind it.

Dunno. Who do we CC?

Ray
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Eric St-Laurent

On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote:


 There is much more which could be done to make this code smarter, but I
 think the lesson here is that we can produce a far, far better result doing
 this work in userspace than we could ever hope to do with an in-kernel
 implementation.  There are some enhancement suggestions in the
 documentation file.

While I think that more user space applications should use fadvise() to
avoid polluting the page cache with unneeded data, I still think the
kernel should be more fair in regard to page cache management.

Personally, I've experienced some sluggish performance after copying
large files around. Even more when using NFS. It's difficult to file a
bug report for interactive feel, I don't know how to measure it. I
just feel it's a weak aspect of the OS.

Surely it's possible to make the kernel a little bit better to protect
the page cache from abuse, from simple or badly designed applications.

Why fairness is provided by the process scheduler with good results, yet
it somewhat easy for a process to cause slowdowns from page cache usage.

My personal opinion is that the VM seem tuned for database types
workloads. Of course, making the page cache more fair to prevent one
process to use most of it will most likely slowdown database type
applications.

Maybe the situation should be reversed, much like the process scheduler.
Fairness by default, and the possibility to request for more system
resources by asking for them with necessary privileges. Much like
SCHED_FIFO policy.


- Eric



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 19:14:59 -0500 Eric St-Laurent [EMAIL PROTECTED] wrote:

 On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote:
 
 
  There is much more which could be done to make this code smarter, but I
  think the lesson here is that we can produce a far, far better result doing
  this work in userspace than we could ever hope to do with an in-kernel
  implementation.  There are some enhancement suggestions in the
  documentation file.
 
 While I think that more user space applications should use fadvise() to
 avoid polluting the page cache with unneeded data, I still think the
 kernel should be more fair in regard to page cache management.
 
 Personally, I've experienced some sluggish performance after copying
 large files around. Even more when using NFS. It's difficult to file a
 bug report for interactive feel, I don't know how to measure it. I
 just feel it's a weak aspect of the OS.

yeah.  It'd be worth spending some time, try to come up with some set of
commands which produce an effect which you find objectionable.

 Surely it's possible to make the kernel a little bit better to protect
 the page cache from abuse, from simple or badly designed applications.
 
 Why fairness is provided by the process scheduler with good results, yet
 it somewhat easy for a process to cause slowdowns from page cache usage.
 
 My personal opinion is that the VM seem tuned for database types
 workloads.

VM hasn't actually been tuned *for* anything much at all, really.  Looking
back on it, much of the tweaking in there has been to avoid really bad
situations.  We put much work into avoiding the 100%, 1000% or 1%
slowdowns, but not a lot of work into providing the 15% speedups.

So it may well be that the result is not particularly great at anything,
but it's also not horridly bad at anything, either.  Or at least, it's not
supposed to be.

 Of course, making the page cache more fair to prevent one
 process to use most of it will most likely slowdown database type
 applications.

databases actually like to manage their own cache via various means.  There
are some situations in which bulk IO activities can trash the databases's
cache.  That's the sort of thing which this tool is trying to help address.

 Maybe the situation should be reversed, much like the process scheduler.
 Fairness by default, and the possibility to request for more system
 resources by asking for them with necessary privileges. Much like
 SCHED_FIFO policy.

Well.  If the CPU scheduler makes a mistake, we see 5% or 15% degredations.
 If VM make a mistake (or fails to read the operator's mind), we go to disk
and can suffer 1000% degredations or worse.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Lee Revell


On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote:

The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(),
write(), pwrite(), close() and dup2() functions.  pagecache control is done
via posix_fadvise() and sync_file_range().



How could this have any effect on the updatedb problem?  updatedb does
not read() anything, it just open()s and stat()s every file on the
disk.

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 17:02:31 -0800 Ray Lee [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  wonders about sys_reclaim_dentry(const char *pathname)
 
 Would there be any other users of it than updatedb?

updatedb is the notorious one.

Alas, one can envisage sane workloads which really really really really
want to cache millions of dentries and inodes.  Workloads which get run
more often than once-per-day-at-4AM.  So if we fix updatedb, those
people with kernel-indistinguishable workloads get real unhappy.  We have
to pass instructions to the kernel to resolve this.

A sys_reclaim_dentry() would also need a sys_is_dentry_there() so updatedb
could restore the previous state.

Probably it'd be better to fix the well-known internal fragmentation problem
we have with the VFS caches.  That's fairly hard.

 I'm not coming up
 with much, but given that I'm not always clever, that doesn't mean much.
 
 thinks out loud... A hypothetical on-demand file virus scanner is
 going to hit already cached or about-to-be-cached entries by definition.
 Perhaps some system audit daemon, such as tripwire. Well, that has the
 same access patterns as updatedb, doesn't it: a directory at a time.
 find, cp -a, the same.
 
 So instead of sys_reclaim_dentry, how about extending fadvise to work on
 the fd returned via opendir?

That'd be pretty simple, but a) would reclaim the pagecache for the
directory and not the dentry object itself and b) will only be easy to do
for ext2 and minixfs, which maintain a separate pagecache per directory.

Yes we could do a nuke all the dentries in this directory thing, but
that's equivalent to sys_reclaim_dentry() in a loop.

 And extending POSIX_FADV_NOREUSE on a file
 fd to drop the dentry at close?
 
 (Call me chicken; I just don't want to be the guy suggesting a new
 syscall for a single or few users.)
 
  ~ ~
 
 Alternately, there have been requests for a way for userspace to get
 notification of all file events for indexing of data and metadata
 (inotify, unfortunately, doesn't scale to a full filesystem). (cf.
 http://lkml.org/lkml/2006/9/30/98 .)

yes, that's a disappointment.

 That'd allow an updatedb daemon to
 keep the index up to date all the time, amortizing the cost. More
 usefully, it'd allow a content indexing daemon to stay up to date all
 the time, though inotify mostly works for those, I suppose.
 
 (Hmm...
   [EMAIL PROTECTED]:~$ find ~ -type d | wc -l
   14067
 
 ...right. So it probably works fine for normal people.)
 
 Hey, waitaminute. This should be a solved problem? SELinux must have
 some sort of requirement for logging file access attempts. Google, at
 least, implies so. Perhaps whatever it implements could be lifted into
 the core kernel without dragging the rest behind it.

Maybe the syscall auditing code can be persuaded to spit out records which
can be used for this.

 Dunno. Who do we CC?

That's a problem.  Nobody and everybody.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:

On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel [EMAIL PROTECTED] wrote:



The use-once policy we have in the kernel should work
perfectly fine for backups.  All we need to do is
actually honor the accessed bit on active page cache
pages, instead of flushing them onto the inactive
list.

What am I overlooking?


That'll improve backups but will break other things.

To do this effectively we'd need to change the policy so that new pagecache
allocations cause no scanning of used-twice pages at all.  So that even
after many gigs of backing up, the working set is still there.

Problem is, (for example) what about the person who has 80% of memory in
used-twice state and who then reads a file or files which are 20% or more of
the size of memory, two or more times.  It'll be 100% cache misses, every time.
This will happen quite a lot.  IOW, once those pages are in used-twice state,
how does further pagecache activity ever get them _out_ of that state?  Only
by joining the used-twice page set, and that can't happen if the 
used-once-so-far
pages got reclaimed.

Doing a refault thing would help a bit, but stops working at a certain point.


At what point does it stop working?

I am not asking this to be difficult, I just want to get Linux
a VM that does not need to be kludged up every time a distro
ships it to its customers.

I believe one starting point would be a concept that people
cannot shoot holes in any more.  That is no guarantee, but
as long as the concept has known holes coding it up is likely
to be a waste of time since the code will need kludges to
deal with the problems later on and we'd be back to square
one.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 20:16:09 -0500 Lee Revell [EMAIL PROTECTED] wrote:

 On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote:
  The tool uses an LD_PRELOAD hack to intercept glibc's read(), pread(),
  write(), pwrite(), close() and dup2() functions.  pagecache control is done
  via posix_fadvise() and sync_file_range().
 
 
 How could this have any effect on the updatedb problem?  updatedb does
 not read() anything, it just open()s and stat()s every file on the
 disk.
 

err, good point.  _one_ of those dang things which goes off when you've
stayed up too late does a lot of pagecache IO, not sure which one.  Maybe
rpmq?  But I'd expect that to be doing direct-io.

But yes, updatedb's pagecache usage will be mainly metadata, and this tool
doesn't address metadata pagecache, although it could do so.

does an updatedb on a modest system

It instantiated 5MB of pagecache and 20MB of slab, took about one minute.

runs all the other things in /etc/cron.daily

rpm uses rather a lot of pagecache.

So yes, it looks like updatedb is a slab problem.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Eric St-Laurent wrote:


While I think that more user space applications should use fadvise() to
avoid polluting the page cache with unneeded data, I still think the
kernel should be more fair in regard to page cache management.



Personally, I've experienced some sluggish performance after copying
large files around. Even more when using NFS. It's difficult to file a
bug report for interactive feel, I don't know how to measure it. I
just feel it's a weak aspect of the OS.


Fairness and interactiveness are very hard to quantify and
measure, which makes it hard to justify patches that improve
this behaviour.

On the other hand, patches that improve benchmark results
are easily justifyable, which makes it easy to merge those
even if it comes at the expense of fairness.

I think fairness and robustness are important, but I have
not figured out a way to justify such changes for upstream
inclusion.  Well, except perhaps by coming up with artificial
test cases, but that feels like cheating :)


My personal opinion is that the VM seem tuned for database types
workloads. Of course, making the page cache more fair to prevent one
process to use most of it will most likely slowdown database type
applications.


The database people disagree.  For one, the accessed bit
on active page gets pages gets ignored, so Linux does not
properly keep the most actively used page cache pages in
memory.

Secondly, the VM can waste quite a lot of time scanning
over the anonymous pages that it does not even want to
evict from memory.  If the VM does not plan on evicting
anonymous memory (or shared memory segments), why waste
CPU time scanning them and randomizing their LRU order?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 03 Mar 2007 20:23:07 -0500 Rik van Riel [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Sat, 03 Mar 2007 19:01:01 -0500 Rik van Riel [EMAIL PROTECTED] wrote:
 
  The use-once policy we have in the kernel should work
  perfectly fine for backups.  All we need to do is
  actually honor the accessed bit on active page cache
  pages, instead of flushing them onto the inactive
  list.
 
  What am I overlooking?
  
  That'll improve backups but will break other things.
  
  To do this effectively we'd need to change the policy so that new pagecache
  allocations cause no scanning of used-twice pages at all.  So that even
  after many gigs of backing up, the working set is still there.
  
  Problem is, (for example) what about the person who has 80% of memory in
  used-twice state and who then reads a file or files which are 20% or more of
  the size of memory, two or more times.  It'll be 100% cache misses, every 
  time.
  This will happen quite a lot.  IOW, once those pages are in used-twice 
  state,
  how does further pagecache activity ever get them _out_ of that state?  Only
  by joining the used-twice page set, and that can't happen if the 
  used-once-so-far
  pages got reclaimed.
  
  Doing a refault thing would help a bit, but stops working at a certain 
  point.
 
 At what point does it stop working?

We need to store that this-page-got-reclaimed info somewhere.  I don't know
how space-efficient that is.  Did anyone ever do an implementation?

Of course, the pages need to be re-read again so there's a potential 100%
hit there, which is in fact not a huge amount in this context.  Depends how
often it occurs (all the time when refault is being useful?) versus what we
gain from it.

 I am not asking this to be difficult, I just want to get Linux
 a VM that does not need to be kludged up every time a distro
 ships it to its customers.

We have a communication problem here.  Please please please work harder to
get these problems communicated to the MM developers.  The only vendor MM
kludge of which I'm aware is a thing which Andrea is working on to address
a large-shm-segment versus bulk-IO problem (yup, database).

If you have enough of an understanding of a problem to be able to develop
and productise a fix then share that info madly, asap.

otoh, rhel-on-the-desktop-or-smaller probably isn't a huge priority, which
can be taken advantage of.

 I believe one starting point would be a concept that people
 cannot shoot holes in any more.  That is no guarantee, but
 as long as the concept has known holes coding it up is likely
 to be a waste of time since the code will need kludges to
 deal with the problems later on and we'd be back to square
 one.

You mean design it and review the design before coding it?  You'll find few
objections there.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Rik van Riel


Andrew Morton wrote:


Doing a refault thing would help a bit, but stops working at a certain point.

At what point does it stop working?


We need to store that this-page-got-reclaimed info somewhere.  I don't know
how space-efficient that is.  Did anyone ever do an implementation?


One 32 bit word per evicted page that we keep track of.


Of course, the pages need to be re-read again so there's a potential 100%
hit there, which is in fact not a huge amount in this context.  Depends how
often it occurs (all the time when refault is being useful?) versus what we
gain from it.


At this point, when we see that a refaulted page is more
active than the coldest page on the active list, we can
also immediately shrink the active list.  That gives the
next inactive page a better chance to get promoted before
it gets evicted.


I am not asking this to be difficult, I just want to get Linux
a VM that does not need to be kludged up every time a distro
ships it to its customers.


We have a communication problem here.  Please please please work harder to
get these problems communicated to the MM developers.  The only vendor MM
kludge of which I'm aware is a thing which Andrea is working on to address
a large-shm-segment versus bulk-IO problem (yup, database).

If you have enough of an understanding of a problem to be able to develop
and productise a fix then share that info madly, asap.


The problem is that most of the distro patches are
kludges, which we would rather not see again in
future kernels.  They tend to work around the problem,
instead of being a proper fix, since reorganizing the
VM in the middle of a release is not an option.

However, incremental small-to-medium changes might
be an option for the upstream kernel, if you are
interested.


I believe one starting point would be a concept that people
cannot shoot holes in any more.  That is no guarantee, but
as long as the concept has known holes coding it up is likely
to be a waste of time since the code will need kludges to
deal with the problems later on and we'd be back to square
one.


You mean design it and review the design before coding it?  You'll find few
objections there.


Few objections, but sadly also very few people interested in
actually reviewing the design :(

If you can find holes in http://linux-mm.org/PageReplacementDesign
please let me know :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Lee Revell


On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote:

But yes, updatedb's pagecache usage will be mainly metadata, and this tool
doesn't address metadata pagecache, although it could do so.



With no kernel changes?  How?  I can't find an equivalent API to
posix_fadvise() for metadata.

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace pagecache management tool

2007-03-03 Thread Andrew Morton

On Sat, 3 Mar 2007 21:35:59 -0500 Lee Revell [EMAIL PROTECTED] wrote:

 On 3/3/07, Andrew Morton [EMAIL PROTECTED] wrote:
  But yes, updatedb's pagecache usage will be mainly metadata, and this tool
  doesn't address metadata pagecache, although it could do so.
 
 
 With no kernel changes?  How?  I can't find an equivalent API to
 posix_fadvise() for metadata.
 

We can use mincore and fadvise against /dev/sda1, too.

mincore's linear search would hurt but you could just run fadvise
regularly.  A lot of the blockdev pagecache is pretty useless anyway: we've
already copied it much of it into dentries and inodes, and some of ext2/3/4's
pagecache is already pinned by the fs.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

90 matches

Mail list logo