Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-06 Thread Mike Galbraith

On Tue, 6 Mar 2001, Marcelo Tosatti wrote:

> On Fri, 2 Mar 2001, Mike Galbraith wrote:
>
> > On Thu, 1 Mar 2001, Rik van Riel wrote:
> >
> > > > > The merging at the elevator level only works if the requests sent to
> > > > > it are right next to each other on disk. This means that randomly
> > > > > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > > > > nothing the elevator could ever hope to do about that.
> > > >
> > > > True to some (very real) extent because of the limited buffering
> > > > of requests.  However, I can not find any useful information
> > > > that the vm is using to guarantee the IT does not destroy
> > > > performance by your own definition.
> > >
> > > Indeed. IMHO we should fix this by putting explicit IO
> > > clustering in the ->writepage() functions.
> >
> > I notice there's a patch sitting in my mailbox.. think I'll go read
> > it and think (grunt grunt;) about this issue some more.
>
> Mike,
>
> One important information which is not being considered by
> page_launder() now the dirty buffers watermark.
>
> In general, it should not try to avoid writing dirty pages if we're above
> the dirty buffers watermark.

Agreed in theory.. I'll go try to measure.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-06 Thread Marcelo Tosatti



On Fri, 2 Mar 2001, Mike Galbraith wrote:

> On Thu, 1 Mar 2001, Rik van Riel wrote:
> 
> > > > The merging at the elevator level only works if the requests sent to
> > > > it are right next to each other on disk. This means that randomly
> > > > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > > > nothing the elevator could ever hope to do about that.
> > >
> > > True to some (very real) extent because of the limited buffering
> > > of requests.  However, I can not find any useful information
> > > that the vm is using to guarantee the IT does not destroy
> > > performance by your own definition.
> >
> > Indeed. IMHO we should fix this by putting explicit IO
> > clustering in the ->writepage() functions.
> 
> I notice there's a patch sitting in my mailbox.. think I'll go read
> it and think (grunt grunt;) about this issue some more.

Mike, 

One important information which is not being considered by
page_launder() now the dirty buffers watermark. 

In general, it should not try to avoid writing dirty pages if we're above
the dirty buffers watermark.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-06 Thread Marcelo Tosatti



On Fri, 2 Mar 2001, Mike Galbraith wrote:

 On Thu, 1 Mar 2001, Rik van Riel wrote:
 
The merging at the elevator level only works if the requests sent to
it are right next to each other on disk. This means that randomly
sending stuff to disk really DOES DESTROY PERFORMANCE and there's
nothing the elevator could ever hope to do about that.
  
   True to some (very real) extent because of the limited buffering
   of requests.  However, I can not find any useful information
   that the vm is using to guarantee the IT does not destroy
   performance by your own definition.
 
  Indeed. IMHO we should fix this by putting explicit IO
  clustering in the -writepage() functions.
 
 I notice there's a patch sitting in my mailbox.. think I'll go read
 it and think (grunt grunt;) about this issue some more.

Mike, 

One important information which is not being considered by
page_launder() now the dirty buffers watermark. 

In general, it should not try to avoid writing dirty pages if we're above
the dirty buffers watermark.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-06 Thread Mike Galbraith

On Tue, 6 Mar 2001, Marcelo Tosatti wrote:

 On Fri, 2 Mar 2001, Mike Galbraith wrote:

  On Thu, 1 Mar 2001, Rik van Riel wrote:
 
 The merging at the elevator level only works if the requests sent to
 it are right next to each other on disk. This means that randomly
 sending stuff to disk really DOES DESTROY PERFORMANCE and there's
 nothing the elevator could ever hope to do about that.
   
True to some (very real) extent because of the limited buffering
of requests.  However, I can not find any useful information
that the vm is using to guarantee the IT does not destroy
performance by your own definition.
  
   Indeed. IMHO we should fix this by putting explicit IO
   clustering in the -writepage() functions.
 
  I notice there's a patch sitting in my mailbox.. think I'll go read
  it and think (grunt grunt;) about this issue some more.

 Mike,

 One important information which is not being considered by
 page_launder() now the dirty buffers watermark.

 In general, it should not try to avoid writing dirty pages if we're above
 the dirty buffers watermark.

Agreed in theory.. I'll go try to measure.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-02 Thread Rik van Riel

On 1 Mar 2001, Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> Rik van Riel  <[EMAIL PROTECTED]> wrote:
> >
> >I haven't tested it yet for a number of reasons. The most
> >important one is that the FreeBSD people have been playing
> >with this thing for a few years now and Matt Dillon has
> >told me the result of their tests ;)
>
> Note that the Linux VM is certainly different enough that I
> doubt the comparisons are all that valid. Especially actual
> virtual memory mapping is basically from another planet
> altogether, and heuristics that are appropriate for *BSD may not
> really translate all that better.

The main difference is that under Linux the size of the
inactive list is dynamic, while under FreeBSD the system
always tries to keep a (very) large inactive list around.

I'm not sure if, or how, this would influence the percentage
of dirty pages on the inactive list or how often we'd need to
flush something to disk as opposed to reclaiming clean pages.

> I'll take numbers over talk any day.  At least Mike had numbers,

The only number I saw when reading over this thread was that
Mike found that under one workload he tested the Linux kernel
ended up doing IO anyway about 2/3rds of the time.

This would also mean we'd be able to _avoid_ IO 1/3rd of the
time ;)

> In short, please don't argue against numbers.

I'm not arguing against his numbers, all I want to know is
if the patch has the same positive effect on other workloads
as well...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-02 Thread Rik van Riel

On 1 Mar 2001, Linus Torvalds wrote:
 In article [EMAIL PROTECTED],
 Rik van Riel  [EMAIL PROTECTED] wrote:
 
 I haven't tested it yet for a number of reasons. The most
 important one is that the FreeBSD people have been playing
 with this thing for a few years now and Matt Dillon has
 told me the result of their tests ;)

 Note that the Linux VM is certainly different enough that I
 doubt the comparisons are all that valid. Especially actual
 virtual memory mapping is basically from another planet
 altogether, and heuristics that are appropriate for *BSD may not
 really translate all that better.

The main difference is that under Linux the size of the
inactive list is dynamic, while under FreeBSD the system
always tries to keep a (very) large inactive list around.

I'm not sure if, or how, this would influence the percentage
of dirty pages on the inactive list or how often we'd need to
flush something to disk as opposed to reclaiming clean pages.

 I'll take numbers over talk any day.  At least Mike had numbers,

The only number I saw when reading over this thread was that
Mike found that under one workload he tested the Linux kernel
ended up doing IO anyway about 2/3rds of the time.

This would also mean we'd be able to _avoid_ IO 1/3rd of the
time ;)

 In short, please don't argue against numbers.

I'm not arguing against his numbers, all I want to know is
if the patch has the same positive effect on other workloads
as well...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Mike Galbraith

On Thu, 1 Mar 2001, Rik van Riel wrote:

> > > The merging at the elevator level only works if the requests sent to
> > > it are right next to each other on disk. This means that randomly
> > > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > > nothing the elevator could ever hope to do about that.
> >
> > True to some (very real) extent because of the limited buffering
> > of requests.  However, I can not find any useful information
> > that the vm is using to guarantee the IT does not destroy
> > performance by your own definition.
>
> Indeed. IMHO we should fix this by putting explicit IO
> clustering in the ->writepage() functions.

I notice there's a patch sitting in my mailbox.. think I'll go read
it and think (grunt grunt;) about this issue some more.

Thanks for the input Rik.  I appreciate it.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Mike Galbraith

On Thu, 1 Mar 2001, Chris Evans wrote:

> Oh dear.. not more "vm design by waving hands in the air". Come on people,
> improve the vm by careful profiling, tweaking and benching, not by
> throwing random patches in that seem cool in theory.

Excuse me.. we're trying to have a _constructive_ conversation here.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Rik van Riel  <[EMAIL PROTECTED]> wrote:
>
>I haven't tested it yet for a number of reasons. The most
>important one is that the FreeBSD people have been playing
>with this thing for a few years now and Matt Dillon has
>told me the result of their tests ;)

Note that the Linux VM is certainly different enough that I doubt the
comparisons are all that valid. Especially actual virtual memory mapping
is basically from another planet altogether, and heuristics that are
appropriate for *BSD may not really translate all that better.

I'll take numbers over talk any day.  At least Mike had numbers, and
possible explanations for them. He also removed more code than he added,
which is always a good sign. 

In short, please don't argue against numbers. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rik van Riel

On Thu, 1 Mar 2001, Chris Evans wrote:
> On Thu, 1 Mar 2001, Rik van Riel wrote:
>
> > True. I think we want something in-between our ideas...
> ^^^
> > a while. This should make it possible for the disk reads to
> ^^
>
> Oh dear.. not more "vm design by waving hands in the air". Come
> on people, improve the vm by careful profiling, tweaking and
> benching, not by throwing random patches in that seem cool in
> theory.

Actually, this was more of "vm design by looking at what
the FreeBSD folks did, why it didn't work and how they
fixed it after 2 years of testing various things".

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Marcelo Tosatti


On Thu, 1 Mar 2001, Chris Evans wrote:

> 
> On Thu, 1 Mar 2001, Rik van Riel wrote:
> 
> > True. I think we want something in-between our ideas...
> ^^^
> > a while. This should make it possible for the disk reads to
> ^^
> 
> Oh dear.. not more "vm design by waving hands in the air". Come on people,
> improve the vm by careful profiling, tweaking and benching, not by
> throwing random patches in that seem cool in theory.

OTOH, "careful profiling, tweaking and benching" are always limited to a
number workloads.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Chris Evans


On Thu, 1 Mar 2001, Rik van Riel wrote:

> True. I think we want something in-between our ideas...
^^^
> a while. This should make it possible for the disk reads to
^^

Oh dear.. not more "vm design by waving hands in the air". Come on people,
improve the vm by careful profiling, tweaking and benching, not by
throwing random patches in that seem cool in theory.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Alan Cox

> Except that your code throws the random junk at the elevator all
> the time, while my code only bothers the elevator every once in
> a while. This should make it possible for the disk reads to
> continue with less interruptions.

Think about it this way, throwing the stuff at the I/O layer is saying
'please make this go away'. Thats the VM decision. Scheduling the I/O is an
I/O and driver layer decision. 




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Alan Cox

> There is no mechanysm in place that ensures that dirty pages can't
> get out of control, and they do in fact get out of control, and it
> is exaserbated (mho) by attempting to define 'too much I/O' without
> any information to base this definition upon.

I think this is a good point. If you do 'too much I/O' then the I/O gets
throttled by submit_bh(). The block I/O layer knows about 'too much I/O'.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

[ ... ]

> Except that your code throws the random junk at the elevator all
> the time, while my code only bothers the elevator every once in
> a while. This should make it possible for the disk reads to
> continue with less interruptions.
> 

Couldn't agree with you more. The elevator does a decent job
these days, but higher level clustering could do more ...

[ ...]

> Indeed. IMHO we should fix this by putting explicit IO
> clustering in the ->writepage() functions.

Enhancing writepage() to perform clustering is the first step.
In addition you want entities (kupdated, kswapd, et. al)
that currently work with only buffers to invoke writepage()
at appropriate points. Just today I sent a patch that does this
and also combines delayed allocation out to Al Viro for comments.
If anyone else is interested I can send it out to the list.

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rik van Riel

On Thu, 1 Mar 2001, Mike Galbraith wrote:
> On Thu, 1 Mar 2001, Rik van Riel wrote:
> > On Thu, 1 Mar 2001, Mike Galbraith wrote:

> No no no and again no (perhaps I misread that bit).  But otoh,
> you haven't tested the patch I sent in good faith.  I sent it
> because I have thought about it.  I may be wrong in my
> interpretation of the results, but those results were thought
> about.. and they exist.

I haven't tested it yet for a number of reasons. The most
important one is that the FreeBSD people have been playing
with this thing for a few years now and Matt Dillon has
told me the result of their tests ;)

> > But if the amount of dirtied pages is _small_, it means that we can
> > allow the reads to continue uninterrupted for a while before we
> > flush all dirty pages in one go...
>
> "If wishes were horses, beggers would ride."
>
> There is no mechanysm in place that ensures that dirty pages
> can't get out of control, and they do in fact get out of
> control, and it is exaserbated (mho) by attempting to define
> 'too much I/O' without any information to base this definition
> upon.

True. I think we want something in-between our ideas...

> > Also, the elevator can only try to optimise whatever you throw at
> > it. If you throw random requests at the elevator, you cannot expect
> > it to do ANY GOOD ...
>
> This is a very good point (which I will think upon).  I ask you this
> in return.  Why do you think that the random junk you throw at the
> elevator is different than the random junk I throw at it? ;-)  I see
> no difference at all.. it's the same exact junk.

Except that your code throws the random junk at the elevator all
the time, while my code only bothers the elevator every once in
a while. This should make it possible for the disk reads to
continue with less interruptions.

> > The merging at the elevator level only works if the requests sent to
> > it are right next to each other on disk. This means that randomly
> > sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> > nothing the elevator could ever hope to do about that.
>
> True to some (very real) extent because of the limited buffering
> of requests.  However, I can not find any useful information
> that the vm is using to guarantee the IT does not destroy
> performance by your own definition.

Indeed. IMHO we should fix this by putting explicit IO
clustering in the ->writepage() functions.

Doing this, in combination with *WAITING* for dirty pages
to accumulate on the inactive list will give us the
possibility to do more writeout of dirty data with less
disk seeks (and less slowdown of the reads).

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Mike Galbraith

On Thu, 1 Mar 2001, Rik van Riel wrote:

> On Thu, 1 Mar 2001, Mike Galbraith wrote:
> > On Wed, 28 Feb 2001, Rik van Riel wrote:
> > > On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> > > > On Wed, 28 Feb 2001, Mike Galbraith wrote:
> > >
> > > > > That's one reason I tossed it out.  I don't _think_ it should have any
> > > > > negative effect on other loads, but a test run might find otherwise.
> > > >
> > > > Writes are more expensive than reads. Apart from the aggressive read
> > > > caching on the disk, writes have limited caching or no caching at all if
> > > > you need security (journalling, for example). (I'm not sure about write
> > > > caching details, any harddisk expert?)
> > >
> > > I suspect Mike needs to change his benchmark load a little
> > > so that it dirties only 10% of the pages (might be realistic
> > > for web and/or database loads).
> >
> > Asking the user to not dirty so many pages is wrong.  My benchmark
> > load is many compute intensive tasks which each dirty a few pages
> > while doing real work.  It would be unrealistic if it just dirtied
> > pages as fast as possible to intentionally jam up the vm, but it
> > doesn't do that.
>
> Asking you to test a different kind of workload is wrong ??

No no no and again no (perhaps I misread that bit).  But otoh, you
haven't tested the patch I sent in good faith.  I sent it because I
have thought about it.  I may be wrong in my interpretation of the
results, but those results were thought about.. and they exist.

> The kind of load I described _is_ realistic, think for example
> about ftp/www/MySQL servers...

Yes.  My favorite test load is also realistic.

> > > At that point, you should be able to see that doing writes
> > > all the time can really mess up read performance due to extra
> > > introduced seeks.
> >
> > The fact that writes are painful doesn't change the fact that data
> > must be written in order to free memory and proceed.  Besides, the
> > elevator is supposed to solve that not the allocator.. or?
>
> But if the amount of dirtied pages is _small_, it means that we can
> allow the reads to continue uninterrupted for a while before we
> flush all dirty pages in one go...

"If wishes were horses, beggers would ride."

There is no mechanysm in place that ensures that dirty pages can't
get out of control, and they do in fact get out of control, and it
is exaserbated (mho) by attempting to define 'too much I/O' without
any information to base this definition upon.

> Also, the elevator can only try to optimise whatever you throw at
> it. If you throw random requests at the elevator, you cannot expect
> it to do ANY GOOD ...

This is a very good point (which I will think upon).  I ask you this
in return.  Why do you think that the random junk you throw at the
elevator is different than the random junk I throw at it? ;-)  I see
no difference at all.. it's the same exact junk.  (it's junk because
neither of us knows that it will be optimizable.. it really is a
random bunch of pages because we have ZERO information concerning
the origins, destinations nor informational content of the pages we're
pushing.  We have no interest [only because we aren't clever enough
to be interested] in these things.)

> The merging at the elevator level only works if the requests sent to
> it are right next to each other on disk. This means that randomly
> sending stuff to disk really DOES DESTROY PERFORMANCE and there's
> nothing the elevator could ever hope to do about that.

True to some (very real) extent because of the limited buffering of
requests.  However, I can not find any useful information that the
vm is using to guarantee the IT does not destroy performance by your
own definition.  If it's there and I'm just missing it, I'd thank
you heartily if you'd hit me up side the head with a clue-x-4 ;-)

> > > We probably want some in-between solution (like FreeBSD has today).
> > > The first time they see a dirty page, they mark it as seen, the
> > > second time they come across it in the inactive list, they flush it.
> > > This way IO is still delayed a bit and not done if there are enough
> > > clean pages around.
> >
> > (delayed write is fine, but I'll be upset if vmlinux doesn't show up
> > after I buy more ram;)
>
> Writing out of old data is a task independent of the VM. This is a
> job done by kupdate. The only thing the VM does is write pages out
> earlier when it's under memory pressure.

I was joking.

> > > Another solution would be to do some more explicit IO clustering and
> > > only flush _large_ clusters ... no need to invoke extra disk seeks
> > > just to free a single page, unless you only have single pages left.
> >
> > This sounds good.. except I keep thinking about the elevator.
> > Clusters disappear as soon as they hit the queues so clustering
> > at the vm level doesn't make any sense to me.
>
> You should think about the elevator a bit more. Feel for the poor
> thing and try to send it requests it can actually do 

Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rik van Riel

On Thu, 1 Mar 2001, Mike Galbraith wrote:
> On Wed, 28 Feb 2001, Rik van Riel wrote:
> > On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> > > On Wed, 28 Feb 2001, Mike Galbraith wrote:
> >
> > > > That's one reason I tossed it out.  I don't _think_ it should have any
> > > > negative effect on other loads, but a test run might find otherwise.
> > >
> > > Writes are more expensive than reads. Apart from the aggressive read
> > > caching on the disk, writes have limited caching or no caching at all if
> > > you need security (journalling, for example). (I'm not sure about write
> > > caching details, any harddisk expert?)
> >
> > I suspect Mike needs to change his benchmark load a little
> > so that it dirties only 10% of the pages (might be realistic
> > for web and/or database loads).
>
> Asking the user to not dirty so many pages is wrong.  My benchmark
> load is many compute intensive tasks which each dirty a few pages
> while doing real work.  It would be unrealistic if it just dirtied
> pages as fast as possible to intentionally jam up the vm, but it
> doesn't do that.

Asking you to test a different kind of workload is wrong ??

The kind of load I described _is_ realistic, think for example
about ftp/www/MySQL servers...

> > At that point, you should be able to see that doing writes
> > all the time can really mess up read performance due to extra
> > introduced seeks.
>
> The fact that writes are painful doesn't change the fact that data
> must be written in order to free memory and proceed.  Besides, the
> elevator is supposed to solve that not the allocator.. or?

But if the amount of dirtied pages is _small_, it means that we can
allow the reads to continue uninterrupted for a while before we
flush all dirty pages in one go...

Also, the elevator can only try to optimise whatever you throw at
it. If you throw random requests at the elevator, you cannot expect
it to do ANY GOOD ...

The merging at the elevator level only works if the requests sent to
it are right next to each other on disk. This means that randomly
sending stuff to disk really DOES DESTROY PERFORMANCE and there's
nothing the elevator could ever hope to do about that.

> > We probably want some in-between solution (like FreeBSD has today).
> > The first time they see a dirty page, they mark it as seen, the
> > second time they come across it in the inactive list, they flush it.
> > This way IO is still delayed a bit and not done if there are enough
> > clean pages around.
>
> (delayed write is fine, but I'll be upset if vmlinux doesn't show up
> after I buy more ram;)

Writing out of old data is a task independent of the VM. This is a
job done by kupdate. The only thing the VM does is write pages out
earlier when it's under memory pressure.

> > Another solution would be to do some more explicit IO clustering and
> > only flush _large_ clusters ... no need to invoke extra disk seeks
> > just to free a single page, unless you only have single pages left.
>
> This sounds good.. except I keep thinking about the elevator.
> Clusters disappear as soon as they hit the queues so clustering
> at the vm level doesn't make any sense to me.

You should think about the elevator a bit more. Feel for the poor
thing and try to send it requests it can actually do something
useful with ;)

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rik van Riel

On Thu, 1 Mar 2001, Mike Galbraith wrote:
 On Wed, 28 Feb 2001, Rik van Riel wrote:
  On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
   On Wed, 28 Feb 2001, Mike Galbraith wrote:
 
That's one reason I tossed it out.  I don't _think_ it should have any
negative effect on other loads, but a test run might find otherwise.
  
   Writes are more expensive than reads. Apart from the aggressive read
   caching on the disk, writes have limited caching or no caching at all if
   you need security (journalling, for example). (I'm not sure about write
   caching details, any harddisk expert?)
 
  I suspect Mike needs to change his benchmark load a little
  so that it dirties only 10% of the pages (might be realistic
  for web and/or database loads).

 Asking the user to not dirty so many pages is wrong.  My benchmark
 load is many compute intensive tasks which each dirty a few pages
 while doing real work.  It would be unrealistic if it just dirtied
 pages as fast as possible to intentionally jam up the vm, but it
 doesn't do that.

Asking you to test a different kind of workload is wrong ??

The kind of load I described _is_ realistic, think for example
about ftp/www/MySQL servers...

  At that point, you should be able to see that doing writes
  all the time can really mess up read performance due to extra
  introduced seeks.

 The fact that writes are painful doesn't change the fact that data
 must be written in order to free memory and proceed.  Besides, the
 elevator is supposed to solve that not the allocator.. or?

But if the amount of dirtied pages is _small_, it means that we can
allow the reads to continue uninterrupted for a while before we
flush all dirty pages in one go...

Also, the elevator can only try to optimise whatever you throw at
it. If you throw random requests at the elevator, you cannot expect
it to do ANY GOOD ...

The merging at the elevator level only works if the requests sent to
it are right next to each other on disk. This means that randomly
sending stuff to disk really DOES DESTROY PERFORMANCE and there's
nothing the elevator could ever hope to do about that.

  We probably want some in-between solution (like FreeBSD has today).
  The first time they see a dirty page, they mark it as seen, the
  second time they come across it in the inactive list, they flush it.
  This way IO is still delayed a bit and not done if there are enough
  clean pages around.

 (delayed write is fine, but I'll be upset if vmlinux doesn't show up
 after I buy more ram;)

Writing out of old data is a task independent of the VM. This is a
job done by kupdate. The only thing the VM does is write pages out
earlier when it's under memory pressure.

  Another solution would be to do some more explicit IO clustering and
  only flush _large_ clusters ... no need to invoke extra disk seeks
  just to free a single page, unless you only have single pages left.

 This sounds good.. except I keep thinking about the elevator.
 Clusters disappear as soon as they hit the queues so clustering
 at the vm level doesn't make any sense to me.

You should think about the elevator a bit more. Feel for the poor
thing and try to send it requests it can actually do something
useful with ;)

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Mike Galbraith

On Thu, 1 Mar 2001, Rik van Riel wrote:

 On Thu, 1 Mar 2001, Mike Galbraith wrote:
  On Wed, 28 Feb 2001, Rik van Riel wrote:
   On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
On Wed, 28 Feb 2001, Mike Galbraith wrote:
  
 That's one reason I tossed it out.  I don't _think_ it should have any
 negative effect on other loads, but a test run might find otherwise.
   
Writes are more expensive than reads. Apart from the aggressive read
caching on the disk, writes have limited caching or no caching at all if
you need security (journalling, for example). (I'm not sure about write
caching details, any harddisk expert?)
  
   I suspect Mike needs to change his benchmark load a little
   so that it dirties only 10% of the pages (might be realistic
   for web and/or database loads).
 
  Asking the user to not dirty so many pages is wrong.  My benchmark
  load is many compute intensive tasks which each dirty a few pages
  while doing real work.  It would be unrealistic if it just dirtied
  pages as fast as possible to intentionally jam up the vm, but it
  doesn't do that.

 Asking you to test a different kind of workload is wrong ??

No no no and again no (perhaps I misread that bit).  But otoh, you
haven't tested the patch I sent in good faith.  I sent it because I
have thought about it.  I may be wrong in my interpretation of the
results, but those results were thought about.. and they exist.

 The kind of load I described _is_ realistic, think for example
 about ftp/www/MySQL servers...

Yes.  My favorite test load is also realistic.

   At that point, you should be able to see that doing writes
   all the time can really mess up read performance due to extra
   introduced seeks.
 
  The fact that writes are painful doesn't change the fact that data
  must be written in order to free memory and proceed.  Besides, the
  elevator is supposed to solve that not the allocator.. or?

 But if the amount of dirtied pages is _small_, it means that we can
 allow the reads to continue uninterrupted for a while before we
 flush all dirty pages in one go...

"If wishes were horses, beggers would ride."

There is no mechanysm in place that ensures that dirty pages can't
get out of control, and they do in fact get out of control, and it
is exaserbated (mho) by attempting to define 'too much I/O' without
any information to base this definition upon.

 Also, the elevator can only try to optimise whatever you throw at
 it. If you throw random requests at the elevator, you cannot expect
 it to do ANY GOOD ...

This is a very good point (which I will think upon).  I ask you this
in return.  Why do you think that the random junk you throw at the
elevator is different than the random junk I throw at it? ;-)  I see
no difference at all.. it's the same exact junk.  (it's junk because
neither of us knows that it will be optimizable.. it really is a
random bunch of pages because we have ZERO information concerning
the origins, destinations nor informational content of the pages we're
pushing.  We have no interest [only because we aren't clever enough
to be interested] in these things.)

 The merging at the elevator level only works if the requests sent to
 it are right next to each other on disk. This means that randomly
 sending stuff to disk really DOES DESTROY PERFORMANCE and there's
 nothing the elevator could ever hope to do about that.

True to some (very real) extent because of the limited buffering of
requests.  However, I can not find any useful information that the
vm is using to guarantee the IT does not destroy performance by your
own definition.  If it's there and I'm just missing it, I'd thank
you heartily if you'd hit me up side the head with a clue-x-4 ;-)

   We probably want some in-between solution (like FreeBSD has today).
   The first time they see a dirty page, they mark it as seen, the
   second time they come across it in the inactive list, they flush it.
   This way IO is still delayed a bit and not done if there are enough
   clean pages around.
 
  (delayed write is fine, but I'll be upset if vmlinux doesn't show up
  after I buy more ram;)

 Writing out of old data is a task independent of the VM. This is a
 job done by kupdate. The only thing the VM does is write pages out
 earlier when it's under memory pressure.

I was joking.

   Another solution would be to do some more explicit IO clustering and
   only flush _large_ clusters ... no need to invoke extra disk seeks
   just to free a single page, unless you only have single pages left.
 
  This sounds good.. except I keep thinking about the elevator.
  Clusters disappear as soon as they hit the queues so clustering
  at the vm level doesn't make any sense to me.

 You should think about the elevator a bit more. Feel for the poor
 thing and try to send it requests it can actually do something
 useful with ;)

I will, and I hope you can help me out with a little more food for
thought.

-Mike

-
To unsubscribe from this 

Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rik van Riel

On Thu, 1 Mar 2001, Mike Galbraith wrote:
 On Thu, 1 Mar 2001, Rik van Riel wrote:
  On Thu, 1 Mar 2001, Mike Galbraith wrote:

 No no no and again no (perhaps I misread that bit).  But otoh,
 you haven't tested the patch I sent in good faith.  I sent it
 because I have thought about it.  I may be wrong in my
 interpretation of the results, but those results were thought
 about.. and they exist.

I haven't tested it yet for a number of reasons. The most
important one is that the FreeBSD people have been playing
with this thing for a few years now and Matt Dillon has
told me the result of their tests ;)

  But if the amount of dirtied pages is _small_, it means that we can
  allow the reads to continue uninterrupted for a while before we
  flush all dirty pages in one go...

 "If wishes were horses, beggers would ride."

 There is no mechanysm in place that ensures that dirty pages
 can't get out of control, and they do in fact get out of
 control, and it is exaserbated (mho) by attempting to define
 'too much I/O' without any information to base this definition
 upon.

True. I think we want something in-between our ideas...

  Also, the elevator can only try to optimise whatever you throw at
  it. If you throw random requests at the elevator, you cannot expect
  it to do ANY GOOD ...

 This is a very good point (which I will think upon).  I ask you this
 in return.  Why do you think that the random junk you throw at the
 elevator is different than the random junk I throw at it? ;-)  I see
 no difference at all.. it's the same exact junk.

Except that your code throws the random junk at the elevator all
the time, while my code only bothers the elevator every once in
a while. This should make it possible for the disk reads to
continue with less interruptions.

  The merging at the elevator level only works if the requests sent to
  it are right next to each other on disk. This means that randomly
  sending stuff to disk really DOES DESTROY PERFORMANCE and there's
  nothing the elevator could ever hope to do about that.

 True to some (very real) extent because of the limited buffering
 of requests.  However, I can not find any useful information
 that the vm is using to guarantee the IT does not destroy
 performance by your own definition.

Indeed. IMHO we should fix this by putting explicit IO
clustering in the -writepage() functions.

Doing this, in combination with *WAITING* for dirty pages
to accumulate on the inactive list will give us the
possibility to do more writeout of dirty data with less
disk seeks (and less slowdown of the reads).

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

[ ... ]

 Except that your code throws the random junk at the elevator all
 the time, while my code only bothers the elevator every once in
 a while. This should make it possible for the disk reads to
 continue with less interruptions.
 

Couldn't agree with you more. The elevator does a decent job
these days, but higher level clustering could do more ...

[ ...]

 Indeed. IMHO we should fix this by putting explicit IO
 clustering in the -writepage() functions.

Enhancing writepage() to perform clustering is the first step.
In addition you want entities (kupdated, kswapd, et. al)
that currently work with only buffers to invoke writepage()
at appropriate points. Just today I sent a patch that does this
and also combines delayed allocation out to Al Viro for comments.
If anyone else is interested I can send it out to the list.

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Alan Cox

 There is no mechanysm in place that ensures that dirty pages can't
 get out of control, and they do in fact get out of control, and it
 is exaserbated (mho) by attempting to define 'too much I/O' without
 any information to base this definition upon.

I think this is a good point. If you do 'too much I/O' then the I/O gets
throttled by submit_bh(). The block I/O layer knows about 'too much I/O'.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Alan Cox

 Except that your code throws the random junk at the elevator all
 the time, while my code only bothers the elevator every once in
 a while. This should make it possible for the disk reads to
 continue with less interruptions.

Think about it this way, throwing the stuff at the I/O layer is saying
'please make this go away'. Thats the VM decision. Scheduling the I/O is an
I/O and driver layer decision. 




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Chris Evans


On Thu, 1 Mar 2001, Rik van Riel wrote:

 True. I think we want something in-between our ideas...
^^^
 a while. This should make it possible for the disk reads to
^^

Oh dear.. not more "vm design by waving hands in the air". Come on people,
improve the vm by careful profiling, tweaking and benching, not by
throwing random patches in that seem cool in theory.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Marcelo Tosatti


On Thu, 1 Mar 2001, Chris Evans wrote:

 
 On Thu, 1 Mar 2001, Rik van Riel wrote:
 
  True. I think we want something in-between our ideas...
 ^^^
  a while. This should make it possible for the disk reads to
 ^^
 
 Oh dear.. not more "vm design by waving hands in the air". Come on people,
 improve the vm by careful profiling, tweaking and benching, not by
 throwing random patches in that seem cool in theory.

OTOH, "careful profiling, tweaking and benching" are always limited to a
number workloads.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rik van Riel

On Thu, 1 Mar 2001, Chris Evans wrote:
 On Thu, 1 Mar 2001, Rik van Riel wrote:

  True. I think we want something in-between our ideas...
 ^^^
  a while. This should make it possible for the disk reads to
 ^^

 Oh dear.. not more "vm design by waving hands in the air". Come
 on people, improve the vm by careful profiling, tweaking and
 benching, not by throwing random patches in that seem cool in
 theory.

Actually, this was more of "vm design by looking at what
the FreeBSD folks did, why it didn't work and how they
fixed it after 2 years of testing various things".

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Rik van Riel  [EMAIL PROTECTED] wrote:

I haven't tested it yet for a number of reasons. The most
important one is that the FreeBSD people have been playing
with this thing for a few years now and Matt Dillon has
told me the result of their tests ;)

Note that the Linux VM is certainly different enough that I doubt the
comparisons are all that valid. Especially actual virtual memory mapping
is basically from another planet altogether, and heuristics that are
appropriate for *BSD may not really translate all that better.

I'll take numbers over talk any day.  At least Mike had numbers, and
possible explanations for them. He also removed more code than he added,
which is always a good sign. 

In short, please don't argue against numbers. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Mike Galbraith

On Thu, 1 Mar 2001, Chris Evans wrote:

 Oh dear.. not more "vm design by waving hands in the air". Come on people,
 improve the vm by careful profiling, tweaking and benching, not by
 throwing random patches in that seem cool in theory.

Excuse me.. we're trying to have a _constructive_ conversation here.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Mike Galbraith

On Thu, 1 Mar 2001, Rik van Riel wrote:

   The merging at the elevator level only works if the requests sent to
   it are right next to each other on disk. This means that randomly
   sending stuff to disk really DOES DESTROY PERFORMANCE and there's
   nothing the elevator could ever hope to do about that.
 
  True to some (very real) extent because of the limited buffering
  of requests.  However, I can not find any useful information
  that the vm is using to guarantee the IT does not destroy
  performance by your own definition.

 Indeed. IMHO we should fix this by putting explicit IO
 clustering in the -writepage() functions.

I notice there's a patch sitting in my mailbox.. think I'll go read
it and think (grunt grunt;) about this issue some more.

Thanks for the input Rik.  I appreciate it.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Mike Galbraith

On Wed, 28 Feb 2001, Rik van Riel wrote:

> On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> > On Wed, 28 Feb 2001, Mike Galbraith wrote:
>
> > > That's one reason I tossed it out.  I don't _think_ it should have any
> > > negative effect on other loads, but a test run might find otherwise.
> >
> > Writes are more expensive than reads. Apart from the aggressive read
> > caching on the disk, writes have limited caching or no caching at all if
> > you need security (journalling, for example). (I'm not sure about write
> > caching details, any harddisk expert?)
>
> I suspect Mike needs to change his benchmark load a little
> so that it dirties only 10% of the pages (might be realistic
> for web and/or database loads).

Asking the user to not dirty so many pages is wrong.  My benchmark
load is many compute intensive tasks which each dirty a few pages
while doing real work.  It would be unrealistic if it just dirtied
pages as fast as possible to intentionally jam up the vm, but it
doesn't do that.

> At that point, you should be able to see that doing writes
> all the time can really mess up read performance due to extra
> introduced seeks.

The fact that writes are painful doesn't change the fact that data
must be written in order to free memory and proceed.  Besides, the
elevator is supposed to solve that not the allocator.. or?

> We probably want some in-between solution (like FreeBSD has today).
> The first time they see a dirty page, they mark it as seen, the
> second time they come across it in the inactive list, they flush it.
> This way IO is still delayed a bit and not done if there are enough
> clean pages around.

(delayed write is fine, but I'll be upset if vmlinux doesn't show up
after I buy more ram;)

> Another solution would be to do some more explicit IO clustering and
> only flush _large_ clusters ... no need to invoke extra disk seeks
> just to free a single page, unless you only have single pages left.

This sounds good.. except I keep thinking about the elevator.  Clusters
disappear as soon as they hit the queues so clustering at the vm level
doesn't make any sense to me.  Where pages actually land is a function
of the fs, and that gets torn down even further by the elevator.  If
you submit pages one at a time, the plug will build clusters for you.

I don't think that the vm has the information needed to make decisions
like this nor the responsibility to do so.  It's a customer of the I/O
layers beneath it.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Clustered IO (was: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4)

2001-02-28 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

> 
> Another solution would be to do some more explicit IO clustering and
> only flush _large_ clusters ... no need to invoke extra disk seeks
> just to free a single page, unless you only have single pages left.

Hi Rik,

Yes, clustering IO at the higher level can improve performance.
This improvement is on top of the excellent elevator changes that
Jens Axboe has done in 2.4.2. In XFS we are doing clustering
at writepage(). There are two paths:

1. page_launder() -> writepage() -> cluster
# this path under memory pressure.
2. try_to_free_buffers() -> writepage() -> cluster
# this path under background writing as in bdflush
# but can also be used by sync() type operations that
# work with buffers than pages.

Clustering by itself (in XFS) improves write performance by about 15-20%,
and we're seeing close to raw I/O performance. With clustering
the IO requests are pegged at 1024 sectors (512K bytes)
when performing large sequential writes ...


ananth.


--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Rik van Riel

On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
> On Wed, 28 Feb 2001, Mike Galbraith wrote:

> > That's one reason I tossed it out.  I don't _think_ it should have any
> > negative effect on other loads, but a test run might find otherwise.
>
> Writes are more expensive than reads. Apart from the aggressive read
> caching on the disk, writes have limited caching or no caching at all if
> you need security (journalling, for example). (I'm not sure about write
> caching details, any harddisk expert?)

I suspect Mike needs to change his benchmark load a little
so that it dirties only 10% of the pages (might be realistic
for web and/or database loads).

At that point, you should be able to see that doing writes
all the time can really mess up read performance due to extra
introduced seeks.

We probably want some in-between solution (like FreeBSD has today).
The first time they see a dirty page, they mark it as seen, the
second time they come across it in the inactive list, they flush it.
This way IO is still delayed a bit and not done if there are enough
clean pages around.

Another solution would be to do some more explicit IO clustering and
only flush _large_ clusters ... no need to invoke extra disk seeks
just to free a single page, unless you only have single pages left.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Mike Galbraith

On Wed, 28 Feb 2001, Marcelo Tosatti wrote:

> On Wed, 28 Feb 2001, Mike Galbraith wrote:
>
> > > > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > > > see if the system still swaps out too much?
> > >
> > > Not yet, but will do.
>
> But what about swapping behaviour?
>
> It still swaps too much?

Yes.

(returning to study mode)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Marcelo Tosatti



On Wed, 28 Feb 2001, Mike Galbraith wrote:

> > > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > > see if the system still swaps out too much?
> >
> > Not yet, but will do.

But what about swapping behaviour? 

It still swaps too much? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Marcelo Tosatti



On Wed, 28 Feb 2001, Mike Galbraith wrote:

> On Tue, 27 Feb 2001, Marcelo Tosatti wrote:
> 
> > On Tue, 27 Feb 2001, Mike Galbraith wrote:
> >
> > > What the patch does is simply to push I/O as fast as we can.. we're
> > > by definition I/O bound and _can't_ defer it under any circumstance,
> > > for in this direction lies constipation.  The only thing in the world
> > > which will make it better is pushing I/O.
> >
> > In your I/O bound case, yes. But not in all cases.
> 
> That's one reason I tossed it out.  I don't _think_ it should have any
> negative effect on other loads, but a test run might find otherwise.

Writes are more expensive than reads. Apart from the aggressive read
caching on the disk, writes have limited caching or no caching at all if
you need security (journalling, for example). (I'm not sure about write
caching details, any harddisk expert?)

On read intensive loads, doing IO to free memory (writing pages out) will
be horribly harmful for these reads (which you can free easily), so its
better to avoid the writes as much as possible.

I remember Matthew Dillon (FreeBSD VM guy) had a read intensive case were
using 20:1 clean/flush ratio to free pages in FreeBSD's launder routine
(at that time, IIRC, their launder routine was looping twice the inactive
dirty list looking for clean pages to throw away, and only on the third
loop it would do IO) was still being a problem for disk performance
because of the writes. Yes, it sounds weird.

I suppose you're running dbench. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Marcelo Tosatti



On Wed, 28 Feb 2001, Mike Galbraith wrote:

 On Tue, 27 Feb 2001, Marcelo Tosatti wrote:
 
  On Tue, 27 Feb 2001, Mike Galbraith wrote:
 
   What the patch does is simply to push I/O as fast as we can.. we're
   by definition I/O bound and _can't_ defer it under any circumstance,
   for in this direction lies constipation.  The only thing in the world
   which will make it better is pushing I/O.
 
  In your I/O bound case, yes. But not in all cases.
 
 That's one reason I tossed it out.  I don't _think_ it should have any
 negative effect on other loads, but a test run might find otherwise.

Writes are more expensive than reads. Apart from the aggressive read
caching on the disk, writes have limited caching or no caching at all if
you need security (journalling, for example). (I'm not sure about write
caching details, any harddisk expert?)

On read intensive loads, doing IO to free memory (writing pages out) will
be horribly harmful for these reads (which you can free easily), so its
better to avoid the writes as much as possible.

I remember Matthew Dillon (FreeBSD VM guy) had a read intensive case were
using 20:1 clean/flush ratio to free pages in FreeBSD's launder routine
(at that time, IIRC, their launder routine was looping twice the inactive
dirty list looking for clean pages to throw away, and only on the third
loop it would do IO) was still being a problem for disk performance
because of the writes. Yes, it sounds weird.

I suppose you're running dbench. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Marcelo Tosatti



On Wed, 28 Feb 2001, Mike Galbraith wrote:

   Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
   see if the system still swaps out too much?
 
  Not yet, but will do.

But what about swapping behaviour? 

It still swaps too much? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Mike Galbraith

On Wed, 28 Feb 2001, Marcelo Tosatti wrote:

 On Wed, 28 Feb 2001, Mike Galbraith wrote:

Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
see if the system still swaps out too much?
  
   Not yet, but will do.

 But what about swapping behaviour?

 It still swaps too much?

Yes.

(returning to study mode)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Rik van Riel

On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
 On Wed, 28 Feb 2001, Mike Galbraith wrote:

  That's one reason I tossed it out.  I don't _think_ it should have any
  negative effect on other loads, but a test run might find otherwise.

 Writes are more expensive than reads. Apart from the aggressive read
 caching on the disk, writes have limited caching or no caching at all if
 you need security (journalling, for example). (I'm not sure about write
 caching details, any harddisk expert?)

I suspect Mike needs to change his benchmark load a little
so that it dirties only 10% of the pages (might be realistic
for web and/or database loads).

At that point, you should be able to see that doing writes
all the time can really mess up read performance due to extra
introduced seeks.

We probably want some in-between solution (like FreeBSD has today).
The first time they see a dirty page, they mark it as seen, the
second time they come across it in the inactive list, they flush it.
This way IO is still delayed a bit and not done if there are enough
clean pages around.

Another solution would be to do some more explicit IO clustering and
only flush _large_ clusters ... no need to invoke extra disk seeks
just to free a single page, unless you only have single pages left.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Clustered IO (was: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4)

2001-02-28 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

 
 Another solution would be to do some more explicit IO clustering and
 only flush _large_ clusters ... no need to invoke extra disk seeks
 just to free a single page, unless you only have single pages left.

Hi Rik,

Yes, clustering IO at the higher level can improve performance.
This improvement is on top of the excellent elevator changes that
Jens Axboe has done in 2.4.2. In XFS we are doing clustering
at writepage(). There are two paths:

1. page_launder() - writepage() - cluster
# this path under memory pressure.
2. try_to_free_buffers() - writepage() - cluster
# this path under background writing as in bdflush
# but can also be used by sync() type operations that
# work with buffers than pages.

Clustering by itself (in XFS) improves write performance by about 15-20%,
and we're seeing close to raw I/O performance. With clustering
the IO requests are pegged at 1024 sectors (512K bytes)
when performing large sequential writes ...


ananth.


--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-28 Thread Mike Galbraith

On Wed, 28 Feb 2001, Rik van Riel wrote:

 On Wed, 28 Feb 2001, Marcelo Tosatti wrote:
  On Wed, 28 Feb 2001, Mike Galbraith wrote:

   That's one reason I tossed it out.  I don't _think_ it should have any
   negative effect on other loads, but a test run might find otherwise.
 
  Writes are more expensive than reads. Apart from the aggressive read
  caching on the disk, writes have limited caching or no caching at all if
  you need security (journalling, for example). (I'm not sure about write
  caching details, any harddisk expert?)

 I suspect Mike needs to change his benchmark load a little
 so that it dirties only 10% of the pages (might be realistic
 for web and/or database loads).

Asking the user to not dirty so many pages is wrong.  My benchmark
load is many compute intensive tasks which each dirty a few pages
while doing real work.  It would be unrealistic if it just dirtied
pages as fast as possible to intentionally jam up the vm, but it
doesn't do that.

 At that point, you should be able to see that doing writes
 all the time can really mess up read performance due to extra
 introduced seeks.

The fact that writes are painful doesn't change the fact that data
must be written in order to free memory and proceed.  Besides, the
elevator is supposed to solve that not the allocator.. or?

 We probably want some in-between solution (like FreeBSD has today).
 The first time they see a dirty page, they mark it as seen, the
 second time they come across it in the inactive list, they flush it.
 This way IO is still delayed a bit and not done if there are enough
 clean pages around.

(delayed write is fine, but I'll be upset if vmlinux doesn't show up
after I buy more ram;)

 Another solution would be to do some more explicit IO clustering and
 only flush _large_ clusters ... no need to invoke extra disk seeks
 just to free a single page, unless you only have single pages left.

This sounds good.. except I keep thinking about the elevator.  Clusters
disappear as soon as they hit the queues so clustering at the vm level
doesn't make any sense to me.  Where pages actually land is a function
of the fs, and that gets torn down even further by the elevator.  If
you submit pages one at a time, the plug will build clusters for you.

I don't think that the vm has the information needed to make decisions
like this nor the responsibility to do so.  It's a customer of the I/O
layers beneath it.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

> > Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> > see if the system still swaps out too much?
>
> Not yet, but will do.

Didn't help.  (It actually reduced throughput a little)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

On Tue, 27 Feb 2001, Marcelo Tosatti wrote:

> On Tue, 27 Feb 2001, Mike Galbraith wrote:
>
> > What the patch does is simply to push I/O as fast as we can.. we're
> > by definition I/O bound and _can't_ defer it under any circumstance,
> > for in this direction lies constipation.  The only thing in the world
> > which will make it better is pushing I/O.
>
> In your I/O bound case, yes. But not in all cases.

That's one reason I tossed it out.  I don't _think_ it should have any
negative effect on other loads, but a test run might find otherwise.

> > What we do right now (as kswapd) is scan a tiny portion of the active
> > page list, and then push an arbitrary amount of swap because we can't
> > possibly deactivate enough pages if our shortage is larger than the
> > search area (nr_active_pages >> 6).. repeat until give-up time.  In
> > practice here (test load, but still..), that leads to pushing soon
> > to be unneeded [supposition!] pages into swap a full 3/4 of the time.

(correction: it's 2/3 of the time not 3/4.. off by one bug in fingers;)

> Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
> see if the system still swaps out too much?

Not yet, but will do.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Marcelo Tosatti


On Tue, 27 Feb 2001, Mike Galbraith wrote:

> On Tue, 27 Feb 2001, Rik van Riel wrote:
> 
> > On Tue, 27 Feb 2001, Mike Galbraith wrote:
> >
> > > Attempting to avoid doing I/O has been harmful to throughput here
> > > ever since the queueing/elevator woes were fixed. Ever since then,
> > > tossing attempts at avoidance has improved throughput markedly.
> > >
> > > IMHO, any patch which claims to improve throughput via code deletion
> > > should be worth a little eyeball time.. and maybe even a test run ;-)
> > >
> > > Comments welcome.
> >
> > Before even thinking about testing this thing, I'd like to
> > see some (detailed?) explanation from you why exactly you
> > think the changes in this patch are good and how + why they
> > work.
> 
> Ok.. quite reasonable ;-)
> 
> First and foremost:  What does refill_inactive_scan do?  It places
> work to do on a list.. and nothing more.  It frees no memory in and
> of itself.. none (but we count it as freed.. that's important). It
> is the amount of memory we want desperately to free in the immediate
> future.  We count on it getting freed.  The only way to free I/O bound
> memory is to do the I/O.. as fast as the I/O subsystem can sync it.
> 
> This is the nut.. scan/deactivate percentages are fairly meaningless
> unless we do something about these pages.
> 
> What the patch does is simply to push I/O as fast as we can.. we're
> by definition I/O bound and _can't_ defer it under any circumstance,
> for in this direction lies constipation.  The only thing in the world
> which will make it better is pushing I/O.

In your I/O bound case, yes. But not in all cases.

> If you test the patch, you'll notice one very important thing.  The
> system no longer over-reacts.. as badly.  That's a diagnostic point.
> (On my system under my favorite page turnover rate load, I see my box
> drowning in a pool of dirty pages.. which it's not allowed to drain)
> 
> What we do right now (as kswapd) is scan a tiny portion of the active
> page list, and then push an arbitrary amount of swap because we can't
> possibly deactivate enough pages if our shortage is larger than the
> search area (nr_active_pages >> 6).. repeat until give-up time.  In
> practice here (test load, but still..), that leads to pushing soon
> to be unneeded [supposition!] pages into swap a full 3/4 of the time.

Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
see if the system still swaps out too much?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

On Tue, 27 Feb 2001, Rik van Riel wrote:

> On Tue, 27 Feb 2001, Mike Galbraith wrote:
>
> > Attempting to avoid doing I/O has been harmful to throughput here
> > ever since the queueing/elevator woes were fixed. Ever since then,
> > tossing attempts at avoidance has improved throughput markedly.
> >
> > IMHO, any patch which claims to improve throughput via code deletion
> > should be worth a little eyeball time.. and maybe even a test run ;-)
> >
> > Comments welcome.
>
> Before even thinking about testing this thing, I'd like to
> see some (detailed?) explanation from you why exactly you
> think the changes in this patch are good and how + why they
> work.

Ok.. quite reasonable ;-)

First and foremost:  What does refill_inactive_scan do?  It places
work to do on a list.. and nothing more.  It frees no memory in and
of itself.. none (but we count it as freed.. that's important). It
is the amount of memory we want desperately to free in the immediate
future.  We count on it getting freed.  The only way to free I/O bound
memory is to do the I/O.. as fast as the I/O subsystem can sync it.

This is the nut.. scan/deactivate percentages are fairly meaningless
unless we do something about these pages.

What the patch does is simply to push I/O as fast as we can.. we're
by definition I/O bound and _can't_ defer it under any circumstance,
for in this direction lies constipation.  The only thing in the world
which will make it better is pushing I/O.

If you test the patch, you'll notice one very important thing.  The
system no longer over-reacts.. as badly.  That's a diagnostic point.
(On my system under my favorite page turnover rate load, I see my box
drowning in a pool of dirty pages.. which it's not allowed to drain)

What we do right now (as kswapd) is scan a tiny portion of the active
page list, and then push an arbitrary amount of swap because we can't
possibly deactivate enough pages if our shortage is larger than the
search area (nr_active_pages >> 6).. repeat until give-up time.  In
practice here (test load, but still..), that leads to pushing soon
to be unneeded [supposition!] pages into swap a full 3/4 of the time.

> IMHO it would be good to not apply ANY code to the stable
> kernel tree unless we understand what it does and what the
> author meant the code to do...

Yes.. I agree 100%.  I was not suggesting that this be blindly
integrated.  (I know me.. can get all cornfoosed and fsck up;)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Rik van Riel

On Tue, 27 Feb 2001, Mike Galbraith wrote:

> Attempting to avoid doing I/O has been harmful to throughput here
> ever since the queueing/elevator woes were fixed. Ever since then,
> tossing attempts at avoidance has improved throughput markedly.
>
> IMHO, any patch which claims to improve throughput via code deletion
> should be worth a little eyeball time.. and maybe even a test run ;-)
>
> Comments welcome.

Before even thinking about testing this thing, I'd like to
see some (detailed?) explanation from you why exactly you
think the changes in this patch are good and how + why they
work.

IMHO it would be good to not apply ANY code to the stable
kernel tree unless we understand what it does and what the
author meant the code to do...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

Hi,

Attempting to avoid doing I/O has been harmful to throughput here
ever since the queueing/elevator woes were fixed. Ever since then,
tossing attempts at avoidance has improved throughput markedly.

IMHO, any patch which claims to improve throughput via code deletion
should be worth a little eyeball time.. and maybe even a test run ;-)

Comments welcome.

-Mike

--- linux-2.4.2-ac4/mm/page_alloc.c.org Mon Feb 26 11:19:27 2001
+++ linux-2.4.2-ac4/mm/page_alloc.c Tue Feb 27 10:31:10 2001
@@ -274,7 +274,7 @@
 struct page * __alloc_pages(zonelist_t *zonelist, unsigned long order)
 {
zone_t **zone;
-   int direct_reclaim = 0;
+   int direct_reclaim = 0, loop = 0;
unsigned int gfp_mask = zonelist->gfp_mask;
struct page * page;

@@ -366,7 +366,7 @@
 *   able to free some memory we can't free ourselves
 */
wakeup_kswapd();
-   if (gfp_mask & __GFP_WAIT) {
+   if (gfp_mask & __GFP_WAIT && loop) {
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
schedule();
@@ -440,7 +440,7 @@
memory_pressure++;
try_to_free_pages(gfp_mask);
wakeup_bdflush(0);
-   if (!order)
+   if (!order || loop++ < (1 << order))
goto try_again;
}
}
--- linux-2.4.2-ac4/mm/vmscan.c.org Mon Feb 26 09:31:46 2001
+++ linux-2.4.2-ac4/mm/vmscan.c Tue Feb 27 09:04:50 2001
@@ -278,6 +278,8 @@
/* Always start by trying to penalize the process that is allocating memory */
if (mm)
retval = swap_out_mm(mm, swap_amount(mm));
+   if (retval)
+   return retval;

/* Then, look at the other mm's */
counter = (mmlist_nr << SWAP_SHIFT) >> priority;
@@ -418,8 +420,8 @@
 #define MAX_LAUNDER(1 << page_cluster)
 int page_launder(int gfp_mask, int user)
 {
-   int launder_loop, maxscan, flushed_pages, freed_pages, maxlaunder;
-   int can_get_io_locks, sync, target, shortage;
+   int maxscan, flushed_pages, freed_pages, maxlaunder;
+   int can_get_io_locks;
struct list_head * page_lru;
struct page * page;
struct zone_struct * zone;
@@ -430,15 +432,10 @@
 */
can_get_io_locks = gfp_mask & __GFP_IO;

-   target = free_shortage();
-
-   sync = 0;
-   launder_loop = 0;
maxlaunder = 0;
flushed_pages = 0;
freed_pages = 0;

-dirty_page_rescan:
spin_lock(_lru_lock);
maxscan = nr_inactive_dirty_pages;
while ((page_lru = inactive_dirty_list.prev) != _dirty_list &&
@@ -446,6 +443,9 @@
page = list_entry(page_lru, struct page, lru);
zone = page->zone;

+   if ((user && freed_pages + flushed_pages > MAX_LAUNDER)
+   || !free_shortage())
+   break;
/* Wrong page on list?! (list corruption, should not happen) */
if (!PageInactiveDirty(page)) {
printk("VM: page_launder, wrong page on list.\n");
@@ -464,18 +464,7 @@
continue;
}

-   /*
-* Disk IO is really expensive, so we make sure we
-* don't do more work than needed.
-* Note that clean pages from zones with enough free
-* pages still get recycled and dirty pages from these
-* zones can get flushed due to IO clustering.
-*/
-   if (freed_pages + flushed_pages > target && !free_shortage())
-   break;
-   if (launder_loop && !maxlaunder)
-   break;
-   if (launder_loop && zone->inactive_clean_pages +
+   if (zone->inactive_clean_pages +
zone->free_pages > zone->pages_high)
goto skip_page;

@@ -500,14 +489,6 @@
if (!writepage)
goto page_active;

-   /* First time through? Move it to the back of the list */
-   if (!launder_loop) {
-   list_del(page_lru);
-   list_add(page_lru, _dirty_list);
-   UnlockPage(page);
-   continue;
-   }
-
/* OK, do a physical asynchronous write to swap.  */
ClearPageDirty(page);
page_cache_get(page);
@@ -517,7 +498,6 @@
/* XXX: all ->writepage()s should use nr_async_pages */
if (!PageSwapCache(page))
flushed_pages++;
-   maxlaunder--;
page_cache_release(page);

   

[patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

Hi,

Attempting to avoid doing I/O has been harmful to throughput here
ever since the queueing/elevator woes were fixed. Ever since then,
tossing attempts at avoidance has improved throughput markedly.

IMHO, any patch which claims to improve throughput via code deletion
should be worth a little eyeball time.. and maybe even a test run ;-)

Comments welcome.

-Mike

--- linux-2.4.2-ac4/mm/page_alloc.c.org Mon Feb 26 11:19:27 2001
+++ linux-2.4.2-ac4/mm/page_alloc.c Tue Feb 27 10:31:10 2001
@@ -274,7 +274,7 @@
 struct page * __alloc_pages(zonelist_t *zonelist, unsigned long order)
 {
zone_t **zone;
-   int direct_reclaim = 0;
+   int direct_reclaim = 0, loop = 0;
unsigned int gfp_mask = zonelist-gfp_mask;
struct page * page;

@@ -366,7 +366,7 @@
 *   able to free some memory we can't free ourselves
 */
wakeup_kswapd();
-   if (gfp_mask  __GFP_WAIT) {
+   if (gfp_mask  __GFP_WAIT  loop) {
__set_current_state(TASK_RUNNING);
current-policy |= SCHED_YIELD;
schedule();
@@ -440,7 +440,7 @@
memory_pressure++;
try_to_free_pages(gfp_mask);
wakeup_bdflush(0);
-   if (!order)
+   if (!order || loop++  (1  order))
goto try_again;
}
}
--- linux-2.4.2-ac4/mm/vmscan.c.org Mon Feb 26 09:31:46 2001
+++ linux-2.4.2-ac4/mm/vmscan.c Tue Feb 27 09:04:50 2001
@@ -278,6 +278,8 @@
/* Always start by trying to penalize the process that is allocating memory */
if (mm)
retval = swap_out_mm(mm, swap_amount(mm));
+   if (retval)
+   return retval;

/* Then, look at the other mm's */
counter = (mmlist_nr  SWAP_SHIFT)  priority;
@@ -418,8 +420,8 @@
 #define MAX_LAUNDER(1  page_cluster)
 int page_launder(int gfp_mask, int user)
 {
-   int launder_loop, maxscan, flushed_pages, freed_pages, maxlaunder;
-   int can_get_io_locks, sync, target, shortage;
+   int maxscan, flushed_pages, freed_pages, maxlaunder;
+   int can_get_io_locks;
struct list_head * page_lru;
struct page * page;
struct zone_struct * zone;
@@ -430,15 +432,10 @@
 */
can_get_io_locks = gfp_mask  __GFP_IO;

-   target = free_shortage();
-
-   sync = 0;
-   launder_loop = 0;
maxlaunder = 0;
flushed_pages = 0;
freed_pages = 0;

-dirty_page_rescan:
spin_lock(pagemap_lru_lock);
maxscan = nr_inactive_dirty_pages;
while ((page_lru = inactive_dirty_list.prev) != inactive_dirty_list 
@@ -446,6 +443,9 @@
page = list_entry(page_lru, struct page, lru);
zone = page-zone;

+   if ((user  freed_pages + flushed_pages  MAX_LAUNDER)
+   || !free_shortage())
+   break;
/* Wrong page on list?! (list corruption, should not happen) */
if (!PageInactiveDirty(page)) {
printk("VM: page_launder, wrong page on list.\n");
@@ -464,18 +464,7 @@
continue;
}

-   /*
-* Disk IO is really expensive, so we make sure we
-* don't do more work than needed.
-* Note that clean pages from zones with enough free
-* pages still get recycled and dirty pages from these
-* zones can get flushed due to IO clustering.
-*/
-   if (freed_pages + flushed_pages  target  !free_shortage())
-   break;
-   if (launder_loop  !maxlaunder)
-   break;
-   if (launder_loop  zone-inactive_clean_pages +
+   if (zone-inactive_clean_pages +
zone-free_pages  zone-pages_high)
goto skip_page;

@@ -500,14 +489,6 @@
if (!writepage)
goto page_active;

-   /* First time through? Move it to the back of the list */
-   if (!launder_loop) {
-   list_del(page_lru);
-   list_add(page_lru, inactive_dirty_list);
-   UnlockPage(page);
-   continue;
-   }
-
/* OK, do a physical asynchronous write to swap.  */
ClearPageDirty(page);
page_cache_get(page);
@@ -517,7 +498,6 @@
/* XXX: all -writepage()s should use nr_async_pages */
if (!PageSwapCache(page))
flushed_pages++;
-   maxlaunder--;
page_cache_release(page);

   

Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Rik van Riel

On Tue, 27 Feb 2001, Mike Galbraith wrote:

 Attempting to avoid doing I/O has been harmful to throughput here
 ever since the queueing/elevator woes were fixed. Ever since then,
 tossing attempts at avoidance has improved throughput markedly.

 IMHO, any patch which claims to improve throughput via code deletion
 should be worth a little eyeball time.. and maybe even a test run ;-)

 Comments welcome.

Before even thinking about testing this thing, I'd like to
see some (detailed?) explanation from you why exactly you
think the changes in this patch are good and how + why they
work.

IMHO it would be good to not apply ANY code to the stable
kernel tree unless we understand what it does and what the
author meant the code to do...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

On Tue, 27 Feb 2001, Rik van Riel wrote:

 On Tue, 27 Feb 2001, Mike Galbraith wrote:

  Attempting to avoid doing I/O has been harmful to throughput here
  ever since the queueing/elevator woes were fixed. Ever since then,
  tossing attempts at avoidance has improved throughput markedly.
 
  IMHO, any patch which claims to improve throughput via code deletion
  should be worth a little eyeball time.. and maybe even a test run ;-)
 
  Comments welcome.

 Before even thinking about testing this thing, I'd like to
 see some (detailed?) explanation from you why exactly you
 think the changes in this patch are good and how + why they
 work.

Ok.. quite reasonable ;-)

First and foremost:  What does refill_inactive_scan do?  It places
work to do on a list.. and nothing more.  It frees no memory in and
of itself.. none (but we count it as freed.. that's important). It
is the amount of memory we want desperately to free in the immediate
future.  We count on it getting freed.  The only way to free I/O bound
memory is to do the I/O.. as fast as the I/O subsystem can sync it.

This is the nut.. scan/deactivate percentages are fairly meaningless
unless we do something about these pages.

What the patch does is simply to push I/O as fast as we can.. we're
by definition I/O bound and _can't_ defer it under any circumstance,
for in this direction lies constipation.  The only thing in the world
which will make it better is pushing I/O.

If you test the patch, you'll notice one very important thing.  The
system no longer over-reacts.. as badly.  That's a diagnostic point.
(On my system under my favorite page turnover rate load, I see my box
drowning in a pool of dirty pages.. which it's not allowed to drain)

What we do right now (as kswapd) is scan a tiny portion of the active
page list, and then push an arbitrary amount of swap because we can't
possibly deactivate enough pages if our shortage is larger than the
search area (nr_active_pages  6).. repeat until give-up time.  In
practice here (test load, but still..), that leads to pushing soon
to be unneeded [supposition!] pages into swap a full 3/4 of the time.

 IMHO it would be good to not apply ANY code to the stable
 kernel tree unless we understand what it does and what the
 author meant the code to do...

Yes.. I agree 100%.  I was not suggesting that this be blindly
integrated.  (I know me.. can get all cornfoosed and fsck up;)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Marcelo Tosatti


On Tue, 27 Feb 2001, Mike Galbraith wrote:

 On Tue, 27 Feb 2001, Rik van Riel wrote:
 
  On Tue, 27 Feb 2001, Mike Galbraith wrote:
 
   Attempting to avoid doing I/O has been harmful to throughput here
   ever since the queueing/elevator woes were fixed. Ever since then,
   tossing attempts at avoidance has improved throughput markedly.
  
   IMHO, any patch which claims to improve throughput via code deletion
   should be worth a little eyeball time.. and maybe even a test run ;-)
  
   Comments welcome.
 
  Before even thinking about testing this thing, I'd like to
  see some (detailed?) explanation from you why exactly you
  think the changes in this patch are good and how + why they
  work.
 
 Ok.. quite reasonable ;-)
 
 First and foremost:  What does refill_inactive_scan do?  It places
 work to do on a list.. and nothing more.  It frees no memory in and
 of itself.. none (but we count it as freed.. that's important). It
 is the amount of memory we want desperately to free in the immediate
 future.  We count on it getting freed.  The only way to free I/O bound
 memory is to do the I/O.. as fast as the I/O subsystem can sync it.
 
 This is the nut.. scan/deactivate percentages are fairly meaningless
 unless we do something about these pages.
 
 What the patch does is simply to push I/O as fast as we can.. we're
 by definition I/O bound and _can't_ defer it under any circumstance,
 for in this direction lies constipation.  The only thing in the world
 which will make it better is pushing I/O.

In your I/O bound case, yes. But not in all cases.

 If you test the patch, you'll notice one very important thing.  The
 system no longer over-reacts.. as badly.  That's a diagnostic point.
 (On my system under my favorite page turnover rate load, I see my box
 drowning in a pool of dirty pages.. which it's not allowed to drain)
 
 What we do right now (as kswapd) is scan a tiny portion of the active
 page list, and then push an arbitrary amount of swap because we can't
 possibly deactivate enough pages if our shortage is larger than the
 search area (nr_active_pages  6).. repeat until give-up time.  In
 practice here (test load, but still..), that leads to pushing soon
 to be unneeded [supposition!] pages into swap a full 3/4 of the time.

Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
see if the system still swaps out too much?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

On Tue, 27 Feb 2001, Marcelo Tosatti wrote:

 On Tue, 27 Feb 2001, Mike Galbraith wrote:

  What the patch does is simply to push I/O as fast as we can.. we're
  by definition I/O bound and _can't_ defer it under any circumstance,
  for in this direction lies constipation.  The only thing in the world
  which will make it better is pushing I/O.

 In your I/O bound case, yes. But not in all cases.

That's one reason I tossed it out.  I don't _think_ it should have any
negative effect on other loads, but a test run might find otherwise.

  What we do right now (as kswapd) is scan a tiny portion of the active
  page list, and then push an arbitrary amount of swap because we can't
  possibly deactivate enough pages if our shortage is larger than the
  search area (nr_active_pages  6).. repeat until give-up time.  In
  practice here (test load, but still..), that leads to pushing soon
  to be unneeded [supposition!] pages into swap a full 3/4 of the time.

(correction: it's 2/3 of the time not 3/4.. off by one bug in fingers;)

 Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
 see if the system still swaps out too much?

Not yet, but will do.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-02-27 Thread Mike Galbraith

  Have you tried to use SWAP_SHIFT as 4 instead of 5 on a stock 2.4.2-ac5 to
  see if the system still swaps out too much?

 Not yet, but will do.

Didn't help.  (It actually reduced throughput a little)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/