Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-29 Thread Rik van Riel

On Fri, 29 Sep 2000, Andrea Arcangeli wrote:
> On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote:
> > OK, good to see that we agree on the fact that we
> > should age and swapout all pages equally agressively.
> 
> Actually I think we should start looking at the mapped stuff
> _only_ when the I/O cache aging is relevant. If the I/O cache
> aging isn't relevant there's no point to look at the mapped
> stuff since there's cache pollution going on.

> If the cache is re-used (so if it's useful) that's completly
> different issue and in that case unmapping potentially unused
> stuff is the right thing to do of course.

This is why I want to do:

1) equal aging of all pages in the system
2) page aging to have properties of both LRU and LFU
3) drop-behind to cope with streaming IO in a good way

and maybe:
4) move unmapped pages to the inactive_clean list for
   immediate reclaiming but put pages which are/were
   mapped on the inactive_dirty list so we keep it a
   little bit longer


The only way to reliably know if the cache is re-used a
lot is by making sure we do the page aging for unmapped
and mapped pages the same. If we don't do that, we won't
be able to make a sensible comparison between the activity
of pages in different places.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-29 Thread Andrea Arcangeli

On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote:
> OK, good to see that we agree on the fact that we
> should age and swapout all pages equally agressively.

Actually I think we should start looking at the mapped stuff _only_ when the
I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no
point to look at the mapped stuff since there's cache pollution going on. It's
much less costly to drop a page from the unmapped cache than to play with
pagetables, and also having slow read() is much better than having to fault
into the .text areas (because the process is going to be designed in a way that
expects read to block so it may do it asynchronously or in a separate thread or
whatever). A `cp /dev/zero .` shouldn't swapout/unmap anything.

If the cache is re-used (so if it's useful) that's completly different issue and
in that case unmapping potentially unused stuff is the right thing to do of
course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-29 Thread Rik van Riel

On Thu, 28 Sep 2000, Andrea Arcangeli wrote:
> On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote:
> > Andrea, I have the strong impression that your idea of
> > memory balancing is based on the idea that the OS should
> > out-smart the application instead of looking at the usage
> > pattern of the pages in memory.
> 
> Not sure what you mean with out-smart.
> 
> My only point is that the OS actually can only swapout such shm.
> If that SHM is not supposed to be swapped out and if the OS I/O
> cache have more aging then the shm cache, then the OS should
> tell the DBMS that it's time to shrink some shm page by freeing
> it.

OK, good to see that we agree on the fact that we
should age and swapout all pages equally agressively.

> > of the pages in question, instead of making presumptions
> > based on what kind of cache the page is in.
> 
> For the mapped pages we never make presumptions. We always check
> the accessed bit and that's the most reliable info to know if
> the page is been accessed recently (set from the cpu accesse
> through the pte not only during page faults or cache hits).  
> With the current design pages mapped multiple times will be
> overaged a bit but this can't be fixed until we make a page->pte
> reverse lookup...

Indeed.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-29 Thread Rik van Riel

On Thu, 28 Sep 2000, Andrea Arcangeli wrote:
 On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote:
  Andrea, I have the strong impression that your idea of
  memory balancing is based on the idea that the OS should
  out-smart the application instead of looking at the usage
  pattern of the pages in memory.
 
 Not sure what you mean with out-smart.
 
 My only point is that the OS actually can only swapout such shm.
 If that SHM is not supposed to be swapped out and if the OS I/O
 cache have more aging then the shm cache, then the OS should
 tell the DBMS that it's time to shrink some shm page by freeing
 it.

OK, good to see that we agree on the fact that we
should age and swapout all pages equally agressively.

  of the pages in question, instead of making presumptions
  based on what kind of cache the page is in.
 
 For the mapped pages we never make presumptions. We always check
 the accessed bit and that's the most reliable info to know if
 the page is been accessed recently (set from the cpu accesse
 through the pte not only during page faults or cache hits).  
 With the current design pages mapped multiple times will be
 overaged a bit but this can't be fixed until we make a page-pte
 reverse lookup...

Indeed.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-29 Thread Andrea Arcangeli

On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote:
 OK, good to see that we agree on the fact that we
 should age and swapout all pages equally agressively.

Actually I think we should start looking at the mapped stuff _only_ when the
I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no
point to look at the mapped stuff since there's cache pollution going on. It's
much less costly to drop a page from the unmapped cache than to play with
pagetables, and also having slow read() is much better than having to fault
into the .text areas (because the process is going to be designed in a way that
expects read to block so it may do it asynchronously or in a separate thread or
whatever). A `cp /dev/zero .` shouldn't swapout/unmap anything.

If the cache is re-used (so if it's useful) that's completly different issue and
in that case unmapping potentially unused stuff is the right thing to do of
course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-29 Thread Rik van Riel

On Fri, 29 Sep 2000, Andrea Arcangeli wrote:
 On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote:
  OK, good to see that we agree on the fact that we
  should age and swapout all pages equally agressively.
 
 Actually I think we should start looking at the mapped stuff
 _only_ when the I/O cache aging is relevant. If the I/O cache
 aging isn't relevant there's no point to look at the mapped
 stuff since there's cache pollution going on.

 If the cache is re-used (so if it's useful) that's completly
 different issue and in that case unmapping potentially unused
 stuff is the right thing to do of course.

This is why I want to do:

1) equal aging of all pages in the system
2) page aging to have properties of both LRU and LFU
3) drop-behind to cope with streaming IO in a good way

and maybe:
4) move unmapped pages to the inactive_clean list for
   immediate reclaiming but put pages which are/were
   mapped on the inactive_dirty list so we keep it a
   little bit longer


The only way to reliably know if the cache is re-used a
lot is by making sure we do the page aging for unmapped
and mapped pages the same. If we don't do that, we won't
be able to make a sensible comparison between the activity
of pages in different places.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Juan J. Quintela

> "ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes:

Hi

ingo> 2) introducing sys_flush(), which flushes pages from the pagecache.

It is not supposed that mincore can do that (yes, just now it is not
implemented, but the interface is there to do that)?

Just curious.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 05:13:59PM +0200, Ingo Molnar wrote:
> Can anyone see any problems with the concept of this approach? This can be

It works only on top of a filesystem while all the checkpointing clever stuff
is done internally by the DB (infact it _needs_ O_SYNC when it works on the
fs).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Ingo Molnar


On Thu, 28 Sep 2000, Andrea Arcangeli wrote:

> The DBMS uses shared SCSI disks across multiple hosts on the same SCSI
> bus and synchronize the distributed cache via TCP. Tell me how to do
> that with the OS cache and mmap.

this could be supported by:

1) mlock()-ing the whole mapping.

2) introducing sys_flush(), which flushes pages from the pagecache.

3) doing sys_msync() after dirtying a range and before sending a TCP
   event.

Whenever the DB-cache-flush-event comes over TCP, it calls sys_flush() for
that given virtual address range or file address space range. Sys_flush
flushes the page from the pagecache and unmaps the address. Whenever it's
needed again by the application it will be faulted in and read from disk.

Can anyone see any problems with the concept of this approach? This can be
used for a page-granularity distributed IO cache.

(there are some smaller problems with this approach, like mlock() on a big
range can only be done by priviledged users, but thats not an issue IMO.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 01:31:40PM +0200, Ingo Molnar wrote:
> if the shm contains raw I/O data, then thats flawed application design -
> an mmap()-ed file should be used instead. Shm is equivalent to shared

The DBMS uses shared SCSI disks across multiple hosts on the same SCSI bus
and synchronize the distributed cache via TCP. Tell me how to do that
with the OS cache and mmap.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote:
> Andrea, I have the strong impression that your idea of
> memory balancing is based on the idea that the OS should
> out-smart the application instead of looking at the usage
> pattern of the pages in memory.

Not sure what you mean with out-smart.

My only point is that the OS actually can only swapout such shm. If that
SHM is not supposed to be swapped out and if the OS I/O cache have more aging
then the shm cache, then the OS should tell the DBMS that it's time to shrink
some shm page by freeing it.

> of the pages in question, instead of making presumptions
> based on what kind of cache the page is in.

For the mapped pages we never make presumptions. We always check the accessed
bit and that's the most reliable info to know if the page is been accessed
recently (set from the cpu accesse through the pte not only during page faults
or cache hits).  With the current design pages mapped multiple times will be
overaged a bit but this can't be fixed until we make a page->pte reverse
lookup...

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 07:08:51AM -0300, Rik van Riel wrote:
> taking care of this itself. But this is not something the OS
> should prescribe to the application.

Agreed.

> (unless the SHM users tell you that this is the normal way
> they use SHM ... but as Christoph just told us, it isn't)

shm is not used as I/O cache from 90% of the apps out there because normal apps
uses the OS cache functionality (90% of those apps doesn't use rawio to share a
black box that looks like a scsi disk via SCSI bus connected to other hosts as
well).

I for sure agree shm swapin/swapout is very important. (we moved shm
swapout/swapin to swap cache with readaround for that reason)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Ingo Molnar


On Thu, 28 Sep 2000, Rik van Riel wrote:

> The OS has no business knowing what's inside that SHM page.

exactly.

> IF the shm contains I/O cache, maybe you're right. However,
> until you know that this is the case, optimising for that
> situation just doesn't make any sense.

if the shm contains raw I/O data, then thats flawed application design -
an mmap()-ed file should be used instead. Shm is equivalent to shared
anonymous pages.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Rik van Riel

On Thu, 28 Sep 2000, Rik van Riel wrote:
> On Wed, 27 Sep 2000, Andrea Arcangeli wrote:

> > But again: if the shm contains I/O cache it should be released
> > and not swapped out.  Swapping out shmfs that contains I/O cache
> > would be exactly like swapping out page-cache.
> 
> The OS has no business knowing what's inside that SHM page.

Hmm, now I woke up maybe I should formulate this in a
different way.

Andrea, I have the strong impression that your idea of
memory balancing is based on the idea that the OS should
out-smart the application instead of looking at the usage
pattern of the pages in memory.

This is fundamentally different from the idea that the OS
should make decisions based on the observed usage patterns
of the pages in question, instead of making presumptions
based on what kind of cache the page is in.

I've been away for 10 days and have been sitting on a bus
all last night so my judgement may be off. I'd certainly
like to hear I'm wrong ;)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Rik van Riel

On Wed, 27 Sep 2000, Andrea Arcangeli wrote:
> On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
> > I just checked one oracle system and it did not lock the memory. And I
> 
> If that memory is used for I/O cache then such memory should
> released when the system runs into swap instead of swapping it
> out too (otherwise it's not cache anymore and it could be slower
> than re-reading from disk the real data in rawio).

It could also be faster. If the database spent half an hour
gathering pieces of data from all over the database, it might
be faster to keep it in one place in swap so it can be read
in again in one swoop.  (I had an interesting talk about this
with a database person while at OLS)

But that's not the point. If your assertion is true, then the
database will probably be using an mlock()ed SHM region and
taking care of this itself. But this is not something the OS
should prescribe to the application.

If the OS finds that certain SHM pages are used far less than
the pages in the I/O cache, then those SHM pages should be
swapped out. The system's job is to keep the most used pages
of data in memory to minimise the amount of page faults
happening. Trying to outsmart the application shouldn't (IHMO
of course) be part of that job...

> > Customers with performance problems very often start with too little
> > memory, but they cannot upgrade until this really big job finishes :-(
> > 
> > Another issue about shm swapping is interactive transactions, where
> > some users have very large contexts and go for a coffee before
> > submitting. This memory can be swapped. 
> 
> Agreed, that's why I said shm performance under swap is very important
> as well (I'm not understimating it).
> 
> But again: if the shm contains I/O cache it should be released
> and not swapped out.  Swapping out shmfs that contains I/O cache
> would be exactly like swapping out page-cache.

The OS has no business knowing what's inside that SHM page.
IF the shm contains I/O cache, maybe you're right. However,
until you know that this is the case, optimising for that
situation just doesn't make any sense.

(unless the SHM users tell you that this is the normal way
they use SHM ... but as Christoph just told us, it isn't)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Rik van Riel

On Wed, 27 Sep 2000, Andrea Arcangeli wrote:
 On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
  I just checked one oracle system and it did not lock the memory. And I
 
 If that memory is used for I/O cache then such memory should
 released when the system runs into swap instead of swapping it
 out too (otherwise it's not cache anymore and it could be slower
 than re-reading from disk the real data in rawio).

It could also be faster. If the database spent half an hour
gathering pieces of data from all over the database, it might
be faster to keep it in one place in swap so it can be read
in again in one swoop.  (I had an interesting talk about this
with a database person while at OLS)

But that's not the point. If your assertion is true, then the
database will probably be using an mlock()ed SHM region and
taking care of this itself. But this is not something the OS
should prescribe to the application.

If the OS finds that certain SHM pages are used far less than
the pages in the I/O cache, then those SHM pages should be
swapped out. The system's job is to keep the most used pages
of data in memory to minimise the amount of page faults
happening. Trying to outsmart the application shouldn't (IHMO
of course) be part of that job...

  Customers with performance problems very often start with too little
  memory, but they cannot upgrade until this really big job finishes :-(
  
  Another issue about shm swapping is interactive transactions, where
  some users have very large contexts and go for a coffee before
  submitting. This memory can be swapped. 
 
 Agreed, that's why I said shm performance under swap is very important
 as well (I'm not understimating it).
 
 But again: if the shm contains I/O cache it should be released
 and not swapped out.  Swapping out shmfs that contains I/O cache
 would be exactly like swapping out page-cache.

The OS has no business knowing what's inside that SHM page.
IF the shm contains I/O cache, maybe you're right. However,
until you know that this is the case, optimising for that
situation just doesn't make any sense.

(unless the SHM users tell you that this is the normal way
they use SHM ... but as Christoph just told us, it isn't)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Rik van Riel

On Thu, 28 Sep 2000, Rik van Riel wrote:
 On Wed, 27 Sep 2000, Andrea Arcangeli wrote:

  But again: if the shm contains I/O cache it should be released
  and not swapped out.  Swapping out shmfs that contains I/O cache
  would be exactly like swapping out page-cache.
 
 The OS has no business knowing what's inside that SHM page.

Hmm, now I woke up maybe I should formulate this in a
different way.

Andrea, I have the strong impression that your idea of
memory balancing is based on the idea that the OS should
out-smart the application instead of looking at the usage
pattern of the pages in memory.

This is fundamentally different from the idea that the OS
should make decisions based on the observed usage patterns
of the pages in question, instead of making presumptions
based on what kind of cache the page is in.

I've been away for 10 days and have been sitting on a bus
all last night so my judgement may be off. I'd certainly
like to hear I'm wrong ;)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Ingo Molnar


On Thu, 28 Sep 2000, Rik van Riel wrote:

 The OS has no business knowing what's inside that SHM page.

exactly.

 IF the shm contains I/O cache, maybe you're right. However,
 until you know that this is the case, optimising for that
 situation just doesn't make any sense.

if the shm contains raw I/O data, then thats flawed application design -
an mmap()-ed file should be used instead. Shm is equivalent to shared
anonymous pages.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote:
 Andrea, I have the strong impression that your idea of
 memory balancing is based on the idea that the OS should
 out-smart the application instead of looking at the usage
 pattern of the pages in memory.

Not sure what you mean with out-smart.

My only point is that the OS actually can only swapout such shm. If that
SHM is not supposed to be swapped out and if the OS I/O cache have more aging
then the shm cache, then the OS should tell the DBMS that it's time to shrink
some shm page by freeing it.

 of the pages in question, instead of making presumptions
 based on what kind of cache the page is in.

For the mapped pages we never make presumptions. We always check the accessed
bit and that's the most reliable info to know if the page is been accessed
recently (set from the cpu accesse through the pte not only during page faults
or cache hits).  With the current design pages mapped multiple times will be
overaged a bit but this can't be fixed until we make a page-pte reverse
lookup...

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 01:31:40PM +0200, Ingo Molnar wrote:
 if the shm contains raw I/O data, then thats flawed application design -
 an mmap()-ed file should be used instead. Shm is equivalent to shared

The DBMS uses shared SCSI disks across multiple hosts on the same SCSI bus
and synchronize the distributed cache via TCP. Tell me how to do that
with the OS cache and mmap.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Andrea Arcangeli

On Thu, Sep 28, 2000 at 05:13:59PM +0200, Ingo Molnar wrote:
 Can anyone see any problems with the concept of this approach? This can be

It works only on top of a filesystem while all the checkpointing clever stuff
is done internally by the DB (infact it _needs_ O_SYNC when it works on the
fs).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-28 Thread Juan J. Quintela

 "ingo" == Ingo Molnar [EMAIL PROTECTED] writes:

Hi

ingo 2) introducing sys_flush(), which flushes pages from the pagecache.

It is not supposed that mincore can do that (yes, just now it is not
implemented, but the interface is there to do that)?

Just curious.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Andrea Arcangeli

On Wed, Sep 27, 2000 at 12:25:44PM -0600, Erik Andersen wrote:
> Or sysinfo(2).  Same thing...

sysinfo structure doesn't export the number of active pages in the system.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Erik Andersen

On Wed Sep 27, 2000 at 07:42:00PM +0200, Andrea Arcangeli wrote:
> 
> You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in
> 2.4.x so it's just the overhead of a read syscall)

Or sysinfo(2).  Same thing...

 -Erik

--
Erik B. Andersen   email:  [EMAIL PROTECTED]
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Andrea Arcangeli

On Wed, Sep 27, 2000 at 06:56:42PM +0200, Christoph Rohland wrote:
> Yes, but how does the application detect that it should free the mem?

The trivial way is not to detect it and to allow the user to select how much
memory it will use as cache and to take it locked and then don't care (he will
have to decrease the size of the shm by hand if it wants to drop some cache).
>From the OS point of view it's like not having that RAM at all and there will
be zero performance difference compared into trashing into swap without such
memory. (on 2.2.x this is not true for a complexity problem in shrink mmap that
is solved with the real lru in 2.4.x)

The other way is to have the shm cache that shrinks dynamically by looking
/proc/meminfo and looking at the aging of their own cache. Again the user
should say a miniumum and a maximum of shm cache to keep locked in memory. Then
you look at the "freemem + cache + buffers - active cache" and you can say when
you're going to run into swap. Specifically with classzone you'll run into swap
when that value is near zero. So when such value is near zero you know it's
time to shrink the shm cache dynamically if it has a low age otherwise the
machine will trash into swap badly and performance will decrease. (you could
start shrinking when such value is below an amount of mbyte again configurable
via a form)

You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in
2.4.x so it's just the overhead of a read syscall)

These DB using rawio really want to substitue part of the kernel cache
functionality and so it's quite natural that they also don't want the kernel to
play with their caches while they run and they would need some more interaction
with the kernel memory balancing (possibly via async signals) to get their shm
reclaimed dynamically more cleanly and efficiently by registering for this
functionality (they could get signals when the machine runs into swap and then
the DB chooses if it worth to release some locked cache after looking at the
/proc/meminfo and the working set on their own caches).

> Also you often have more overhead reading out of a database then
> having preprocessed data in swap. 

Yes I see, it of course depends on the kind of cache (if it's very near to the
on-disk format than more probably it shouldn't be swapped out).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Christoph Rohland

Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
> > I just checked one oracle system and it did not lock the memory. And I
> 
> If that memory is used for I/O cache then such memory should
> released when the system runs into swap instead of swapping it out
> too (otherwise it's not cache anymore and it could be slower than
> re-reading from disk the real data in rawio).

Yes, but how does the application detect that it should free the mem?
Also you often have more overhead reading out of a database then
having preprocessed data in swap. 

> > Customers with performance problems very often start with too little
> > memory, but they cannot upgrade until this really big job finishes :-(
> > 
> > Another issue about shm swapping is interactive transactions, where
> > some users have very large contexts and go for a coffee before
> > submitting. This memory can be swapped. 
> 
> Agreed, that's why I said shm performance under swap is very important
> as well (I'm not understimating it).

fine :-)

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Andrea Arcangeli

On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
> I just checked one oracle system and it did not lock the memory. And I

If that memory is used for I/O cache then such memory should released when the
system runs into swap instead of swapping it out too (otherwise it's not cache
anymore and it could be slower than re-reading from disk the real data in
rawio).

> Customers with performance problems very often start with too little
> memory, but they cannot upgrade until this really big job finishes :-(
> 
> Another issue about shm swapping is interactive transactions, where
> some users have very large contexts and go for a coffee before
> submitting. This memory can be swapped. 

Agreed, that's why I said shm performance under swap is very important
as well (I'm not understimating it).

But again: if the shm contains I/O cache it should be released and not swapped
out.  Swapping out shmfs that contains I/O cache would be exactly like swapping
out page-cache.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Christoph Rohland

Ingo Molnar <[EMAIL PROTECTED]> writes:

> On 27 Sep 2000, Christoph Rohland wrote:
> 
> > Nobody should rely on shm swapping for productive use. But you have
> > changing/increasing loads on application servers and out of a sudden
> > you run oom. In this case the system should behave and it is _very_
> > good to have a smooth behaviour.
> 
> it might make sense even in production use. If there is some calculation
> that has to be done only once per month, then sure the customer can decide
> to wait for it a few hours until it swaps itself ready, instead of buying
> gigs of RAM just to execute this single operation faster. Uncooperative
> OOM in such cases is a show-stopper. Or are you saying the same thing? :-)

That's what I meant with the coffee break. In a big installation
somebody is always drinking coffee :-)
 
You also have often different loads during daytime and
nighttime. Swapping buffers out to swap disk instead of rereading from
the database makes a lot of sense for this. But a single job should
never swap. (It works for two month and then next month you get the
big escalation and you would love to have hotplug memory)

So swapping happens in productive use. But nobody should rely on
that too much. 

And I completely agree that uncooperative OOM is not acceptable.

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Ingo Molnar


On 27 Sep 2000, Christoph Rohland wrote:

> Nobody should rely on shm swapping for productive use. But you have
> changing/increasing loads on application servers and out of a sudden
> you run oom. In this case the system should behave and it is _very_
> good to have a smooth behaviour.

it might make sense even in production use. If there is some calculation
that has to be done only once per month, then sure the customer can decide
to wait for it a few hours until it swaps itself ready, instead of buying
gigs of RAM just to execute this single operation faster. Uncooperative
OOM in such cases is a show-stopper. Or are you saying the same thing? :-)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Christoph Rohland

Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> Said that I heard of real world programs that have a .text larger than 2G

=:-O

> I know Oracle (and most other DB) are very shm intensive.  However
> the fact you say the shm is not locked in memory is really a news to
> me. I really remembered that the shm was locked.

I just checked one oracle system and it did not lock the memory. And I
do not think that the other databases do it by default either.

And our application server doesn't do it definitely. And it uses loads
of shared memory. We will have application servers soon with 16 GB
memory at customer sites which will have the whole memory in shmfs.

> I also don't see the point of keeping data cache in the swap. Swap
> involves SMP tlb flushes and all the other big overhead that you
> could avoid by sizing properly the shm cache and taking it locked.
> 
> Note: having very fast shm swapout/swapin is very good thing (infact
> we introduced readaround of the swapin and moved shm swapout/swapin
> locking to the swap cache in early 2.3.x exactly for that
> reason). But I just don't think DBMS needed that.

Nobody should rely on shm swapping for productive use. But you have
changing/increasing loads on application servers and out of a sudden
you run oom. In this case the system should behave and it is _very_
good to have a smooth behaviour. 

Customers with performance problems very often start with too little
memory, but they cannot upgrade until this really big job finishes :-(

Another issue about shm swapping is interactive transactions, where
some users have very large contexts and go for a coffee before
submitting. This memory can be swapped. 

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Christoph Rohland

Andrea Arcangeli [EMAIL PROTECTED] writes:

 Said that I heard of real world programs that have a .text larger than 2G

=:-O

 I know Oracle (and most other DB) are very shm intensive.  However
 the fact you say the shm is not locked in memory is really a news to
 me. I really remembered that the shm was locked.

I just checked one oracle system and it did not lock the memory. And I
do not think that the other databases do it by default either.

And our application server doesn't do it definitely. And it uses loads
of shared memory. We will have application servers soon with 16 GB
memory at customer sites which will have the whole memory in shmfs.

 I also don't see the point of keeping data cache in the swap. Swap
 involves SMP tlb flushes and all the other big overhead that you
 could avoid by sizing properly the shm cache and taking it locked.
 
 Note: having very fast shm swapout/swapin is very good thing (infact
 we introduced readaround of the swapin and moved shm swapout/swapin
 locking to the swap cache in early 2.3.x exactly for that
 reason). But I just don't think DBMS needed that.

Nobody should rely on shm swapping for productive use. But you have
changing/increasing loads on application servers and out of a sudden
you run oom. In this case the system should behave and it is _very_
good to have a smooth behaviour. 

Customers with performance problems very often start with too little
memory, but they cannot upgrade until this really big job finishes :-(

Another issue about shm swapping is interactive transactions, where
some users have very large contexts and go for a coffee before
submitting. This memory can be swapped. 

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Ingo Molnar


On 27 Sep 2000, Christoph Rohland wrote:

 Nobody should rely on shm swapping for productive use. But you have
 changing/increasing loads on application servers and out of a sudden
 you run oom. In this case the system should behave and it is _very_
 good to have a smooth behaviour.

it might make sense even in production use. If there is some calculation
that has to be done only once per month, then sure the customer can decide
to wait for it a few hours until it swaps itself ready, instead of buying
gigs of RAM just to execute this single operation faster. Uncooperative
OOM in such cases is a show-stopper. Or are you saying the same thing? :-)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Christoph Rohland

Ingo Molnar [EMAIL PROTECTED] writes:

 On 27 Sep 2000, Christoph Rohland wrote:
 
  Nobody should rely on shm swapping for productive use. But you have
  changing/increasing loads on application servers and out of a sudden
  you run oom. In this case the system should behave and it is _very_
  good to have a smooth behaviour.
 
 it might make sense even in production use. If there is some calculation
 that has to be done only once per month, then sure the customer can decide
 to wait for it a few hours until it swaps itself ready, instead of buying
 gigs of RAM just to execute this single operation faster. Uncooperative
 OOM in such cases is a show-stopper. Or are you saying the same thing? :-)

That's what I meant with the coffee break. In a big installation
somebody is always drinking coffee :-)
 
You also have often different loads during daytime and
nighttime. Swapping buffers out to swap disk instead of rereading from
the database makes a lot of sense for this. But a single job should
never swap. (It works for two month and then next month you get the
big escalation and you would love to have hotplug memory)

So swapping happens in productive use. But nobody should rely on
that too much. 

And I completely agree that uncooperative OOM is not acceptable.

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Andrea Arcangeli

On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
 I just checked one oracle system and it did not lock the memory. And I

If that memory is used for I/O cache then such memory should released when the
system runs into swap instead of swapping it out too (otherwise it's not cache
anymore and it could be slower than re-reading from disk the real data in
rawio).

 Customers with performance problems very often start with too little
 memory, but they cannot upgrade until this really big job finishes :-(
 
 Another issue about shm swapping is interactive transactions, where
 some users have very large contexts and go for a coffee before
 submitting. This memory can be swapped. 

Agreed, that's why I said shm performance under swap is very important
as well (I'm not understimating it).

But again: if the shm contains I/O cache it should be released and not swapped
out.  Swapping out shmfs that contains I/O cache would be exactly like swapping
out page-cache.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Christoph Rohland

Andrea Arcangeli [EMAIL PROTECTED] writes:

 On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
  I just checked one oracle system and it did not lock the memory. And I
 
 If that memory is used for I/O cache then such memory should
 released when the system runs into swap instead of swapping it out
 too (otherwise it's not cache anymore and it could be slower than
 re-reading from disk the real data in rawio).

Yes, but how does the application detect that it should free the mem?
Also you often have more overhead reading out of a database then
having preprocessed data in swap. 

  Customers with performance problems very often start with too little
  memory, but they cannot upgrade until this really big job finishes :-(
  
  Another issue about shm swapping is interactive transactions, where
  some users have very large contexts and go for a coffee before
  submitting. This memory can be swapped. 
 
 Agreed, that's why I said shm performance under swap is very important
 as well (I'm not understimating it).

fine :-)

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Andrea Arcangeli

On Wed, Sep 27, 2000 at 06:56:42PM +0200, Christoph Rohland wrote:
 Yes, but how does the application detect that it should free the mem?

The trivial way is not to detect it and to allow the user to select how much
memory it will use as cache and to take it locked and then don't care (he will
have to decrease the size of the shm by hand if it wants to drop some cache).
From the OS point of view it's like not having that RAM at all and there will
be zero performance difference compared into trashing into swap without such
memory. (on 2.2.x this is not true for a complexity problem in shrink mmap that
is solved with the real lru in 2.4.x)

The other way is to have the shm cache that shrinks dynamically by looking
/proc/meminfo and looking at the aging of their own cache. Again the user
should say a miniumum and a maximum of shm cache to keep locked in memory. Then
you look at the "freemem + cache + buffers - active cache" and you can say when
you're going to run into swap. Specifically with classzone you'll run into swap
when that value is near zero. So when such value is near zero you know it's
time to shrink the shm cache dynamically if it has a low age otherwise the
machine will trash into swap badly and performance will decrease. (you could
start shrinking when such value is below an amount of mbyte again configurable
via a form)

You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in
2.4.x so it's just the overhead of a read syscall)

These DB using rawio really want to substitue part of the kernel cache
functionality and so it's quite natural that they also don't want the kernel to
play with their caches while they run and they would need some more interaction
with the kernel memory balancing (possibly via async signals) to get their shm
reclaimed dynamically more cleanly and efficiently by registering for this
functionality (they could get signals when the machine runs into swap and then
the DB chooses if it worth to release some locked cache after looking at the
/proc/meminfo and the working set on their own caches).

 Also you often have more overhead reading out of a database then
 having preprocessed data in swap. 

Yes I see, it of course depends on the kind of cache (if it's very near to the
on-disk format than more probably it shouldn't be swapped out).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Erik Andersen

On Wed Sep 27, 2000 at 07:42:00PM +0200, Andrea Arcangeli wrote:
 
 You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in
 2.4.x so it's just the overhead of a read syscall)

Or sysinfo(2).  Same thing...

 -Erik

--
Erik B. Andersen   email:  [EMAIL PROTECTED]
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-27 Thread Andrea Arcangeli

On Wed, Sep 27, 2000 at 12:25:44PM -0600, Erik Andersen wrote:
 Or sysinfo(2).  Same thing...

sysinfo structure doesn't export the number of active pages in the system.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 10:00:12PM +0200, Peter Osterlund wrote:
> Therefore, no matter what algorithm you use in elevator_linus() the total
> number of seeks should be the same.

It isn't. There's a big difference between the two algorithms and all your
previous emails was completly correct about the "theorical" additional seeks
during starvation avoidance.

> an even better way to sort the requests. I think the important thing is
> trying to minimize the total amount of seek time. But doing that requires

The current algorithm only tries to minimize the total amount of seek time.
That's what the elevator reordering is there for.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-26 Thread Peter Osterlund

On Tue, 26 Sep 2000, Andrea Arcangeli wrote:

: > smaller with your algorithm? (I later realized that request merging is
: > done before the elevator function kicks in, so your algorithm should
: 
: Not sure what you mean. There are two cases: the bh is merged, or
: the bh will be queued in a new request because merging isn't possible.
: 
: Your change deals only with the latter case and that should be
: pretty orthogonal to what the merging stuff does.

I previously thought that elevator_linus() was called first, and then
elevator_linus_merge() was invoked to merge sequential requests before
they were sent to the driver. If that had been the case, your version of
elevator_linus() would have generated more seeks than CSCAN.

But as I said, I was mistaken, it doesn't work that way. The
elevator_linus() function is only called if merging isn't possible.
Therefore, no matter what algorithm you use in elevator_linus() the total
number of seeks should be the same.

Therefore, if you are not trying to be fair (as CSCAN is) maybe there is
an even better way to sort the requests. I think the important thing is
trying to minimize the total amount of seek time. But doing that requires
knowledge about how the seek time depends on the seek distance.

-- 
Peter Österlund  Email: [EMAIL PROTECTED]
Sköndalsvägen 35[EMAIL PROTECTED]
S-128 66 Sköndal Home page: http://home1.swipnet.se/~w-15919
Sweden   Phone: +46 8 942647


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-26 Thread Christoph Rohland

Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> Could you tell me what's wrong in having an app with a 1.5G mapped executable
> (or a tiny executable but with a 1.5G shared/private file mapping if you
> prefer),

O.K. that sound more reasonable. I was reading image as program
text... and a 1.5GB program text is a something I never have seen (and
hopefully will never see :-)

> 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
> filesystem cache?

I don't really see a reason for fs cache in the application. I think
that parallel applications tend to either share mostly all or nothing,
but I may be wrong here.

> The application have a misc I/O load that in some part will run out
> of the working set, what's wrong with this?
> 
> What's ridiculous? Please elaborate.

I think we fixed this misreading. 

But still IMHO you underestimate the importance of shared memory for a
lot of applications in the high end. There is not only Oracle out
there and most of the shared memory is _not_ locked.

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 06:20:47PM +0200, Christoph Rohland wrote:
> O.K. that sound more reasonable. I was reading image as program
> text... and a 1.5GB program text is a something I never have seen (and
> hopefully will never see :-)

:)

>From the shrink_mmap complexity of the algorithm point of view a 1.5GB .text is
completly equal to a MAP_SHARED large 1.5GB or a MAP_PRIVATE large 1.5GB
(it doesn't need to be the .text of the program).

Said that I heard of real world programs that have a .text larger than 2G
(that's why I wasn't very careful to say it doesn't need to be a 1.5G
.text but that any other so large page-cache mapping would have the same
effect).

> > 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
> > filesystem cache?
> 
> I don't really see a reason for fs cache in the application. I think

Infact the application can as well use rawio.

> that parallel applications tend to either share mostly all or nothing,
> but I may be wrong here.

And then at some point you'll run `find /` or `tar mylatestsources.tar.gz
sources/` or updatedb is startedup or whatever. And you don't need more
than 200M of fs cache for that purpose.

Think at the O(N) complexity that we had in si_meminfo (guess why in 2.4.x
`free` say 0 in shared field). It was making impossible to run `xosview` on a
10G box (it was stalling for seconds).

And si_meminfo was only counting 1 field, not rolling pages around
lru grabbing locks and dirtyfing cachelines.

That's a plain complexity/scalability issue as far I can tell, and classzone
solves it completly.  When you run tar with your 1.5G shared mapping in memory
and you happen to hit the low watermark and you need to recycle some byte of
old cache, you'll run as fast as without the mapping in memory. There will be
zero difference in performance.  (just like now if you run `free` on a 10G
machine it runs as fast on a 4mbyte machine)

> I think we fixed this misreading. 

I should have explained things more carefully since the first place sorry.

> But still IMHO you underestimate the importance of shared memory for a
> lot of applications in the high end. There is not only Oracle out
> there and most of the shared memory is _not_ locked.

Well I wasn't claiming that this optimization is very sensitive for DB
applications (at least for DB that doesn't use quite big file mappings).

I know Oracle (and most other DB) are very shm intensive.  However the fact you
say the shm is not locked in memory is really a news to me. I really remembered
that the shm was locked.

I also don't see the point of keeping data cache in the swap. Swap involves SMP
tlb flushes and all the other big overhead that you could avoid by sizing
properly the shm cache and taking it locked.

Note: having very fast shm swapout/swapin is very good thing (infact we
introduced readaround of the swapin and moved shm swapout/swapin locking to the
swap cache in early 2.3.x exactly for that reason). But I just don't think
DBMS needed that.

Note: simulations are completly a different thing (their evolution is not
predicable). Simulations can sure trash shm into swap anytime (but Oracle
shouldn't do that AFIK).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 12:14:18AM +0200, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 10:52:08PM +0200, Peter Osterlund wrote:
> > Do you know why? Is it because the average seek distance becomes
> 
> Good question. No I don't know why right now. I'll try again just to be 200%
> sure and I'll let you know the results.

These are the numbers produced by my current blkdev tree based on
test8-pre5 with only the spinlock-1 patch on it:

-
2.4.0-test8-pre5 + blkdev-1 - IA32 2-way SMP LVM-stripe IDE

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 25440961  16.38 7.99% 0.647 1.16% 15.60 14.8% 1.330 5.53%
   . 25440962  16.34 10.9% 0.676 1.12% 15.70 17.2% 1.330 5.95%
   . 25440964  16.30 10.9% 0.690 1.07% 15.55 17.9% 1.324 6.24%
   . 25440968  15.71 12.1% 0.713 1.06% 15.11 17.8% 1.327 6.01%
 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 25440961  16.41 7.82% 0.716 0.82% 15.91 14.6% 1.334 4.86%
   . 25440962  16.44 11.1% 0.715 0.91% 15.82 17.1% 1.316 4.67%
   . 25440964  16.39 10.9% 0.722 0.95% 15.52 17.9% 1.322 5.07%
   . 25440968  16.02 11.8% 0.742 0.99% 15.13 17.8% 1.329 5.06%

andrea@laser:/mnt/p > ~/dbench/dbench 40
40 clients started
...+.+...+...+..+.+++..++...++..+.++...+++++...+++++..+...+...+.+..+...++...+.+.+...+.+.++.+
Throughput 10.7262 MB/sec (NB.4077 MB/sec  107.262 MBit/sec)
andrea@laser:/mnt/p > ~/dbench/dbench 40
40 clients started
...+...+.+.+..+..+...+.+...+.++...++.++...+..+.+...+...+..++.+.+++++...+...++.++...+...+...+...+...+.+.+
Throughput 11.7624 MB/sec (NB.703 MB/sec  117.624 MBit/sec)

 

Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 08:54:23AM +0200, Christoph Rohland wrote:
> "Stephen C. Tweedie" <[EMAIL PROTECTED]> writes:
> 
> > Hi,
> > 
> > On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote:
> > 
> > > Having shrink_mmap that browse the mapped page cache is useless
> > > as having shrink_mmap browsing kernel memory and anonymous pages
> > > as it does in 2.2.x as far I can tell. It's an algorithm
> > > complexity problem and it will waste lots of CPU.
> > 
> > It's a compromise between CPU cost and Getting It Right.  Ignoring the
> > mmap is not a good solution either.
> > 
> > > Now think this simple real life example. A 2G RAM machine running
> > > an executable image of 1.5G, 300M in shm and 200M in cache.
> 
> Hey that's ridiculous: 1.5G executable image and 300M shm? Take it
> vice-versa and you are approaching real life.

Could you tell me what's wrong in having an app with a 1.5G mapped executable
(or a tiny executable but with a 1.5G shared/private file mapping if you
prefer), 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
filesystem cache?

The application have a misc I/O load that in some part will run out
of the working set, what's wrong with this?

What's ridiculous? Please elaborate.

To emulate that workload we only need to mmap(1.5G, MAP_PRIVATE or MAP_SHARED),
fault into it, and run bonnie.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-26 Thread Jamie Lokier

Peter Osterlund wrote:
> Btw, does anyone know how the seek time depends on the seek distance
> on modern disk hardware?

I know very little but I've seen it before on this list.

For larger seeks, the head is accelerated then decelerated to roughly
the right track.  That time is approx. the square root of the seek
distance, which we can assume is approximately the difference in block
number on the disk.

After that, there is a settling time which is roughly constant for all
seeks, though I wouldn't be surprised to find it's a bit smaller for
really tiny seeks.

There are different rules for when there are multiple indepdendent
heads, for RAID arrays etc. (just to remind).

enjoy,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 08:54:23AM +0200, Christoph Rohland wrote:
 "Stephen C. Tweedie" [EMAIL PROTECTED] writes:
 
  Hi,
  
  On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote:
  
   Having shrink_mmap that browse the mapped page cache is useless
   as having shrink_mmap browsing kernel memory and anonymous pages
   as it does in 2.2.x as far I can tell. It's an algorithm
   complexity problem and it will waste lots of CPU.
  
  It's a compromise between CPU cost and Getting It Right.  Ignoring the
  mmap is not a good solution either.
  
   Now think this simple real life example. A 2G RAM machine running
   an executable image of 1.5G, 300M in shm and 200M in cache.
 
 Hey that's ridiculous: 1.5G executable image and 300M shm? Take it
 vice-versa and you are approaching real life.

Could you tell me what's wrong in having an app with a 1.5G mapped executable
(or a tiny executable but with a 1.5G shared/private file mapping if you
prefer), 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
filesystem cache?

The application have a misc I/O load that in some part will run out
of the working set, what's wrong with this?

What's ridiculous? Please elaborate.

To emulate that workload we only need to mmap(1.5G, MAP_PRIVATE or MAP_SHARED),
fault into it, and run bonnie.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 12:14:18AM +0200, Andrea Arcangeli wrote:
 On Mon, Sep 25, 2000 at 10:52:08PM +0200, Peter Osterlund wrote:
  Do you know why? Is it because the average seek distance becomes
 
 Good question. No I don't know why right now. I'll try again just to be 200%
 sure and I'll let you know the results.

These are the numbers produced by my current blkdev tree based on
test8-pre5 with only the spinlock-1 patch on it:

-
2.4.0-test8-pre5 + blkdev-1 - IA32 2-way SMP LVM-stripe IDE

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 25440961  16.38 7.99% 0.647 1.16% 15.60 14.8% 1.330 5.53%
   . 25440962  16.34 10.9% 0.676 1.12% 15.70 17.2% 1.330 5.95%
   . 25440964  16.30 10.9% 0.690 1.07% 15.55 17.9% 1.324 6.24%
   . 25440968  15.71 12.1% 0.713 1.06% 15.11 17.8% 1.327 6.01%
 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 25440961  16.41 7.82% 0.716 0.82% 15.91 14.6% 1.334 4.86%
   . 25440962  16.44 11.1% 0.715 0.91% 15.82 17.1% 1.316 4.67%
   . 25440964  16.39 10.9% 0.722 0.95% 15.52 17.9% 1.322 5.07%
   . 25440968  16.02 11.8% 0.742 0.99% 15.13 17.8% 1.329 5.06%

andrea@laser:/mnt/p  ~/dbench/dbench 40
40 clients started
...+.+...+...+..+.+++..++...++..+.++...+++++...+++++..+...+...+.+..+...++...+.+.+...+.+.++.+
Throughput 10.7262 MB/sec (NB.4077 MB/sec  107.262 MBit/sec)
andrea@laser:/mnt/p  ~/dbench/dbench 40
40 clients started
...+...+.+.+..+..+...+.+...+.++...++.++...+..+.+...+...+..++.+.+++++...+...++.++...+...+...+...+...+.+.+
Throughput 11.7624 MB/sec (NB.703 MB/sec  117.624 MBit/sec)

 

Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 06:20:47PM +0200, Christoph Rohland wrote:
 O.K. that sound more reasonable. I was reading image as program
 text... and a 1.5GB program text is a something I never have seen (and
 hopefully will never see :-)

:)

From the shrink_mmap complexity of the algorithm point of view a 1.5GB .text is
completly equal to a MAP_SHARED large 1.5GB or a MAP_PRIVATE large 1.5GB
(it doesn't need to be the .text of the program).

Said that I heard of real world programs that have a .text larger than 2G
(that's why I wasn't very careful to say it doesn't need to be a 1.5G
.text but that any other so large page-cache mapping would have the same
effect).

  300M of shm (or 300M of anonymous memory if you prefer) and 200M as
  filesystem cache?
 
 I don't really see a reason for fs cache in the application. I think

Infact the application can as well use rawio.

 that parallel applications tend to either share mostly all or nothing,
 but I may be wrong here.

And then at some point you'll run `find /` or `tar mylatestsources.tar.gz
sources/` or updatedb is startedup or whatever. And you don't need more
than 200M of fs cache for that purpose.

Think at the O(N) complexity that we had in si_meminfo (guess why in 2.4.x
`free` say 0 in shared field). It was making impossible to run `xosview` on a
10G box (it was stalling for seconds).

And si_meminfo was only counting 1 field, not rolling pages around
lru grabbing locks and dirtyfing cachelines.

That's a plain complexity/scalability issue as far I can tell, and classzone
solves it completly.  When you run tar with your 1.5G shared mapping in memory
and you happen to hit the low watermark and you need to recycle some byte of
old cache, you'll run as fast as without the mapping in memory. There will be
zero difference in performance.  (just like now if you run `free` on a 10G
machine it runs as fast on a 4mbyte machine)

 I think we fixed this misreading. 

I should have explained things more carefully since the first place sorry.

 But still IMHO you underestimate the importance of shared memory for a
 lot of applications in the high end. There is not only Oracle out
 there and most of the shared memory is _not_ locked.

Well I wasn't claiming that this optimization is very sensitive for DB
applications (at least for DB that doesn't use quite big file mappings).

I know Oracle (and most other DB) are very shm intensive.  However the fact you
say the shm is not locked in memory is really a news to me. I really remembered
that the shm was locked.

I also don't see the point of keeping data cache in the swap. Swap involves SMP
tlb flushes and all the other big overhead that you could avoid by sizing
properly the shm cache and taking it locked.

Note: having very fast shm swapout/swapin is very good thing (infact we
introduced readaround of the swapin and moved shm swapout/swapin locking to the
swap cache in early 2.3.x exactly for that reason). But I just don't think
DBMS needed that.

Note: simulations are completly a different thing (their evolution is not
predicable). Simulations can sure trash shm into swap anytime (but Oracle
shouldn't do that AFIK).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-26 Thread Christoph Rohland

Andrea Arcangeli [EMAIL PROTECTED] writes:

 Could you tell me what's wrong in having an app with a 1.5G mapped executable
 (or a tiny executable but with a 1.5G shared/private file mapping if you
 prefer),

O.K. that sound more reasonable. I was reading image as program
text... and a 1.5GB program text is a something I never have seen (and
hopefully will never see :-)

 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
 filesystem cache?

I don't really see a reason for fs cache in the application. I think
that parallel applications tend to either share mostly all or nothing,
but I may be wrong here.

 The application have a misc I/O load that in some part will run out
 of the working set, what's wrong with this?
 
 What's ridiculous? Please elaborate.

I think we fixed this misreading. 

But still IMHO you underestimate the importance of shared memory for a
lot of applications in the high end. There is not only Oracle out
there and most of the shared memory is _not_ locked.

Greetings
Christoph
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-26 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 10:00:12PM +0200, Peter Osterlund wrote:
 Therefore, no matter what algorithm you use in elevator_linus() the total
 number of seeks should be the same.

It isn't. There's a big difference between the two algorithms and all your
previous emails was completly correct about the "theorical" additional seeks
during starvation avoidance.

 an even better way to sort the requests. I think the important thing is
 trying to minimize the total amount of seek time. But doing that requires

The current algorithm only tries to minimize the total amount of seek time.
That's what the elevator reordering is there for.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Linus Torvalds



On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> 
> The machine will run low on memory as soon as I read 200mbyte from disk.

So? 

Yes, at that point we'll do the LRU dance. Then we won't be low on memory
any more, and we won't do the LRU dance any more. What's the magic in
zoneinfo that makes it not have to do the same thing?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 03:30:10PM -0700, Linus Torvalds wrote:
> On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> > 
> > I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
> > test9 will waste time rolling between LRUs 384000 pages, while classzone
> > won't ever see 1 of those pages until you run low on fs cache.
> 
> What drugs are you on? Nobody looks at the LRU's until the system is low
> on memory. Sure, there's some background activity, but what are you

The system is low on memory when you run `free` and you see a value
< freepages_high*PAGE_SIZE in the "free" column first row.

> talking about? It's only when you're low on memory that _either_ approach
> starts looking at the LRU list.

The machine will run low on memory as soon as I read 200mbyte from disk.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Tue, Sep 26, 2000 at 12:30:28AM +0200, Juan J. Quintela wrote:
> Which is completely wrong if the program uses _any not completely_
> unusual locality of reference.  Think twice about that, it is more
> probable that you need more that 300MB of filesystem cache that you
> have an aplication that references _randomly_ 1.5GB of data.  You need
> to balance that _always_ :((

The application doesn't references ramdonly 1.5GB of data. Assume
there's a big executable large 2G (and yes I know there are) and I run it.
After some hour its RSS it's 1.5G. Ok?

So now this program also shmget a 300 Mbyte shm segment.

Now this program starts reading and writing terabyte of data that
wouldn't fit in cache even if there would be 300G of ram (and
this is possible too). Or maybe the program itself uses rawio
but then you at a certain point use the machine to run a tar somewhere.

Now tell me why this program needs more than 200Mbyte of fs cache
if the kernel doesn't waste time on the mapped pages (as in
classzone).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Rik van Riel

On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote:

> > basically the whole of memory is data cache, some of which is mapped
> > and some of which is not?
> 
> As as said in the last email aging on the cache is supposed to that.
> 
> Wasting CPU and incrasing the complexity of the algorithm is a price
> that I won't pay just to get the information on when it's time
> to recall swap_out().

You must be joking. Page replacement should be tuned to
do good page replacement, not just to be easy on the CPU.
(though a heavily thrashing system /is/ easy on the cpu,
I'll have to admit that)

> If the cache have no age it means I'd better throw it out instead
> of swapping/unmapping out stuff, simple?

Simple, yes. But completely BOGUS if you don't age the cache
and the mapped pages at the same rate!

If I age your pages twice as much as my pages, is it still
only fair that your pages will be swapped out first? ;)

> > anything since last time.  Anything that only ages per-pte, not
> > per-page, is simply going to die horribly under such load, and any
> 
> The aging on the fs cache is done per-page.

And the same should be done for other pages as well.
If you don't do that, you'll have big problems keeping
page replacement balanced and making the system work well
under various loads.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 07:26:56PM -0300, Rik van Riel wrote:
> IMHO this is a minor issue because:

I don't think it's a minor issue.

If you don't have reschedule point in your equivalent of shrink_mmap and this
1.5G will happen to be consecutive in the lru order (quite probably if it's
been pagedin at fast rate) then you may even hang in interruptible mode for
seconds as soon as somebody start reading from disk. 2.4.x have to scale for
dozen of Giga of RAM as there are archs supporting that amount of RAM.

> 2) you don't /want/ to run low on fs cache, you want

So I can't read more than the size that the fs cache can take? I must be
allowed to do that (they're 200 Mbyte of RAM that can be more than enough
if the server mainly generate pollution anyway).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote:
> OK, and here's another simple real life example.  A 2GB RAM machine
> running something like Oracle with a hundred client processes all
> shm-mapping the same shared memory segment.

Oracle takes the SHM locked, and it will never run on a machine without
enough memory.

> Oh, and you're also doing lots of file IO.  How on earth do you decide
> what to swap and what to page out in this sort of scenario, where
> basically the whole of memory is data cache, some of which is mapped
> and some of which is not?

As as said in the last email aging on the cache is supposed to that.

Wasting CPU and incrasing the complexity of the algorithm is a price
that I won't pay just to get the information on when it's time
to recall swap_out().

If the cache have no age it means I'd better throw it out instead
of swapping/unmapping out stuff, simple?

> anything since last time.  Anything that only ages per-pte, not
> per-page, is simply going to die horribly under such load, and any

The aging on the fs cache is done per-page.

The per-pte issue happens when we just took the difficult decision (that it was
time to swap-out) and you have the same problem because you don't know the
chain of pte that point to the physical page (so you're refresh the referenced
bit more often). Once we'll have the chain of pte pointing to the page
classzone will only need a real lru for the mapped pages to use it instead of
walking pagetables.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Linus Torvalds



On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> 
> I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
> test9 will waste time rolling between LRUs 384000 pages, while classzone
> won't ever see 1 of those pages until you run low on fs cache.

What drugs are you on? Nobody looks at the LRU's until the system is low
on memory. Sure, there's some background activity, but what are you
talking about? It's only when you're low on memory that _either_ approach
starts looking at the LRU list.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Juan J. Quintela

> "andrea" == Andrea Arcangeli <[EMAIL PROTECTED]> writes:

Hi

andrea> I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
andrea> test9 will waste time rolling between LRUs 384000 pages, while classzone
andrea> won't ever see 1 of those pages until you run low on fs cache.

Which is completely wrong if the program uses _any not completely_
unusual locality of reference.  Think twice about that, it is more
probable that you need more that 300MB of filesystem cache that you
have an aplication that references _randomly_ 1.5GB of data.  You need
to balance that _always_ :((

I think that there is no silver bullet here :(

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Rik van Riel

On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote:
> > > > It doesn't --- that is part of the design.  The vm scanner propagates
> > > 
> > > And that's the inferior part of the design IMHO.
> > 
> > Indeed, but physical page based aging is a definate
> > 2.5 thing ... ;(
> 
> I'm talking about the fact that if you have a file mmapped in
> 1.5G of RAM test9 will waste time rolling between LRUs 384000
> pages, while classzone won't ever see 1 of those pages until you
> run low on fs cache.

IMHO this is a minor issue because:
1) you need to do page replacement with shared pages
   right
2) you don't /want/ to run low on fs cache, you want
   to have a good balance between thee cache(s) and
   the processes

OTOH, if you have a way to keep fair page aging and
fix the CPU time issue at the same time, I'd love
to see it.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote:
> > > It doesn't --- that is part of the design.  The vm scanner propagates
> > 
> > And that's the inferior part of the design IMHO.
> 
> Indeed, but physical page based aging is a definate
> 2.5 thing ... ;(

I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
test9 will waste time rolling between LRUs 384000 pages, while classzone
won't ever see 1 of those pages until you run low on fs cache.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 11:28:55PM +0200, Jens Axboe wrote:
> q->plug_device_fn(q, ...);
> list_add(...)
> generic_unplug_device(q);
> 
> would suffice in scsi_lib for now.

It looks sane to me.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 10:52:08PM +0200, Peter Osterlund wrote:
> Do you know why? Is it because the average seek distance becomes

Good question. No I don't know why right now. I'll try again just to be 200%
sure and I'll let you know the results.

> smaller with your algorithm? (I later realized that request merging is
> done before the elevator function kicks in, so your algorithm should

Not sure what you mean. There are two cases: the bh is merged, or
the bh will be queued in a new request because merging isn't possible.

Your change deals only with the latter case and that should be
pretty orthogonal to what the merging stuff does.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Jens Axboe

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > The scsi layer currently "manually" does a list_add on the queue itself,
> > which doesn't look too healthy.
> 
> It's grabbing the io_request_lock so it looks healthy for now :)

It's safe alright, but if we want to do the generic_unplug_queue
instead of just hitting the request_fn (which might do anything
anyway), it would be nicer to expose this part of the block layer
(i.e. have a general way of queueing a request to the request_queue).
But I guess just

q->plug_device_fn(q, ...);
list_add(...)
generic_unplug_device(q);

would suffice in scsi_lib for now.

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Alexander Viro




On Sun, 24 Sep 2000, Linus Torvalds wrote:

[directories in pagecache on ext2]

> > I'll do it and post the result tomorrow. I bet that there will be issues
> > I've overlooked (stuff that happens to work on UFS, but needs to be more
> > general for ext2), so it's going as "very alpha", but hey, it's pretty
> > straightforward, so there is a chance to debug it fast. Yes, famous last
> > words and all such...
> 
> Sure.

All right, I think I've got something that may work. Yes, there were issues -
UFS has the constant directory chunk size (1 sector), while ext2 makes it
equal to fs block size. _Bad_ idea, since the sector writes are atomic and
block ones... Oh, well, so ext2 is slightly less robust. It required some
changes, I'll do the initial testing and post the patch once it will pass
the trivial tests.

BTW, why on the Earth had we done it that way? It has no noticable effect
on directory fragmentation, it makes code (both in page- and buffer-cache
variants) more complex, it's less robust (by definition - directory layout
may be broken easier)... What was the point?

Not that we could do something about that now (albeit as a ro-compat feature
it would be nice), but I'm curious about the reasons...
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Peter Osterlund

Andrea Arcangeli <[EMAIL PROTECTED]> writes:

> The new elevator ordering algorithm returns me much better numbers
> than the CSCAN one with tiobench.

Do you know why? Is it because the average seek distance becomes
smaller with your algorithm? (I later realized that request merging is
done before the elevator function kicks in, so your algorithm should
not produce more seeks than a CSCAN algorithm. Unfortunately I didn't
realize this when I wrote my CSCAN patch.)

Btw, does anyone know how the seek time depends on the seek distance
on modern disk hardware?

-- 
Peter Österlund  Email: [EMAIL PROTECTED]
Sköndalsvägen 35[EMAIL PROTECTED]
S-128 66 Sköndal Home page: http://home1.swipnet.se/~w-15919
Sweden   Phone: +46 8 942647

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote:

> Having shrink_mmap that browse the mapped page cache is useless
> as having shrink_mmap browsing kernel memory and anonymous pages
> as it does in 2.2.x as far I can tell. It's an algorithm
> complexity problem and it will waste lots of CPU.

It's a compromise between CPU cost and Getting It Right.  Ignoring the
mmap is not a good solution either.

> Now think this simple real life example. A 2G RAM machine running an executable
> image of 1.5G, 300M in shm and 200M in cache.

OK, and here's another simple real life example.  A 2GB RAM machine
running something like Oracle with a hundred client processes all
shm-mapping the same shared memory segment.

Oh, and you're also doing lots of file IO.  How on earth do you decide
what to swap and what to page out in this sort of scenario, where
basically the whole of memory is data cache, some of which is mapped
and some of which is not?

If you don't separate out the propagation of referenced bits from the
actual page aging, then every time you pass over the whole VM working
set, you're likely to find a handful of live references to some of the
shared memory, and a hundred or so references that haven't done
anything since last time.  Anything that only ages per-pte, not
per-page, is simply going to die horribly under such load, and any
imbalance between pure filesystem cache and VM pressure will be
magnified to the point where one dominates.

Hence my observation that it's really easy to find special cases where
certain optimisations make a ton of sense, but you often lose balance
in the process.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Rik van Riel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 07:06:57PM +0100, Stephen C. Tweedie wrote:
> > Good.  One of the problems we always had in the past, though, was that
> > getting the relative aging of cache vs. vmas was easy if you had a
> > small set of test loads, but it was really, really hard to find a
> > balance that didn't show pathological behaviour in the worst cases.
> 
> Yep, that's not trivial.

It is. Just do physical-page based aging (so you age all the
pages in the system the same) and the problem is solved.

> > > I may be overlooking something but where do you notice when a page
> > > gets unmapped from the last mapping and put it back into a place
> > > that can be reached from shrink_mmap (or whatever the cache recycler is)?
> > 
> > It doesn't --- that is part of the design.  The vm scanner propagates
> 
> And that's the inferior part of the design IMHO.

Indeed, but physical page based aging is a definate
2.5 thing ... ;(

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 07:03:47PM +0200, Andrea Arcangeli wrote:
> 
> > This really seems to be the biggest difference between the two
> > approaches right now.  The FreeBSD folks believe fervently that one of
> > [ aging cache and mapped pages in the same cycle ]
> 
> Right.
> 
> And since you move the page into the active list only once you reach it from
> the cache recycler and you find it with page->age != 0, you also spend time
> putting those pages back and forth from those LRU lists while in my approch the
> mapped pages are never seen from the cycle recylcer and no cycle is spent on
> them. This mean in a pure fs read test with cache pollution going on, there's
> _no_way_ that classzone touches or notice _any_ mapped page in its path.

The "age==0" pages are basically just "pages we are ready to get rid
of right away".  The alternative to having that inactive list is to do
what we do today --- which is to throw away the pages immediately.
Having that extra list is simply giving pages a last chance before
evicting them.  It allows us to run reliably with fewer physically
free pages --- we can reap inactive pages with no IO so those pages
are as good as free for most purposes.

The alternative to moving pages to the inactive list would be freeing
them completely.  Moving a page back to the active list from inactive
is equivalent to avoiding a disk IO to pull in the page from backing
store.  It's supposed to be an optimisation to save physically
freeing things unless we really, really need to.  It is _not_ a
transition which recently referenced pages encounter.

> > the main reasons that their VM rocks is that it ages cache pages and
> > mapped pages at the same rate.  Having both on the same aging list
> > achieves that.  Separating the two raises the question of how to
> > balance the aging of cache vs. swap in a fair manner.
> 
> I believe increasing the aging in the unmapped cache should take care of that
> fine. (it was working pretty much fine also with only 1 bit of most
> frequently used aging plus the LRU order of the list)

Good.  One of the problems we always had in the past, though, was that
getting the relative aging of cache vs. vmas was easy if you had a
small set of test loads, but it was really, really hard to find a
balance that didn't show pathological behaviour in the worst cases.

> > > In classzone the aging exists too but it's _completly_ orthogonal to how
> > > rest of the VM works.
> > 
> > Umm, that applies to Rik's stuff too!
> 
> I may be overlooking something but where do you notice when a page
> gets unmapped from the last mapping and put it back into a place
> that can be reached from shrink_mmap (or whatever the cache recycler is)?

It doesn't --- that is part of the design.  The vm scanner propagates
referenced bits to the struct page, so the new shrink_mmap can do its
aging based on whether a page has been referenced at all recently, not
caring whether the reference was a VM reference or a page cache
reference.  That is done specifically to address the balance issue
between VM and filesystem memory pressure.

> Since none mapped page can in any way be freed by the cache recycler
> (you need to unmap it first from swap_out at the moment) if you
> should reach those pages from the cache recyler someway it means
> thus you're wasting CPU (I couldn't reach any mapped page from the
> cache recylcer in classzone and infact the mapped pages wasn't
> linked in any LRU at all to save even more CPU).

That's not how the current VM is supposed to work.  The cache scanner
isn't meant to reclaim pages --- it is meant to update the age
information on pages, which is not quite the same job.  If it finds
pages whose age becomes zero, those are shifted to the inactive list,
and once that list is large enough (ie. we have enough freeable
pages), it can give up.  The inactive list then gets physically freed
on demand.

The fact that we have a common loop in the VM for updating all age
information is central to the design, and requires the cache recycler
to pass over all those pages.  By doing it that way, rather than from
the VM scan, we can avoid one of the really bad properties of the old
2.0 aging code --- it means that for shared pages, we only do the
aging once per walk over the pages regardless of how many ptes refer
to the page.  This avoids the nasty worst-case behaviour of having a
recently-referenced page thrown out of memory just because there also
happened to be a lot of old, unused references to it too. 

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 07:21:48PM +0200, bert hubert wrote:
> Ok, sorry. Kernel development is proceding at a furious pace and I sometimes
> lose track. 

No problem :).

> I seem to remember that people were impressed by classzone, but that the
> implementation was very non-trivial and hard to grok. One of the reasons

Yes. Classzone is certainly more complex.

> There is no such thing as 'under swap'. There are lots of loadpatterns that
> will generate different kinds of memory pressure. Just calling it 'under
> swap' gives entirely the wrong impression. 

Sorry for not being precise. I meant one of those load patterns.

> 'rivaling virtual memory' code. Energies spent on Rik's VM will yield far
> higher differential improvement. 

I've spent efforts on classzone as well, and since I think it's way superior
approch I'll at least port it on top of 2.4.0-test9 as soon as time
permits to generate some number.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread bert hubert

> We're talking about shrink_[id]cache_memory change. That have _nothing_ to do
> with the VM changes that happened anywhere between test8 and test9-pre6.
> 
> You were talking about a different thing.

Ok, sorry. Kernel development is proceding at a furious pace and I sometimes
lose track. 

> I consider the current approch the wrong way to go and for this reason I
> prefer to spend time porting/improving classzone.

I seem to remember that people were impressed by classzone, but that the
implementation was very non-trivial and hard to grok. One of the reasons
Rik's vm made it (so far) is that it is pretty straightforward, with all the
marks of the right amount of simplicity. 

> In the meantime if you want to go back to 2.4.0-test1-ac22-class++ to give
> it a try under swap to see the difference in the behaviour and compare
> (Mike said it's still an order of magnitude faster with his "make -j30
> bzImage" testcase and he's always very reliable in his reports).

There is no such thing as 'under swap'. There are lots of loadpatterns that
will generate different kinds of memory pressure. Just calling it 'under
swap' gives entirely the wrong impression. 

Although Mike's compile is a relevant benchmark, every VM has cases for
which it excels, and cases for which it sucks. This appears to be a general
property of VM design. 

Given knowledge of the algorithms used, you can always dream up a situation
where it will fail. It's a bit like writing the halting problem algorithm.
Same goes the other way around, every VM will have a 'shining benchmark' -
hence the invention of benchmarketing.

We used to have a bad virtual memory implementation that was sometimes well
tuned so a lots of ordinary cases showed acceptable performance. We now have
an elegant VM that works reasonably well, but needs more tweaking.

What is the point of all this ranting? Think twice before embarking on
'rivaling virtual memory' code. Energies spent on Rik's VM will yield far
higher differential improvement. 

Regards,


bert hubert

-- 
PowerDNS Versatile DNS Services  
Trilab   The Technology People   
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 07:05:02PM +0200, Ingo Molnar wrote:
> yep - and Jens i'm sorry about the outburst. Until a bug is found it's
> unrealistic to blame anything.

I think the only bug maybe to blame in the elevator is the EXCLUSIVE wakeup
thing (and I've not benchmarked it alone to see if it makes any real world
performance difference but for sure its behaviour wasn't intentional). Anything
else related to the elevator internals should perform better than the old
elevator (aka the 2.2.15 one). The new elevator ordering algorithm returns me
much better numbers than the CSCAN one with tiobench. Also consider the latency
control at the moment is completly disabled as default so there are no barriers
unless you change that with elvtune.

Also I'm using -r 250 and -w 500 and it doesn't change really anything in the
numbers compared to too big values (but it fixes the starvation problem).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Linus Torvalds wrote:

> Blaming the elevator is unfair and unrealistic. [...]

yep - and Jens i'm sorry about the outburst. Until a bug is found it's
unrealistic to blame anything.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 05:24:42PM +0100, Stephen C. Tweedie wrote:
> Your other recent complaint, that newly-swapped pages end up on the
> wrong end of the LRU lists and can't be reclaimed without cycling the
> rest of the pages in shrink_mmap, is also cured in Rik's code, by
> placing pages which are queued for swapout on a different list
> altogether.  I thought we had managed to agree in Ottawa that such a
> cure for the old 2.4 VM was desirable.

Yes, I seen and the fix looks ok. It's the deactivate_page call when
we swapout the anonymous page. I overlooked it at first, I apologise.

> > The mapped pages was never seen by anything except swap_out, if they was
> > mapped (it's not a if page->age then move into the active list, with
> > classzone the page was _just_ in the active list in first place since it
> > was mapped).
> 
> This really seems to be the biggest difference between the two
> approaches right now.  The FreeBSD folks believe fervently that one of

Right.

And since you move the page into the active list only once you reach it from
the cache recycler and you find it with page->age != 0, you also spend time
putting those pages back and forth from those LRU lists while in my approch the
mapped pages are never seen from the cycle recylcer and no cycle is spent on
them. This mean in a pure fs read test with cache pollution going on, there's
_no_way_ that classzone touches or notice _any_ mapped page in its path.

I think you can't be faster than classzone here.

When the cache isn't polluted adding some more bit of aging I'll better know
when it's time to unmap/swapout stuff. (it just works this way but with only
literally 1 bit of aging at the moment)

> the main reasons that their VM rocks is that it ages cache pages and
> mapped pages at the same rate.  Having both on the same aging list
> achieves that.  Separating the two raises the question of how to
> balance the aging of cache vs. swap in a fair manner.

I believe increasing the aging in the unmapped cache should take care of that
fine. (it was working pretty much fine also with only 1 bit of most
frequently used aging plus the LRU order of the list)

> > In classzone the aging exists too but it's _completly_ orthogonal to how
> > rest of the VM works.
> 
> Umm, that applies to Rik's stuff too!

I may be overlooking something but where do you notice when a page
gets unmapped from the last mapping and put it back into a place
that can be reached from shrink_mmap (or whatever the cache recycler is)?

Since none mapped page can in any way be freed by the cache recycler
(you need to unmap it first from swap_out at the moment) if you
should reach those pages from the cache recyler someway it means
thus you're wasting CPU (I couldn't reach any mapped page from the
cache recylcer in classzone and infact the mapped pages wasn't
linked in any LRU at all to save even more CPU).

> Good, the best theoretical VM in the world can fall apart instantly on
> contact with the real world. :-)

:))

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 01:41:37AM +0200, Andrea Arcangeli wrote:
> 
> Since you're talking about this I'll soon (as soon as I'll finish some other
> thing that is just work in progress) release a classzone against latest's
> 2.4.x. My approch is _quite_ different from the curren VM. Current approch is
> very imperfect and it's based solely on aging whereas classzone had hooks into
> pagefaults paths and all other map/unmap points to have perfect accounting of
> the amount of active/inactive stuff.

Andrea, I'm not quite sure what you're saying here.  Could you be a
bit more specific?

The current VM _does_ track the amount of active/inactive stuff.  It
does so by keeping separate list of active and inactive stuff.
Accounting on memory pressure on these different lists is used to
generate dynamic targets for how many pages we aim to have on those
lists, so aging/reclaim activity is tuned to the current memory load.

Your other recent complaint, that newly-swapped pages end up on the
wrong end of the LRU lists and can't be reclaimed without cycling the
rest of the pages in shrink_mmap, is also cured in Rik's code, by
placing pages which are queued for swapout on a different list
altogether.  I thought we had managed to agree in Ottawa that such a
cure for the old 2.4 VM was desirable.

> The mapped pages was never seen by
> anything except swap_out, if they was mapped (it's not a if page->age then move
> into the active list, with classzone the page was _just_ in the active list in
> first place since it was mapped).

This really seems to be the biggest difference between the two
approaches right now.  The FreeBSD folks believe fervently that one of
the main reasons that their VM rocks is that it ages cache pages and
mapped pages at the same rate.  Having both on the same aging list
achieves that.  Separating the two raises the question of how to
balance the aging of cache vs. swap in a fair manner.

> In classzone the aging exists too but it's _completly_ orthogonal to how rest
> of the VM works.

Umm, that applies to Rik's stuff too!

> This is my humble opinion at least. I may be wrong. I'll let you know
> once I'll have a patch I'll happy with and some real life number to proof my
> theory.

Good, the best theoretical VM in the world can fall apart instantly on
contact with the real world. :-)

Cheers, 
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 06:20:40PM +0200, Ingo Molnar wrote:
> i only suggested this as a debugging helper, instead of the suggested

I don't think removing the superlock from all fs is good thing at this stage (I
agree with SCT doing it only for ext2 [that's what we mostly care about] would
be possible). Who cares if UFS grabs the super lock or not?

grep lock_super fs/ext2/*.c is enough and we don't need debugging in the
scheduler for that.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Alexander Viro wrote:

> > i'd suggest to simply BUG() in schedule() if the superblock lock is held
> > not directly by lock_super. Holding the superblock lock is IMO quite rude
> > anyway (for performance and latency) - is there any place where we hold it
> > for a long time and it's unavoidable?
> 
> Ingo, schedule() has no bloody business _knowing_ about superblock
> locks in the first place. Yes, ext2 should not bother taking it at
> all. For completely unrelated reasons.

i only suggested this as a debugging helper, instead of the suggested
ext2_getblk() BUG() helper. Obviously schedule() has no business knowing
about filesystem locks.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Alexander Viro



On Mon, 25 Sep 2000, Ingo Molnar wrote:

> 
> On Mon, 25 Sep 2000, Stephen C. Tweedie wrote:
> 
> > Sorry, but in this case you have got a lot more variables than you
> > seem to think.  The obvious lock is the ext2 superblock lock, but
> > there are side cases with quota and O_SYNC which are much less
> > commonly triggered.  That's not even starting to consider the other
> > dozens of filesystems in the kernel which have to be audited if we
> > change the locking requirements for GFP calls.
> 
> i'd suggest to simply BUG() in schedule() if the superblock lock is held
> not directly by lock_super. Holding the superblock lock is IMO quite rude
> anyway (for performance and latency) - is there any place where we hold it
> for a long time and it's unavoidable?

Ingo, schedule() has no bloody business _knowing_ about superblock locks
in the first place. Yes, ext2 should not bother taking it at all. For
completely unrelated reasons.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Stephen C. Tweedie wrote:

> Sorry, but in this case you have got a lot more variables than you
> seem to think.  The obvious lock is the ext2 superblock lock, but
> there are side cases with quota and O_SYNC which are much less
> commonly triggered.  That's not even starting to consider the other
> dozens of filesystems in the kernel which have to be audited if we
> change the locking requirements for GFP calls.

i'd suggest to simply BUG() in schedule() if the superblock lock is held
not directly by lock_super. Holding the superblock lock is IMO quite rude
anyway (for performance and latency) - is there any place where we hold it
for a long time and it's unavoidable?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 12:36:50AM +0200, bert hubert wrote:
> On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote:
> > On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote:
> > > any form of serialisation on the quota file).  This feels like rather
> > > a lot of new and interesting deadlocks to be introducing so late in
> > > 2.4.  :-)
> 
> True. But they also appear to be found and solved at an impressive rate.
> These deadlocks are fatal and don't hide in corners, whereas the previous mm
> problems used to be very hard to spot and fix, there not being real
> showstoppers, except for abysmal performance. [1]

Sorry, but in this case you have got a lot more variables than you
seem to think.  The obvious lock is the ext2 superblock lock, but
there are side cases with quota and O_SYNC which are much less
commonly triggered.  That's not even starting to consider the other
dozens of filesystems in the kernel which have to be audited if we
change the locking requirements for GFP calls.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Sun, Sep 24, 2000 at 11:39:13PM -0300, Marcelo Tosatti wrote:
> - Change kmem_cache_shrink to return the number of freed pages. 

I did that too extending a patch from Mark. I also removed the first_not_full
ugliness providing a LIFO behaviour to the completly freed slabs (so
kmem_cache_reap removes the oldest completly unused slabs from the queue, not
the most recently used ones with potentially live cache in the CPU). 

> There was a comment on the shrink functions about making
> kmem_cache_shrink() work on a GFP_DMA/GFP_HIGHMEM basis to free only the
> wanted pages by the current allocation. 

This is meaningless at the moment because it can't be addressed without
classzone logic in the allocator (classzone means that the allocator will pass
to the memory balancing code the information about _which_ classzone you have
to allocate memory from, so you won't waste time to synchronously balance
unrelated zones).

My patch is here (it isn't going to apply cleanly due the test9 changes
in do_try_to_free_pages but porting is trivial). It was tested and
it was working for me.


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test7/slab-1

BTW, here there's a fix for a longstanding SMP race (since swap_out and msync
doesn't run with the big lock) that can corrupt memory: 


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/msync-smp-race-1

Here the fix for another SMP race in enstablish_pte:


ftp://ftp.uskernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/tlb-flush-smp-race-1

The fix for this last bit is ugly bit it's safe because Manfred said s390 have
a flush_tlb_page that atomically flushes and makees the pte invalid (cleaner
fix means moving part of enstablish_pte into the arch inlines).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:53:05PM +0200, Ingo Molnar wrote:
> sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC)

I didn't know.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of
> 
> Actually I'm the one who introduced the EXCLUSIVE thing there and I audited

sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:29:42PM +0200, Ingo Molnar wrote:
> There is no guarantee at all that the reader will win. If reads and writes
> racing for request slots ever becomes a problem then we should introduce a
> separate read and write waitqueue.

I agree. However here I also have a in flight per-queue limit of locked stuff
(otherwise with 512k sized request on scsi I could fill in some second 128mbyte
of RAM locked and I don't want to decrease the size of the queue because it has
to be large for aggressive reordering when the request are 4k large each).
This in-flight-perqueue limit is actually a non exclusive wakeup and it
triggers more often than the request shortage (because most of the time write
are consecutive) and so having two waitqueues and the reads that reigsters
themself into both shouldn't be very significative improvement at the moment (I
should first care about a wake-one in-flight-limit-per-queue wakeup :).

> the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of

Actually I'm the one who introduced the EXCLUSIVE thing there and I audited
_all_ the device drivers to check they do 1 wakeup for each 1 request they
release before sending it off Linus. But I never thought (until some day ago)
about the fact that if a read completes a reserved request the write won't be
able to accept it.

So long term we'll do two wake-one queues with reads registered in both.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:18:54PM +0200, Jens Axboe wrote:
> The scsi layer currently "manually" does a list_add on the queue itself,
> which doesn't look too healthy.

It's grabbing the io_request_lock so it looks healthy for now :)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Jens Axboe

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > The sg problem was different. When sg queues a request, it invokes the
> > request_fn to handle it. But if the queue is currently plugged, the
> > scsi_request_fn will not do anything.
> 
> That will explain it, yes. In the same way for correctness also those should
> be converted from request_fn to generic_unplug_device, right? (this

Yes, that would be the right fix. However, then we also need some
way of inserting requests in the queue and let it plug when appropriate.
The scsi layer currently "manually" does a list_add on the queue itself,
which doesn't look too healthy.

> will also avoid to recall spurious request_fn because the device is still
> in the tq_disk queue even when the I/O generated by the below request_fn
> completed)
> 
>   if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7)
>   (q->request_fn)(q);
>   if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7)
>   (q->request_fn)(q);

AFAIR, Eric tried to talk to the Compaq folks (and Leonard too, I dunno)
about why they want this. What came of it, I don't know.

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:11:34PM +0200, Jens Axboe wrote:
> Interesting. I haven't done any serious benching with the CSCAN introduction
> in elevator_linus, I'll try that too.

Only changing that the performance decreased reproducibly from 16 to 14
mbyte/sec in the read test with 2 threads.

So far I'm testing only IDE with LVM striping on two equal fast disks on
separate IDE channels.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:08:38PM +0200, Jens Axboe wrote:
> The sg problem was different. When sg queues a request, it invokes the
> request_fn to handle it. But if the queue is currently plugged, the
> scsi_request_fn will not do anything.

That will explain it, yes. In the same way for correctness also those should
be converted from request_fn to generic_unplug_device, right? (this
will also avoid to recall spurious request_fn because the device is still in the
tq_disk queue even when the I/O generated by the below request_fn completed)

if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7)
(q->request_fn)(q);
if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7)
(q->request_fn)(q);

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> driver (and I very much hope that with EXCLUSIVE gone away and the
> wait_on_* fixed those hangs will go away because I don't see anything else
> wrong at this moment).

the EXCLUSIVE thing only optimizes the wakeup, it's not semantic! How
better is it to let 100 processes race for one freed-up request slot?
There is no guarantee at all that the reader will win. If reads and writes
racing for request slots ever becomes a problem then we should introduce a
separate read and write waitqueue.

the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of
(performance) sense.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Jens Axboe

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > And a new elevator was introduced some months ago to solve this.
> 
> And now that I done some benchmark it seems the major optimization consists in
> the implementation of the new _ordering_ algorithm in test2, not really from
> the removal of the more finegrined latency control (said that I'm not going to
> reintroduce the previous latency control, the current one doesn't provide
> great latency but it's ok).

Yes, I found this the greatest improvement too.

> As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes
> the ordering algorithm), tiotest performance drops significantly in the
> 2-thread-reading case. elvtune settings doesn't matter, that's only a matter
> of the ordering.

Interesting. I haven't done any serious benching with the CSCAN introduction
in elevator_linus, I'll try that too.

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Jens Axboe

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > i had yesterday - those were simple VM deadlocks. I dont see any deadlocks
> 
> Definitely. They can't explain anything about the VM deadlocks. I was
> _only_ talking about the blkdev hangs that caused you to unplug the
> queue at each reschedule in tux and that Eric reported me for the SG
> driver (and I very much hope that with EXCLUSIVE gone away and the
> wait_on_* fixed those hangs will go away because I don't see anything else
> wrong at this moment).

The sg problem was different. When sg queues a request, it invokes the
request_fn to handle it. But if the queue is currently plugged, the
scsi_request_fn will not do anything.

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 03:49:52PM +0200, Jens Axboe wrote:
> And a new elevator was introduced some months ago to solve this.

And now that I done some benchmark it seems the major optimization consists in
the implementation of the new _ordering_ algorithm in test2, not really from
the removal of the more finegrined latency control (said that I'm not going to
reintroduce the previous latency control, the current one doesn't provide great
latency but it's ok).

As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes
the ordering algorithm), tiotest performance drops significantly in the
2-thread-reading case. elvtune settings doesn't matter, that's only a matter of
the ordering.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Jens Axboe

On Mon, Sep 25 2000, Ingo Molnar wrote:
> > The changes made were never half-done. The recent bug fixes have
> > mainly been to remove cruft from the earlier elevator and fixing a bug
> > where the elevator insert would screw up a bit. So I'd call that fine
> > tuning or adjusting, not fixing half-done stuff.
> 
> sorry i did not mean to offend you - unadjusted and unfixed stuff hanging
> around in the kernel for months is 'half done' for me.

No offense taken, I just tried to explain my view. And in light of
the bad test2, I'd like the new changes to not have any "issues". So
this work has been going on for the last month or so, and I think we are
finally getting to agreement on what needs to be done now and how. WIP.

> > And a new elevator was introduced some months ago to solve this.
> 
> and these are still not solved in the vanilla kernel, as recent complaints
> on l-k prove.

Different problems, though :(. However, I believe they are solved in
Andrea and my current tree. Just needs the final cleaning, more later.

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> I was _only_ talking about the blkdev hangs [...]

i guess this was just miscommunication. It never 'hung', it just performed
reads with 20k/sec or so. (without any writes being done in the
background.) A 'hang' for me is a deadlock or lockup, not a slowdown.

> that caused you to unplug the queue at each reschedule in tux and that
> Eric reported me for the SG driver (and I very much hope that with
> EXCLUSIVE gone away and the wait_on_* fixed those hangs will go away
> because I don't see anything else wrong at this moment).

okay, i'll test this.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 03:57:31PM +0200, Ingo Molnar wrote:
> i had yesterday - those were simple VM deadlocks. I dont see any deadlocks

Definitely. They can't explain anything about the VM deadlocks. I was
_only_ talking about the blkdev hangs that caused you to unplug the
queue at each reschedule in tux and that Eric reported me for the SG
driver (and I very much hope that with EXCLUSIVE gone away and the
wait_on_* fixed those hangs will go away because I don't see anything else
wrong at this moment).

> but one of these two fixes could explain the slowdown i saw on and off for
> quite some time, seeing very bad read performance occasionally. (do you
> remember my sched.c tq_disc hack?)

Exactly, that's the only thing I was talking about in this subthread.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Jens Axboe wrote:

> The changes made were never half-done. The recent bug fixes have
> mainly been to remove cruft from the earlier elevator and fixing a bug
> where the elevator insert would screw up a bit. So I'd call that fine
> tuning or adjusting, not fixing half-done stuff.

sorry i did not mean to offend you - unadjusted and unfixed stuff hanging
around in the kernel for months is 'half done' for me.

> > the first reports about bad write performance came right after the
> > original elevator patches went in, about 6 months ago.
> 
> And a new elevator was introduced some months ago to solve this.

and these are still not solved in the vanilla kernel, as recent complaints
on l-k prove.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> - sync_page(page);
>   set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> + sync_page(page);

> - run_task_queue(_disk);
>   set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> + run_task_queue(_disk);

these look like genuine fixes, but i dont think they can explain the hangs
i had yesterday - those were simple VM deadlocks. I dont see any deadlocks
today - but i'm running the unsafe B2 variant of the vmfixes patch. (and i
have no swapping enabled which simplifies my VM setup.)

but one of these two fixes could explain the slowdown i saw on and off for
quite some time, seeing very bad read performance occasionally. (do you
remember my sched.c tq_disc hack?)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 04:02:30AM +0200, Andrea Arcangeli wrote:
> On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote:
> > So help testing the patches to them. Arrgh...
> 
> I think I'd better fix the bugs that I know about before testing patches that
> tries to remove the superblock_lock at this stage.

Right.  If we're introducing new deadlock possibilities, then sure we
can fix the obvious cases in ext2, but it will be next to impossible
to do a thorough audit of all of the other filesystems.  Adding in the
new shrink_icache loop into the VFS just feels too dangerous right
now.

Of course, that doesn't mean we shouldn't remove the excessive
superblock locking from ext2 --- rather, it is simply more robust to
keep the two issues separate.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 03:10:51PM +0200, Ingo Molnar wrote:
> yep. But i dont understand why this makes any difference - the waitqueue

It makes a difference because your sleeping reads won't get the wakeup
even while they could queue their reserved read request (they have
to wait the FIFO to roll or some write to complete).

> wakeup is FIFO, so any other request will eventually arrive. Could you
> explain this bug a bit better?

Well it may not explain an infinite hang because as you say the write that got
the suprious wakeup will unplug the queue and after some time the reads will be
wakenup. So maybe that wasn't the reason of your hangs because I remeber your
problem looked more like an infinite hang that was only solved by kflushd
writing some more stuff and unplugging the queue as side effect (however I'm
not sure since I never experienced those myself). 

But I hope if it wasn't that one it's the below fix that will help:

Index: mm/filemap.c
===
RCS file: /home/andrea/cvs/linux/mm/filemap.c,v
retrieving revision 1.1.1.5.2.3
retrieving revision 1.1.1.5.2.4
diff -u -r1.1.1.5.2.3 -r1.1.1.5.2.4
--- mm/filemap.c2000/09/21 03:11:53 1.1.1.5.2.3
+++ mm/filemap.c2000/09/25 03:33:31 1.1.1.5.2.4
@@ -622,8 +622,8 @@
 
add_wait_queue(>wait, );
do {
-   sync_page(page);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   sync_page(page);
if (!PageLocked(page))
break;
schedule();
Index: fs/buffer.c
===
RCS file: /home/andrea/cvs/linux/fs/buffer.c,v
retrieving revision 1.1.1.5.2.1
retrieving revision 1.1.1.5.2.2
diff -u -r1.1.1.5.2.1 -r1.1.1.5.2.2
--- fs/buffer.c 2000/09/06 19:57:51 1.1.1.5.2.1
+++ fs/buffer.c 2000/09/25 03:33:30 1.1.1.5.2.2
@@ -147,8 +147,8 @@
atomic_inc(>b_count);
add_wait_queue(>b_wait, );
do {
-   run_task_queue(_disk);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   run_task_queue(_disk);
if (!buffer_locked(bh))
break;
schedule();


Think if the buffer returns locked between set_task_state(tsk,
TASK_UNINTERRUPTIBLE) and if (!buffer_locked(bh)). The window is very small but
it looks a genuine window for a deadlock. (and this one could sure explain
infinite hangs in read... even if it looks even less realistic than the
EXCLUSIVE task thing)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > yet another elevator algorithm we need a squeaky clean VM balancer above
> 
> FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec
> in the tiobench write test compared to clean 2.4.0-test8-pre5 that
> delivers 8mbyte/sec

great! I'm happy we have a fine-tuned elevator again.

> Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in
> wait_for_request. The high part of the queue is reserved for reads.
> Now if a read completes and it wakeups a write you'll hang.

yep. But i dont understand why this makes any difference - the waitqueue
wakeup is FIFO, so any other request will eventually arrive. Could you
explain this bug a bit better?

> If you think I should delay those fixes to do something else I don't
> agree sorry.

no, i never ment it. I find it very good that those half-done changes are
cleaned up and the remaining bugs / performance problems are eliminated -
the first reports about bad write performance came right after the
original elevator patches went in, about 6 months ago.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] vmfixes-2.4.0-test9-B2

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 12:13:08PM +0200, Ingo Molnar wrote:
> 
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > Not sure if this is the right moment for those changes though, I'm not
> > worried about ext2 but about the other non-netoworked fses that nobody
> > uses regularly.
> 
> it *is* the right moment to clean these issues up. These kinds of things

I'm talking about the removal of the superblock lock from the filesystems.

Note: I don't have problems with the removal of the superblock lock even if
done at this stage, I'm not the one who can choose those things, it's Linus's
responsability to take the final decision for the official tree, but don't ask
me to test patches that removes the superblock lock _at_this_stage_ before I
can run a stable and fast 2.4.x because I won't do that. Period.

> yet another elevator algorithm we need a squeaky clean VM balancer above

FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec in the
tiobench write test compared to clean 2.4.0-test8-pre5 that delivers 8mbyte/sec
instead with only blkdev layer changes in between the two kernels (and no
that's not a matter of the elevator since there are no seeks in the test
and I've not changed the elevator sorting algorithm during the bench).

Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in
wait_for_request. The high part of the queue is reserved for reads.
Now if a read completes and it wakeups a write you'll hang.

If you think I should delay those fixes to do something else I don't agree
sorry. 

> all. Please help identifying, fixing, debugging and testing these VM
> balancing issues. This is tough work and it needs to be done.

I had an alternative VM, that I prefer from a design standpoint, I'll improve
it and I'll maintain it.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >