Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Fri, 29 Sep 2000, Andrea Arcangeli wrote: > On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote: > > OK, good to see that we agree on the fact that we > > should age and swapout all pages equally agressively. > > Actually I think we should start looking at the mapped stuff > _only_ when the I/O cache aging is relevant. If the I/O cache > aging isn't relevant there's no point to look at the mapped > stuff since there's cache pollution going on. > If the cache is re-used (so if it's useful) that's completly > different issue and in that case unmapping potentially unused > stuff is the right thing to do of course. This is why I want to do: 1) equal aging of all pages in the system 2) page aging to have properties of both LRU and LFU 3) drop-behind to cope with streaming IO in a good way and maybe: 4) move unmapped pages to the inactive_clean list for immediate reclaiming but put pages which are/were mapped on the inactive_dirty list so we keep it a little bit longer The only way to reliably know if the cache is re-used a lot is by making sure we do the page aging for unmapped and mapped pages the same. If we don't do that, we won't be able to make a sensible comparison between the activity of pages in different places. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote: > OK, good to see that we agree on the fact that we > should age and swapout all pages equally agressively. Actually I think we should start looking at the mapped stuff _only_ when the I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no point to look at the mapped stuff since there's cache pollution going on. It's much less costly to drop a page from the unmapped cache than to play with pagetables, and also having slow read() is much better than having to fault into the .text areas (because the process is going to be designed in a way that expects read to block so it may do it asynchronously or in a separate thread or whatever). A `cp /dev/zero .` shouldn't swapout/unmap anything. If the cache is re-used (so if it's useful) that's completly different issue and in that case unmapping potentially unused stuff is the right thing to do of course. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Andrea Arcangeli wrote: > On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote: > > Andrea, I have the strong impression that your idea of > > memory balancing is based on the idea that the OS should > > out-smart the application instead of looking at the usage > > pattern of the pages in memory. > > Not sure what you mean with out-smart. > > My only point is that the OS actually can only swapout such shm. > If that SHM is not supposed to be swapped out and if the OS I/O > cache have more aging then the shm cache, then the OS should > tell the DBMS that it's time to shrink some shm page by freeing > it. OK, good to see that we agree on the fact that we should age and swapout all pages equally agressively. > > of the pages in question, instead of making presumptions > > based on what kind of cache the page is in. > > For the mapped pages we never make presumptions. We always check > the accessed bit and that's the most reliable info to know if > the page is been accessed recently (set from the cpu accesse > through the pte not only during page faults or cache hits). > With the current design pages mapped multiple times will be > overaged a bit but this can't be fixed until we make a page->pte > reverse lookup... Indeed. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Andrea Arcangeli wrote: On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote: Andrea, I have the strong impression that your idea of memory balancing is based on the idea that the OS should out-smart the application instead of looking at the usage pattern of the pages in memory. Not sure what you mean with out-smart. My only point is that the OS actually can only swapout such shm. If that SHM is not supposed to be swapped out and if the OS I/O cache have more aging then the shm cache, then the OS should tell the DBMS that it's time to shrink some shm page by freeing it. OK, good to see that we agree on the fact that we should age and swapout all pages equally agressively. of the pages in question, instead of making presumptions based on what kind of cache the page is in. For the mapped pages we never make presumptions. We always check the accessed bit and that's the most reliable info to know if the page is been accessed recently (set from the cpu accesse through the pte not only during page faults or cache hits). With the current design pages mapped multiple times will be overaged a bit but this can't be fixed until we make a page-pte reverse lookup... Indeed. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote: OK, good to see that we agree on the fact that we should age and swapout all pages equally agressively. Actually I think we should start looking at the mapped stuff _only_ when the I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no point to look at the mapped stuff since there's cache pollution going on. It's much less costly to drop a page from the unmapped cache than to play with pagetables, and also having slow read() is much better than having to fault into the .text areas (because the process is going to be designed in a way that expects read to block so it may do it asynchronously or in a separate thread or whatever). A `cp /dev/zero .` shouldn't swapout/unmap anything. If the cache is re-used (so if it's useful) that's completly different issue and in that case unmapping potentially unused stuff is the right thing to do of course. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Fri, 29 Sep 2000, Andrea Arcangeli wrote: On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote: OK, good to see that we agree on the fact that we should age and swapout all pages equally agressively. Actually I think we should start looking at the mapped stuff _only_ when the I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no point to look at the mapped stuff since there's cache pollution going on. If the cache is re-used (so if it's useful) that's completly different issue and in that case unmapping potentially unused stuff is the right thing to do of course. This is why I want to do: 1) equal aging of all pages in the system 2) page aging to have properties of both LRU and LFU 3) drop-behind to cope with streaming IO in a good way and maybe: 4) move unmapped pages to the inactive_clean list for immediate reclaiming but put pages which are/were mapped on the inactive_dirty list so we keep it a little bit longer The only way to reliably know if the cache is re-used a lot is by making sure we do the page aging for unmapped and mapped pages the same. If we don't do that, we won't be able to make a sensible comparison between the activity of pages in different places. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
> "ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes: Hi ingo> 2) introducing sys_flush(), which flushes pages from the pagecache. It is not supposed that mincore can do that (yes, just now it is not implemented, but the interface is there to do that)? Just curious. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 05:13:59PM +0200, Ingo Molnar wrote: > Can anyone see any problems with the concept of this approach? This can be It works only on top of a filesystem while all the checkpointing clever stuff is done internally by the DB (infact it _needs_ O_SYNC when it works on the fs). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Andrea Arcangeli wrote: > The DBMS uses shared SCSI disks across multiple hosts on the same SCSI > bus and synchronize the distributed cache via TCP. Tell me how to do > that with the OS cache and mmap. this could be supported by: 1) mlock()-ing the whole mapping. 2) introducing sys_flush(), which flushes pages from the pagecache. 3) doing sys_msync() after dirtying a range and before sending a TCP event. Whenever the DB-cache-flush-event comes over TCP, it calls sys_flush() for that given virtual address range or file address space range. Sys_flush flushes the page from the pagecache and unmaps the address. Whenever it's needed again by the application it will be faulted in and read from disk. Can anyone see any problems with the concept of this approach? This can be used for a page-granularity distributed IO cache. (there are some smaller problems with this approach, like mlock() on a big range can only be done by priviledged users, but thats not an issue IMO.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 01:31:40PM +0200, Ingo Molnar wrote: > if the shm contains raw I/O data, then thats flawed application design - > an mmap()-ed file should be used instead. Shm is equivalent to shared The DBMS uses shared SCSI disks across multiple hosts on the same SCSI bus and synchronize the distributed cache via TCP. Tell me how to do that with the OS cache and mmap. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote: > Andrea, I have the strong impression that your idea of > memory balancing is based on the idea that the OS should > out-smart the application instead of looking at the usage > pattern of the pages in memory. Not sure what you mean with out-smart. My only point is that the OS actually can only swapout such shm. If that SHM is not supposed to be swapped out and if the OS I/O cache have more aging then the shm cache, then the OS should tell the DBMS that it's time to shrink some shm page by freeing it. > of the pages in question, instead of making presumptions > based on what kind of cache the page is in. For the mapped pages we never make presumptions. We always check the accessed bit and that's the most reliable info to know if the page is been accessed recently (set from the cpu accesse through the pte not only during page faults or cache hits). With the current design pages mapped multiple times will be overaged a bit but this can't be fixed until we make a page->pte reverse lookup... Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 07:08:51AM -0300, Rik van Riel wrote: > taking care of this itself. But this is not something the OS > should prescribe to the application. Agreed. > (unless the SHM users tell you that this is the normal way > they use SHM ... but as Christoph just told us, it isn't) shm is not used as I/O cache from 90% of the apps out there because normal apps uses the OS cache functionality (90% of those apps doesn't use rawio to share a black box that looks like a scsi disk via SCSI bus connected to other hosts as well). I for sure agree shm swapin/swapout is very important. (we moved shm swapout/swapin to swap cache with readaround for that reason) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Rik van Riel wrote: > The OS has no business knowing what's inside that SHM page. exactly. > IF the shm contains I/O cache, maybe you're right. However, > until you know that this is the case, optimising for that > situation just doesn't make any sense. if the shm contains raw I/O data, then thats flawed application design - an mmap()-ed file should be used instead. Shm is equivalent to shared anonymous pages. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Rik van Riel wrote: > On Wed, 27 Sep 2000, Andrea Arcangeli wrote: > > But again: if the shm contains I/O cache it should be released > > and not swapped out. Swapping out shmfs that contains I/O cache > > would be exactly like swapping out page-cache. > > The OS has no business knowing what's inside that SHM page. Hmm, now I woke up maybe I should formulate this in a different way. Andrea, I have the strong impression that your idea of memory balancing is based on the idea that the OS should out-smart the application instead of looking at the usage pattern of the pages in memory. This is fundamentally different from the idea that the OS should make decisions based on the observed usage patterns of the pages in question, instead of making presumptions based on what kind of cache the page is in. I've been away for 10 days and have been sitting on a bus all last night so my judgement may be off. I'd certainly like to hear I'm wrong ;) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, 27 Sep 2000, Andrea Arcangeli wrote: > On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: > > I just checked one oracle system and it did not lock the memory. And I > > If that memory is used for I/O cache then such memory should > released when the system runs into swap instead of swapping it > out too (otherwise it's not cache anymore and it could be slower > than re-reading from disk the real data in rawio). It could also be faster. If the database spent half an hour gathering pieces of data from all over the database, it might be faster to keep it in one place in swap so it can be read in again in one swoop. (I had an interesting talk about this with a database person while at OLS) But that's not the point. If your assertion is true, then the database will probably be using an mlock()ed SHM region and taking care of this itself. But this is not something the OS should prescribe to the application. If the OS finds that certain SHM pages are used far less than the pages in the I/O cache, then those SHM pages should be swapped out. The system's job is to keep the most used pages of data in memory to minimise the amount of page faults happening. Trying to outsmart the application shouldn't (IHMO of course) be part of that job... > > Customers with performance problems very often start with too little > > memory, but they cannot upgrade until this really big job finishes :-( > > > > Another issue about shm swapping is interactive transactions, where > > some users have very large contexts and go for a coffee before > > submitting. This memory can be swapped. > > Agreed, that's why I said shm performance under swap is very important > as well (I'm not understimating it). > > But again: if the shm contains I/O cache it should be released > and not swapped out. Swapping out shmfs that contains I/O cache > would be exactly like swapping out page-cache. The OS has no business knowing what's inside that SHM page. IF the shm contains I/O cache, maybe you're right. However, until you know that this is the case, optimising for that situation just doesn't make any sense. (unless the SHM users tell you that this is the normal way they use SHM ... but as Christoph just told us, it isn't) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, 27 Sep 2000, Andrea Arcangeli wrote: On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: I just checked one oracle system and it did not lock the memory. And I If that memory is used for I/O cache then such memory should released when the system runs into swap instead of swapping it out too (otherwise it's not cache anymore and it could be slower than re-reading from disk the real data in rawio). It could also be faster. If the database spent half an hour gathering pieces of data from all over the database, it might be faster to keep it in one place in swap so it can be read in again in one swoop. (I had an interesting talk about this with a database person while at OLS) But that's not the point. If your assertion is true, then the database will probably be using an mlock()ed SHM region and taking care of this itself. But this is not something the OS should prescribe to the application. If the OS finds that certain SHM pages are used far less than the pages in the I/O cache, then those SHM pages should be swapped out. The system's job is to keep the most used pages of data in memory to minimise the amount of page faults happening. Trying to outsmart the application shouldn't (IHMO of course) be part of that job... Customers with performance problems very often start with too little memory, but they cannot upgrade until this really big job finishes :-( Another issue about shm swapping is interactive transactions, where some users have very large contexts and go for a coffee before submitting. This memory can be swapped. Agreed, that's why I said shm performance under swap is very important as well (I'm not understimating it). But again: if the shm contains I/O cache it should be released and not swapped out. Swapping out shmfs that contains I/O cache would be exactly like swapping out page-cache. The OS has no business knowing what's inside that SHM page. IF the shm contains I/O cache, maybe you're right. However, until you know that this is the case, optimising for that situation just doesn't make any sense. (unless the SHM users tell you that this is the normal way they use SHM ... but as Christoph just told us, it isn't) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Rik van Riel wrote: On Wed, 27 Sep 2000, Andrea Arcangeli wrote: But again: if the shm contains I/O cache it should be released and not swapped out. Swapping out shmfs that contains I/O cache would be exactly like swapping out page-cache. The OS has no business knowing what's inside that SHM page. Hmm, now I woke up maybe I should formulate this in a different way. Andrea, I have the strong impression that your idea of memory balancing is based on the idea that the OS should out-smart the application instead of looking at the usage pattern of the pages in memory. This is fundamentally different from the idea that the OS should make decisions based on the observed usage patterns of the pages in question, instead of making presumptions based on what kind of cache the page is in. I've been away for 10 days and have been sitting on a bus all last night so my judgement may be off. I'd certainly like to hear I'm wrong ;) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, 28 Sep 2000, Rik van Riel wrote: The OS has no business knowing what's inside that SHM page. exactly. IF the shm contains I/O cache, maybe you're right. However, until you know that this is the case, optimising for that situation just doesn't make any sense. if the shm contains raw I/O data, then thats flawed application design - an mmap()-ed file should be used instead. Shm is equivalent to shared anonymous pages. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote: Andrea, I have the strong impression that your idea of memory balancing is based on the idea that the OS should out-smart the application instead of looking at the usage pattern of the pages in memory. Not sure what you mean with out-smart. My only point is that the OS actually can only swapout such shm. If that SHM is not supposed to be swapped out and if the OS I/O cache have more aging then the shm cache, then the OS should tell the DBMS that it's time to shrink some shm page by freeing it. of the pages in question, instead of making presumptions based on what kind of cache the page is in. For the mapped pages we never make presumptions. We always check the accessed bit and that's the most reliable info to know if the page is been accessed recently (set from the cpu accesse through the pte not only during page faults or cache hits). With the current design pages mapped multiple times will be overaged a bit but this can't be fixed until we make a page-pte reverse lookup... Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 01:31:40PM +0200, Ingo Molnar wrote: if the shm contains raw I/O data, then thats flawed application design - an mmap()-ed file should be used instead. Shm is equivalent to shared The DBMS uses shared SCSI disks across multiple hosts on the same SCSI bus and synchronize the distributed cache via TCP. Tell me how to do that with the OS cache and mmap. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Thu, Sep 28, 2000 at 05:13:59PM +0200, Ingo Molnar wrote: Can anyone see any problems with the concept of this approach? This can be It works only on top of a filesystem while all the checkpointing clever stuff is done internally by the DB (infact it _needs_ O_SYNC when it works on the fs). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
"ingo" == Ingo Molnar [EMAIL PROTECTED] writes: Hi ingo 2) introducing sys_flush(), which flushes pages from the pagecache. It is not supposed that mincore can do that (yes, just now it is not implemented, but the interface is there to do that)? Just curious. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, Sep 27, 2000 at 12:25:44PM -0600, Erik Andersen wrote: > Or sysinfo(2). Same thing... sysinfo structure doesn't export the number of active pages in the system. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed Sep 27, 2000 at 07:42:00PM +0200, Andrea Arcangeli wrote: > > You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in > 2.4.x so it's just the overhead of a read syscall) Or sysinfo(2). Same thing... -Erik -- Erik B. Andersen email: [EMAIL PROTECTED] --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, Sep 27, 2000 at 06:56:42PM +0200, Christoph Rohland wrote: > Yes, but how does the application detect that it should free the mem? The trivial way is not to detect it and to allow the user to select how much memory it will use as cache and to take it locked and then don't care (he will have to decrease the size of the shm by hand if it wants to drop some cache). >From the OS point of view it's like not having that RAM at all and there will be zero performance difference compared into trashing into swap without such memory. (on 2.2.x this is not true for a complexity problem in shrink mmap that is solved with the real lru in 2.4.x) The other way is to have the shm cache that shrinks dynamically by looking /proc/meminfo and looking at the aging of their own cache. Again the user should say a miniumum and a maximum of shm cache to keep locked in memory. Then you look at the "freemem + cache + buffers - active cache" and you can say when you're going to run into swap. Specifically with classzone you'll run into swap when that value is near zero. So when such value is near zero you know it's time to shrink the shm cache dynamically if it has a low age otherwise the machine will trash into swap badly and performance will decrease. (you could start shrinking when such value is below an amount of mbyte again configurable via a form) You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in 2.4.x so it's just the overhead of a read syscall) These DB using rawio really want to substitue part of the kernel cache functionality and so it's quite natural that they also don't want the kernel to play with their caches while they run and they would need some more interaction with the kernel memory balancing (possibly via async signals) to get their shm reclaimed dynamically more cleanly and efficiently by registering for this functionality (they could get signals when the machine runs into swap and then the DB chooses if it worth to release some locked cache after looking at the /proc/meminfo and the working set on their own caches). > Also you often have more overhead reading out of a database then > having preprocessed data in swap. Yes I see, it of course depends on the kind of cache (if it's very near to the on-disk format than more probably it shouldn't be swapped out). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: > > I just checked one oracle system and it did not lock the memory. And I > > If that memory is used for I/O cache then such memory should > released when the system runs into swap instead of swapping it out > too (otherwise it's not cache anymore and it could be slower than > re-reading from disk the real data in rawio). Yes, but how does the application detect that it should free the mem? Also you often have more overhead reading out of a database then having preprocessed data in swap. > > Customers with performance problems very often start with too little > > memory, but they cannot upgrade until this really big job finishes :-( > > > > Another issue about shm swapping is interactive transactions, where > > some users have very large contexts and go for a coffee before > > submitting. This memory can be swapped. > > Agreed, that's why I said shm performance under swap is very important > as well (I'm not understimating it). fine :-) Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: > I just checked one oracle system and it did not lock the memory. And I If that memory is used for I/O cache then such memory should released when the system runs into swap instead of swapping it out too (otherwise it's not cache anymore and it could be slower than re-reading from disk the real data in rawio). > Customers with performance problems very often start with too little > memory, but they cannot upgrade until this really big job finishes :-( > > Another issue about shm swapping is interactive transactions, where > some users have very large contexts and go for a coffee before > submitting. This memory can be swapped. Agreed, that's why I said shm performance under swap is very important as well (I'm not understimating it). But again: if the shm contains I/O cache it should be released and not swapped out. Swapping out shmfs that contains I/O cache would be exactly like swapping out page-cache. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Ingo Molnar <[EMAIL PROTECTED]> writes: > On 27 Sep 2000, Christoph Rohland wrote: > > > Nobody should rely on shm swapping for productive use. But you have > > changing/increasing loads on application servers and out of a sudden > > you run oom. In this case the system should behave and it is _very_ > > good to have a smooth behaviour. > > it might make sense even in production use. If there is some calculation > that has to be done only once per month, then sure the customer can decide > to wait for it a few hours until it swaps itself ready, instead of buying > gigs of RAM just to execute this single operation faster. Uncooperative > OOM in such cases is a show-stopper. Or are you saying the same thing? :-) That's what I meant with the coffee break. In a big installation somebody is always drinking coffee :-) You also have often different loads during daytime and nighttime. Swapping buffers out to swap disk instead of rereading from the database makes a lot of sense for this. But a single job should never swap. (It works for two month and then next month you get the big escalation and you would love to have hotplug memory) So swapping happens in productive use. But nobody should rely on that too much. And I completely agree that uncooperative OOM is not acceptable. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On 27 Sep 2000, Christoph Rohland wrote: > Nobody should rely on shm swapping for productive use. But you have > changing/increasing loads on application servers and out of a sudden > you run oom. In this case the system should behave and it is _very_ > good to have a smooth behaviour. it might make sense even in production use. If there is some calculation that has to be done only once per month, then sure the customer can decide to wait for it a few hours until it swaps itself ready, instead of buying gigs of RAM just to execute this single operation faster. Uncooperative OOM in such cases is a show-stopper. Or are you saying the same thing? :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > Said that I heard of real world programs that have a .text larger than 2G =:-O > I know Oracle (and most other DB) are very shm intensive. However > the fact you say the shm is not locked in memory is really a news to > me. I really remembered that the shm was locked. I just checked one oracle system and it did not lock the memory. And I do not think that the other databases do it by default either. And our application server doesn't do it definitely. And it uses loads of shared memory. We will have application servers soon with 16 GB memory at customer sites which will have the whole memory in shmfs. > I also don't see the point of keeping data cache in the swap. Swap > involves SMP tlb flushes and all the other big overhead that you > could avoid by sizing properly the shm cache and taking it locked. > > Note: having very fast shm swapout/swapin is very good thing (infact > we introduced readaround of the swapin and moved shm swapout/swapin > locking to the swap cache in early 2.3.x exactly for that > reason). But I just don't think DBMS needed that. Nobody should rely on shm swapping for productive use. But you have changing/increasing loads on application servers and out of a sudden you run oom. In this case the system should behave and it is _very_ good to have a smooth behaviour. Customers with performance problems very often start with too little memory, but they cannot upgrade until this really big job finishes :-( Another issue about shm swapping is interactive transactions, where some users have very large contexts and go for a coffee before submitting. This memory can be swapped. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Andrea Arcangeli [EMAIL PROTECTED] writes: Said that I heard of real world programs that have a .text larger than 2G =:-O I know Oracle (and most other DB) are very shm intensive. However the fact you say the shm is not locked in memory is really a news to me. I really remembered that the shm was locked. I just checked one oracle system and it did not lock the memory. And I do not think that the other databases do it by default either. And our application server doesn't do it definitely. And it uses loads of shared memory. We will have application servers soon with 16 GB memory at customer sites which will have the whole memory in shmfs. I also don't see the point of keeping data cache in the swap. Swap involves SMP tlb flushes and all the other big overhead that you could avoid by sizing properly the shm cache and taking it locked. Note: having very fast shm swapout/swapin is very good thing (infact we introduced readaround of the swapin and moved shm swapout/swapin locking to the swap cache in early 2.3.x exactly for that reason). But I just don't think DBMS needed that. Nobody should rely on shm swapping for productive use. But you have changing/increasing loads on application servers and out of a sudden you run oom. In this case the system should behave and it is _very_ good to have a smooth behaviour. Customers with performance problems very often start with too little memory, but they cannot upgrade until this really big job finishes :-( Another issue about shm swapping is interactive transactions, where some users have very large contexts and go for a coffee before submitting. This memory can be swapped. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On 27 Sep 2000, Christoph Rohland wrote: Nobody should rely on shm swapping for productive use. But you have changing/increasing loads on application servers and out of a sudden you run oom. In this case the system should behave and it is _very_ good to have a smooth behaviour. it might make sense even in production use. If there is some calculation that has to be done only once per month, then sure the customer can decide to wait for it a few hours until it swaps itself ready, instead of buying gigs of RAM just to execute this single operation faster. Uncooperative OOM in such cases is a show-stopper. Or are you saying the same thing? :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Ingo Molnar [EMAIL PROTECTED] writes: On 27 Sep 2000, Christoph Rohland wrote: Nobody should rely on shm swapping for productive use. But you have changing/increasing loads on application servers and out of a sudden you run oom. In this case the system should behave and it is _very_ good to have a smooth behaviour. it might make sense even in production use. If there is some calculation that has to be done only once per month, then sure the customer can decide to wait for it a few hours until it swaps itself ready, instead of buying gigs of RAM just to execute this single operation faster. Uncooperative OOM in such cases is a show-stopper. Or are you saying the same thing? :-) That's what I meant with the coffee break. In a big installation somebody is always drinking coffee :-) You also have often different loads during daytime and nighttime. Swapping buffers out to swap disk instead of rereading from the database makes a lot of sense for this. But a single job should never swap. (It works for two month and then next month you get the big escalation and you would love to have hotplug memory) So swapping happens in productive use. But nobody should rely on that too much. And I completely agree that uncooperative OOM is not acceptable. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: I just checked one oracle system and it did not lock the memory. And I If that memory is used for I/O cache then such memory should released when the system runs into swap instead of swapping it out too (otherwise it's not cache anymore and it could be slower than re-reading from disk the real data in rawio). Customers with performance problems very often start with too little memory, but they cannot upgrade until this really big job finishes :-( Another issue about shm swapping is interactive transactions, where some users have very large contexts and go for a coffee before submitting. This memory can be swapped. Agreed, that's why I said shm performance under swap is very important as well (I'm not understimating it). But again: if the shm contains I/O cache it should be released and not swapped out. Swapping out shmfs that contains I/O cache would be exactly like swapping out page-cache. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Andrea Arcangeli [EMAIL PROTECTED] writes: On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: I just checked one oracle system and it did not lock the memory. And I If that memory is used for I/O cache then such memory should released when the system runs into swap instead of swapping it out too (otherwise it's not cache anymore and it could be slower than re-reading from disk the real data in rawio). Yes, but how does the application detect that it should free the mem? Also you often have more overhead reading out of a database then having preprocessed data in swap. Customers with performance problems very often start with too little memory, but they cannot upgrade until this really big job finishes :-( Another issue about shm swapping is interactive transactions, where some users have very large contexts and go for a coffee before submitting. This memory can be swapped. Agreed, that's why I said shm performance under swap is very important as well (I'm not understimating it). fine :-) Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, Sep 27, 2000 at 06:56:42PM +0200, Christoph Rohland wrote: Yes, but how does the application detect that it should free the mem? The trivial way is not to detect it and to allow the user to select how much memory it will use as cache and to take it locked and then don't care (he will have to decrease the size of the shm by hand if it wants to drop some cache). From the OS point of view it's like not having that RAM at all and there will be zero performance difference compared into trashing into swap without such memory. (on 2.2.x this is not true for a complexity problem in shrink mmap that is solved with the real lru in 2.4.x) The other way is to have the shm cache that shrinks dynamically by looking /proc/meminfo and looking at the aging of their own cache. Again the user should say a miniumum and a maximum of shm cache to keep locked in memory. Then you look at the "freemem + cache + buffers - active cache" and you can say when you're going to run into swap. Specifically with classzone you'll run into swap when that value is near zero. So when such value is near zero you know it's time to shrink the shm cache dynamically if it has a low age otherwise the machine will trash into swap badly and performance will decrease. (you could start shrinking when such value is below an amount of mbyte again configurable via a form) You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in 2.4.x so it's just the overhead of a read syscall) These DB using rawio really want to substitue part of the kernel cache functionality and so it's quite natural that they also don't want the kernel to play with their caches while they run and they would need some more interaction with the kernel memory balancing (possibly via async signals) to get their shm reclaimed dynamically more cleanly and efficiently by registering for this functionality (they could get signals when the machine runs into swap and then the DB chooses if it worth to release some locked cache after looking at the /proc/meminfo and the working set on their own caches). Also you often have more overhead reading out of a database then having preprocessed data in swap. Yes I see, it of course depends on the kind of cache (if it's very near to the on-disk format than more probably it shouldn't be swapped out). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed Sep 27, 2000 at 07:42:00PM +0200, Andrea Arcangeli wrote: You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in 2.4.x so it's just the overhead of a read syscall) Or sysinfo(2). Same thing... -Erik -- Erik B. Andersen email: [EMAIL PROTECTED] --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Wed, Sep 27, 2000 at 12:25:44PM -0600, Erik Andersen wrote: Or sysinfo(2). Same thing... sysinfo structure doesn't export the number of active pages in the system. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Tue, Sep 26, 2000 at 10:00:12PM +0200, Peter Osterlund wrote: > Therefore, no matter what algorithm you use in elevator_linus() the total > number of seeks should be the same. It isn't. There's a big difference between the two algorithms and all your previous emails was completly correct about the "theorical" additional seeks during starvation avoidance. > an even better way to sort the requests. I think the important thing is > trying to minimize the total amount of seek time. But doing that requires The current algorithm only tries to minimize the total amount of seek time. That's what the elevator reordering is there for. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Tue, 26 Sep 2000, Andrea Arcangeli wrote: : > smaller with your algorithm? (I later realized that request merging is : > done before the elevator function kicks in, so your algorithm should : : Not sure what you mean. There are two cases: the bh is merged, or : the bh will be queued in a new request because merging isn't possible. : : Your change deals only with the latter case and that should be : pretty orthogonal to what the merging stuff does. I previously thought that elevator_linus() was called first, and then elevator_linus_merge() was invoked to merge sequential requests before they were sent to the driver. If that had been the case, your version of elevator_linus() would have generated more seeks than CSCAN. But as I said, I was mistaken, it doesn't work that way. The elevator_linus() function is only called if merging isn't possible. Therefore, no matter what algorithm you use in elevator_linus() the total number of seeks should be the same. Therefore, if you are not trying to be fair (as CSCAN is) maybe there is an even better way to sort the requests. I think the important thing is trying to minimize the total amount of seek time. But doing that requires knowledge about how the seek time depends on the seek distance. -- Peter Österlund Email: [EMAIL PROTECTED] Sköndalsvägen 35[EMAIL PROTECTED] S-128 66 Sköndal Home page: http://home1.swipnet.se/~w-15919 Sweden Phone: +46 8 942647 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > Could you tell me what's wrong in having an app with a 1.5G mapped executable > (or a tiny executable but with a 1.5G shared/private file mapping if you > prefer), O.K. that sound more reasonable. I was reading image as program text... and a 1.5GB program text is a something I never have seen (and hopefully will never see :-) > 300M of shm (or 300M of anonymous memory if you prefer) and 200M as > filesystem cache? I don't really see a reason for fs cache in the application. I think that parallel applications tend to either share mostly all or nothing, but I may be wrong here. > The application have a misc I/O load that in some part will run out > of the working set, what's wrong with this? > > What's ridiculous? Please elaborate. I think we fixed this misreading. But still IMHO you underestimate the importance of shared memory for a lot of applications in the high end. There is not only Oracle out there and most of the shared memory is _not_ locked. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, Sep 26, 2000 at 06:20:47PM +0200, Christoph Rohland wrote: > O.K. that sound more reasonable. I was reading image as program > text... and a 1.5GB program text is a something I never have seen (and > hopefully will never see :-) :) >From the shrink_mmap complexity of the algorithm point of view a 1.5GB .text is completly equal to a MAP_SHARED large 1.5GB or a MAP_PRIVATE large 1.5GB (it doesn't need to be the .text of the program). Said that I heard of real world programs that have a .text larger than 2G (that's why I wasn't very careful to say it doesn't need to be a 1.5G .text but that any other so large page-cache mapping would have the same effect). > > 300M of shm (or 300M of anonymous memory if you prefer) and 200M as > > filesystem cache? > > I don't really see a reason for fs cache in the application. I think Infact the application can as well use rawio. > that parallel applications tend to either share mostly all or nothing, > but I may be wrong here. And then at some point you'll run `find /` or `tar mylatestsources.tar.gz sources/` or updatedb is startedup or whatever. And you don't need more than 200M of fs cache for that purpose. Think at the O(N) complexity that we had in si_meminfo (guess why in 2.4.x `free` say 0 in shared field). It was making impossible to run `xosview` on a 10G box (it was stalling for seconds). And si_meminfo was only counting 1 field, not rolling pages around lru grabbing locks and dirtyfing cachelines. That's a plain complexity/scalability issue as far I can tell, and classzone solves it completly. When you run tar with your 1.5G shared mapping in memory and you happen to hit the low watermark and you need to recycle some byte of old cache, you'll run as fast as without the mapping in memory. There will be zero difference in performance. (just like now if you run `free` on a 10G machine it runs as fast on a 4mbyte machine) > I think we fixed this misreading. I should have explained things more carefully since the first place sorry. > But still IMHO you underestimate the importance of shared memory for a > lot of applications in the high end. There is not only Oracle out > there and most of the shared memory is _not_ locked. Well I wasn't claiming that this optimization is very sensitive for DB applications (at least for DB that doesn't use quite big file mappings). I know Oracle (and most other DB) are very shm intensive. However the fact you say the shm is not locked in memory is really a news to me. I really remembered that the shm was locked. I also don't see the point of keeping data cache in the swap. Swap involves SMP tlb flushes and all the other big overhead that you could avoid by sizing properly the shm cache and taking it locked. Note: having very fast shm swapout/swapin is very good thing (infact we introduced readaround of the swapin and moved shm swapout/swapin locking to the swap cache in early 2.3.x exactly for that reason). But I just don't think DBMS needed that. Note: simulations are completly a different thing (their evolution is not predicable). Simulations can sure trash shm into swap anytime (but Oracle shouldn't do that AFIK). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Tue, Sep 26, 2000 at 12:14:18AM +0200, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 10:52:08PM +0200, Peter Osterlund wrote: > > Do you know why? Is it because the average seek distance becomes > > Good question. No I don't know why right now. I'll try again just to be 200% > sure and I'll let you know the results. These are the numbers produced by my current blkdev tree based on test8-pre5 with only the spinlock-1 patch on it: - 2.4.0-test8-pre5 + blkdev-1 - IA32 2-way SMP LVM-stripe IDE File Block Num Seq ReadRand Read Seq Write Rand Write DirSize Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%) --- -- --- --- --- --- --- --- . 25440961 16.38 7.99% 0.647 1.16% 15.60 14.8% 1.330 5.53% . 25440962 16.34 10.9% 0.676 1.12% 15.70 17.2% 1.330 5.95% . 25440964 16.30 10.9% 0.690 1.07% 15.55 17.9% 1.324 6.24% . 25440968 15.71 12.1% 0.713 1.06% 15.11 17.8% 1.327 6.01% File Block Num Seq ReadRand Read Seq Write Rand Write DirSize Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%) --- -- --- --- --- --- --- --- . 25440961 16.41 7.82% 0.716 0.82% 15.91 14.6% 1.334 4.86% . 25440962 16.44 11.1% 0.715 0.91% 15.82 17.1% 1.316 4.67% . 25440964 16.39 10.9% 0.722 0.95% 15.52 17.9% 1.322 5.07% . 25440968 16.02 11.8% 0.742 0.99% 15.13 17.8% 1.329 5.06% andrea@laser:/mnt/p > ~/dbench/dbench 40 40 clients started ...+.+...+...+..+.+++..++...++..+.++...+++++...+++++..+...+...+.+..+...++...+.+.+...+.+.++.+ Throughput 10.7262 MB/sec (NB.4077 MB/sec 107.262 MBit/sec) andrea@laser:/mnt/p > ~/dbench/dbench 40 40 clients started ...+...+.+.+..+..+...+.+...+.++...++.++...+..+.+...+...+..++.+.+++++...+...++.++...+...+...+...+...+.+.+ Throughput 11.7624 MB/sec (NB.703 MB/sec 117.624 MBit/sec)
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, Sep 26, 2000 at 08:54:23AM +0200, Christoph Rohland wrote: > "Stephen C. Tweedie" <[EMAIL PROTECTED]> writes: > > > Hi, > > > > On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote: > > > > > Having shrink_mmap that browse the mapped page cache is useless > > > as having shrink_mmap browsing kernel memory and anonymous pages > > > as it does in 2.2.x as far I can tell. It's an algorithm > > > complexity problem and it will waste lots of CPU. > > > > It's a compromise between CPU cost and Getting It Right. Ignoring the > > mmap is not a good solution either. > > > > > Now think this simple real life example. A 2G RAM machine running > > > an executable image of 1.5G, 300M in shm and 200M in cache. > > Hey that's ridiculous: 1.5G executable image and 300M shm? Take it > vice-versa and you are approaching real life. Could you tell me what's wrong in having an app with a 1.5G mapped executable (or a tiny executable but with a 1.5G shared/private file mapping if you prefer), 300M of shm (or 300M of anonymous memory if you prefer) and 200M as filesystem cache? The application have a misc I/O load that in some part will run out of the working set, what's wrong with this? What's ridiculous? Please elaborate. To emulate that workload we only need to mmap(1.5G, MAP_PRIVATE or MAP_SHARED), fault into it, and run bonnie. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
Peter Osterlund wrote: > Btw, does anyone know how the seek time depends on the seek distance > on modern disk hardware? I know very little but I've seen it before on this list. For larger seeks, the head is accelerated then decelerated to roughly the right track. That time is approx. the square root of the seek distance, which we can assume is approximately the difference in block number on the disk. After that, there is a settling time which is roughly constant for all seeks, though I wouldn't be surprised to find it's a bit smaller for really tiny seeks. There are different rules for when there are multiple indepdendent heads, for RAID arrays etc. (just to remind). enjoy, -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, Sep 26, 2000 at 08:54:23AM +0200, Christoph Rohland wrote: "Stephen C. Tweedie" [EMAIL PROTECTED] writes: Hi, On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote: Having shrink_mmap that browse the mapped page cache is useless as having shrink_mmap browsing kernel memory and anonymous pages as it does in 2.2.x as far I can tell. It's an algorithm complexity problem and it will waste lots of CPU. It's a compromise between CPU cost and Getting It Right. Ignoring the mmap is not a good solution either. Now think this simple real life example. A 2G RAM machine running an executable image of 1.5G, 300M in shm and 200M in cache. Hey that's ridiculous: 1.5G executable image and 300M shm? Take it vice-versa and you are approaching real life. Could you tell me what's wrong in having an app with a 1.5G mapped executable (or a tiny executable but with a 1.5G shared/private file mapping if you prefer), 300M of shm (or 300M of anonymous memory if you prefer) and 200M as filesystem cache? The application have a misc I/O load that in some part will run out of the working set, what's wrong with this? What's ridiculous? Please elaborate. To emulate that workload we only need to mmap(1.5G, MAP_PRIVATE or MAP_SHARED), fault into it, and run bonnie. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Tue, Sep 26, 2000 at 12:14:18AM +0200, Andrea Arcangeli wrote: On Mon, Sep 25, 2000 at 10:52:08PM +0200, Peter Osterlund wrote: Do you know why? Is it because the average seek distance becomes Good question. No I don't know why right now. I'll try again just to be 200% sure and I'll let you know the results. These are the numbers produced by my current blkdev tree based on test8-pre5 with only the spinlock-1 patch on it: - 2.4.0-test8-pre5 + blkdev-1 - IA32 2-way SMP LVM-stripe IDE File Block Num Seq ReadRand Read Seq Write Rand Write DirSize Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%) --- -- --- --- --- --- --- --- . 25440961 16.38 7.99% 0.647 1.16% 15.60 14.8% 1.330 5.53% . 25440962 16.34 10.9% 0.676 1.12% 15.70 17.2% 1.330 5.95% . 25440964 16.30 10.9% 0.690 1.07% 15.55 17.9% 1.324 6.24% . 25440968 15.71 12.1% 0.713 1.06% 15.11 17.8% 1.327 6.01% File Block Num Seq ReadRand Read Seq Write Rand Write DirSize Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%) --- -- --- --- --- --- --- --- . 25440961 16.41 7.82% 0.716 0.82% 15.91 14.6% 1.334 4.86% . 25440962 16.44 11.1% 0.715 0.91% 15.82 17.1% 1.316 4.67% . 25440964 16.39 10.9% 0.722 0.95% 15.52 17.9% 1.322 5.07% . 25440968 16.02 11.8% 0.742 0.99% 15.13 17.8% 1.329 5.06% andrea@laser:/mnt/p ~/dbench/dbench 40 40 clients started ...+.+...+...+..+.+++..++...++..+.++...+++++...+++++..+...+...+.+..+...++...+.+.+...+.+.++.+ Throughput 10.7262 MB/sec (NB.4077 MB/sec 107.262 MBit/sec) andrea@laser:/mnt/p ~/dbench/dbench 40 40 clients started ...+...+.+.+..+..+...+.+...+.++...++.++...+..+.+...+...+..++.+.+++++...+...++.++...+...+...+...+...+.+.+ Throughput 11.7624 MB/sec (NB.703 MB/sec 117.624 MBit/sec)
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, Sep 26, 2000 at 06:20:47PM +0200, Christoph Rohland wrote: O.K. that sound more reasonable. I was reading image as program text... and a 1.5GB program text is a something I never have seen (and hopefully will never see :-) :) From the shrink_mmap complexity of the algorithm point of view a 1.5GB .text is completly equal to a MAP_SHARED large 1.5GB or a MAP_PRIVATE large 1.5GB (it doesn't need to be the .text of the program). Said that I heard of real world programs that have a .text larger than 2G (that's why I wasn't very careful to say it doesn't need to be a 1.5G .text but that any other so large page-cache mapping would have the same effect). 300M of shm (or 300M of anonymous memory if you prefer) and 200M as filesystem cache? I don't really see a reason for fs cache in the application. I think Infact the application can as well use rawio. that parallel applications tend to either share mostly all or nothing, but I may be wrong here. And then at some point you'll run `find /` or `tar mylatestsources.tar.gz sources/` or updatedb is startedup or whatever. And you don't need more than 200M of fs cache for that purpose. Think at the O(N) complexity that we had in si_meminfo (guess why in 2.4.x `free` say 0 in shared field). It was making impossible to run `xosview` on a 10G box (it was stalling for seconds). And si_meminfo was only counting 1 field, not rolling pages around lru grabbing locks and dirtyfing cachelines. That's a plain complexity/scalability issue as far I can tell, and classzone solves it completly. When you run tar with your 1.5G shared mapping in memory and you happen to hit the low watermark and you need to recycle some byte of old cache, you'll run as fast as without the mapping in memory. There will be zero difference in performance. (just like now if you run `free` on a 10G machine it runs as fast on a 4mbyte machine) I think we fixed this misreading. I should have explained things more carefully since the first place sorry. But still IMHO you underestimate the importance of shared memory for a lot of applications in the high end. There is not only Oracle out there and most of the shared memory is _not_ locked. Well I wasn't claiming that this optimization is very sensitive for DB applications (at least for DB that doesn't use quite big file mappings). I know Oracle (and most other DB) are very shm intensive. However the fact you say the shm is not locked in memory is really a news to me. I really remembered that the shm was locked. I also don't see the point of keeping data cache in the swap. Swap involves SMP tlb flushes and all the other big overhead that you could avoid by sizing properly the shm cache and taking it locked. Note: having very fast shm swapout/swapin is very good thing (infact we introduced readaround of the swapin and moved shm swapout/swapin locking to the swap cache in early 2.3.x exactly for that reason). But I just don't think DBMS needed that. Note: simulations are completly a different thing (their evolution is not predicable). Simulations can sure trash shm into swap anytime (but Oracle shouldn't do that AFIK). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Andrea Arcangeli [EMAIL PROTECTED] writes: Could you tell me what's wrong in having an app with a 1.5G mapped executable (or a tiny executable but with a 1.5G shared/private file mapping if you prefer), O.K. that sound more reasonable. I was reading image as program text... and a 1.5GB program text is a something I never have seen (and hopefully will never see :-) 300M of shm (or 300M of anonymous memory if you prefer) and 200M as filesystem cache? I don't really see a reason for fs cache in the application. I think that parallel applications tend to either share mostly all or nothing, but I may be wrong here. The application have a misc I/O load that in some part will run out of the working set, what's wrong with this? What's ridiculous? Please elaborate. I think we fixed this misreading. But still IMHO you underestimate the importance of shared memory for a lot of applications in the high end. There is not only Oracle out there and most of the shared memory is _not_ locked. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Tue, Sep 26, 2000 at 10:00:12PM +0200, Peter Osterlund wrote: Therefore, no matter what algorithm you use in elevator_linus() the total number of seeks should be the same. It isn't. There's a big difference between the two algorithms and all your previous emails was completly correct about the "theorical" additional seeks during starvation avoidance. an even better way to sort the requests. I think the important thing is trying to minimize the total amount of seek time. But doing that requires The current algorithm only tries to minimize the total amount of seek time. That's what the elevator reordering is there for. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > The machine will run low on memory as soon as I read 200mbyte from disk. So? Yes, at that point we'll do the LRU dance. Then we won't be low on memory any more, and we won't do the LRU dance any more. What's the magic in zoneinfo that makes it not have to do the same thing? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 03:30:10PM -0700, Linus Torvalds wrote: > On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > > > I'm talking about the fact that if you have a file mmapped in 1.5G of RAM > > test9 will waste time rolling between LRUs 384000 pages, while classzone > > won't ever see 1 of those pages until you run low on fs cache. > > What drugs are you on? Nobody looks at the LRU's until the system is low > on memory. Sure, there's some background activity, but what are you The system is low on memory when you run `free` and you see a value < freepages_high*PAGE_SIZE in the "free" column first row. > talking about? It's only when you're low on memory that _either_ approach > starts looking at the LRU list. The machine will run low on memory as soon as I read 200mbyte from disk. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, Sep 26, 2000 at 12:30:28AM +0200, Juan J. Quintela wrote: > Which is completely wrong if the program uses _any not completely_ > unusual locality of reference. Think twice about that, it is more > probable that you need more that 300MB of filesystem cache that you > have an aplication that references _randomly_ 1.5GB of data. You need > to balance that _always_ :(( The application doesn't references ramdonly 1.5GB of data. Assume there's a big executable large 2G (and yes I know there are) and I run it. After some hour its RSS it's 1.5G. Ok? So now this program also shmget a 300 Mbyte shm segment. Now this program starts reading and writing terabyte of data that wouldn't fit in cache even if there would be 300G of ram (and this is possible too). Or maybe the program itself uses rawio but then you at a certain point use the machine to run a tar somewhere. Now tell me why this program needs more than 200Mbyte of fs cache if the kernel doesn't waste time on the mapped pages (as in classzone). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote: > > basically the whole of memory is data cache, some of which is mapped > > and some of which is not? > > As as said in the last email aging on the cache is supposed to that. > > Wasting CPU and incrasing the complexity of the algorithm is a price > that I won't pay just to get the information on when it's time > to recall swap_out(). You must be joking. Page replacement should be tuned to do good page replacement, not just to be easy on the CPU. (though a heavily thrashing system /is/ easy on the cpu, I'll have to admit that) > If the cache have no age it means I'd better throw it out instead > of swapping/unmapping out stuff, simple? Simple, yes. But completely BOGUS if you don't age the cache and the mapped pages at the same rate! If I age your pages twice as much as my pages, is it still only fair that your pages will be swapped out first? ;) > > anything since last time. Anything that only ages per-pte, not > > per-page, is simply going to die horribly under such load, and any > > The aging on the fs cache is done per-page. And the same should be done for other pages as well. If you don't do that, you'll have big problems keeping page replacement balanced and making the system work well under various loads. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 07:26:56PM -0300, Rik van Riel wrote: > IMHO this is a minor issue because: I don't think it's a minor issue. If you don't have reschedule point in your equivalent of shrink_mmap and this 1.5G will happen to be consecutive in the lru order (quite probably if it's been pagedin at fast rate) then you may even hang in interruptible mode for seconds as soon as somebody start reading from disk. 2.4.x have to scale for dozen of Giga of RAM as there are archs supporting that amount of RAM. > 2) you don't /want/ to run low on fs cache, you want So I can't read more than the size that the fs cache can take? I must be allowed to do that (they're 200 Mbyte of RAM that can be more than enough if the server mainly generate pollution anyway). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote: > OK, and here's another simple real life example. A 2GB RAM machine > running something like Oracle with a hundred client processes all > shm-mapping the same shared memory segment. Oracle takes the SHM locked, and it will never run on a machine without enough memory. > Oh, and you're also doing lots of file IO. How on earth do you decide > what to swap and what to page out in this sort of scenario, where > basically the whole of memory is data cache, some of which is mapped > and some of which is not? As as said in the last email aging on the cache is supposed to that. Wasting CPU and incrasing the complexity of the algorithm is a price that I won't pay just to get the information on when it's time to recall swap_out(). If the cache have no age it means I'd better throw it out instead of swapping/unmapping out stuff, simple? > anything since last time. Anything that only ages per-pte, not > per-page, is simply going to die horribly under such load, and any The aging on the fs cache is done per-page. The per-pte issue happens when we just took the difficult decision (that it was time to swap-out) and you have the same problem because you don't know the chain of pte that point to the physical page (so you're refresh the referenced bit more often). Once we'll have the chain of pte pointing to the page classzone will only need a real lru for the mapped pages to use it instead of walking pagetables. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > I'm talking about the fact that if you have a file mmapped in 1.5G of RAM > test9 will waste time rolling between LRUs 384000 pages, while classzone > won't ever see 1 of those pages until you run low on fs cache. What drugs are you on? Nobody looks at the LRU's until the system is low on memory. Sure, there's some background activity, but what are you talking about? It's only when you're low on memory that _either_ approach starts looking at the LRU list. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
> "andrea" == Andrea Arcangeli <[EMAIL PROTECTED]> writes: Hi andrea> I'm talking about the fact that if you have a file mmapped in 1.5G of RAM andrea> test9 will waste time rolling between LRUs 384000 pages, while classzone andrea> won't ever see 1 of those pages until you run low on fs cache. Which is completely wrong if the program uses _any not completely_ unusual locality of reference. Think twice about that, it is more probable that you need more that 300MB of filesystem cache that you have an aplication that references _randomly_ 1.5GB of data. You need to balance that _always_ :(( I think that there is no silver bullet here :( Later, Juan. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote: > > > > It doesn't --- that is part of the design. The vm scanner propagates > > > > > > And that's the inferior part of the design IMHO. > > > > Indeed, but physical page based aging is a definate > > 2.5 thing ... ;( > > I'm talking about the fact that if you have a file mmapped in > 1.5G of RAM test9 will waste time rolling between LRUs 384000 > pages, while classzone won't ever see 1 of those pages until you > run low on fs cache. IMHO this is a minor issue because: 1) you need to do page replacement with shared pages right 2) you don't /want/ to run low on fs cache, you want to have a good balance between thee cache(s) and the processes OTOH, if you have a way to keep fair page aging and fix the CPU time issue at the same time, I'd love to see it. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote: > > > It doesn't --- that is part of the design. The vm scanner propagates > > > > And that's the inferior part of the design IMHO. > > Indeed, but physical page based aging is a definate > 2.5 thing ... ;( I'm talking about the fact that if you have a file mmapped in 1.5G of RAM test9 will waste time rolling between LRUs 384000 pages, while classzone won't ever see 1 of those pages until you run low on fs cache. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 11:28:55PM +0200, Jens Axboe wrote: > q->plug_device_fn(q, ...); > list_add(...) > generic_unplug_device(q); > > would suffice in scsi_lib for now. It looks sane to me. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 10:52:08PM +0200, Peter Osterlund wrote: > Do you know why? Is it because the average seek distance becomes Good question. No I don't know why right now. I'll try again just to be 200% sure and I'll let you know the results. > smaller with your algorithm? (I later realized that request merging is > done before the elevator function kicks in, so your algorithm should Not sure what you mean. There are two cases: the bh is merged, or the bh will be queued in a new request because merging isn't possible. Your change deals only with the latter case and that should be pretty orthogonal to what the merging stuff does. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > The scsi layer currently "manually" does a list_add on the queue itself, > > which doesn't look too healthy. > > It's grabbing the io_request_lock so it looks healthy for now :) It's safe alright, but if we want to do the generic_unplug_queue instead of just hitting the request_fn (which might do anything anyway), it would be nicer to expose this part of the block layer (i.e. have a general way of queueing a request to the request_queue). But I guess just q->plug_device_fn(q, ...); list_add(...) generic_unplug_device(q); would suffice in scsi_lib for now. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Sun, 24 Sep 2000, Linus Torvalds wrote: [directories in pagecache on ext2] > > I'll do it and post the result tomorrow. I bet that there will be issues > > I've overlooked (stuff that happens to work on UFS, but needs to be more > > general for ext2), so it's going as "very alpha", but hey, it's pretty > > straightforward, so there is a chance to debug it fast. Yes, famous last > > words and all such... > > Sure. All right, I think I've got something that may work. Yes, there were issues - UFS has the constant directory chunk size (1 sector), while ext2 makes it equal to fs block size. _Bad_ idea, since the sector writes are atomic and block ones... Oh, well, so ext2 is slightly less robust. It required some changes, I'll do the initial testing and post the patch once it will pass the trivial tests. BTW, why on the Earth had we done it that way? It has no noticable effect on directory fragmentation, it makes code (both in page- and buffer-cache variants) more complex, it's less robust (by definition - directory layout may be broken easier)... What was the point? Not that we could do something about that now (albeit as a ro-compat feature it would be nice), but I'm curious about the reasons... Cheers, Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > The new elevator ordering algorithm returns me much better numbers > than the CSCAN one with tiobench. Do you know why? Is it because the average seek distance becomes smaller with your algorithm? (I later realized that request merging is done before the elevator function kicks in, so your algorithm should not produce more seeks than a CSCAN algorithm. Unfortunately I didn't realize this when I wrote my CSCAN patch.) Btw, does anyone know how the seek time depends on the seek distance on modern disk hardware? -- Peter Österlund Email: [EMAIL PROTECTED] Sköndalsvägen 35[EMAIL PROTECTED] S-128 66 Sköndal Home page: http://home1.swipnet.se/~w-15919 Sweden Phone: +46 8 942647 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Hi, On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote: > Having shrink_mmap that browse the mapped page cache is useless > as having shrink_mmap browsing kernel memory and anonymous pages > as it does in 2.2.x as far I can tell. It's an algorithm > complexity problem and it will waste lots of CPU. It's a compromise between CPU cost and Getting It Right. Ignoring the mmap is not a good solution either. > Now think this simple real life example. A 2G RAM machine running an executable > image of 1.5G, 300M in shm and 200M in cache. OK, and here's another simple real life example. A 2GB RAM machine running something like Oracle with a hundred client processes all shm-mapping the same shared memory segment. Oh, and you're also doing lots of file IO. How on earth do you decide what to swap and what to page out in this sort of scenario, where basically the whole of memory is data cache, some of which is mapped and some of which is not? If you don't separate out the propagation of referenced bits from the actual page aging, then every time you pass over the whole VM working set, you're likely to find a handful of live references to some of the shared memory, and a hundred or so references that haven't done anything since last time. Anything that only ages per-pte, not per-page, is simply going to die horribly under such load, and any imbalance between pure filesystem cache and VM pressure will be magnified to the point where one dominates. Hence my observation that it's really easy to find special cases where certain optimisations make a ton of sense, but you often lose balance in the process. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 07:06:57PM +0100, Stephen C. Tweedie wrote: > > Good. One of the problems we always had in the past, though, was that > > getting the relative aging of cache vs. vmas was easy if you had a > > small set of test loads, but it was really, really hard to find a > > balance that didn't show pathological behaviour in the worst cases. > > Yep, that's not trivial. It is. Just do physical-page based aging (so you age all the pages in the system the same) and the problem is solved. > > > I may be overlooking something but where do you notice when a page > > > gets unmapped from the last mapping and put it back into a place > > > that can be reached from shrink_mmap (or whatever the cache recycler is)? > > > > It doesn't --- that is part of the design. The vm scanner propagates > > And that's the inferior part of the design IMHO. Indeed, but physical page based aging is a definate 2.5 thing ... ;( regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Hi, On Mon, Sep 25, 2000 at 07:03:47PM +0200, Andrea Arcangeli wrote: > > > This really seems to be the biggest difference between the two > > approaches right now. The FreeBSD folks believe fervently that one of > > [ aging cache and mapped pages in the same cycle ] > > Right. > > And since you move the page into the active list only once you reach it from > the cache recycler and you find it with page->age != 0, you also spend time > putting those pages back and forth from those LRU lists while in my approch the > mapped pages are never seen from the cycle recylcer and no cycle is spent on > them. This mean in a pure fs read test with cache pollution going on, there's > _no_way_ that classzone touches or notice _any_ mapped page in its path. The "age==0" pages are basically just "pages we are ready to get rid of right away". The alternative to having that inactive list is to do what we do today --- which is to throw away the pages immediately. Having that extra list is simply giving pages a last chance before evicting them. It allows us to run reliably with fewer physically free pages --- we can reap inactive pages with no IO so those pages are as good as free for most purposes. The alternative to moving pages to the inactive list would be freeing them completely. Moving a page back to the active list from inactive is equivalent to avoiding a disk IO to pull in the page from backing store. It's supposed to be an optimisation to save physically freeing things unless we really, really need to. It is _not_ a transition which recently referenced pages encounter. > > the main reasons that their VM rocks is that it ages cache pages and > > mapped pages at the same rate. Having both on the same aging list > > achieves that. Separating the two raises the question of how to > > balance the aging of cache vs. swap in a fair manner. > > I believe increasing the aging in the unmapped cache should take care of that > fine. (it was working pretty much fine also with only 1 bit of most > frequently used aging plus the LRU order of the list) Good. One of the problems we always had in the past, though, was that getting the relative aging of cache vs. vmas was easy if you had a small set of test loads, but it was really, really hard to find a balance that didn't show pathological behaviour in the worst cases. > > > In classzone the aging exists too but it's _completly_ orthogonal to how > > > rest of the VM works. > > > > Umm, that applies to Rik's stuff too! > > I may be overlooking something but where do you notice when a page > gets unmapped from the last mapping and put it back into a place > that can be reached from shrink_mmap (or whatever the cache recycler is)? It doesn't --- that is part of the design. The vm scanner propagates referenced bits to the struct page, so the new shrink_mmap can do its aging based on whether a page has been referenced at all recently, not caring whether the reference was a VM reference or a page cache reference. That is done specifically to address the balance issue between VM and filesystem memory pressure. > Since none mapped page can in any way be freed by the cache recycler > (you need to unmap it first from swap_out at the moment) if you > should reach those pages from the cache recyler someway it means > thus you're wasting CPU (I couldn't reach any mapped page from the > cache recylcer in classzone and infact the mapped pages wasn't > linked in any LRU at all to save even more CPU). That's not how the current VM is supposed to work. The cache scanner isn't meant to reclaim pages --- it is meant to update the age information on pages, which is not quite the same job. If it finds pages whose age becomes zero, those are shifted to the inactive list, and once that list is large enough (ie. we have enough freeable pages), it can give up. The inactive list then gets physically freed on demand. The fact that we have a common loop in the VM for updating all age information is central to the design, and requires the cache recycler to pass over all those pages. By doing it that way, rather than from the VM scan, we can avoid one of the really bad properties of the old 2.0 aging code --- it means that for shared pages, we only do the aging once per walk over the pages regardless of how many ptes refer to the page. This avoids the nasty worst-case behaviour of having a recently-referenced page thrown out of memory just because there also happened to be a lot of old, unused references to it too. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 07:21:48PM +0200, bert hubert wrote: > Ok, sorry. Kernel development is proceding at a furious pace and I sometimes > lose track. No problem :). > I seem to remember that people were impressed by classzone, but that the > implementation was very non-trivial and hard to grok. One of the reasons Yes. Classzone is certainly more complex. > There is no such thing as 'under swap'. There are lots of loadpatterns that > will generate different kinds of memory pressure. Just calling it 'under > swap' gives entirely the wrong impression. Sorry for not being precise. I meant one of those load patterns. > 'rivaling virtual memory' code. Energies spent on Rik's VM will yield far > higher differential improvement. I've spent efforts on classzone as well, and since I think it's way superior approch I'll at least port it on top of 2.4.0-test9 as soon as time permits to generate some number. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
> We're talking about shrink_[id]cache_memory change. That have _nothing_ to do > with the VM changes that happened anywhere between test8 and test9-pre6. > > You were talking about a different thing. Ok, sorry. Kernel development is proceding at a furious pace and I sometimes lose track. > I consider the current approch the wrong way to go and for this reason I > prefer to spend time porting/improving classzone. I seem to remember that people were impressed by classzone, but that the implementation was very non-trivial and hard to grok. One of the reasons Rik's vm made it (so far) is that it is pretty straightforward, with all the marks of the right amount of simplicity. > In the meantime if you want to go back to 2.4.0-test1-ac22-class++ to give > it a try under swap to see the difference in the behaviour and compare > (Mike said it's still an order of magnitude faster with his "make -j30 > bzImage" testcase and he's always very reliable in his reports). There is no such thing as 'under swap'. There are lots of loadpatterns that will generate different kinds of memory pressure. Just calling it 'under swap' gives entirely the wrong impression. Although Mike's compile is a relevant benchmark, every VM has cases for which it excels, and cases for which it sucks. This appears to be a general property of VM design. Given knowledge of the algorithms used, you can always dream up a situation where it will fail. It's a bit like writing the halting problem algorithm. Same goes the other way around, every VM will have a 'shining benchmark' - hence the invention of benchmarketing. We used to have a bad virtual memory implementation that was sometimes well tuned so a lots of ordinary cases showed acceptable performance. We now have an elegant VM that works reasonably well, but needs more tweaking. What is the point of all this ranting? Think twice before embarking on 'rivaling virtual memory' code. Energies spent on Rik's VM will yield far higher differential improvement. Regards, bert hubert -- PowerDNS Versatile DNS Services Trilab The Technology People 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 07:05:02PM +0200, Ingo Molnar wrote: > yep - and Jens i'm sorry about the outburst. Until a bug is found it's > unrealistic to blame anything. I think the only bug maybe to blame in the elevator is the EXCLUSIVE wakeup thing (and I've not benchmarked it alone to see if it makes any real world performance difference but for sure its behaviour wasn't intentional). Anything else related to the elevator internals should perform better than the old elevator (aka the 2.2.15 one). The new elevator ordering algorithm returns me much better numbers than the CSCAN one with tiobench. Also consider the latency control at the moment is completly disabled as default so there are no barriers unless you change that with elvtune. Also I'm using -r 250 and -w 500 and it doesn't change really anything in the numbers compared to too big values (but it fixes the starvation problem). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Linus Torvalds wrote: > Blaming the elevator is unfair and unrealistic. [...] yep - and Jens i'm sorry about the outburst. Until a bug is found it's unrealistic to blame anything. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 05:24:42PM +0100, Stephen C. Tweedie wrote: > Your other recent complaint, that newly-swapped pages end up on the > wrong end of the LRU lists and can't be reclaimed without cycling the > rest of the pages in shrink_mmap, is also cured in Rik's code, by > placing pages which are queued for swapout on a different list > altogether. I thought we had managed to agree in Ottawa that such a > cure for the old 2.4 VM was desirable. Yes, I seen and the fix looks ok. It's the deactivate_page call when we swapout the anonymous page. I overlooked it at first, I apologise. > > The mapped pages was never seen by anything except swap_out, if they was > > mapped (it's not a if page->age then move into the active list, with > > classzone the page was _just_ in the active list in first place since it > > was mapped). > > This really seems to be the biggest difference between the two > approaches right now. The FreeBSD folks believe fervently that one of Right. And since you move the page into the active list only once you reach it from the cache recycler and you find it with page->age != 0, you also spend time putting those pages back and forth from those LRU lists while in my approch the mapped pages are never seen from the cycle recylcer and no cycle is spent on them. This mean in a pure fs read test with cache pollution going on, there's _no_way_ that classzone touches or notice _any_ mapped page in its path. I think you can't be faster than classzone here. When the cache isn't polluted adding some more bit of aging I'll better know when it's time to unmap/swapout stuff. (it just works this way but with only literally 1 bit of aging at the moment) > the main reasons that their VM rocks is that it ages cache pages and > mapped pages at the same rate. Having both on the same aging list > achieves that. Separating the two raises the question of how to > balance the aging of cache vs. swap in a fair manner. I believe increasing the aging in the unmapped cache should take care of that fine. (it was working pretty much fine also with only 1 bit of most frequently used aging plus the LRU order of the list) > > In classzone the aging exists too but it's _completly_ orthogonal to how > > rest of the VM works. > > Umm, that applies to Rik's stuff too! I may be overlooking something but where do you notice when a page gets unmapped from the last mapping and put it back into a place that can be reached from shrink_mmap (or whatever the cache recycler is)? Since none mapped page can in any way be freed by the cache recycler (you need to unmap it first from swap_out at the moment) if you should reach those pages from the cache recyler someway it means thus you're wasting CPU (I couldn't reach any mapped page from the cache recylcer in classzone and infact the mapped pages wasn't linked in any LRU at all to save even more CPU). > Good, the best theoretical VM in the world can fall apart instantly on > contact with the real world. :-) :)) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Hi, On Mon, Sep 25, 2000 at 01:41:37AM +0200, Andrea Arcangeli wrote: > > Since you're talking about this I'll soon (as soon as I'll finish some other > thing that is just work in progress) release a classzone against latest's > 2.4.x. My approch is _quite_ different from the curren VM. Current approch is > very imperfect and it's based solely on aging whereas classzone had hooks into > pagefaults paths and all other map/unmap points to have perfect accounting of > the amount of active/inactive stuff. Andrea, I'm not quite sure what you're saying here. Could you be a bit more specific? The current VM _does_ track the amount of active/inactive stuff. It does so by keeping separate list of active and inactive stuff. Accounting on memory pressure on these different lists is used to generate dynamic targets for how many pages we aim to have on those lists, so aging/reclaim activity is tuned to the current memory load. Your other recent complaint, that newly-swapped pages end up on the wrong end of the LRU lists and can't be reclaimed without cycling the rest of the pages in shrink_mmap, is also cured in Rik's code, by placing pages which are queued for swapout on a different list altogether. I thought we had managed to agree in Ottawa that such a cure for the old 2.4 VM was desirable. > The mapped pages was never seen by > anything except swap_out, if they was mapped (it's not a if page->age then move > into the active list, with classzone the page was _just_ in the active list in > first place since it was mapped). This really seems to be the biggest difference between the two approaches right now. The FreeBSD folks believe fervently that one of the main reasons that their VM rocks is that it ages cache pages and mapped pages at the same rate. Having both on the same aging list achieves that. Separating the two raises the question of how to balance the aging of cache vs. swap in a fair manner. > In classzone the aging exists too but it's _completly_ orthogonal to how rest > of the VM works. Umm, that applies to Rik's stuff too! > This is my humble opinion at least. I may be wrong. I'll let you know > once I'll have a patch I'll happy with and some real life number to proof my > theory. Good, the best theoretical VM in the world can fall apart instantly on contact with the real world. :-) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, Sep 25, 2000 at 06:20:40PM +0200, Ingo Molnar wrote: > i only suggested this as a debugging helper, instead of the suggested I don't think removing the superlock from all fs is good thing at this stage (I agree with SCT doing it only for ext2 [that's what we mostly care about] would be possible). Who cares if UFS grabs the super lock or not? grep lock_super fs/ext2/*.c is enough and we don't need debugging in the scheduler for that. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, 25 Sep 2000, Alexander Viro wrote: > > i'd suggest to simply BUG() in schedule() if the superblock lock is held > > not directly by lock_super. Holding the superblock lock is IMO quite rude > > anyway (for performance and latency) - is there any place where we hold it > > for a long time and it's unavoidable? > > Ingo, schedule() has no bloody business _knowing_ about superblock > locks in the first place. Yes, ext2 should not bother taking it at > all. For completely unrelated reasons. i only suggested this as a debugging helper, instead of the suggested ext2_getblk() BUG() helper. Obviously schedule() has no business knowing about filesystem locks. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, 25 Sep 2000, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > > > Sorry, but in this case you have got a lot more variables than you > > seem to think. The obvious lock is the ext2 superblock lock, but > > there are side cases with quota and O_SYNC which are much less > > commonly triggered. That's not even starting to consider the other > > dozens of filesystems in the kernel which have to be audited if we > > change the locking requirements for GFP calls. > > i'd suggest to simply BUG() in schedule() if the superblock lock is held > not directly by lock_super. Holding the superblock lock is IMO quite rude > anyway (for performance and latency) - is there any place where we hold it > for a long time and it's unavoidable? Ingo, schedule() has no bloody business _knowing_ about superblock locks in the first place. Yes, ext2 should not bother taking it at all. For completely unrelated reasons. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > Sorry, but in this case you have got a lot more variables than you > seem to think. The obvious lock is the ext2 superblock lock, but > there are side cases with quota and O_SYNC which are much less > commonly triggered. That's not even starting to consider the other > dozens of filesystems in the kernel which have to be audited if we > change the locking requirements for GFP calls. i'd suggest to simply BUG() in schedule() if the superblock lock is held not directly by lock_super. Holding the superblock lock is IMO quite rude anyway (for performance and latency) - is there any place where we hold it for a long time and it's unavoidable? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
Hi, On Mon, Sep 25, 2000 at 12:36:50AM +0200, bert hubert wrote: > On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote: > > On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote: > > > any form of serialisation on the quota file). This feels like rather > > > a lot of new and interesting deadlocks to be introducing so late in > > > 2.4. :-) > > True. But they also appear to be found and solved at an impressive rate. > These deadlocks are fatal and don't hide in corners, whereas the previous mm > problems used to be very hard to spot and fix, there not being real > showstoppers, except for abysmal performance. [1] Sorry, but in this case you have got a lot more variables than you seem to think. The obvious lock is the ext2 superblock lock, but there are side cases with quota and O_SYNC which are much less commonly triggered. That's not even starting to consider the other dozens of filesystems in the kernel which have to be audited if we change the locking requirements for GFP calls. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Sun, Sep 24, 2000 at 11:39:13PM -0300, Marcelo Tosatti wrote: > - Change kmem_cache_shrink to return the number of freed pages. I did that too extending a patch from Mark. I also removed the first_not_full ugliness providing a LIFO behaviour to the completly freed slabs (so kmem_cache_reap removes the oldest completly unused slabs from the queue, not the most recently used ones with potentially live cache in the CPU). > There was a comment on the shrink functions about making > kmem_cache_shrink() work on a GFP_DMA/GFP_HIGHMEM basis to free only the > wanted pages by the current allocation. This is meaningless at the moment because it can't be addressed without classzone logic in the allocator (classzone means that the allocator will pass to the memory balancing code the information about _which_ classzone you have to allocate memory from, so you won't waste time to synchronously balance unrelated zones). My patch is here (it isn't going to apply cleanly due the test9 changes in do_try_to_free_pages but porting is trivial). It was tested and it was working for me. ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test7/slab-1 BTW, here there's a fix for a longstanding SMP race (since swap_out and msync doesn't run with the big lock) that can corrupt memory: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/msync-smp-race-1 Here the fix for another SMP race in enstablish_pte: ftp://ftp.uskernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/tlb-flush-smp-race-1 The fix for this last bit is ugly bit it's safe because Manfred said s390 have a flush_tlb_page that atomically flushes and makees the pte invalid (cleaner fix means moving part of enstablish_pte into the arch inlines). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 04:53:05PM +0200, Ingo Molnar wrote: > sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC) I didn't know. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of > > Actually I'm the one who introduced the EXCLUSIVE thing there and I audited sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 04:29:42PM +0200, Ingo Molnar wrote: > There is no guarantee at all that the reader will win. If reads and writes > racing for request slots ever becomes a problem then we should introduce a > separate read and write waitqueue. I agree. However here I also have a in flight per-queue limit of locked stuff (otherwise with 512k sized request on scsi I could fill in some second 128mbyte of RAM locked and I don't want to decrease the size of the queue because it has to be large for aggressive reordering when the request are 4k large each). This in-flight-perqueue limit is actually a non exclusive wakeup and it triggers more often than the request shortage (because most of the time write are consecutive) and so having two waitqueues and the reads that reigsters themself into both shouldn't be very significative improvement at the moment (I should first care about a wake-one in-flight-limit-per-queue wakeup :). > the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of Actually I'm the one who introduced the EXCLUSIVE thing there and I audited _all_ the device drivers to check they do 1 wakeup for each 1 request they release before sending it off Linus. But I never thought (until some day ago) about the fact that if a read completes a reserved request the write won't be able to accept it. So long term we'll do two wake-one queues with reads registered in both. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 04:18:54PM +0200, Jens Axboe wrote: > The scsi layer currently "manually" does a list_add on the queue itself, > which doesn't look too healthy. It's grabbing the io_request_lock so it looks healthy for now :) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > The sg problem was different. When sg queues a request, it invokes the > > request_fn to handle it. But if the queue is currently plugged, the > > scsi_request_fn will not do anything. > > That will explain it, yes. In the same way for correctness also those should > be converted from request_fn to generic_unplug_device, right? (this Yes, that would be the right fix. However, then we also need some way of inserting requests in the queue and let it plug when appropriate. The scsi layer currently "manually" does a list_add on the queue itself, which doesn't look too healthy. > will also avoid to recall spurious request_fn because the device is still > in the tq_disk queue even when the I/O generated by the below request_fn > completed) > > if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7) > (q->request_fn)(q); > if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7) > (q->request_fn)(q); AFAIR, Eric tried to talk to the Compaq folks (and Leonard too, I dunno) about why they want this. What came of it, I don't know. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 04:11:34PM +0200, Jens Axboe wrote: > Interesting. I haven't done any serious benching with the CSCAN introduction > in elevator_linus, I'll try that too. Only changing that the performance decreased reproducibly from 16 to 14 mbyte/sec in the read test with 2 threads. So far I'm testing only IDE with LVM striping on two equal fast disks on separate IDE channels. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 04:08:38PM +0200, Jens Axboe wrote: > The sg problem was different. When sg queues a request, it invokes the > request_fn to handle it. But if the queue is currently plugged, the > scsi_request_fn will not do anything. That will explain it, yes. In the same way for correctness also those should be converted from request_fn to generic_unplug_device, right? (this will also avoid to recall spurious request_fn because the device is still in the tq_disk queue even when the I/O generated by the below request_fn completed) if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7) (q->request_fn)(q); if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7) (q->request_fn)(q); Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > driver (and I very much hope that with EXCLUSIVE gone away and the > wait_on_* fixed those hangs will go away because I don't see anything else > wrong at this moment). the EXCLUSIVE thing only optimizes the wakeup, it's not semantic! How better is it to let 100 processes race for one freed-up request slot? There is no guarantee at all that the reader will win. If reads and writes racing for request slots ever becomes a problem then we should introduce a separate read and write waitqueue. the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of (performance) sense. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > And a new elevator was introduced some months ago to solve this. > > And now that I done some benchmark it seems the major optimization consists in > the implementation of the new _ordering_ algorithm in test2, not really from > the removal of the more finegrined latency control (said that I'm not going to > reintroduce the previous latency control, the current one doesn't provide > great latency but it's ok). Yes, I found this the greatest improvement too. > As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes > the ordering algorithm), tiotest performance drops significantly in the > 2-thread-reading case. elvtune settings doesn't matter, that's only a matter > of the ordering. Interesting. I haven't done any serious benching with the CSCAN introduction in elevator_linus, I'll try that too. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > i had yesterday - those were simple VM deadlocks. I dont see any deadlocks > > Definitely. They can't explain anything about the VM deadlocks. I was > _only_ talking about the blkdev hangs that caused you to unplug the > queue at each reschedule in tux and that Eric reported me for the SG > driver (and I very much hope that with EXCLUSIVE gone away and the > wait_on_* fixed those hangs will go away because I don't see anything else > wrong at this moment). The sg problem was different. When sg queues a request, it invokes the request_fn to handle it. But if the queue is currently plugged, the scsi_request_fn will not do anything. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 03:49:52PM +0200, Jens Axboe wrote: > And a new elevator was introduced some months ago to solve this. And now that I done some benchmark it seems the major optimization consists in the implementation of the new _ordering_ algorithm in test2, not really from the removal of the more finegrined latency control (said that I'm not going to reintroduce the previous latency control, the current one doesn't provide great latency but it's ok). As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes the ordering algorithm), tiotest performance drops significantly in the 2-thread-reading case. elvtune settings doesn't matter, that's only a matter of the ordering. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25 2000, Ingo Molnar wrote: > > The changes made were never half-done. The recent bug fixes have > > mainly been to remove cruft from the earlier elevator and fixing a bug > > where the elevator insert would screw up a bit. So I'd call that fine > > tuning or adjusting, not fixing half-done stuff. > > sorry i did not mean to offend you - unadjusted and unfixed stuff hanging > around in the kernel for months is 'half done' for me. No offense taken, I just tried to explain my view. And in light of the bad test2, I'd like the new changes to not have any "issues". So this work has been going on for the last month or so, and I think we are finally getting to agreement on what needs to be done now and how. WIP. > > And a new elevator was introduced some months ago to solve this. > > and these are still not solved in the vanilla kernel, as recent complaints > on l-k prove. Different problems, though :(. However, I believe they are solved in Andrea and my current tree. Just needs the final cleaning, more later. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > I was _only_ talking about the blkdev hangs [...] i guess this was just miscommunication. It never 'hung', it just performed reads with 20k/sec or so. (without any writes being done in the background.) A 'hang' for me is a deadlock or lockup, not a slowdown. > that caused you to unplug the queue at each reschedule in tux and that > Eric reported me for the SG driver (and I very much hope that with > EXCLUSIVE gone away and the wait_on_* fixed those hangs will go away > because I don't see anything else wrong at this moment). okay, i'll test this. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 03:57:31PM +0200, Ingo Molnar wrote: > i had yesterday - those were simple VM deadlocks. I dont see any deadlocks Definitely. They can't explain anything about the VM deadlocks. I was _only_ talking about the blkdev hangs that caused you to unplug the queue at each reschedule in tux and that Eric reported me for the SG driver (and I very much hope that with EXCLUSIVE gone away and the wait_on_* fixed those hangs will go away because I don't see anything else wrong at this moment). > but one of these two fixes could explain the slowdown i saw on and off for > quite some time, seeing very bad read performance occasionally. (do you > remember my sched.c tq_disc hack?) Exactly, that's the only thing I was talking about in this subthread. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Jens Axboe wrote: > The changes made were never half-done. The recent bug fixes have > mainly been to remove cruft from the earlier elevator and fixing a bug > where the elevator insert would screw up a bit. So I'd call that fine > tuning or adjusting, not fixing half-done stuff. sorry i did not mean to offend you - unadjusted and unfixed stuff hanging around in the kernel for months is 'half done' for me. > > the first reports about bad write performance came right after the > > original elevator patches went in, about 6 months ago. > > And a new elevator was introduced some months ago to solve this. and these are still not solved in the vanilla kernel, as recent complaints on l-k prove. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > - sync_page(page); > set_task_state(tsk, TASK_UNINTERRUPTIBLE); > + sync_page(page); > - run_task_queue(_disk); > set_task_state(tsk, TASK_UNINTERRUPTIBLE); > + run_task_queue(_disk); these look like genuine fixes, but i dont think they can explain the hangs i had yesterday - those were simple VM deadlocks. I dont see any deadlocks today - but i'm running the unsafe B2 variant of the vmfixes patch. (and i have no swapping enabled which simplifies my VM setup.) but one of these two fixes could explain the slowdown i saw on and off for quite some time, seeing very bad read performance occasionally. (do you remember my sched.c tq_disc hack?) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
Hi, On Mon, Sep 25, 2000 at 04:02:30AM +0200, Andrea Arcangeli wrote: > On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote: > > So help testing the patches to them. Arrgh... > > I think I'd better fix the bugs that I know about before testing patches that > tries to remove the superblock_lock at this stage. Right. If we're introducing new deadlock possibilities, then sure we can fix the obvious cases in ext2, but it will be next to impossible to do a thorough audit of all of the other filesystems. Adding in the new shrink_icache loop into the VFS just feels too dangerous right now. Of course, that doesn't mean we shouldn't remove the excessive superblock locking from ext2 --- rather, it is simply more robust to keep the two issues separate. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 03:10:51PM +0200, Ingo Molnar wrote: > yep. But i dont understand why this makes any difference - the waitqueue It makes a difference because your sleeping reads won't get the wakeup even while they could queue their reserved read request (they have to wait the FIFO to roll or some write to complete). > wakeup is FIFO, so any other request will eventually arrive. Could you > explain this bug a bit better? Well it may not explain an infinite hang because as you say the write that got the suprious wakeup will unplug the queue and after some time the reads will be wakenup. So maybe that wasn't the reason of your hangs because I remeber your problem looked more like an infinite hang that was only solved by kflushd writing some more stuff and unplugging the queue as side effect (however I'm not sure since I never experienced those myself). But I hope if it wasn't that one it's the below fix that will help: Index: mm/filemap.c === RCS file: /home/andrea/cvs/linux/mm/filemap.c,v retrieving revision 1.1.1.5.2.3 retrieving revision 1.1.1.5.2.4 diff -u -r1.1.1.5.2.3 -r1.1.1.5.2.4 --- mm/filemap.c2000/09/21 03:11:53 1.1.1.5.2.3 +++ mm/filemap.c2000/09/25 03:33:31 1.1.1.5.2.4 @@ -622,8 +622,8 @@ add_wait_queue(>wait, ); do { - sync_page(page); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + sync_page(page); if (!PageLocked(page)) break; schedule(); Index: fs/buffer.c === RCS file: /home/andrea/cvs/linux/fs/buffer.c,v retrieving revision 1.1.1.5.2.1 retrieving revision 1.1.1.5.2.2 diff -u -r1.1.1.5.2.1 -r1.1.1.5.2.2 --- fs/buffer.c 2000/09/06 19:57:51 1.1.1.5.2.1 +++ fs/buffer.c 2000/09/25 03:33:30 1.1.1.5.2.2 @@ -147,8 +147,8 @@ atomic_inc(>b_count); add_wait_queue(>b_wait, ); do { - run_task_queue(_disk); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + run_task_queue(_disk); if (!buffer_locked(bh)) break; schedule(); Think if the buffer returns locked between set_task_state(tsk, TASK_UNINTERRUPTIBLE) and if (!buffer_locked(bh)). The window is very small but it looks a genuine window for a deadlock. (and this one could sure explain infinite hangs in read... even if it looks even less realistic than the EXCLUSIVE task thing) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > yet another elevator algorithm we need a squeaky clean VM balancer above > > FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec > in the tiobench write test compared to clean 2.4.0-test8-pre5 that > delivers 8mbyte/sec great! I'm happy we have a fine-tuned elevator again. > Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in > wait_for_request. The high part of the queue is reserved for reads. > Now if a read completes and it wakeups a write you'll hang. yep. But i dont understand why this makes any difference - the waitqueue wakeup is FIFO, so any other request will eventually arrive. Could you explain this bug a bit better? > If you think I should delay those fixes to do something else I don't > agree sorry. no, i never ment it. I find it very good that those half-done changes are cleaned up and the remaining bugs / performance problems are eliminated - the first reports about bad write performance came right after the original elevator patches went in, about 6 months ago. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] vmfixes-2.4.0-test9-B2
On Mon, Sep 25, 2000 at 12:13:08PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Not sure if this is the right moment for those changes though, I'm not > > worried about ext2 but about the other non-netoworked fses that nobody > > uses regularly. > > it *is* the right moment to clean these issues up. These kinds of things I'm talking about the removal of the superblock lock from the filesystems. Note: I don't have problems with the removal of the superblock lock even if done at this stage, I'm not the one who can choose those things, it's Linus's responsability to take the final decision for the official tree, but don't ask me to test patches that removes the superblock lock _at_this_stage_ before I can run a stable and fast 2.4.x because I won't do that. Period. > yet another elevator algorithm we need a squeaky clean VM balancer above FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec in the tiobench write test compared to clean 2.4.0-test8-pre5 that delivers 8mbyte/sec instead with only blkdev layer changes in between the two kernels (and no that's not a matter of the elevator since there are no seeks in the test and I've not changed the elevator sorting algorithm during the bench). Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in wait_for_request. The high part of the queue is reserved for reads. Now if a read completes and it wakeups a write you'll hang. If you think I should delay those fixes to do something else I don't agree sorry. > all. Please help identifying, fixing, debugging and testing these VM > balancing issues. This is tough work and it needs to be done. I had an alternative VM, that I prefer from a design standpoint, I'll improve it and I'll maintain it. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/