Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > I'm still not understanding. The terms you're using are a bit ambiguous. > > > > What does "find some dirty unallocated blocks" mean? Find a page which is > > dirty and which does not have a disk mapping? > > > > Normally the above operation would be implemented via > > ext4_writeback_writepage(), and it runs under lock_page(). > > I'm mostly worried about delayed allocation case. My impression was that > holding number of pages locked isn't a good idea, even if they're locked > in index order. so, I was going to turn number of pages writeback, then > allocate blocks for all of them at once, then put proper blocknr's into > bh's (or PG_mappedtodisk?). ooh, that sounds hacky and quite worrisome. If someone comes in and does an fsync() we've lost our synchronisation point. Yes, all callers happen to do lock_page(); wait_on_page_writeback(); (I think) but we've never considered a bare PageWriteback() as something which protects page internals. We're OK wrt page reclaim and we're OK wrt truncate and invalidate. As long as the page is uptodate we _should_ be OK wrt readpage(). But still, it'd be better to use the standard locking rather than inventing new rules, if poss. I'd be 100% OK with locking multiple pages in ascending pgoff_t order. Locking the page is the standard way of doing this synchronisation and the only problem I can think of is that having a tremendous number of pages locked could cause the wake_up_page() waitqueue hashes to get overloaded and go slow. But it's also possible to lock many, many pages with readahead and nobody has reported problems in there. > > > > > >>going to commit > >>find inode I dirty > >>do NOT find these blocks because they're > >> allocated only, but pages/bhs aren't > >> mapped > >> to them > >>start commit > > > > I think you're assuming here that commit would be using ->t_sync_datalist > > to locate dirty buffer_heads. > > nope, I mean sb->inode->page walk. > > > But under this proposal, t_sync_datalist just gets removed: the new > > ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm > > understanding you, the way in which we'd handle any such race is to make > > kjournald's writeback of the dirty pages block in lock_page(). Once it > > gets the page lock it can look to see if some other thread has mapped the > > page to disk. > > if I'm right holding number of pages locked, then they won't be locked, but > writeback. of course kjournald can block on writeback as well, but how does > it find pages with *newly allocated* blocks only? I don't think we'd want kjournald to do that. Even if a page was dirtied by an overwrite, we'd want to write it back during commit, just from a quality-of-implementation point of view. If we were to leave these pages unwritten during commit then a post-recovery file could have a mix of up-to-five-second-old data and up-to-30-seconds-old data. > > It may turn out that kjournald needs a private way of getting at the > > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we > > had the radix-tree-of-dirty-inodes thing then that's easy enough to do > > anyway, with a tagged search. But I expect that a single pass through the > > superblock's dirty inodes would suffice for ordered-data. Files which > > have chattr +j would screw things up, as usual. > > not dirty inodes only, but rather some fast way to find pages with newly > allocated pages. Newly allocated blocks, you mean? Just write out the overwritten blocks as well as the new ones, I reckon. It's what we do now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: I'm still not understanding. The terms you're using are a bit ambiguous. What does "find some dirty unallocated blocks" mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(), and it runs under lock_page(). I'm mostly worried about delayed allocation case. My impression was that holding number of pages locked isn't a good idea, even if they're locked in index order. so, I was going to turn number of pages writeback, then allocate blocks for all of them at once, then put proper blocknr's into bh's (or PG_mappedtodisk?). going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit I think you're assuming here that commit would be using ->t_sync_datalist to locate dirty buffer_heads. nope, I mean sb->inode->page walk. But under this proposal, t_sync_datalist just gets removed: the new ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm understanding you, the way in which we'd handle any such race is to make kjournald's writeback of the dirty pages block in lock_page(). Once it gets the page lock it can look to see if some other thread has mapped the page to disk. if I'm right holding number of pages locked, then they won't be locked, but writeback. of course kjournald can block on writeback as well, but how does it find pages with *newly allocated* blocks only? It may turn out that kjournald needs a private way of getting at the I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we had the radix-tree-of-dirty-inodes thing then that's easy enough to do anyway, with a tagged search. But I expect that a single pass through the superblock's dirty inodes would suffice for ordered-data. Files which have chattr +j would screw things up, as usual. not dirty inodes only, but rather some fast way to find pages with newly allocated pages. I assume (hope) that your delayed allocation code implements ->writepages()? Doing the allocation one-page-at-a-time sounds painful... indeed. this is a root cause of all this complexity. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: > > > >> Andrew Morton wrote: > >>> Yes, there can be issues with needing to allocate journal space within the > >>> context of a commit. But > >> no-no, this isn't required. we only need to mark pages/blocks within > >> transaction, otherwise race is possible when we allocate blocks in > >> transaction, > >> then transacton starts to commit, then we mark pages/blocks to be flushed > >> before commit. > > > > I don't understand. Can you please describe the race in more detail? > > if I understood your idea right, then in data=ordered mode, commit thread > writes > all dirty mapped blocks before real commit. > > say, we have two threads: t1 is a thread doing flushing and t2 is a commit > thread > > t1t2 > find dirty inode I > find some dirty unallocated blocks > journal_start() > allocate blocks > attach them to I > journal_stop() I'm still not understanding. The terms you're using are a bit ambiguous. What does "find some dirty unallocated blocks" mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(), and it runs under lock_page(). > going to commit > find inode I dirty > do NOT find these blocks because they're > allocated only, but pages/bhs aren't > mapped > to them > start commit I think you're assuming here that commit would be using ->t_sync_datalist to locate dirty buffer_heads. But under this proposal, t_sync_datalist just gets removed: the new ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm understanding you, the way in which we'd handle any such race is to make kjournald's writeback of the dirty pages block in lock_page(). Once it gets the page lock it can look to see if some other thread has mapped the page to disk. It may turn out that kjournald needs a private way of getting at the I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we had the radix-tree-of-dirty-inodes thing then that's easy enough to do anyway, with a tagged search. But I expect that a single pass through the superblock's dirty inodes would suffice for ordered-data. Files which have chattr +j would screw things up, as usual. I assume (hope) that your delayed allocation code implements ->writepages()? Doing the allocation one-page-at-a-time sounds painful... > > map pages/bhs to just allocate blocks > > > so, either we mark pages/bhs someway within journal_start()--journal_stop() or > commit thread should do lookup for all dirty pages. the latter doesn't sound > nice, IMHO. > I don't think I'm understanding you fully yet. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. I don't understand. Can you please describe the race in more detail? if I understood your idea right, then in data=ordered mode, commit thread writes all dirty mapped blocks before real commit. say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread t1 t2 find dirty inode I find some dirty unallocated blocks journal_start() allocate blocks attach them to I journal_stop() going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit map pages/bhs to just allocate blocks so, either we mark pages/bhs someway within journal_start()--journal_stop() or commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > Yes, there can be issues with needing to allocate journal space within the > > context of a commit. But > > no-no, this isn't required. we only need to mark pages/blocks within > transaction, otherwise race is possible when we allocate blocks in > transaction, > then transacton starts to commit, then we mark pages/blocks to be flushed > before commit. I don't understand. Can you please describe the race in more detail? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. a) If the page has newly allocated space on disk then the metadata which refers to that page is already in the journal: no new journal space needed. b) If the page doesn't have space allocated on disk then we don't need to write it out at ordered-mode commit time, because the post-recovery filesystem will not have any references to that page. c) If the page is dirty due to overwrite then no metadata update was required. IOW, under what circumstances would an ordered-mode commit need to allocate space for a delayed-allocate page? no need to allocate space within commit thread, I think. only to take care of the race I described above. in hackish version of data=ordered for delayed allocation I used counter of submitted bio's with newly-allocated blocks and commit thread waits for the counter to reach 0. However b) might lead to the hey-my-file-is-full-of-zeroes problem. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Thu, 03 May 2007 21:38:10 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > We can make great improvements here, and I've (twice) previously decribed > > how: hoist the entire ordered-mode data handling out of ext3, and out of > > the buffer_head layer and move it up into the VFS pagecache layer. > > Basically, do ordered-data with a commit-time inode walk, calling > > do_sync_mapping_range(). > > > > Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too. > > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there. > > I'm not sure it's that easy. > > if we move to pages, then we have to mark pages to be flushed holding > transaction open. now take delayed allocation into account: we need > to allocate number of blocks at once and then mark all pages mapped, > again within context of the same transaction. Yes, there can be issues with needing to allocate journal space within the context of a commit. But a) If the page has newly allocated space on disk then the metadata which refers to that page is already in the journal: no new journal space needed. b) If the page doesn't have space allocated on disk then we don't need to write it out at ordered-mode commit time, because the post-recovery filesystem will not have any references to that page. c) If the page is dirty due to overwrite then no metadata update was required. IOW, under what circumstances would an ordered-mode commit need to allocate space for a delayed-allocate page? However b) might lead to the hey-my-file-is-full-of-zeroes problem. > so, an implementation > would look like the following? > > generic_writepages() { > /* collect set of contig. dirty pages */ > foo_get_blocks() { > foo_journal_start(); > foo_new_blocks(); > foo_attach_blocks_to_inode(); > generic_mark_pages_mapped(); > foo_journal_stop(); > } > } > > another question is will it scale well given number of dirty inodes > can be much larger than number of inodes with dirty mapped blocks > (in delayed allocation case, for example) ? Possibly - zillions of dirty-for-atime inodes might get in the way. A short-term fix would be to create a separate dirty-inode list on the superblock (ug). A long-term fix is to rip all the per-superblock dirty-inode lists and use a radix-tree. Not for lookup purposes, but for the tree's ability to do tagged and restartable searches. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: We can make great improvements here, and I've (twice) previously decribed how: hoist the entire ordered-mode data handling out of ext3, and out of the buffer_head layer and move it up into the VFS pagecache layer. Basically, do ordered-data with a commit-time inode walk, calling do_sync_mapping_range(). Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too. Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there. I'm not sure it's that easy. if we move to pages, then we have to mark pages to be flushed holding transaction open. now take delayed allocation into account: we need to allocate number of blocks at once and then mark all pages mapped, again within context of the same transaction. so, an implementation would look like the following? generic_writepages() { /* collect set of contig. dirty pages */ foo_get_blocks() { foo_journal_start(); foo_new_blocks(); foo_attach_blocks_to_inode(); generic_mark_pages_mapped(); foo_journal_stop(); } } another question is will it scale well given number of dirty inodes can be much larger than number of inodes with dirty mapped blocks (in delayed allocation case, for example) ? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Wed, 2007-05-02 at 08:53 +0200, Jens Axboe wrote: > On Fri, Apr 27 2007, Linus Torvalds wrote: > > So I do believe that we could probably do something about the IO > > scheduling _too_: > > > > - break up large write requests (yeah, it will make for worse IO > >throughput, but if make it configurable, and especially with > >controllers that don't have insane overheads per command, the > >difference between 128kB requests and 16MB requests is probably not > >really even noticeable - SCSI things with large per-command overheads > >are just stupid) > > > >Generating huge requests will automatically mean that they are > >"unbreakable" from an IO scheduler perspective, so it's bad for latency > >for other reqeusts once they've started. > > Overlooked this one initially... We actually don't generate huge > requests, exactly because of that. Even if the device can do large > requests (most SATA disks today can do 32meg), we default to 512kB as > the largest one that we will build due to file system requests. It's > trivial to reduce that limit, see /sys/block//queue/max_sectors_kb. > That controls the maximum per-request size. For the record, I haven't been able to stall KDE for ages with data=writeback. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, Apr 27 2007, Linus Torvalds wrote: > So I do believe that we could probably do something about the IO > scheduling _too_: > > - break up large write requests (yeah, it will make for worse IO >throughput, but if make it configurable, and especially with >controllers that don't have insane overheads per command, the >difference between 128kB requests and 16MB requests is probably not >really even noticeable - SCSI things with large per-command overheads >are just stupid) > >Generating huge requests will automatically mean that they are >"unbreakable" from an IO scheduler perspective, so it's bad for latency >for other reqeusts once they've started. Overlooked this one initially... We actually don't generate huge requests, exactly because of that. Even if the device can do large requests (most SATA disks today can do 32meg), we default to 512kB as the largest one that we will build due to file system requests. It's trivial to reduce that limit, see /sys/block//queue/max_sectors_kb. That controls the maximum per-request size. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, Apr 28 2007, Linus Torvalds wrote: > > The main problem is that if the user extracts tar archive, tar eventually > > blocks on writeback I/O --- O.K. But if bash attempts to write one page to > > .bash_history file at the same time, it blocks too --- bad, the user is > > annoyed. > > Right, but it's actually very unlikely. Think about it: the person who > extracts the tar-archive is perhaps dirtying a thousand pages, while the > .bash_history writeback is doing a single one. Which process do you think > is going to hit the "oops, we went over the limit" case 99.9% of the time? > > The _really_ annoying problem is when you just have absolutely tons of > memory dirty, and you start doing the writeback: if you saturate the IO > queues totally, it simply doesn't matter _who_ starts the writeback, > because anybody who needs to do any IO at all (not necessarily writing) is > going to be blocked. > > This is why having gigabytes of dirty data (or even "just" hundreds of > megs) can be so annoying. > > Even with a good software IO scheduler, when you have disks that do tagged > queueing, if you fill up the disk queue with a few dozen (depends on the > disk what the queue limit is) huge write requests, it doesn't really > matter if the _software_ queuing then gives a big advantage to reads > coming in. They'll _still_ be waiting for a long time, especially since > you don't know what the disk firmware is going to do. > > It's possible that we could do things like refusing to use all tag entries > on the disk for writing. That would probably help latency a _lot_. Right > now, if we do writeback, and fill up all the slots on the disk, we cannot > even feed the disk the read request immediately - we'll have to wait for > some of the writes to finish before we can even queue the read to the > disk. > > (Of course, if disks don't support tagged queueing, you'll never have this > problem at all, but most disks do these days, and I strongly suspect it > really can aggravate latency numbers a lot). > > Jens? Comments? Or do you do that already? Yes, CFQ tries to handle that quite aggressively already. With the emergene of NCQ on SATA, it has become a much bigger problem since it's seen so easily on the desktop. The SCSI people usually don't care about latency that much, so not many complaints there. The recently posted patch series for CFQ that I will submit soon for 2.6.22 has more fixes/tweaks for this. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, Apr 28 2007, Mikulas Patocka wrote: > >So perhaps if there's any privileged reads going on then we should limit > >writes to a depth of 2 at most, with some timeout mechanism that would > > SCSI has a "high priority" bit in the command block, so you can just set > it --- but I am not sure how well do disks support it. I'd be surprised if it was useful. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007, Lee Revell wrote: On 4/28/07, Mikulas Patocka <[EMAIL PROTECTED]> wrote: I most wonder, why vim fsyncs its swapfile regularly (blocking typing during that) and doesn't fsync the resulting file on :w :-/ Never seen this. Why would fsync block typing unless vim was doing disk IO for every keystroke? Lee Not for every keystroke, but after some time it calls fsync(). During execution of that call, keyboard is blocked. It is not normally problem (fsync executes very fastly), but it starts to be problem in case of extremely overloaded system. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Lee Revell wrote: On 4/28/07, Mikulas Patocka <[EMAIL PROTECTED]> wrote: I most wonder, why vim fsyncs its swapfile regularly (blocking typing during that) and doesn't fsync the resulting file on :w :-/ Never seen this. Why would fsync block typing unless vim was doing disk IO for every keystroke? It does do that, for the crash-recovery files it maintains. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote: SpadFS doesn't write to unallocated parts like log filesystems (LFS) or phase tree filesystems (TUX2); it writes inside normal used structures, but it marks each structure with generation tags --- when it updates global table of tags, it atomically makes several structures valid. I don't know about this idea being used elsewhere. So how is this generation structure organized ? paper ? Paper is in CITSA 2006 proceedings (but you likely don't have them and I signed some statement that I can't post it elsewhere :-( ) Basicly the idea is this: * you have array containing 65536 32-bit numbers --- crash count table --- that array is on disk and in memory (see struct __spadfs->cct in my sources) * you have 16-bit value --- crash count, that value is on disk and in memory too (see struct __spadfs->cc) * On mount, you load crash count table and crash count from disk to memory. You increment carsh count on disk (but leave old in memory). You increment one entry in crash count table - cct[cc] in memory, but leave old on disk. * On sync you write all metadata buffers, do write barrier, write one sector of crash count table from memory to disk and do write barrier again. * On unmount, you sync and decrement crash count on disk. --- so crash count counts crashes --- it is increased each time you mount and don't unmount. Consistency of structures: * Each directory entry has two tags --- 32-bit transaction count (txc) and 16-bit crash count(cc). * You create directory entry with entry->txc = fs->txc[fs->cc] and entry->cc = fs->cc * Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc (see macro CC_VALID) * If the directory entry is not valid, it is skipped during directory scan, as if it wasn't there --- so you create a directory entry and its valid. If the system crashes, it will load crash count table from disk and there's one-less value than entry->txc, so the entry will be invalid. It will also run with increased cc, so it will never touch txc at an old index, so the entry will be valid forever. --- if you sync, you write crash count table to disk and directory entry will be atomically made valid forever (because values in crash count table never decrease) In my implementation, the top bit of entry->txc is used to mark whether the entry is scheduled for adding or delete, so that you can atomically add one directory entry and delete other. Space allocation bitmaps or lists are managed in such a way that there are two copies and cc/txc pair determining which one is valid. Files are extended in such a way that each file has two "size" entries and cc/txc pair denoting which one is valid, so that you can atomically extend/truncate file and mark its space allocated/freed in bitmaps or lists (BTW. this cc/txc pair is the same one that denotes if the directory entry is valid and another bit determines one of these two functions --- to save space). Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote: > SpadFS doesn't write to unallocated parts like log filesystems (LFS) or > phase tree filesystems (TUX2); it writes inside normal used structures, > but it marks each structure with generation tags --- when it updates > global table of tags, it atomically makes several structures valid. I > don't know about this idea being used elsewhere. So how is this generation structure organized ? paper ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On 4/28/07, Mikulas Patocka <[EMAIL PROTECTED]> wrote: I most wonder, why vim fsyncs its swapfile regularly (blocking typing during that) and doesn't fsync the resulting file on :w :-/ Never seen this. Why would fsync block typing unless vim was doing disk IO for every keystroke? Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
hm, fsync. Aside: why the heck do applications think that their data is so important that they need to fsync it all the time. I used to run a kernel on my laptop which had "return 0;" at the top of fsync() and fdatasync(). Most pleasurable. But wedging for 20 minutes is probably excessive punishment. I most wonder, why vim fsyncs its swapfile regularly (blocking typing during that) and doesn't fsync the resulting file on :w :-/ Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007, Linus Torvalds wrote: The main problem is that if the user extracts tar archive, tar eventually blocks on writeback I/O --- O.K. But if bash attempts to write one page to .bash_history file at the same time, it blocks too --- bad, the user is annoyed. Right, but it's actually very unlikely. Think about it: the person who extracts the tar-archive is perhaps dirtying a thousand pages, while the .bash_history writeback is doing a single one. Which process do you think is going to hit the "oops, we went over the limit" case 99.9% of the time? Both. See balance_dirty_pages --- you loop there if global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK) is over limit. So tar gets there first, start writeback, blocks. Innocent process calling one small write() gets there too (while writeback has not yet finished), sees that the expression is over limit and blocks too. Really, you go to ballance_dirty_pages with 1/8 probability, so small writers will block with that probability --- better than blocking always, but still annoying. The _really_ annoying problem is when you just have absolutely tons of memory dirty, and you start doing the writeback: if you saturate the IO queues totally, it simply doesn't matter _who_ starts the writeback, because anybody who needs to do any IO at all (not necessarily writing) is going to be blocked. I saw this writeback problem on machine that had a lot of memory (1G), internal fast disk where the distribution was installed and very slow external SCSI disk (6MB/s or so). When I did heavy write on the external disk and writeback started, the computer almost completely locked up --- any process trying to write anything to the fast disk blocked until writeback on the slow disk finishes. (that machine had some old RHEL kernel and it is not mine so I can't test new kernels on it --- but the above fragment of code shows that the problem still exists today) Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
So perhaps if there's any privileged reads going on then we should limit writes to a depth of 2 at most, with some timeout mechanism that would SCSI has a "high priority" bit in the command block, so you can just set it --- but I am not sure how well do disks support it. Mikulas gradually allow the deepening of the hardware queue, as long as no highprio reads come inbetween? With 2 pending requests and even assuming worst-case seeks the user-visible latency would be on the order of 20-30 msecs, which is at the edge of human perception. The problem comes when a hardware queue of 32-64 entries starves that one highprio read which then results in a 2+ seconds latency. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007 09:30:06 -0700 (PDT) Linus Torvalds <[EMAIL PROTECTED]> wrote: > There are worse examples. Try connecting some flash disk over USB-1, and > untar to it. Ugh. > > I'd love to have some per-device dirty limit, but it's harder than it > should be. this one should help: Patch: per device dirty throttling http://lwn.net/Articles/226709/ -- Paolo Ornati Linux 2.6.21-cfs-v7-g13fe02de on x86_64 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > Even with a good software IO scheduler, when you have disks that do > tagged queueing, if you fill up the disk queue with a few dozen > (depends on the disk what the queue limit is) huge write requests, it > doesn't really matter if the _software_ queuing then gives a big > advantage to reads coming in. They'll _still_ be waiting for a long > time, especially since you don't know what the disk firmware is going > to do. by far the largest advantage of tagged queueing is when we go from 1 pending request to 2 pending requests. The rest helps too for certain workloads (especially benchmarks), but if the IRQ handling is fast enough, having just 2 is more than enough to get 80% of the advantage of say of hardware-queue with a depth of 64. So perhaps if there's any privileged reads going on then we should limit writes to a depth of 2 at most, with some timeout mechanism that would gradually allow the deepening of the hardware queue, as long as no highprio reads come inbetween? With 2 pending requests and even assuming worst-case seeks the user-visible latency would be on the order of 20-30 msecs, which is at the edge of human perception. The problem comes when a hardware queue of 32-64 entries starves that one highprio read which then results in a 2+ seconds latency. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007, Matthias Andree wrote: > > Another thing that is rather unpleasant (haven't yet tried fiddling with > the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM, > that's going to leave you with tons of dirty buffers that clear slowly > -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating... Now *this* is actually really really nasty. There are worse examples. Try connecting some flash disk over USB-1, and untar to it. Ugh. I'd love to have some per-device dirty limit, but it's harder than it should be. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007, Mikulas Patocka wrote: > > > > Especially with lots of memory, allowing 40% of that memory to be dirty is > > just insane (even if we limit it to "just" 40% of the normal memory zone. > > That can be gigabytes. And no amount of IO scheduling will make it > > pleasant to try to handle the situation where that much memory is dirty. > > What about using different dirtypage limits for different processes? Not good. We inadvertedly actually had a very strange case of that, in the sense that we had different dirtypage limits depending on the type of the allocation: if somebody used GFP_HIGHUSER, he'd be looking at the percentage as a percentage of _all_ memory, but if somebody used GFP_KERNEL he'd look at it as a percentage of just the normal low memory. So effectively they had different limits (the percentage may have been the same, but the _meaning_ of the percentage changed ;) And it's really problematic, because it means that the process that has a high tolerance for dirty memory will happily dirty a lot of RAM, and then when the process that has a _low_ tolerance comes along, it might write just a single byte, and go "oh, damn, I'm way over my dirty limits, I will now have to start doing writeouts like mad". Your form is much better: > --- i.e. every process has dirtypage activity counter, that is increased when > it dirties a page and decreased over time. ..but is really hard to do, and in particular, it's really hard to make any kinds of guarantees that when you have a hundred processes, they won't go over the total dirty limit together! And one of the reasons for the dirty limit is that the VM really wants to know that it always has enough clean memory that it can throw away that even if it needs to do allocations while under IO, it's not totally screwed. An example of this is using dirty mmap with a networked filesystem: with 2.6.20 and later, this should actually _work_ fairly reliably, exactly because we now also count the dirty mapped pages in the dirty limits, so we never get into the situation that we used to be able to get into, where some process had mapped all of RAM, and dirtied it without the kernel even realizing, and then when the kernel needed more memory (in order to write some of it back), it was totally screwed. So we do need the "global limit", as just a VM safety issue. We could do some per-process counters in addition to that, but generally, the global limit actually ends up doing the right thing: heavy writers are more likely to _hit_ the limit, so statistically the people who write most are also the people who end up havign to clean up - so it's all fair. > The main problem is that if the user extracts tar archive, tar eventually > blocks on writeback I/O --- O.K. But if bash attempts to write one page to > .bash_history file at the same time, it blocks too --- bad, the user is > annoyed. Right, but it's actually very unlikely. Think about it: the person who extracts the tar-archive is perhaps dirtying a thousand pages, while the .bash_history writeback is doing a single one. Which process do you think is going to hit the "oops, we went over the limit" case 99.9% of the time? The _really_ annoying problem is when you just have absolutely tons of memory dirty, and you start doing the writeback: if you saturate the IO queues totally, it simply doesn't matter _who_ starts the writeback, because anybody who needs to do any IO at all (not necessarily writing) is going to be blocked. This is why having gigabytes of dirty data (or even "just" hundreds of megs) can be so annoying. Even with a good software IO scheduler, when you have disks that do tagged queueing, if you fill up the disk queue with a few dozen (depends on the disk what the queue limit is) huge write requests, it doesn't really matter if the _software_ queuing then gives a big advantage to reads coming in. They'll _still_ be waiting for a long time, especially since you don't know what the disk firmware is going to do. It's possible that we could do things like refusing to use all tag entries on the disk for writing. That would probably help latency a _lot_. Right now, if we do writeback, and fill up all the slots on the disk, we cannot even feed the disk the read request immediately - we'll have to wait for some of the writes to finish before we can even queue the read to the disk. (Of course, if disks don't support tagged queueing, you'll never have this problem at all, but most disks do these days, and I strongly suspect it really can aggravate latency numbers a lot). Jens? Comments? Or do you do that already? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007 10:51:48 +0200 Matthias Andree <[EMAIL PROTECTED]> wrote: > On Fri, 27 Apr 2007, Andrew Morton wrote: > > > But none of this explains a 20-minute hang, unless a *lot* of fsyncs are > > being performed, perhaps. > > Another thing that is rather unpleasant (haven't yet tried fiddling with > the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM, > that's going to leave you with tons of dirty buffers that clear slowly > -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating... > yes, a few people are attacking that from various angles at present. It's tricky - writeback has to juggle a lot of balls. We'll get there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Andrew Morton wrote: > But none of this explains a 20-minute hang, unless a *lot* of fsyncs are > being performed, perhaps. Another thing that is rather unpleasant (haven't yet tried fiddling with the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM, that's going to leave you with tons of dirty buffers that clear slowly -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating... -- Matthias Andree - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Linus Torvalds wrote: > Oh, well.. Journalling sucks. > > I was actually _really_ hoping that somebody would come along and tell > everybody that this whole journal-logging is stupid, and that it's just > better to not ever re-write blocks on disk, but instead write to new > blocks with version numbers (and not re-use old blocks until new versions > are stable on disk). Only that you need direct-overwrite support to be able to safely trash data you no longer need... -- Matthias Andree - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Linus Torvalds wrote: > > > On Fri, 27 Apr 2007, Marat Buharov wrote: > > > > On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > Aside: why the heck do applications think that their data is so important > > > that they need to fsync it all the time. I used to run a kernel on my > > > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most > > > pleasurable. > > > > So, if having fake fsync() and fdatasync() is pleasurable for laptop > > and desktop, may be it's time to add option into Kconfig which > > disables normal fsync behaviour in favor of robust desktop? > > This really is an ext3 issue, not "fsync()". > > On a good filesystem, when you do "fsync()" on a file, nothing at all > happens to any other files. On ext3, it seems to sync the global journal, This behavior has been in Linux and sort of official since the early 2.4.X days - remember the discussion on fsync()ing directory changes for MTAs that led to the mount option "dirsync" for ext?fs so that rename(), link() and stuff like that became synchronous even without fsync()ing the parent directory? I can look up archive references if need be. Surely four years ago, if not five (this is from the top of my head, not a quotable fact I verified from the LKML archives though). > I used to run reiserfs, and it had its problems, but this was the > "feature" of ext3 that I've disliked most. If you run a MUA with local > mail, it will do fsync's for most things, and things really hickup if you > are doing some other writes at the same time. In contrast, with reiser, if > you did a big untar or some other big write, if somebody fsync'ed a small > file, it wasn't even a blip on the radar - the fsync would sync just that > small thing. It's not as though I'd recommend reiserfs. I have seen one major corruption recently in openSUSE 10.2 with ext3, but I've had constant headaches with reiserfs since the day it went into S.u.S.E. kernels at the time until I switched away from reiserfs some years ago. -- Matthias Andree - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 2007-04-28 at 00:01 -0700, Andrew Morton wrote: > On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > > > On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote: > > As promised, I tested with a kernel that I know for fact that I have > > tested heavy IO on previously, and behavior was identically horrid, so > > it's not something new that snuck in ~recently, my disk just got a _lot_ > > fuller in the meantime (12k mp3s munch a lot). > > Just to clarify here - you're saying that some older kernel is as sucky as > 2.6.21, and that (presumably) dropping the dirty ratios makes things a bit > better on the old kernel as well? I didn't drop dirty ratios, only verified that behavior was just as horrible as 2.6.21. > Actually, I'm surprised that data=writeback didn't help much. If the > present theories are correct it should have helped quite a lot, because in > data=writeback mode fsync(small-file) will not cause > fdatasync(everything-else). data=writeback did help quite noticeably, just not enough. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote: > > On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote: > > > > > Actually, you don't need to apply the patch - just do > > > > > > echo 5 > /proc/sys/vm/dirty_background_ratio > > > echo 10 > /proc/sys/vm/dirty_ratio > > > > That seems to have done the trick. Amarok and GUI aren't exactly speed > > demons while writeout is happening, but they are not hanging for > > eternities. > > As promised, I tested with a kernel that I know for fact that I have > tested heavy IO on previously, and behavior was identically horrid, so > it's not something new that snuck in ~recently, my disk just got a _lot_ > fuller in the meantime (12k mp3s munch a lot). Just to clarify here - you're saying that some older kernel is as sucky as 2.6.21, and that (presumably) dropping the dirty ratios makes things a bit better on the old kernel as well? > I also verified that I don't need to use the dirty data restrictions > with ext2, all is just peachy using stock settings. Amarok switches > songs quickly, and GUI doesn't hang. Behavior is that expected of a > heavily loaded IO subsystem, and is 1000% better than ext3 with my very > full disk. Yes, the very full disk could explain why things are _so_ bad. Not only does fsync() force vast amounts of writeout, it's also seeky writeout. > Journaling is very nice, but I think I'll be much better off without it > responsiveness wise. Well, physical journalling with ordered data is bad here. Other forms of journalling which don't introduce this great contention point shouldn't be as bad. Actually, I'm surprised that data=writeback didn't help much. If the present theories are correct it should have helped quite a lot, because in data=writeback mode fsync(small-file) will not cause fdatasync(everything-else). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote: > On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote: > > > Actually, you don't need to apply the patch - just do > > > > echo 5 > /proc/sys/vm/dirty_background_ratio > > echo 10 > /proc/sys/vm/dirty_ratio > > That seems to have done the trick. Amarok and GUI aren't exactly speed > demons while writeout is happening, but they are not hanging for > eternities. As promised, I tested with a kernel that I know for fact that I have tested heavy IO on previously, and behavior was identically horrid, so it's not something new that snuck in ~recently, my disk just got a _lot_ fuller in the meantime (12k mp3s munch a lot). I also verified that I don't need to use the dirty data restrictions with ext2, all is just peachy using stock settings. Amarok switches songs quickly, and GUI doesn't hang. Behavior is that expected of a heavily loaded IO subsystem, and is 1000% better than ext3 with my very full disk. Journaling is very nice, but I think I'll be much better off without it responsiveness wise. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Linus Torvalds wrote: On Fri, 27 Apr 2007, Mike Galbraith wrote: As subject states, my GUI is going away for extended periods of time when my very full and likely highly fragmented (how to find out) filesystem is under heavy write load. While write is under way, if amarok (mp3 player) is running, no song change will occur until write is finished, and the GUI can go _entirely_ comatose for very long periods. Usually, it will come back to life after write is finished, but occasionally, a complete GUI restart is necessary. One thing to try out (and dammit, I should make it the default now in 2.6.21) is to just make the dirty limits much lower. We've been talking about this for ages, I think this might be the right time to do it. Especially with lots of memory, allowing 40% of that memory to be dirty is just insane (even if we limit it to "just" 40% of the normal memory zone. That can be gigabytes. And no amount of IO scheduling will make it pleasant to try to handle the situation where that much memory is dirty. What about using different dirtypage limits for different processes? --- i.e. every process has dirtypage activity counter, that is increased when it dirties a page and decreased over time. Compute the limit for process as some inverse of this counter --- so that processes that dirtied a lot of pages will be blocked at lower limit and processes that dirtied few pages will be blocked at higher limit. The main problem is that if the user extracts tar archive, tar eventually blocks on writeback I/O --- O.K. But if bash attempts to write one page to .bash_history file at the same time, it blocks too --- bad, the user is annoyed. (I don't have time to write and test it, it is just an idea --- I found these writeback lockups of the whole system annoying too) Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Bill Huey wrote: On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote: Oh, well.. Journalling sucks. I was actually _really_ hoping that somebody would come along and tell everybody that this whole journal-logging is stupid, and that it's just better to not ever re-write blocks on disk, but instead write to new blocks with version numbers (and not re-use old blocks until new versions are stable on disk). There was even somebody who did something like that for a PhD thesis, I forget the details (and it apparently died when the thesis was presumably accepted ;). That sounds a whole lot like NetApp's WAFL file system and is heavily patented. bill Hi SpadFS doesn't write to unallocated parts like log filesystems (LFS) or phase tree filesystems (TUX2); it writes inside normal used structures, but it marks each structure with generation tags --- when it updates global table of tags, it atomically makes several structures valid. I don't know about this idea being used elsewhere. It's fsync is slow too (needs to write all (meta)data too), but it at least doesn't livelock --- fsync is basically: * write all buffers and wait for completion * take lock preventing metadata updates * write all buffers again (those that were updated while previous write was in progress) and wait for completion * update global generation count table * release the lock Maybe Suse will be paying me from this autumn to make more features to it --- so far it works, doesn't eat data, but isn't much known :) Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Sat, 28 Apr 2007, Mikulas Patocka wrote: On Fri, 27 Apr 2007, Bill Huey wrote: Hi SpadFS doesn't write to unallocated parts like log filesystems (LFS) or phase tree filesystems (TUX2); --- BTW, I don't think that writing to unallocated parts of disk is good idea. These filesystems have cool write benchmarks, but one subtle (and unbenchmarkable) problem: They group files according to time when they were created and not according to directory hierarchy. When the user has directory with project files and he edited different files at different times, normal filesystems will place the files near each other (so that "grep blabla *" is fast) and log-structured filesystems will scatter the files over the whole disk. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote: > Actually, you don't need to apply the patch - just do > > echo 5 > /proc/sys/vm/dirty_background_ratio > echo 10 > /proc/sys/vm/dirty_ratio That seems to have done the trick. Amarok and GUI aren't exactly speed demons while writeout is happening, but they are not hanging for eternities. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007 13:31:30 -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote: > > On a good filesystem, when you do "fsync()" on a file, nothing at all > > happens to any other files. On ext3, it seems to sync the global journal, > > which means that just about *everything* that writes even a single byte > > (well, at least anything journalled, which would be all the normal > > directory ops etc) to disk will just *stop* dead cold! > > > > It's horrid. And it really is ext3, not "fsync()". > > > > I used to run reiserfs, and it had its problems, but this was the > > "feature" of ext3 that I've disliked most. If you run a MUA with local > > mail, it will do fsync's for most things, and things really hickup if you > > are doing some other writes at the same time. In contrast, with reiser, if > > you did a big untar or some other big write, if somebody fsync'ed a small > > file, it wasn't even a blip on the radar - the fsync would sync just that > > small thing. > > It's true that this is a "feature" of ext3 with data=ordered (the default), > but I suspect the same thing is now true in reiserfs too. The reason is > that if a journal commit doesn't flush the data as well then a crash will > result in garbage (from old deleted files) being visible in the newly > allocated file. People used to complain about this with reiserfs all the > time having corrupt data in new files after a crash, which is why I believe > it was fixed. People still complain about hey-my-files-are-all-full-of-zeroes on XFS. > There definitely are some problems with the ext3 journal commit though. > If the journal is full it will cause the whole journal to checkpoint out > to the filesystem synchronously even if just space for a small transaction > is needed. That is doubly bad if you have a very large journal. I believe > Alex has a patch to have it checkpoint much smaller chunks to the fs. > We can make great improvements here, and I've (twice) previously decribed how: hoist the entire ordered-mode data handling out of ext3, and out of the buffer_head layer and move it up into the VFS pagecache layer. Basically, do ordered-data with a commit-time inode walk, calling do_sync_mapping_range(). Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too. Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there. And guess what? We can then partly fix _this_ problem too. If we're running a commit on behalf of fsync(inode1) and we come across an inode2 which doesn't have any block allocation metadata in this commit, we don't need to sync inode2's pages. Weep. It's times like this when I want to escape all this patch-wrangling nonsense and go do some real stuff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007 13:09:06 -0600 Zan Lynx <[EMAIL PROTECTED]> wrote: > On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote: > [snip] > > ext3's problem here is that a single fsync() requires that ext3 sync the > > whole filesystem. Because > > > > - a journal commit can contain metadata from multiple files, and if we > > want to journal one file's metadata via fsync(), we unavoidably journal > > all the other file's metadata at the same time. > > > > - ordered mode requires that we write a file's data blocks prior to > > journalling the metadata which refers to those blocks. > > > > net result: syncing anything syncs the whole world. > > > > There are a few areas in which this could conceivably be tuned up: if a > > particular file doesn't currently have any metadata in the commit, we don't > > actually need to sync its data blocks: we could just transfer them into > > next commit. Hard, unlikely to be of benefit. > [snip] > > How about mixing the ordered and data journal modes? If the data blocks > would fit, have fsync write them into the journal as is done in > data=journal mode. Then that file data is committed to disk as fsync > requires, but it shouldn't require flushing all the previous metadata to > get an ordered guarantee. In some ways that would be quite neat: if a process does a small write then fsyncs it, write it all into the journal. That avoids a seek out to the file's data blocks. However it'd be quite hard to do, I expect: we don't know until commit time how much data has been written to this file (actually, we don't even know at commit-time, but we could, with quite some work, find out). But none of this will solve the problem, because even with your optimised fsync(), we still need to write out bonnie's large file at commit time, when we fsync() your small write to a different file. (And when I say "this problem" I refer to the known-about problem which we're discussing here. I suspect this in fact isn't Mike's problem - 20 minutes is crazy - it's not attributable to the fsync-syncs-everything problem unless Mike's GUI is doing a huge numer of separate fsyncs) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Jan Engelhardt wrote: > > Interesting. For my laptop, I have configured like 90 for > dirty_background_ratio and 95 for dirty_ratio. Makes for a nice > delayed write, but I do not do workloads bigger than extracing kernel > tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway. > Setting it to something like 95, I could probably rm -Rf the kernel > tree again and the disk never gets active because it is all cached. > But if dirty_ratio is lowered, the disk will get active soon. Yes. For laptops, you may want to - raise the dirty limits - increase the dirty scan times but you do realize that if you then need memory for something else, latency just becomes *horrible*. So even on laptops, it's not obviously the right thing to do (these days, throwing money at the problem instead, and getting one of the nice new 1.8" flash disks, will solve all issues: you'd have no reason to try to delay spinning up the disk anyway). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Linus Torvalds wrote: On Fri, 27 Apr 2007, Andreas Dilger wrote: It's true that this is a "feature" of ext3 with data=ordered (the default), but I suspect the same thing is now true in reiserfs too. Oh, well.. Journalling sucks. Go back to ext2? ;) I was actually _really_ hoping that somebody would come along and tell everybody that this whole journal-logging is stupid, and that it's just better to not ever re-write blocks on disk, but instead write to new blocks with version numbers (and not re-use old blocks until new versions are stable on disk). Ah, "copy on write"! ZFS (Sun) and WAFL (NetApp) does this. Don't know about WAFL, but ZFS does logging too. -Manoj -- Manoj Joseph http://kerneljunkie.blogspot.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Linus Torvalds wrote: On Fri, 27 Apr 2007, Andreas Dilger wrote: It's true that this is a "feature" of ext3 with data=ordered (the default), but I suspect the same thing is now true in reiserfs too. Oh, well.. Journalling sucks. I was actually _really_ hoping that somebody would come along and tell everybody that this whole journal-logging is stupid, and that it's just better to not ever re-write blocks on disk, but instead write to new blocks with version numbers (and not re-use old blocks until new versions are stable on disk). That sort of sounds like something NCR used to do in the mainframe days files had generation numbers, and multiple generations of the files were kept around with the OS automatically removing the older ones. There was even somebody who did something like that for a PhD thesis, I forget the details (and it apparently died when the thesis was presumably accepted ;). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Linus Torvalds wrote: There was even somebody who did something like that for a PhD thesis, I forget the details (and it apparently died when the thesis was presumably accepted ;). You mean SpadFS[1] right ? Linus Gabriel [1] http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
In article <[EMAIL PROTECTED]> you write: >I was actually _really_ hoping that somebody would come along and tell >everybody that this whole journal-logging is stupid, and that it's just >better to not ever re-write blocks on disk, but instead write to new >blocks with version numbers (and not re-use old blocks until new versions >are stable on disk). > >There was even somebody who did something like that for a PhD thesis, I >forget the details (and it apparently died when the thesis was presumably >accepted ;). If you mean tux2, it died because of patent issues: http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.3/0332.html Mike. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote: > Oh, well.. Journalling sucks. > > I was actually _really_ hoping that somebody would come along and tell > everybody that this whole journal-logging is stupid, and that it's just > better to not ever re-write blocks on disk, but instead write to new > blocks with version numbers (and not re-use old blocks until new versions > are stable on disk). > > There was even somebody who did something like that for a PhD thesis, I > forget the details (and it apparently died when the thesis was presumably > accepted ;). That sounds a whole lot like NetApp's WAFL file system and is heavily patented. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Apr 27 2007 08:18, Linus Torvalds wrote: > >Actually, you don't need to apply the patch - just do > > echo 5 > /proc/sys/vm/dirty_background_ratio > echo 10 > /proc/sys/vm/dirty_ratio > >and say if it seems to improve things. I think those are much saner >defaults especially for a desktop system (and probably for most servers >too, for that matter). Interesting. For my laptop, I have configured like 90 for dirty_background_ratio and 95 for dirty_ratio. Makes for a nice delayed write, but I do not do workloads bigger than extracing kernel tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway. Setting it to something like 95, I could probably rm -Rf the kernel tree again and the disk never gets active because it is all cached. But if dirty_ratio is lowered, the disk will get active soon. >Historical note: allowing about half of memory to contain dirty pages made >more sense back in the days when people had 16-64MB of memory, and a >single untar of even fairly small projects would otherwise hit the disk. >But memory sizes have grown *much* more quickly than disk speeds (and >latency requirements have gone down, not up), so a default that may >actually have been perfectly fine at some point seems crazy these days.. Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
The idea has not died and some NAS/file server vendors have already been doing this for some time. (I am not sure but is WAFS the same thing?) > -Original Message- > From: [EMAIL PROTECTED] [mailto:linux-kernel- > [EMAIL PROTECTED] On Behalf Of Linus Torvalds > Sent: Friday, April 27, 2007 12:51 PM > To: Andreas Dilger > Cc: Marat Buharov; Andrew Morton; Mike Galbraith; LKML; Jens Axboe; > [EMAIL PROTECTED]; Alex Tomas > Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose > when FS is under heavy write load (massive starvation) > > > > On Fri, 27 Apr 2007, Andreas Dilger wrote: > > > > It's true that this is a "feature" of ext3 with data=ordered (the > default), > > but I suspect the same thing is now true in reiserfs too. > > Oh, well.. Journalling sucks. > > I was actually _really_ hoping that somebody would come along and tell > everybody that this whole journal-logging is stupid, and that it's just > better to not ever re-write blocks on disk, but instead write to new > blocks with version numbers (and not re-use old blocks until new > versions > are stable on disk). > > There was even somebody who did something like that for a PhD thesis, I > forget the details (and it apparently died when the thesis was > presumably > accepted ;). > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Andreas Dilger wrote: > > It's true that this is a "feature" of ext3 with data=ordered (the default), > but I suspect the same thing is now true in reiserfs too. Oh, well.. Journalling sucks. I was actually _really_ hoping that somebody would come along and tell everybody that this whole journal-logging is stupid, and that it's just better to not ever re-write blocks on disk, but instead write to new blocks with version numbers (and not re-use old blocks until new versions are stable on disk). There was even somebody who did something like that for a PhD thesis, I forget the details (and it apparently died when the thesis was presumably accepted ;). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 13:31 -0600, Andreas Dilger wrote: > I believe > Alex has a patch to have it checkpoint much smaller chunks to the fs. I wouldn't be averse to test driving such a patch (understatement). You have a pointer? -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Linus Torvalds wrote: On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote: Could[/should] this stuff be changed from ratios to amounts? Or a quick boot-time test to use a ratio if the memory is small and an amount (like tax brackets, I would expect) if it's great? Yes, the "percentage" thing was likely wrong. That said, there *is* some correlation between "lots of memory" and "high-end machine", and that in turn tends to correlate with "fast disk", so I don't think the percentage approach is really *horribly* wrong. The main issue with the percentage is that we do export them as such through the /proc/ interface, and they are easy to change and understand. So changing them to amounts is non-trivial if you also want to support the old interfaces - and the advantage isn't obvious enough that it's a clear-cut case. I wonder if it would be useful if the limit was 'data we can write out in 1 (configurable) second. This would typically mean either one 50mb (depending on disk) contigous block or 100-200 scattered blocks (since the typical disk latency is about 5-10ms). Has anyone tried something like this? Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote: > On a good filesystem, when you do "fsync()" on a file, nothing at all > happens to any other files. On ext3, it seems to sync the global journal, > which means that just about *everything* that writes even a single byte > (well, at least anything journalled, which would be all the normal > directory ops etc) to disk will just *stop* dead cold! > > It's horrid. And it really is ext3, not "fsync()". > > I used to run reiserfs, and it had its problems, but this was the > "feature" of ext3 that I've disliked most. If you run a MUA with local > mail, it will do fsync's for most things, and things really hickup if you > are doing some other writes at the same time. In contrast, with reiser, if > you did a big untar or some other big write, if somebody fsync'ed a small > file, it wasn't even a blip on the radar - the fsync would sync just that > small thing. It's true that this is a "feature" of ext3 with data=ordered (the default), but I suspect the same thing is now true in reiserfs too. The reason is that if a journal commit doesn't flush the data as well then a crash will result in garbage (from old deleted files) being visible in the newly allocated file. People used to complain about this with reiserfs all the time having corrupt data in new files after a crash, which is why I believe it was fixed. There definitely are some problems with the ext3 journal commit though. If the journal is full it will cause the whole journal to checkpoint out to the filesystem synchronously even if just space for a small transaction is needed. That is doubly bad if you have a very large journal. I believe Alex has a patch to have it checkpoint much smaller chunks to the fs. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote: > But none of this explains a 20-minute hang, unless a *lot* of fsyncs are > being performed, perhaps. Yes. I need to do a lot more testing. All I see is one, and it's game over. Bizarre. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote: > Actually, you don't need to apply the patch - just do > > echo 5 > /proc/sys/vm/dirty_background_ratio > echo 10 > /proc/sys/vm/dirty_ratio > I'll try this, and do some testing with other kernels as well. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote: [snip] > ext3's problem here is that a single fsync() requires that ext3 sync the > whole filesystem. Because > > - a journal commit can contain metadata from multiple files, and if we > want to journal one file's metadata via fsync(), we unavoidably journal > all the other file's metadata at the same time. > > - ordered mode requires that we write a file's data blocks prior to > journalling the metadata which refers to those blocks. > > net result: syncing anything syncs the whole world. > > There are a few areas in which this could conceivably be tuned up: if a > particular file doesn't currently have any metadata in the commit, we don't > actually need to sync its data blocks: we could just transfer them into > next commit. Hard, unlikely to be of benefit. [snip] How about mixing the ordered and data journal modes? If the data blocks would fit, have fsync write them into the journal as is done in data=journal mode. Then that file data is committed to disk as fsync requires, but it shouldn't require flushing all the previous metadata to get an ordered guarantee. Or so it seems to me. -- Zan Lynx <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007 08:18:34 -0700 (PDT) Linus Torvalds <[EMAIL PROTECTED]> wrote: > echo 5 > /proc/sys/vm/dirty_background_ratio > echo 10 > /proc/sys/vm/dirty_ratio That'll help a lot. ext3's problem here is that a single fsync() requires that ext3 sync the whole filesystem. Because - a journal commit can contain metadata from multiple files, and if we want to journal one file's metadata via fsync(), we unavoidably journal all the other file's metadata at the same time. - ordered mode requires that we write a file's data blocks prior to journalling the metadata which refers to those blocks. net result: syncing anything syncs the whole world. There are a few areas in which this could conceivably be tuned up: if a particular file doesn't currently have any metadata in the commit, we don't actually need to sync its data blocks: we could just transfer them into next commit. Hard, unlikely to be of benefit. Arguably, we could get away without syncing overwritten data blocks. Users would occasionally see older data than they otherwise would have after a crash. Could help a bit in some circumstances. But none of this explains a 20-minute hang, unless a *lot* of fsyncs are being performed, perhaps. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Linus Torvalds wrote: > > On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote: >> Could[/should] this stuff be changed from ratios to amounts? Or a quick >> boot-time test to use a ratio if the memory is small and an amount (like >> tax brackets, I would expect) if it's great? > > Yes, the "percentage" thing was likely wrong. That said, there *is* some > correlation between "lots of memory" and "high-end machine", and that in > turn tends to correlate with "fast disk", so I don't think the percentage > approach is really *horribly* wrong. > > The main issue with the percentage is that we do export them as such > through the /proc/ interface, and they are easy to change and understand. > So changing them to amounts is non-trivial if you also want to support the > old interfaces - and the advantage isn't obvious enough that it's a > clear-cut case. > We could add a new "limit" field, though. If it defaulted to 0 (unlimited) the default behavior wouldn't change. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote: > > Could[/should] this stuff be changed from ratios to amounts? Or a quick > boot-time test to use a ratio if the memory is small and an amount (like > tax brackets, I would expect) if it's great? Yes, the "percentage" thing was likely wrong. That said, there *is* some correlation between "lots of memory" and "high-end machine", and that in turn tends to correlate with "fast disk", so I don't think the percentage approach is really *horribly* wrong. The main issue with the percentage is that we do export them as such through the /proc/ interface, and they are easy to change and understand. So changing them to amounts is non-trivial if you also want to support the old interfaces - and the advantage isn't obvious enough that it's a clear-cut case. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
> One thing to try out (and dammit, I should make it the default now in > 2.6.21) is to just make the dirty limits much lower. We've been talking > about this for ages, I think this might be the right time to do it. Could[/should] this stuff be changed from ratios to amounts? Or a quick boot-time test to use a ratio if the memory is small and an amount (like tax brackets, I would expect) if it's great? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Marat Buharov wrote: > > On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > Aside: why the heck do applications think that their data is so important > > that they need to fsync it all the time. I used to run a kernel on my > > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most > > pleasurable. > > So, if having fake fsync() and fdatasync() is pleasurable for laptop > and desktop, may be it's time to add option into Kconfig which > disables normal fsync behaviour in favor of robust desktop? This really is an ext3 issue, not "fsync()". On a good filesystem, when you do "fsync()" on a file, nothing at all happens to any other files. On ext3, it seems to sync the global journal, which means that just about *everything* that writes even a single byte (well, at least anything journalled, which would be all the normal directory ops etc) to disk will just *stop* dead cold! It's horrid. And it really is ext3, not "fsync()". I used to run reiserfs, and it had its problems, but this was the "feature" of ext3 that I've disliked most. If you run a MUA with local mail, it will do fsync's for most things, and things really hickup if you are doing some other writes at the same time. In contrast, with reiser, if you did a big untar or some other big write, if somebody fsync'ed a small file, it wasn't even a blip on the radar - the fsync would sync just that small thing. Maybe I'm wrong on the exact details (I'm not really up on the ext3 journal handling ;^), but you don't even have to know about any internals at all: you can just test it. Gaak. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007, Mike Galbraith wrote: > > As subject states, my GUI is going away for extended periods of time > when my very full and likely highly fragmented (how to find out) > filesystem is under heavy write load. While write is under way, if > amarok (mp3 player) is running, no song change will occur until write is > finished, and the GUI can go _entirely_ comatose for very long periods. > Usually, it will come back to life after write is finished, but > occasionally, a complete GUI restart is necessary. One thing to try out (and dammit, I should make it the default now in 2.6.21) is to just make the dirty limits much lower. We've been talking about this for ages, I think this might be the right time to do it. Especially with lots of memory, allowing 40% of that memory to be dirty is just insane (even if we limit it to "just" 40% of the normal memory zone. That can be gigabytes. And no amount of IO scheduling will make it pleasant to try to handle the situation where that much memory is dirty. So I do believe that we could probably do something about the IO scheduling _too_: - break up large write requests (yeah, it will make for worse IO throughput, but if make it configurable, and especially with controllers that don't have insane overheads per command, the difference between 128kB requests and 16MB requests is probably not really even noticeable - SCSI things with large per-command overheads are just stupid) Generating huge requests will automatically mean that they are "unbreakable" from an IO scheduler perspective, so it's bad for latency for other reqeusts once they've started. - maybe be more aggressive about prioritizing reads over writes. but in the meantime, what happens if you apply this patch? Actually, you don't need to apply the patch - just do echo 5 > /proc/sys/vm/dirty_background_ratio echo 10 > /proc/sys/vm/dirty_ratio and say if it seems to improve things. I think those are much saner defaults especially for a desktop system (and probably for most servers too, for that matter). Even 10% of memory dirty can be a whole lot of RAM, but it should hopefully be _better_ than the insane default we have now. Historical note: allowing about half of memory to contain dirty pages made more sense back in the days when people had 16-64MB of memory, and a single untar of even fairly small projects would otherwise hit the disk. But memory sizes have grown *much* more quickly than disk speeds (and latency requirements have gone down, not up), so a default that may actually have been perfectly fine at some point seems crazy these days.. Linus --- diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f469e3c..a794945 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -67,12 +67,12 @@ static inline long sync_writeback_pages(void) /* * Start background writeback (via pdflush) at this percentage */ -int dirty_background_ratio = 10; +int dirty_background_ratio = 5; /* * The generator of dirty data starts writeback at this percentage */ -int vm_dirty_ratio = 40; +int vm_dirty_ratio = 10; /* * The interval between `kupdate'-style writebacks, in jiffies - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Peter Zijlstra wrote: No way is globally disabling fsync() a good thing. I guess Andrew just is a sucker for punishment :-) Mmm... perhaps another nice thing to include in laptop-mode operation? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Marat Buharov wrote: On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote: Aside: why the heck do applications think that their data is so important that they need to fsync it all the time. I used to run a kernel on my laptop which had "return 0;" at the top of fsync() and fdatasync(). Most pleasurable. So, if having fake fsync() and fdatasync() is pleasurable for laptop and desktop, may be it's time to add option into Kconfig which disables normal fsync behaviour in favor of robust desktop? Sure, a noop fsync/fdatasync would speed up some things. And I am sure Andrew Morton knew what he was doing and the consequences. But unless you care nothing about your data, you should not do it. It is as simple as that. No, it does not give you a robust desktop!! -Manoj -- Manoj Joseph http://kerneljunkie.blogspot.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 15:59 +0400, Marat Buharov wrote: > On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > Aside: why the heck do applications think that their data is so important > > that they need to fsync it all the time. I used to run a kernel on my > > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most > > pleasurable. > > So, if having fake fsync() and fdatasync() is pleasurable for laptop > and desktop, may be it's time to add option into Kconfig which > disables normal fsync behaviour in favor of robust desktop? Nah, just teaching user-space to behave themselves should be sufficient; there is just no way kicker can justify doing a fdatasync(), I mean, come on its just showing a friggin menu. I have always wondered why that thing was so damn slow, like it needs to fetch stuff like that from all four corners of disk, feh! Just sliding over a sub-menu can take more than a second; I mean, it _really_ is just faster to just start things from your favourite shell. No way is globally disabling fsync() a good thing. I guess Andrew just is a sucker for punishment :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote: Aside: why the heck do applications think that their data is so important that they need to fsync it all the time. I used to run a kernel on my laptop which had "return 0;" at the top of fsync() and fdatasync(). Most pleasurable. So, if having fake fsync() and fdatasync() is pleasurable for laptop and desktop, may be it's time to add option into Kconfig which disables normal fsync behaviour in favor of robust desktop? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote: > Another livelock possibility is that bonnie is redirtying pages faster than > commit can write them out, so commit got livelocked: > > When I was doing the original port-from-2.2 I found that an application > which does > > for ( ; ; ) > pwrite(fd, "", 1, 0); > > would permanently livelock the fs. I fixed that, but it was six years ago, > and perhaps we later unfixed it. Well, box doesn't seem the least bit upset after quite a while now, so I guess it didn't get unfixed. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote: > On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > > > Greetings, > > > > As subject states, my GUI is going away for extended periods of time > > when my very full and likely highly fragmented (how to find out) > > filesystem is under heavy write load. While write is under way, if > > amarok (mp3 player) is running, no song change will occur until write is > > finished, and the GUI can go _entirely_ comatose for very long periods. > > Usually, it will come back to life after write is finished, but > > occasionally, a complete GUI restart is necessary. > > I'd be suspecting a GUI bug if a restart is necessary. Perhaps it went to > lunch for so long in the kernel that some time-based thing went bad. Yeah, there have been some KDE updates, maybe something went south. I know for sure that nothing this horrible used to happen during IO. But then when I used to regularly test IO, my disk heads didn't have to traverse nearly as much either. > Right. One possibility here is that bonnie is stuffing new dirty blocks > onto the committing transaction's ordered-data list and JBD commit is > livelocking. Only we're not supposed to be putting those blocks on that > list. > > Another livelock possibility is that bonnie is redirtying pages faster than > commit can write them out, so commit got livelocked: > > When I was doing the original port-from-2.2 I found that an application > which does > > for ( ; ; ) > pwrite(fd, "", 1, 0); > > would permanently livelock the fs. I fixed that, but it was six years ago, > and perhaps we later unfixed it. I'll try that. > It would be most interesting to try data=writeback. Seems somewhat better, but nothing close to tolerable. I still had to hot-key to a VT and kill the bonnie. > hm, fsync. > > Aside: why the heck do applications think that their data is so important > that they need to fsync it all the time. I used to run a kernel on my > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most > pleasurable. I thought unkind thoughts when I saw those traces :) Thanks, -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote: > Greetings, > > As subject states, my GUI is going away for extended periods of time > when my very full and likely highly fragmented (how to find out) > filesystem is under heavy write load. While write is under way, if > amarok (mp3 player) is running, no song change will occur until write is > finished, and the GUI can go _entirely_ comatose for very long periods. > Usually, it will come back to life after write is finished, but > occasionally, a complete GUI restart is necessary. I'd be suspecting a GUI bug if a restart is necessary. Perhaps it went to lunch for so long in the kernel that some time-based thing went bad. > The longest comatose period to date was ~20 minutes with 2.6.20.7 a few > days ago. I was letting SuSE's software update programs update my SuSE > 10.2 system, and started a bonnie while it was running (because I had > been seeing this on recent kernels, and wanted to see if it was in > stable as well), WHAM, instant dead GUI. When this happens, kbd and > mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and > killed the bonnie. No joy, GUI stayed utterly comatose until the > updater finished roughly 20 minutes later, at which time the shells I'd > tried to start popped up, and all worked as if nothing bad had ever > happened. During the time in between, no window could be brought into > focus, nada. > > While a bonnie is writing, if I poke KDE's menu button, that will > instantly trigger nastiness, and a trace (this one was with a cfs > kernel, but I just did same with virgin 2.6.21) shows that "kicker", > KDE's launcher proggy does an fdatasync for some reason, and that's the > end of it's world for ages. When clicking on amarok's icon, it does an > fsync, and that's the last thing that will happen in it's world until > write is done as well. I've repeated this with CFQ and AS IO > schedulers. Well that all sucks. > I have a couple of old kernels lying around that I can test with, but I > think it's going to be the same. Seems to be ext3's journal that is > causing my woes. Below this trace of kicker is one of amarok during > it's dead to the world time. > > Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter. .config is latest > kernel tested attached. Mount options are > noatime,nodiratime,acl,user_xattr. > > > [ 308.046646] kickerD 0044 0 5897 1 (NOTLB) > [ 308.052611]f32abe4c 00200082 83398b5a 0044 c01c251e f32ab000 > f32ab000 c01169b6 > [ 308.060926]f772fbcc cdc7e694 0039 8339857a 0044 83398b5a > 0044 > [ 308.069422]c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8 > f32ab000 c1b5ab10 > [ 308.077927] Call Trace: > [ 308.080568] [] log_wait_commit+0x9d/0x11f > [ 308.085549] [] journal_stop+0x1a1/0x22a > [ 308.090364] [] journal_force_commit+0x1d/0x20 > [ 308.095699] [] ext3_force_commit+0x24/0x26 > [ 308.100774] [] ext3_write_inode+0x2d/0x3b > [ 308.105771] [] __writeback_single_inode+0x2df/0x3a9 > [ 308.111633] [] sync_inode+0x15/0x38 > [ 308.116093] [] ext3_sync_file+0xbd/0xc8 > [ 308.120900] [] do_fsync+0x58/0x8b > [ 308.125188] [] __do_fsync+0x20/0x2f > [ 308.129656] [] sys_fdatasync+0x10/0x12 > [ 308.134384] [] sysenter_past_esp+0x5d/0x81 > [ 308.139441] === Right. One possibility here is that bonnie is stuffing new dirty blocks onto the committing transaction's ordered-data list and JBD commit is livelocking. Only we're not supposed to be putting those blocks on that list. Another livelock possibility is that bonnie is redirtying pages faster than commit can write them out, so commit got livelocked: When I was doing the original port-from-2.2 I found that an application which does for ( ; ; ) pwrite(fd, "", 1, 0); would permanently livelock the fs. I fixed that, but it was six years ago, and perhaps we later unfixed it. It would be most interesting to try data=writeback. > [ 311.755953] bonnieD 0046 0 6146 5929 (NOTLB) > [ 311.761929]e7622a60 00200082 04d7e5fe 0046 03332bd5 > e7622000 c02c0c54 > [ 311.770244]d8eaabcc e7622a64 f7d0c3ec 04d7e521 0046 04d7e5fe > 0046 > [ 311.778758]e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 > c018b105 e7622a8c > [ 311.787261] Call Trace: > [ 311.789904] [] io_schedule+0xe/0x16 > [ 311.794373] [] sync_buffer+0x2e/0x32 > [ 311.798927] [] __wait_on_bit_lock+0x3f/0x62 > [ 311.804089] [] out_of_line_wait_on_bit_lock+0x5f/0x67 > [ 311.810115] [] __lock_buffer+0x2b/0x31 > [ 311.814846] [] sync_dirty_buffer+0x88/0xc3 > [ 311.819921] [] journal_dirty_data+0x1dd/0x205 > [ 311.825256] [] ext3_journal_dirty_data+0x12/0x37 > [ 311.830858] [] journal_dirty_data_fn+0x15/0x1c > [ 311.836280] [] walk_page_buffers+0x36/0x68 > [ 311.841347] [] ext3_ordered_writepage+0x11a/0x
[ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Greetings, As subject states, my GUI is going away for extended periods of time when my very full and likely highly fragmented (how to find out) filesystem is under heavy write load. While write is under way, if amarok (mp3 player) is running, no song change will occur until write is finished, and the GUI can go _entirely_ comatose for very long periods. Usually, it will come back to life after write is finished, but occasionally, a complete GUI restart is necessary. The longest comatose period to date was ~20 minutes with 2.6.20.7 a few days ago. I was letting SuSE's software update programs update my SuSE 10.2 system, and started a bonnie while it was running (because I had been seeing this on recent kernels, and wanted to see if it was in stable as well), WHAM, instant dead GUI. When this happens, kbd and mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and killed the bonnie. No joy, GUI stayed utterly comatose until the updater finished roughly 20 minutes later, at which time the shells I'd tried to start popped up, and all worked as if nothing bad had ever happened. During the time in between, no window could be brought into focus, nada. While a bonnie is writing, if I poke KDE's menu button, that will instantly trigger nastiness, and a trace (this one was with a cfs kernel, but I just did same with virgin 2.6.21) shows that "kicker", KDE's launcher proggy does an fdatasync for some reason, and that's the end of it's world for ages. When clicking on amarok's icon, it does an fsync, and that's the last thing that will happen in it's world until write is done as well. I've repeated this with CFQ and AS IO schedulers. I have a couple of old kernels lying around that I can test with, but I think it's going to be the same. Seems to be ext3's journal that is causing my woes. Below this trace of kicker is one of amarok during it's dead to the world time. Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter. .config is latest kernel tested attached. Mount options are noatime,nodiratime,acl,user_xattr. [ 308.046646] kickerD 0044 0 5897 1 (NOTLB) [ 308.052611]f32abe4c 00200082 83398b5a 0044 c01c251e f32ab000 f32ab000 c01169b6 [ 308.060926]f772fbcc cdc7e694 0039 8339857a 0044 83398b5a 0044 [ 308.069422]c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8 f32ab000 c1b5ab10 [ 308.077927] Call Trace: [ 308.080568] [] log_wait_commit+0x9d/0x11f [ 308.085549] [] journal_stop+0x1a1/0x22a [ 308.090364] [] journal_force_commit+0x1d/0x20 [ 308.095699] [] ext3_force_commit+0x24/0x26 [ 308.100774] [] ext3_write_inode+0x2d/0x3b [ 308.105771] [] __writeback_single_inode+0x2df/0x3a9 [ 308.111633] [] sync_inode+0x15/0x38 [ 308.116093] [] ext3_sync_file+0xbd/0xc8 [ 308.120900] [] do_fsync+0x58/0x8b [ 308.125188] [] __do_fsync+0x20/0x2f [ 308.129656] [] sys_fdatasync+0x10/0x12 [ 308.134384] [] sysenter_past_esp+0x5d/0x81 [ 308.139441] === [ 311.755953] bonnieD 0046 0 6146 5929 (NOTLB) [ 311.761929]e7622a60 00200082 04d7e5fe 0046 03332bd5 e7622000 c02c0c54 [ 311.770244]d8eaabcc e7622a64 f7d0c3ec 04d7e521 0046 04d7e5fe 0046 [ 311.778758]e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 c018b105 e7622a8c [ 311.787261] Call Trace: [ 311.789904] [] io_schedule+0xe/0x16 [ 311.794373] [] sync_buffer+0x2e/0x32 [ 311.798927] [] __wait_on_bit_lock+0x3f/0x62 [ 311.804089] [] out_of_line_wait_on_bit_lock+0x5f/0x67 [ 311.810115] [] __lock_buffer+0x2b/0x31 [ 311.814846] [] sync_dirty_buffer+0x88/0xc3 [ 311.819921] [] journal_dirty_data+0x1dd/0x205 [ 311.825256] [] ext3_journal_dirty_data+0x12/0x37 [ 311.830858] [] journal_dirty_data_fn+0x15/0x1c [ 311.836280] [] walk_page_buffers+0x36/0x68 [ 311.841347] [] ext3_ordered_writepage+0x11a/0x191 [ 311.847027] [] generic_writepages+0x1f3/0x305 [ 311.852344] [] do_writepages+0x37/0x39 [ 311.857064] [] __writeback_single_inode+0x96/0x3a9 [ 311.862842] [] sync_sb_inodes+0x1bc/0x27f [ 311.867830] [] writeback_inodes+0x98/0xe1 [ 311.872819] [] balance_dirty_pages_ratelimited_nr+0xc4/0x1bf [ 311.879461] [] generic_file_buffered_write+0x32e/0x677 [ 311.885576] [] __generic_file_aio_write_nolock+0x2e2/0x57f [ 311.892044] [] generic_file_aio_write+0x60/0xd4 [ 311.897553] [] ext3_file_write+0x27/0xa5 [ 311.902455] [] do_sync_write+0xcd/0x103 [ 311.907270] [] vfs_write+0xa8/0x128 [ 311.911738] [] sys_write+0x3d/0x64 [ 311.916111] [] sysenter_past_esp+0x5d/0x81 [ 311.921185] === [ 311.924763] pdflush D 0046 0 6147 5 (L-TLB) [ 311.930739]ec7e2ef0 0046 03f2b0ea 0046 ec7e2f0c c0186b45 ec7e2000 c01169b6 [ 311.939052]ea14069c ec7e2f00 ec7e2f00 03f2afc9 0046 03f2b0ea 0046 0282 [ 311.947557]ec7e2f00 ab4c ec7e2f30 ec7e2f20 c04a3689