Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > I'm still not understanding.  The terms you're using are a bit ambiguous.
> > 
> > What does "find some dirty unallocated blocks" mean?  Find a page which is
> > dirty and which does not have a disk mapping?
> > 
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
> 
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome.  If someone comes in and does
an fsync() we've lost our synchronisation point.  Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals.  We're OK wrt page reclaim and we're OK wrt
truncate and invalidate.  As long as the page is uptodate we _should_ be OK
wrt readpage().  But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.


I'd be 100% OK with locking multiple pages in ascending pgoff_t order. 
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow.  But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.


> > 
> > 
> >>going to commit
> >>find inode I dirty
> >>do NOT find these blocks because they're
> >>  allocated only, but pages/bhs aren't 
> >> mapped
> >>  to them
> >>start commit
> > 
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
> 
> nope, I mean sb->inode->page walk.
> 
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page().  Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
> 
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that.  Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view.  If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search.  But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data.  Files which
> > have chattr +j would screw things up, as usual.
> 
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon. 
It's what we do now.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).





going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.


nope, I mean sb->inode->page walk.


But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.


if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?


It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.


not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.


I assume (hope) that your delayed allocation code implements
->writepages()?  Doing the allocation one-page-at-a-time sounds painful...


indeed. this is a root cause of all this complexity.

thanks, Alex


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:
> > 
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit.  But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in 
> >> transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> > 
> > I don't understand.  Can you please describe the race in more detail?
> 
> if I understood your idea right, then in data=ordered mode, commit thread 
> writes
> all dirty mapped blocks before real commit.
> 
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
> thread
> 
> t1t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


>   going to commit
>   find inode I dirty
>   do NOT find these blocks because they're
> allocated only, but pages/bhs aren't 
> mapped
> to them
>   start commit

I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.



It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
->writepages()?  Doing the allocation one-page-at-a-time sounds painful...

> 
> map pages/bhs to just allocate blocks
> 
> 
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound 
> nice, IMHO.
> 

I don't think I'm understanding you fully yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


I don't understand.  Can you please describe the race in more detail?


if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.

say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
thread

t1  t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()

going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


map pages/bhs to just allocate blocks


so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound 
nice, IMHO.

thanks, Alex



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > Yes, there can be issues with needing to allocate journal space within the
> > context of a commit.  But
> 
> no-no, this isn't required. we only need to mark pages/blocks within
> transaction, otherwise race is possible when we allocate blocks in 
> transaction,
> then transacton starts to commit, then we mark pages/blocks to be flushed
> before commit.

I don't understand.  Can you please describe the race in more detail?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But


no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?


no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.



However b) might lead to the hey-my-file-is-full-of-zeroes problem.



thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But


no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?


no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.



However b) might lead to the hey-my-file-is-full-of-zeroes problem.



thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  Yes, there can be issues with needing to allocate journal space within the
  context of a commit.  But
 
 no-no, this isn't required. we only need to mark pages/blocks within
 transaction, otherwise race is possible when we allocate blocks in 
 transaction,
 then transacton starts to commit, then we mark pages/blocks to be flushed
 before commit.

I don't understand.  Can you please describe the race in more detail?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:


Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


I don't understand.  Can you please describe the race in more detail?


if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.

say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
thread

t1  t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()

going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


map pages/bhs to just allocate blocks


so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound 
nice, IMHO.

thanks, Alex



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:
  
  Andrew Morton wrote:
  Yes, there can be issues with needing to allocate journal space within the
  context of a commit.  But
  no-no, this isn't required. we only need to mark pages/blocks within
  transaction, otherwise race is possible when we allocate blocks in 
  transaction,
  then transacton starts to commit, then we mark pages/blocks to be flushed
  before commit.
  
  I don't understand.  Can you please describe the race in more detail?
 
 if I understood your idea right, then in data=ordered mode, commit thread 
 writes
 all dirty mapped blocks before real commit.
 
 say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
 thread
 
 t1t2
 find dirty inode I
 find some dirty unallocated blocks
 journal_start()
 allocate blocks
 attach them to I
 journal_stop()

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does find some dirty unallocated blocks mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


   going to commit
   find inode I dirty
   do NOT find these blocks because they're
 allocated only, but pages/bhs aren't 
 mapped
 to them
   start commit

I think you're assuming here that commit would be using -t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb-inode-page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.



It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
-writepages()?  Doing the allocation one-page-at-a-time sounds painful...

 
 map pages/bhs to just allocate blocks
 
 
 so, either we mark pages/bhs someway within journal_start()--journal_stop() or
 commit thread should do lookup for all dirty pages. the latter doesn't sound 
 nice, IMHO.
 

I don't think I'm understanding you fully yet.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does find some dirty unallocated blocks mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).





going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


I think you're assuming here that commit would be using -t_sync_datalist
to locate dirty buffer_heads.


nope, I mean sb-inode-page walk.


But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb-inode-page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.


if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?


It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.


not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.


I assume (hope) that your delayed allocation code implements
-writepages()?  Doing the allocation one-page-at-a-time sounds painful...


indeed. this is a root cause of all this complexity.

thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  I'm still not understanding.  The terms you're using are a bit ambiguous.
  
  What does find some dirty unallocated blocks mean?  Find a page which is
  dirty and which does not have a disk mapping?
  
  Normally the above operation would be implemented via
  ext4_writeback_writepage(), and it runs under lock_page().
 
 I'm mostly worried about delayed allocation case. My impression was that
 holding number of pages locked isn't a good idea, even if they're locked
 in index order. so, I was going to turn number of pages writeback, then
 allocate blocks for all of them at once, then put proper blocknr's into
 bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome.  If someone comes in and does
an fsync() we've lost our synchronisation point.  Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals.  We're OK wrt page reclaim and we're OK wrt
truncate and invalidate.  As long as the page is uptodate we _should_ be OK
wrt readpage().  But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.


I'd be 100% OK with locking multiple pages in ascending pgoff_t order. 
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow.  But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.


  
  
 going to commit
 find inode I dirty
 do NOT find these blocks because they're
   allocated only, but pages/bhs aren't 
  mapped
   to them
 start commit
  
  I think you're assuming here that commit would be using -t_sync_datalist
  to locate dirty buffer_heads.
 
 nope, I mean sb-inode-page walk.
 
  But under this proposal, t_sync_datalist just gets removed: the new
  ordered-data mode _only_ need to do the sb-inode-page walk.  So if I'm
  understanding you, the way in which we'd handle any such race is to make
  kjournald's writeback of the dirty pages block in lock_page().  Once it
  gets the page lock it can look to see if some other thread has mapped the
  page to disk.
 
 if I'm right holding number of pages locked, then they won't be locked, but
 writeback. of course kjournald can block on writeback as well, but how does
 it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that.  Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view.  If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

  It may turn out that kjournald needs a private way of getting at the
  I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
  had the radix-tree-of-dirty-inodes thing then that's easy enough to do
  anyway, with a tagged search.  But I expect that a single pass through the
  superblock's dirty inodes would suffice for ordered-data.  Files which
  have chattr +j would screw things up, as usual.
 
 not dirty inodes only, but rather some fast way to find pages with newly
 allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon. 
It's what we do now.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Andrew Morton
On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > We can make great improvements here, and I've (twice) previously decribed
> > how: hoist the entire ordered-mode data handling out of ext3, and out of
> > the buffer_head layer and move it up into the VFS pagecache layer. 
> > Basically, do ordered-data with a commit-time inode walk, calling
> > do_sync_mapping_range().
> > 
> > Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
> > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
> 
> I'm not sure it's that easy.
> 
> if we move to pages, then we have to mark pages to be flushed holding
> transaction open. now take delayed allocation into account: we need
> to allocate number of blocks at once and then mark all pages mapped,
> again within context of the same transaction.

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?

However b) might lead to the hey-my-file-is-full-of-zeroes problem.

> so, an implementation
> would look like the following?
> 
> generic_writepages() {
>   /* collect set of contig. dirty pages */
>   foo_get_blocks() {
>   foo_journal_start();
>   foo_new_blocks();
>   foo_attach_blocks_to_inode();
>   generic_mark_pages_mapped();
>   foo_journal_stop();
>   }
> }
> 
> another question is will it scale well given number of dirty inodes
> can be much larger than number of inodes with dirty mapped blocks
> (in delayed allocation case, for example) ?

Possibly - zillions of dirty-for-atime inodes might get in the way.  A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug).  A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree.  Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas

Andrew Morton wrote:

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer. 
Basically, do ordered-data with a commit-time inode walk, calling

do_sync_mapping_range().

Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.


I'm not sure it's that easy.

if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?

generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}

another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?

thanks, Alex




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas

Andrew Morton wrote:

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer. 
Basically, do ordered-data with a commit-time inode walk, calling

do_sync_mapping_range().

Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.


I'm not sure it's that easy.

if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?

generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}

another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?

thanks, Alex




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Andrew Morton
On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  We can make great improvements here, and I've (twice) previously decribed
  how: hoist the entire ordered-mode data handling out of ext3, and out of
  the buffer_head layer and move it up into the VFS pagecache layer. 
  Basically, do ordered-data with a commit-time inode walk, calling
  do_sync_mapping_range().
  
  Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
  Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
 
 I'm not sure it's that easy.
 
 if we move to pages, then we have to mark pages to be flushed holding
 transaction open. now take delayed allocation into account: we need
 to allocate number of blocks at once and then mark all pages mapped,
 again within context of the same transaction.

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?

However b) might lead to the hey-my-file-is-full-of-zeroes problem.

 so, an implementation
 would look like the following?
 
 generic_writepages() {
   /* collect set of contig. dirty pages */
   foo_get_blocks() {
   foo_journal_start();
   foo_new_blocks();
   foo_attach_blocks_to_inode();
   generic_mark_pages_mapped();
   foo_journal_stop();
   }
 }
 
 another question is will it scale well given number of dirty inodes
 can be much larger than number of inodes with dirty mapped blocks
 (in delayed allocation case, for example) ?

Possibly - zillions of dirty-for-atime inodes might get in the way.  A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug).  A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree.  Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-02 Thread Mike Galbraith
On Wed, 2007-05-02 at 08:53 +0200, Jens Axboe wrote:
> On Fri, Apr 27 2007, Linus Torvalds wrote:
> > So I do believe that we could probably do something about the IO 
> > scheduling _too_:
> > 
> >  - break up large write requests (yeah, it will make for worse IO 
> >throughput, but if make it configurable, and especially with 
> >controllers that don't have insane overheads per command, the 
> >difference between 128kB requests and 16MB requests is probably not 
> >really even noticeable - SCSI things with large per-command overheads 
> >are just stupid)
> > 
> >Generating huge requests will automatically mean that they are 
> >"unbreakable" from an IO scheduler perspective, so it's bad for latency 
> >for other reqeusts once they've started.
> 
> Overlooked this one initially... We actually don't generate huge
> requests, exactly because of that. Even if the device can do large
> requests (most SATA disks today can do 32meg), we default to 512kB as
> the largest one that we will build due to file system requests. It's
> trivial to reduce that limit, see /sys/block//queue/max_sectors_kb.
> That controls the maximum per-request size.

For the record, I haven't been able to stall KDE for ages with
data=writeback.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-02 Thread Jens Axboe
On Fri, Apr 27 2007, Linus Torvalds wrote:
> So I do believe that we could probably do something about the IO 
> scheduling _too_:
> 
>  - break up large write requests (yeah, it will make for worse IO 
>throughput, but if make it configurable, and especially with 
>controllers that don't have insane overheads per command, the 
>difference between 128kB requests and 16MB requests is probably not 
>really even noticeable - SCSI things with large per-command overheads 
>are just stupid)
> 
>Generating huge requests will automatically mean that they are 
>"unbreakable" from an IO scheduler perspective, so it's bad for latency 
>for other reqeusts once they've started.

Overlooked this one initially... We actually don't generate huge
requests, exactly because of that. Even if the device can do large
requests (most SATA disks today can do 32meg), we default to 512kB as
the largest one that we will build due to file system requests. It's
trivial to reduce that limit, see /sys/block//queue/max_sectors_kb.
That controls the maximum per-request size.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-02 Thread Jens Axboe
On Fri, Apr 27 2007, Linus Torvalds wrote:
 So I do believe that we could probably do something about the IO 
 scheduling _too_:
 
  - break up large write requests (yeah, it will make for worse IO 
throughput, but if make it configurable, and especially with 
controllers that don't have insane overheads per command, the 
difference between 128kB requests and 16MB requests is probably not 
really even noticeable - SCSI things with large per-command overheads 
are just stupid)
 
Generating huge requests will automatically mean that they are 
unbreakable from an IO scheduler perspective, so it's bad for latency 
for other reqeusts once they've started.

Overlooked this one initially... We actually don't generate huge
requests, exactly because of that. Even if the device can do large
requests (most SATA disks today can do 32meg), we default to 512kB as
the largest one that we will build due to file system requests. It's
trivial to reduce that limit, see /sys/block/dev/queue/max_sectors_kb.
That controls the maximum per-request size.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-02 Thread Mike Galbraith
On Wed, 2007-05-02 at 08:53 +0200, Jens Axboe wrote:
 On Fri, Apr 27 2007, Linus Torvalds wrote:
  So I do believe that we could probably do something about the IO 
  scheduling _too_:
  
   - break up large write requests (yeah, it will make for worse IO 
 throughput, but if make it configurable, and especially with 
 controllers that don't have insane overheads per command, the 
 difference between 128kB requests and 16MB requests is probably not 
 really even noticeable - SCSI things with large per-command overheads 
 are just stupid)
  
 Generating huge requests will automatically mean that they are 
 unbreakable from an IO scheduler perspective, so it's bad for latency 
 for other reqeusts once they've started.
 
 Overlooked this one initially... We actually don't generate huge
 requests, exactly because of that. Even if the device can do large
 requests (most SATA disks today can do 32meg), we default to 512kB as
 the largest one that we will build due to file system requests. It's
 trivial to reduce that limit, see /sys/block/dev/queue/max_sectors_kb.
 That controls the maximum per-request size.

For the record, I haven't been able to stall KDE for ages with
data=writeback.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-30 Thread Jens Axboe
On Sat, Apr 28 2007, Linus Torvalds wrote:
> > The main problem is that if the user extracts tar archive, tar eventually
> > blocks on writeback I/O --- O.K. But if bash attempts to write one page to
> > .bash_history file at the same time, it blocks too --- bad, the user is
> > annoyed.
> 
> Right, but it's actually very unlikely. Think about it: the person who 
> extracts the tar-archive is perhaps dirtying a thousand pages, while the 
> .bash_history writeback is doing a single one. Which process do you think 
> is going to hit the "oops, we went over the limit" case 99.9% of the time?
> 
> The _really_ annoying problem is when you just have absolutely tons of 
> memory dirty, and you start doing the writeback: if you saturate the IO 
> queues totally, it simply doesn't matter _who_ starts the writeback, 
> because anybody who needs to do any IO at all (not necessarily writing) is 
> going to be blocked.
> 
> This is why having gigabytes of dirty data (or even "just" hundreds of 
> megs) can be so annoying.
> 
> Even with a good software IO scheduler, when you have disks that do tagged 
> queueing, if you fill up the disk queue with a few dozen (depends on the 
> disk what the queue limit is) huge write requests, it doesn't really 
> matter if the _software_ queuing then gives a big advantage to reads 
> coming in. They'll _still_ be waiting for a long time, especially since 
> you don't know what the disk firmware is going to do.
> 
> It's possible that we could do things like refusing to use all tag entries 
> on the disk for writing. That would probably help latency a _lot_. Right 
> now, if we do writeback, and fill up all the slots on the disk, we cannot 
> even feed the disk the read request immediately - we'll have to wait for 
> some of the writes to finish before we can even queue the read to the 
> disk.
> 
> (Of course, if disks don't support tagged queueing, you'll never have this 
> problem at all, but most disks do these days, and I strongly suspect it 
> really can aggravate latency numbers a lot).
> 
> Jens? Comments? Or do you do that already?

Yes, CFQ tries to handle that quite aggressively already. With the
emergene of NCQ on SATA, it has become a much bigger problem since it's
seen so easily on the desktop. The SCSI people usually don't care about
latency that much, so not many complaints there.

The recently posted patch series for CFQ that I will submit soon for
2.6.22 has more fixes/tweaks for this.


-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-30 Thread Jens Axboe
On Sat, Apr 28 2007, Mikulas Patocka wrote:
> >So perhaps if there's any privileged reads going on then we should limit
> >writes to a depth of 2 at most, with some timeout mechanism that would
> 
> SCSI has a "high priority" bit in the command block, so you can just set 
> it --- but I am not sure how well do disks support it.

I'd be surprised if it was useful.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-30 Thread Jens Axboe
On Sat, Apr 28 2007, Linus Torvalds wrote:
  The main problem is that if the user extracts tar archive, tar eventually
  blocks on writeback I/O --- O.K. But if bash attempts to write one page to
  .bash_history file at the same time, it blocks too --- bad, the user is
  annoyed.
 
 Right, but it's actually very unlikely. Think about it: the person who 
 extracts the tar-archive is perhaps dirtying a thousand pages, while the 
 .bash_history writeback is doing a single one. Which process do you think 
 is going to hit the oops, we went over the limit case 99.9% of the time?
 
 The _really_ annoying problem is when you just have absolutely tons of 
 memory dirty, and you start doing the writeback: if you saturate the IO 
 queues totally, it simply doesn't matter _who_ starts the writeback, 
 because anybody who needs to do any IO at all (not necessarily writing) is 
 going to be blocked.
 
 This is why having gigabytes of dirty data (or even just hundreds of 
 megs) can be so annoying.
 
 Even with a good software IO scheduler, when you have disks that do tagged 
 queueing, if you fill up the disk queue with a few dozen (depends on the 
 disk what the queue limit is) huge write requests, it doesn't really 
 matter if the _software_ queuing then gives a big advantage to reads 
 coming in. They'll _still_ be waiting for a long time, especially since 
 you don't know what the disk firmware is going to do.
 
 It's possible that we could do things like refusing to use all tag entries 
 on the disk for writing. That would probably help latency a _lot_. Right 
 now, if we do writeback, and fill up all the slots on the disk, we cannot 
 even feed the disk the read request immediately - we'll have to wait for 
 some of the writes to finish before we can even queue the read to the 
 disk.
 
 (Of course, if disks don't support tagged queueing, you'll never have this 
 problem at all, but most disks do these days, and I strongly suspect it 
 really can aggravate latency numbers a lot).
 
 Jens? Comments? Or do you do that already?

Yes, CFQ tries to handle that quite aggressively already. With the
emergene of NCQ on SATA, it has become a much bigger problem since it's
seen so easily on the desktop. The SCSI people usually don't care about
latency that much, so not many complaints there.

The recently posted patch series for CFQ that I will submit soon for
2.6.22 has more fixes/tweaks for this.


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-30 Thread Jens Axboe
On Sat, Apr 28 2007, Mikulas Patocka wrote:
 So perhaps if there's any privileged reads going on then we should limit
 writes to a depth of 2 at most, with some timeout mechanism that would
 
 SCSI has a high priority bit in the command block, so you can just set 
 it --- but I am not sure how well do disks support it.

I'd be surprised if it was useful.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-29 Thread Mikulas Patocka

On Sat, 28 Apr 2007, Lee Revell wrote:


On 4/28/07, Mikulas Patocka <[EMAIL PROTECTED]> wrote:

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w  :-/


Never seen this.  Why would fsync block typing unless vim was doing
disk IO for every keystroke?

Lee


Not for every keystroke, but after some time it calls fsync(). During 
execution of that call, keyboard is blocked. It is not normally problem 
(fsync executes very fastly), but it starts to be problem in case of 
extremely overloaded system.


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-29 Thread Mark Lord

Lee Revell wrote:

On 4/28/07, Mikulas Patocka <[EMAIL PROTECTED]> wrote:

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w  :-/


Never seen this.  Why would fsync block typing unless vim was doing
disk IO for every keystroke?


It does do that, for the crash-recovery files it maintains.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-29 Thread Mark Lord

Lee Revell wrote:

On 4/28/07, Mikulas Patocka [EMAIL PROTECTED] wrote:

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w  :-/


Never seen this.  Why would fsync block typing unless vim was doing
disk IO for every keystroke?


It does do that, for the crash-recovery files it maintains.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-29 Thread Mikulas Patocka

On Sat, 28 Apr 2007, Lee Revell wrote:


On 4/28/07, Mikulas Patocka [EMAIL PROTECTED] wrote:

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w  :-/


Never seen this.  Why would fsync block typing unless vim was doing
disk IO for every keystroke?

Lee


Not for every keystroke, but after some time it calls fsync(). During 
execution of that call, keyboard is blocked. It is not normally problem 
(fsync executes very fastly), but it starts to be problem in case of 
extremely overloaded system.


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
phase tree filesystems (TUX2); it writes inside normal used structures,
but it marks each structure with generation tags --- when it updates
global table of tags, it atomically makes several structures valid. I
don't know about this idea being used elsewhere.


So how is this generation structure organized ? paper ?


Paper is in CITSA 2006 proceedings (but you likely don't have them and I 
signed some statement that I can't post it elsewhere :-( )


Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table --- 
that array is on disk and in memory (see struct __spadfs->cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory 
too (see struct __spadfs->cc)


* On mount, you load crash count table and crash count from disk to 
memory. You increment carsh count on disk (but leave old in memory). You 
increment one entry in crash count table - cct[cc] in memory, but leave 
old on disk.
* On sync you write all metadata buffers, do write barrier, write one 
sector of crash count table from memory to disk and do write 
barrier again.

* On unmount, you sync and decrement crash count on disk.

--- so crash count counts crashes --- it is increased each time you mount 
and don't unmount.


Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc) 
and 16-bit crash count(cc).
* You create directory entry with entry->txc = fs->txc[fs->cc] and 
entry->cc = fs->cc
* Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc 
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory 
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes, 
it will load crash count table from disk and there's one-less value than 
entry->txc, so the entry will be invalid. It will also run with increased 
cc, so it will never touch txc at an old index, so the entry will be valid 
forever.
--- if you sync, you write crash count table to disk and directory entry 
will be atomically made valid forever (because values in crash count table 
never decrease)


In my implementation, the top bit of entry->txc is used to mark whether 
the entry is scheduled for adding or delete, so that you can atomically 
add one directory entry and delete other.


Space allocation bitmaps or lists are managed in such a way that there are 
two copies and cc/txc pair determining which one is valid.


Files are extended in such a way that each file has two "size" entries and 
cc/txc pair denoting which one is valid, so that you can atomically 
extend/truncate file and mark its space allocated/freed in bitmaps or 
lists (BTW. this cc/txc pair is the same one that denotes if the directory 
entry is valid and another bit determines one of these two functions --- 
to save space).


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread hui
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or 
> phase tree filesystems (TUX2); it writes inside normal used structures, 
> but it marks each structure with generation tags --- when it updates 
> global table of tags, it atomically makes several structures valid. I 
> don't know about this idea being used elsewhere.

So how is this generation structure organized ? paper ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Lee Revell

On 4/28/07, Mikulas Patocka <[EMAIL PROTECTED]> wrote:

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w  :-/


Never seen this.  Why would fsync block typing unless vim was doing
disk IO for every keystroke?

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

hm, fsync.

Aside: why the heck do applications think that their data is so important
that they need to fsync it all the time.  I used to run a kernel on my
laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
pleasurable.

But wedging for 20 minutes is probably excessive punishment.


I most wonder, why vim fsyncs its swapfile regularly (blocking typing 
during that) and doesn't fsync the resulting file on :w  :-/


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka



On Sat, 28 Apr 2007, Linus Torvalds wrote:


The main problem is that if the user extracts tar archive, tar eventually
blocks on writeback I/O --- O.K. But if bash attempts to write one page to
.bash_history file at the same time, it blocks too --- bad, the user is
annoyed.


Right, but it's actually very unlikely. Think about it: the person who
extracts the tar-archive is perhaps dirtying a thousand pages, while the
.bash_history writeback is doing a single one. Which process do you think
is going to hit the "oops, we went over the limit" case 99.9% of the time?


Both. See balance_dirty_pages --- you loop there if
global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + 
global_page_state(NR_WRITEBACK) is over limit.


So tar gets there first, start writeback, blocks. Innocent process calling 
one small write() gets there too (while writeback has not yet finished), 
sees that the expression is over limit and blocks too.


Really, you go to ballance_dirty_pages with 1/8 probability, so small 
writers will block with that probability --- better than blocking always, 
but still annoying.



The _really_ annoying problem is when you just have absolutely tons of
memory dirty, and you start doing the writeback: if you saturate the IO
queues totally, it simply doesn't matter _who_ starts the writeback,
because anybody who needs to do any IO at all (not necessarily writing) is
going to be blocked.


I saw this writeback problem on machine that had a lot of memory (1G), 
internal fast disk where the distribution was installed and very slow 
external SCSI disk (6MB/s or so). When I did heavy write on the external 
disk and writeback started, the computer almost completely locked up --- 
any process trying to write anything to the fast disk blocked until 
writeback on the slow disk finishes.
(that machine had some old RHEL kernel and it is not mine so I can't test 
new kernels on it --- but the above fragment of code shows that the 
problem still exists today)


Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

So perhaps if there's any privileged reads going on then we should limit
writes to a depth of 2 at most, with some timeout mechanism that would


SCSI has a "high priority" bit in the command block, so you can just set 
it --- but I am not sure how well do disks support it.


Mikulas


gradually allow the deepening of the hardware queue, as long as no
highprio reads come inbetween? With 2 pending requests and even assuming
worst-case seeks the user-visible latency would be on the order of 20-30
msecs, which is at the edge of human perception. The problem comes when
a hardware queue of 32-64 entries starves that one highprio read which
then results in a 2+ seconds latency.

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Paolo Ornati
On Sat, 28 Apr 2007 09:30:06 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> There are worse examples. Try connecting some flash disk over USB-1, and 
> untar to it. Ugh.
> 
> I'd love to have some per-device dirty limit, but it's harder than it 
> should be.

this one should help:

Patch: per device dirty throttling
http://lwn.net/Articles/226709/

-- 
Paolo Ornati
Linux 2.6.21-cfs-v7-g13fe02de on x86_64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Even with a good software IO scheduler, when you have disks that do 
> tagged queueing, if you fill up the disk queue with a few dozen 
> (depends on the disk what the queue limit is) huge write requests, it 
> doesn't really matter if the _software_ queuing then gives a big 
> advantage to reads coming in. They'll _still_ be waiting for a long 
> time, especially since you don't know what the disk firmware is going 
> to do.

by far the largest advantage of tagged queueing is when we go from 1 
pending request to 2 pending requests. The rest helps too for certain 
workloads (especially benchmarks), but if the IRQ handling is fast 
enough, having just 2 is more than enough to get 80% of the advantage of 
say of hardware-queue with a depth of 64.

So perhaps if there's any privileged reads going on then we should limit 
writes to a depth of 2 at most, with some timeout mechanism that would 
gradually allow the deepening of the hardware queue, as long as no 
highprio reads come inbetween? With 2 pending requests and even assuming 
worst-case seeks the user-visible latency would be on the order of 20-30 
msecs, which is at the edge of human perception. The problem comes when 
a hardware queue of 32-64 entries starves that one highprio read which 
then results in a 2+ seconds latency.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Linus Torvalds


On Sat, 28 Apr 2007, Matthias Andree wrote:
> 
> Another thing that is rather unpleasant (haven't yet tried fiddling with
> the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
> that's going to leave you with tons of dirty buffers that clear slowly
> -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...

Now *this* is actually really really nasty.

There are worse examples. Try connecting some flash disk over USB-1, and 
untar to it. Ugh.

I'd love to have some per-device dirty limit, but it's harder than it 
should be.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Linus Torvalds


On Sat, 28 Apr 2007, Mikulas Patocka wrote:
> > 
> > Especially with lots of memory, allowing 40% of that memory to be dirty is
> > just insane (even if we limit it to "just" 40% of the normal memory zone.
> > That can be gigabytes. And no amount of IO scheduling will make it
> > pleasant to try to handle the situation where that much memory is dirty.
> 
> What about using different dirtypage limits for different processes?

Not good. We inadvertedly actually had a very strange case of that, in the 
sense that we had different dirtypage limits depending on the type of the 
allocation: if somebody used GFP_HIGHUSER, he'd be looking at the 
percentage as a percentage of _all_ memory, but if somebody used 
GFP_KERNEL he'd look at it as a percentage of just the normal low memory. 
So effectively they had different limits (the percentage may have been the 
same, but the _meaning_ of the percentage changed ;)

And it's really problematic, because it means that the process that has a 
high tolerance for dirty memory will happily dirty a lot of RAM, and then 
when the process that has a _low_ tolerance comes along, it might write 
just a single byte, and go "oh, damn, I'm way over my dirty limits, I will 
now have to start doing writeouts like mad".

Your form is much better:

> --- i.e. every process has dirtypage activity counter, that is increased when
> it dirties a page and decreased over time.

..but is really hard to do, and in particular, it's really hard to make 
any kinds of guarantees that when you have a hundred processes, they won't 
go over the total dirty limit together!

And one of the reasons for the dirty limit is that the VM really wants to 
know that it always has enough clean memory that it can throw away that 
even if it needs to do allocations while under IO, it's not totally 
screwed.  An example of this is using dirty mmap with a networked 
filesystem: with 2.6.20 and later, this should actually _work_ fairly 
reliably, exactly because we now also count the dirty mapped pages in the 
dirty limits, so we never get into the situation that we used to be able 
to get into, where some process had mapped all of RAM, and dirtied it 
without the kernel even realizing, and then when the kernel needed more 
memory (in order to write some of it back), it was totally screwed.

So we do need the "global limit", as just a VM safety issue. We could do 
some per-process counters in addition to that, but generally, the global 
limit actually ends up doing the right thing: heavy writers are more 
likely to _hit_ the limit, so statistically the people who write most are 
also the people who end up havign to clean up - so it's all fair.

> The main problem is that if the user extracts tar archive, tar eventually
> blocks on writeback I/O --- O.K. But if bash attempts to write one page to
> .bash_history file at the same time, it blocks too --- bad, the user is
> annoyed.

Right, but it's actually very unlikely. Think about it: the person who 
extracts the tar-archive is perhaps dirtying a thousand pages, while the 
.bash_history writeback is doing a single one. Which process do you think 
is going to hit the "oops, we went over the limit" case 99.9% of the time?

The _really_ annoying problem is when you just have absolutely tons of 
memory dirty, and you start doing the writeback: if you saturate the IO 
queues totally, it simply doesn't matter _who_ starts the writeback, 
because anybody who needs to do any IO at all (not necessarily writing) is 
going to be blocked.

This is why having gigabytes of dirty data (or even "just" hundreds of 
megs) can be so annoying.

Even with a good software IO scheduler, when you have disks that do tagged 
queueing, if you fill up the disk queue with a few dozen (depends on the 
disk what the queue limit is) huge write requests, it doesn't really 
matter if the _software_ queuing then gives a big advantage to reads 
coming in. They'll _still_ be waiting for a long time, especially since 
you don't know what the disk firmware is going to do.

It's possible that we could do things like refusing to use all tag entries 
on the disk for writing. That would probably help latency a _lot_. Right 
now, if we do writeback, and fill up all the slots on the disk, we cannot 
even feed the disk the read request immediately - we'll have to wait for 
some of the writes to finish before we can even queue the read to the 
disk.

(Of course, if disks don't support tagged queueing, you'll never have this 
problem at all, but most disks do these days, and I strongly suspect it 
really can aggravate latency numbers a lot).

Jens? Comments? Or do you do that already?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Andrew Morton
On Sat, 28 Apr 2007 10:51:48 +0200 Matthias Andree <[EMAIL PROTECTED]> wrote:

> On Fri, 27 Apr 2007, Andrew Morton wrote:
> 
> > But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
> > being performed, perhaps.
> 
> Another thing that is rather unpleasant (haven't yet tried fiddling with
> the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
> that's going to leave you with tons of dirty buffers that clear slowly
> -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...
> 

yes, a few people are attacking that from various angles at present.  It's
tricky - writeback has to juggle a lot of balls.  We'll get there.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Matthias Andree
On Fri, 27 Apr 2007, Andrew Morton wrote:

> But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
> being performed, perhaps.

Another thing that is rather unpleasant (haven't yet tried fiddling with
the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
that's going to leave you with tons of dirty buffers that clear slowly
-- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...

-- 
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Matthias Andree
On Fri, 27 Apr 2007, Linus Torvalds wrote:

> Oh, well.. Journalling sucks.
> 
> I was actually _really_ hoping that somebody would come along and tell 
> everybody that this whole journal-logging is stupid, and that it's just 
> better to not ever re-write blocks on disk, but instead write to new 
> blocks with version numbers (and not re-use old blocks until new versions 
> are stable on disk).

Only that you need direct-overwrite support to be able to safely trash
data you no longer need...

-- 
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Matthias Andree
On Fri, 27 Apr 2007, Linus Torvalds wrote:

> 
> 
> On Fri, 27 Apr 2007, Marat Buharov wrote:
> >
> > On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > Aside: why the heck do applications think that their data is so important
> > > that they need to fsync it all the time.  I used to run a kernel on my
> > > laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
> > > pleasurable.
> > 
> > So, if having fake fsync() and fdatasync() is pleasurable for laptop
> > and desktop, may be it's time to add option into Kconfig which
> > disables normal fsync behaviour in favor of robust desktop?
> 
> This really is an ext3 issue, not "fsync()".
> 
> On a good filesystem, when you do "fsync()" on a file, nothing at all 
> happens to any other files. On ext3, it seems to sync the global journal, 

This behavior has been in Linux and sort of official since the early
2.4.X days - remember the discussion on fsync()ing directory changes for
MTAs that led to the mount option "dirsync" for ext?fs so that rename(),
link() and stuff like that became synchronous even without fsync()ing
the parent directory? I can look up archive references if need be.

Surely four years ago, if not five (this is from the top of my head, not
a quotable fact I verified from the LKML archives though).

> I used to run reiserfs, and it had its problems, but this was the 
> "feature" of ext3 that I've disliked most. If you run a MUA with local 
> mail, it will do fsync's for most things, and things really hickup if you 
> are doing some other writes at the same time. In contrast, with reiser, if 
> you did a big untar or some other big write, if somebody fsync'ed a small 
> file, it wasn't even a blip on the radar - the fsync would sync just that 
> small thing.

It's not as though I'd recommend reiserfs. I have seen one major
corruption recently in openSUSE 10.2 with ext3, but I've had constant
headaches with reiserfs since the day it went into S.u.S.E. kernels at
the time until I switched away from reiserfs some years ago.

-- 
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mike Galbraith
On Sat, 2007-04-28 at 00:01 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:
> 
> > On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:

> > As promised, I tested with a kernel that I know for fact that I have
> > tested heavy IO on previously, and behavior was identically horrid, so
> > it's not something new that snuck in ~recently, my disk just got a _lot_
> > fuller in the meantime (12k mp3s munch a lot).
> 
> Just to clarify here - you're saying that some older kernel is as sucky as
> 2.6.21, and that (presumably) dropping the dirty ratios makes things a bit
> better on the old kernel as well?

I didn't drop dirty ratios, only verified that behavior was just as
horrible as 2.6.21.

> Actually, I'm surprised that data=writeback didn't help much.  If the
> present theories are correct it should have helped quite a lot, because in
> data=writeback mode fsync(small-file) will not cause
> fdatasync(everything-else).

data=writeback did help quite noticeably, just not enough.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Andrew Morton
On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:

> On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
> > On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
> > 
> > > Actually, you don't need to apply the patch - just do
> > > 
> > >   echo 5 > /proc/sys/vm/dirty_background_ratio
> > >   echo 10 > /proc/sys/vm/dirty_ratio
> > 
> > That seems to have done the trick.  Amarok and GUI aren't exactly speed
> > demons while writeout is happening, but they are not hanging for
> > eternities.
> 
> As promised, I tested with a kernel that I know for fact that I have
> tested heavy IO on previously, and behavior was identically horrid, so
> it's not something new that snuck in ~recently, my disk just got a _lot_
> fuller in the meantime (12k mp3s munch a lot).

Just to clarify here - you're saying that some older kernel is as sucky as
2.6.21, and that (presumably) dropping the dirty ratios makes things a bit
better on the old kernel as well?

> I also verified that I don't need to use the dirty data restrictions
> with ext2, all is just peachy using stock settings.  Amarok switches
> songs quickly, and GUI doesn't hang.  Behavior is that expected of a
> heavily loaded IO subsystem, and is 1000% better than ext3 with my very
> full disk.

Yes, the very full disk could explain why things are _so_ bad.  Not only
does fsync() force vast amounts of writeout, it's also seeky writeout.

> Journaling is very nice, but I think I'll be much better off without it
> responsiveness wise.

Well, physical journalling with ordered data is bad here.  Other forms of
journalling which don't introduce this great contention point shouldn't be
as bad.


Actually, I'm surprised that data=writeback didn't help much.  If the
present theories are correct it should have helped quite a lot, because in
data=writeback mode fsync(small-file) will not cause
fdatasync(everything-else).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mike Galbraith
On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
> On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
> 
> > Actually, you don't need to apply the patch - just do
> > 
> > echo 5 > /proc/sys/vm/dirty_background_ratio
> > echo 10 > /proc/sys/vm/dirty_ratio
> 
> That seems to have done the trick.  Amarok and GUI aren't exactly speed
> demons while writeout is happening, but they are not hanging for
> eternities.

As promised, I tested with a kernel that I know for fact that I have
tested heavy IO on previously, and behavior was identically horrid, so
it's not something new that snuck in ~recently, my disk just got a _lot_
fuller in the meantime (12k mp3s munch a lot).

I also verified that I don't need to use the dirty data restrictions
with ext2, all is just peachy using stock settings.  Amarok switches
songs quickly, and GUI doesn't hang.  Behavior is that expected of a
heavily loaded IO subsystem, and is 1000% better than ext3 with my very
full disk.

Journaling is very nice, but I think I'll be much better off without it
responsiveness wise.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka



On Fri, 27 Apr 2007, Linus Torvalds wrote:




On Fri, 27 Apr 2007, Mike Galbraith wrote:


As subject states, my GUI is going away for extended periods of time
when my very full and likely highly fragmented (how to find out)
filesystem is under heavy write load.  While write is under way, if
amarok (mp3 player) is running, no song change will occur until write is
finished, and the GUI can go _entirely_ comatose for very long periods.
Usually, it will come back to life after write is finished, but
occasionally, a complete GUI restart is necessary.


One thing to try out (and dammit, I should make it the default now in
2.6.21) is to just make the dirty limits much lower. We've been talking
about this for ages, I think this might be the right time to do it.

Especially with lots of memory, allowing 40% of that memory to be dirty is
just insane (even if we limit it to "just" 40% of the normal memory zone.
That can be gigabytes. And no amount of IO scheduling will make it
pleasant to try to handle the situation where that much memory is dirty.


What about using different dirtypage limits for different processes?

--- i.e. every process has dirtypage activity counter, that is increased 
when it dirties a page and decreased over time. Compute the limit for 
process as some inverse of this counter --- so that processes that dirtied 
a lot of pages will be blocked at lower limit and processes that dirtied 
few pages will be blocked at higher limit.


The main problem is that if the user extracts tar archive, tar eventually 
blocks on writeback I/O --- O.K. But if bash attempts to write one page to 
.bash_history file at the same time, it blocks too --- bad, the user is 
annoyed.


(I don't have time to write and test it, it is just an idea --- I found 
these writeback lockups of the whole system annoying too)


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

On Fri, 27 Apr 2007, Bill Huey wrote:


On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:

Oh, well.. Journalling sucks.

I was actually _really_ hoping that somebody would come along and tell
everybody that this whole journal-logging is stupid, and that it's just
better to not ever re-write blocks on disk, but instead write to new
blocks with version numbers (and not re-use old blocks until new versions
are stable on disk).

There was even somebody who did something like that for a PhD thesis, I
forget the details (and it apparently died when the thesis was presumably
accepted ;).


That sounds a whole lot like NetApp's WAFL file system and is heavily 
patented.


bill


Hi

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or 
phase tree filesystems (TUX2); it writes inside normal used structures, 
but it marks each structure with generation tags --- when it updates 
global table of tags, it atomically makes several structures valid. I 
don't know about this idea being used elsewhere.


It's fsync is slow too (needs to write all (meta)data too), but it at 
least doesn't livelock --- fsync is basically:

* write all buffers and wait for completion
* take lock preventing metadata updates
* write all buffers again (those that were updated while previous write 
was in progress) and wait for completion

* update global generation count table
* release the lock

Maybe Suse will be paying me from this autumn to make more features to it 
--- so far it works, doesn't eat data, but isn't much known :)


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

On Fri, 27 Apr 2007, Bill Huey wrote:


On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:

Oh, well.. Journalling sucks.

I was actually _really_ hoping that somebody would come along and tell
everybody that this whole journal-logging is stupid, and that it's just
better to not ever re-write blocks on disk, but instead write to new
blocks with version numbers (and not re-use old blocks until new versions
are stable on disk).

There was even somebody who did something like that for a PhD thesis, I
forget the details (and it apparently died when the thesis was presumably
accepted ;).


That sounds a whole lot like NetApp's WAFL file system and is heavily 
patented.


bill


Hi

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or 
phase tree filesystems (TUX2); it writes inside normal used structures, 
but it marks each structure with generation tags --- when it updates 
global table of tags, it atomically makes several structures valid. I 
don't know about this idea being used elsewhere.


It's fsync is slow too (needs to write all (meta)data too), but it at 
least doesn't livelock --- fsync is basically:

* write all buffers and wait for completion
* take lock preventing metadata updates
* write all buffers again (those that were updated while previous write 
was in progress) and wait for completion

* update global generation count table
* release the lock

Maybe Suse will be paying me from this autumn to make more features to it 
--- so far it works, doesn't eat data, but isn't much known :)


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mike Galbraith
On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
 On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
 
  Actually, you don't need to apply the patch - just do
  
  echo 5  /proc/sys/vm/dirty_background_ratio
  echo 10  /proc/sys/vm/dirty_ratio
 
 That seems to have done the trick.  Amarok and GUI aren't exactly speed
 demons while writeout is happening, but they are not hanging for
 eternities.

As promised, I tested with a kernel that I know for fact that I have
tested heavy IO on previously, and behavior was identically horrid, so
it's not something new that snuck in ~recently, my disk just got a _lot_
fuller in the meantime (12k mp3s munch a lot).

I also verified that I don't need to use the dirty data restrictions
with ext2, all is just peachy using stock settings.  Amarok switches
songs quickly, and GUI doesn't hang.  Behavior is that expected of a
heavily loaded IO subsystem, and is 1000% better than ext3 with my very
full disk.

Journaling is very nice, but I think I'll be much better off without it
responsiveness wise.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka



On Fri, 27 Apr 2007, Linus Torvalds wrote:




On Fri, 27 Apr 2007, Mike Galbraith wrote:


As subject states, my GUI is going away for extended periods of time
when my very full and likely highly fragmented (how to find out)
filesystem is under heavy write load.  While write is under way, if
amarok (mp3 player) is running, no song change will occur until write is
finished, and the GUI can go _entirely_ comatose for very long periods.
Usually, it will come back to life after write is finished, but
occasionally, a complete GUI restart is necessary.


One thing to try out (and dammit, I should make it the default now in
2.6.21) is to just make the dirty limits much lower. We've been talking
about this for ages, I think this might be the right time to do it.

Especially with lots of memory, allowing 40% of that memory to be dirty is
just insane (even if we limit it to just 40% of the normal memory zone.
That can be gigabytes. And no amount of IO scheduling will make it
pleasant to try to handle the situation where that much memory is dirty.


What about using different dirtypage limits for different processes?

--- i.e. every process has dirtypage activity counter, that is increased 
when it dirties a page and decreased over time. Compute the limit for 
process as some inverse of this counter --- so that processes that dirtied 
a lot of pages will be blocked at lower limit and processes that dirtied 
few pages will be blocked at higher limit.


The main problem is that if the user extracts tar archive, tar eventually 
blocks on writeback I/O --- O.K. But if bash attempts to write one page to 
.bash_history file at the same time, it blocks too --- bad, the user is 
annoyed.


(I don't have time to write and test it, it is just an idea --- I found 
these writeback lockups of the whole system annoying too)


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Andrew Morton
On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith [EMAIL PROTECTED] wrote:

 On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
  On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
  
   Actually, you don't need to apply the patch - just do
   
 echo 5  /proc/sys/vm/dirty_background_ratio
 echo 10  /proc/sys/vm/dirty_ratio
  
  That seems to have done the trick.  Amarok and GUI aren't exactly speed
  demons while writeout is happening, but they are not hanging for
  eternities.
 
 As promised, I tested with a kernel that I know for fact that I have
 tested heavy IO on previously, and behavior was identically horrid, so
 it's not something new that snuck in ~recently, my disk just got a _lot_
 fuller in the meantime (12k mp3s munch a lot).

Just to clarify here - you're saying that some older kernel is as sucky as
2.6.21, and that (presumably) dropping the dirty ratios makes things a bit
better on the old kernel as well?

 I also verified that I don't need to use the dirty data restrictions
 with ext2, all is just peachy using stock settings.  Amarok switches
 songs quickly, and GUI doesn't hang.  Behavior is that expected of a
 heavily loaded IO subsystem, and is 1000% better than ext3 with my very
 full disk.

Yes, the very full disk could explain why things are _so_ bad.  Not only
does fsync() force vast amounts of writeout, it's also seeky writeout.

 Journaling is very nice, but I think I'll be much better off without it
 responsiveness wise.

Well, physical journalling with ordered data is bad here.  Other forms of
journalling which don't introduce this great contention point shouldn't be
as bad.


Actually, I'm surprised that data=writeback didn't help much.  If the
present theories are correct it should have helped quite a lot, because in
data=writeback mode fsync(small-file) will not cause
fdatasync(everything-else).

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mike Galbraith
On Sat, 2007-04-28 at 00:01 -0700, Andrew Morton wrote:
 On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith [EMAIL PROTECTED] wrote:
 
  On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:

  As promised, I tested with a kernel that I know for fact that I have
  tested heavy IO on previously, and behavior was identically horrid, so
  it's not something new that snuck in ~recently, my disk just got a _lot_
  fuller in the meantime (12k mp3s munch a lot).
 
 Just to clarify here - you're saying that some older kernel is as sucky as
 2.6.21, and that (presumably) dropping the dirty ratios makes things a bit
 better on the old kernel as well?

I didn't drop dirty ratios, only verified that behavior was just as
horrible as 2.6.21.

 Actually, I'm surprised that data=writeback didn't help much.  If the
 present theories are correct it should have helped quite a lot, because in
 data=writeback mode fsync(small-file) will not cause
 fdatasync(everything-else).

data=writeback did help quite noticeably, just not enough.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Matthias Andree
On Fri, 27 Apr 2007, Linus Torvalds wrote:

 
 
 On Fri, 27 Apr 2007, Marat Buharov wrote:
 
  On 4/27/07, Andrew Morton [EMAIL PROTECTED] wrote:
   Aside: why the heck do applications think that their data is so important
   that they need to fsync it all the time.  I used to run a kernel on my
   laptop which had return 0; at the top of fsync() and fdatasync().  Most
   pleasurable.
  
  So, if having fake fsync() and fdatasync() is pleasurable for laptop
  and desktop, may be it's time to add option into Kconfig which
  disables normal fsync behaviour in favor of robust desktop?
 
 This really is an ext3 issue, not fsync().
 
 On a good filesystem, when you do fsync() on a file, nothing at all 
 happens to any other files. On ext3, it seems to sync the global journal, 

This behavior has been in Linux and sort of official since the early
2.4.X days - remember the discussion on fsync()ing directory changes for
MTAs that led to the mount option dirsync for ext?fs so that rename(),
link() and stuff like that became synchronous even without fsync()ing
the parent directory? I can look up archive references if need be.

Surely four years ago, if not five (this is from the top of my head, not
a quotable fact I verified from the LKML archives though).

 I used to run reiserfs, and it had its problems, but this was the 
 feature of ext3 that I've disliked most. If you run a MUA with local 
 mail, it will do fsync's for most things, and things really hickup if you 
 are doing some other writes at the same time. In contrast, with reiser, if 
 you did a big untar or some other big write, if somebody fsync'ed a small 
 file, it wasn't even a blip on the radar - the fsync would sync just that 
 small thing.

It's not as though I'd recommend reiserfs. I have seen one major
corruption recently in openSUSE 10.2 with ext3, but I've had constant
headaches with reiserfs since the day it went into S.u.S.E. kernels at
the time until I switched away from reiserfs some years ago.

-- 
Matthias Andree
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Matthias Andree
On Fri, 27 Apr 2007, Linus Torvalds wrote:

 Oh, well.. Journalling sucks.
 
 I was actually _really_ hoping that somebody would come along and tell 
 everybody that this whole journal-logging is stupid, and that it's just 
 better to not ever re-write blocks on disk, but instead write to new 
 blocks with version numbers (and not re-use old blocks until new versions 
 are stable on disk).

Only that you need direct-overwrite support to be able to safely trash
data you no longer need...

-- 
Matthias Andree
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Matthias Andree
On Fri, 27 Apr 2007, Andrew Morton wrote:

 But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
 being performed, perhaps.

Another thing that is rather unpleasant (haven't yet tried fiddling with
the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
that's going to leave you with tons of dirty buffers that clear slowly
-- watch -n 1 grep -i dirty /proc/meminfo is boring, but elucidating...

-- 
Matthias Andree
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Andrew Morton
On Sat, 28 Apr 2007 10:51:48 +0200 Matthias Andree [EMAIL PROTECTED] wrote:

 On Fri, 27 Apr 2007, Andrew Morton wrote:
 
  But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
  being performed, perhaps.
 
 Another thing that is rather unpleasant (haven't yet tried fiddling with
 the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
 that's going to leave you with tons of dirty buffers that clear slowly
 -- watch -n 1 grep -i dirty /proc/meminfo is boring, but elucidating...
 

yes, a few people are attacking that from various angles at present.  It's
tricky - writeback has to juggle a lot of balls.  We'll get there.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Linus Torvalds


On Sat, 28 Apr 2007, Mikulas Patocka wrote:
  
  Especially with lots of memory, allowing 40% of that memory to be dirty is
  just insane (even if we limit it to just 40% of the normal memory zone.
  That can be gigabytes. And no amount of IO scheduling will make it
  pleasant to try to handle the situation where that much memory is dirty.
 
 What about using different dirtypage limits for different processes?

Not good. We inadvertedly actually had a very strange case of that, in the 
sense that we had different dirtypage limits depending on the type of the 
allocation: if somebody used GFP_HIGHUSER, he'd be looking at the 
percentage as a percentage of _all_ memory, but if somebody used 
GFP_KERNEL he'd look at it as a percentage of just the normal low memory. 
So effectively they had different limits (the percentage may have been the 
same, but the _meaning_ of the percentage changed ;)

And it's really problematic, because it means that the process that has a 
high tolerance for dirty memory will happily dirty a lot of RAM, and then 
when the process that has a _low_ tolerance comes along, it might write 
just a single byte, and go oh, damn, I'm way over my dirty limits, I will 
now have to start doing writeouts like mad.

Your form is much better:

 --- i.e. every process has dirtypage activity counter, that is increased when
 it dirties a page and decreased over time.

..but is really hard to do, and in particular, it's really hard to make 
any kinds of guarantees that when you have a hundred processes, they won't 
go over the total dirty limit together!

And one of the reasons for the dirty limit is that the VM really wants to 
know that it always has enough clean memory that it can throw away that 
even if it needs to do allocations while under IO, it's not totally 
screwed.  An example of this is using dirty mmap with a networked 
filesystem: with 2.6.20 and later, this should actually _work_ fairly 
reliably, exactly because we now also count the dirty mapped pages in the 
dirty limits, so we never get into the situation that we used to be able 
to get into, where some process had mapped all of RAM, and dirtied it 
without the kernel even realizing, and then when the kernel needed more 
memory (in order to write some of it back), it was totally screwed.

So we do need the global limit, as just a VM safety issue. We could do 
some per-process counters in addition to that, but generally, the global 
limit actually ends up doing the right thing: heavy writers are more 
likely to _hit_ the limit, so statistically the people who write most are 
also the people who end up havign to clean up - so it's all fair.

 The main problem is that if the user extracts tar archive, tar eventually
 blocks on writeback I/O --- O.K. But if bash attempts to write one page to
 .bash_history file at the same time, it blocks too --- bad, the user is
 annoyed.

Right, but it's actually very unlikely. Think about it: the person who 
extracts the tar-archive is perhaps dirtying a thousand pages, while the 
.bash_history writeback is doing a single one. Which process do you think 
is going to hit the oops, we went over the limit case 99.9% of the time?

The _really_ annoying problem is when you just have absolutely tons of 
memory dirty, and you start doing the writeback: if you saturate the IO 
queues totally, it simply doesn't matter _who_ starts the writeback, 
because anybody who needs to do any IO at all (not necessarily writing) is 
going to be blocked.

This is why having gigabytes of dirty data (or even just hundreds of 
megs) can be so annoying.

Even with a good software IO scheduler, when you have disks that do tagged 
queueing, if you fill up the disk queue with a few dozen (depends on the 
disk what the queue limit is) huge write requests, it doesn't really 
matter if the _software_ queuing then gives a big advantage to reads 
coming in. They'll _still_ be waiting for a long time, especially since 
you don't know what the disk firmware is going to do.

It's possible that we could do things like refusing to use all tag entries 
on the disk for writing. That would probably help latency a _lot_. Right 
now, if we do writeback, and fill up all the slots on the disk, we cannot 
even feed the disk the read request immediately - we'll have to wait for 
some of the writes to finish before we can even queue the read to the 
disk.

(Of course, if disks don't support tagged queueing, you'll never have this 
problem at all, but most disks do these days, and I strongly suspect it 
really can aggravate latency numbers a lot).

Jens? Comments? Or do you do that already?

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Linus Torvalds


On Sat, 28 Apr 2007, Matthias Andree wrote:
 
 Another thing that is rather unpleasant (haven't yet tried fiddling with
 the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
 that's going to leave you with tons of dirty buffers that clear slowly
 -- watch -n 1 grep -i dirty /proc/meminfo is boring, but elucidating...

Now *this* is actually really really nasty.

There are worse examples. Try connecting some flash disk over USB-1, and 
untar to it. Ugh.

I'd love to have some per-device dirty limit, but it's harder than it 
should be.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 Even with a good software IO scheduler, when you have disks that do 
 tagged queueing, if you fill up the disk queue with a few dozen 
 (depends on the disk what the queue limit is) huge write requests, it 
 doesn't really matter if the _software_ queuing then gives a big 
 advantage to reads coming in. They'll _still_ be waiting for a long 
 time, especially since you don't know what the disk firmware is going 
 to do.

by far the largest advantage of tagged queueing is when we go from 1 
pending request to 2 pending requests. The rest helps too for certain 
workloads (especially benchmarks), but if the IRQ handling is fast 
enough, having just 2 is more than enough to get 80% of the advantage of 
say of hardware-queue with a depth of 64.

So perhaps if there's any privileged reads going on then we should limit 
writes to a depth of 2 at most, with some timeout mechanism that would 
gradually allow the deepening of the hardware queue, as long as no 
highprio reads come inbetween? With 2 pending requests and even assuming 
worst-case seeks the user-visible latency would be on the order of 20-30 
msecs, which is at the edge of human perception. The problem comes when 
a hardware queue of 32-64 entries starves that one highprio read which 
then results in a 2+ seconds latency.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Paolo Ornati
On Sat, 28 Apr 2007 09:30:06 -0700 (PDT)
Linus Torvalds [EMAIL PROTECTED] wrote:

 There are worse examples. Try connecting some flash disk over USB-1, and 
 untar to it. Ugh.
 
 I'd love to have some per-device dirty limit, but it's harder than it 
 should be.

this one should help:

Patch: per device dirty throttling
http://lwn.net/Articles/226709/

-- 
Paolo Ornati
Linux 2.6.21-cfs-v7-g13fe02de on x86_64
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

So perhaps if there's any privileged reads going on then we should limit
writes to a depth of 2 at most, with some timeout mechanism that would


SCSI has a high priority bit in the command block, so you can just set 
it --- but I am not sure how well do disks support it.


Mikulas


gradually allow the deepening of the hardware queue, as long as no
highprio reads come inbetween? With 2 pending requests and even assuming
worst-case seeks the user-visible latency would be on the order of 20-30
msecs, which is at the edge of human perception. The problem comes when
a hardware queue of 32-64 entries starves that one highprio read which
then results in a 2+ seconds latency.

Ingo


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka



On Sat, 28 Apr 2007, Linus Torvalds wrote:


The main problem is that if the user extracts tar archive, tar eventually
blocks on writeback I/O --- O.K. But if bash attempts to write one page to
.bash_history file at the same time, it blocks too --- bad, the user is
annoyed.


Right, but it's actually very unlikely. Think about it: the person who
extracts the tar-archive is perhaps dirtying a thousand pages, while the
.bash_history writeback is doing a single one. Which process do you think
is going to hit the oops, we went over the limit case 99.9% of the time?


Both. See balance_dirty_pages --- you loop there if
global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + 
global_page_state(NR_WRITEBACK) is over limit.


So tar gets there first, start writeback, blocks. Innocent process calling 
one small write() gets there too (while writeback has not yet finished), 
sees that the expression is over limit and blocks too.


Really, you go to ballance_dirty_pages with 1/8 probability, so small 
writers will block with that probability --- better than blocking always, 
but still annoying.



The _really_ annoying problem is when you just have absolutely tons of
memory dirty, and you start doing the writeback: if you saturate the IO
queues totally, it simply doesn't matter _who_ starts the writeback,
because anybody who needs to do any IO at all (not necessarily writing) is
going to be blocked.


I saw this writeback problem on machine that had a lot of memory (1G), 
internal fast disk where the distribution was installed and very slow 
external SCSI disk (6MB/s or so). When I did heavy write on the external 
disk and writeback started, the computer almost completely locked up --- 
any process trying to write anything to the fast disk blocked until 
writeback on the slow disk finishes.
(that machine had some old RHEL kernel and it is not mine so I can't test 
new kernels on it --- but the above fragment of code shows that the 
problem still exists today)


Mikulas

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

hm, fsync.

Aside: why the heck do applications think that their data is so important
that they need to fsync it all the time.  I used to run a kernel on my
laptop which had return 0; at the top of fsync() and fdatasync().  Most
pleasurable.

But wedging for 20 minutes is probably excessive punishment.


I most wonder, why vim fsyncs its swapfile regularly (blocking typing 
during that) and doesn't fsync the resulting file on :w  :-/


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Lee Revell

On 4/28/07, Mikulas Patocka [EMAIL PROTECTED] wrote:

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w  :-/


Never seen this.  Why would fsync block typing unless vim was doing
disk IO for every keystroke?

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread hui
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
 SpadFS doesn't write to unallocated parts like log filesystems (LFS) or 
 phase tree filesystems (TUX2); it writes inside normal used structures, 
 but it marks each structure with generation tags --- when it updates 
 global table of tags, it atomically makes several structures valid. I 
 don't know about this idea being used elsewhere.

So how is this generation structure organized ? paper ?

bill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-28 Thread Mikulas Patocka

On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
phase tree filesystems (TUX2); it writes inside normal used structures,
but it marks each structure with generation tags --- when it updates
global table of tags, it atomically makes several structures valid. I
don't know about this idea being used elsewhere.


So how is this generation structure organized ? paper ?


Paper is in CITSA 2006 proceedings (but you likely don't have them and I 
signed some statement that I can't post it elsewhere :-( )


Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table --- 
that array is on disk and in memory (see struct __spadfs-cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory 
too (see struct __spadfs-cc)


* On mount, you load crash count table and crash count from disk to 
memory. You increment carsh count on disk (but leave old in memory). You 
increment one entry in crash count table - cct[cc] in memory, but leave 
old on disk.
* On sync you write all metadata buffers, do write barrier, write one 
sector of crash count table from memory to disk and do write 
barrier again.

* On unmount, you sync and decrement crash count on disk.

--- so crash count counts crashes --- it is increased each time you mount 
and don't unmount.


Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc) 
and 16-bit crash count(cc).
* You create directory entry with entry-txc = fs-txc[fs-cc] and 
entry-cc = fs-cc
* Directory entry is considered valid if fs-txc[entry-cc] = entry-txc 
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory 
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes, 
it will load crash count table from disk and there's one-less value than 
entry-txc, so the entry will be invalid. It will also run with increased 
cc, so it will never touch txc at an old index, so the entry will be valid 
forever.
--- if you sync, you write crash count table to disk and directory entry 
will be atomically made valid forever (because values in crash count table 
never decrease)


In my implementation, the top bit of entry-txc is used to mark whether 
the entry is scheduled for adding or delete, so that you can atomically 
add one directory entry and delete other.


Space allocation bitmaps or lists are managed in such a way that there are 
two copies and cc/txc pair determining which one is valid.


Files are extended in such a way that each file has two size entries and 
cc/txc pair denoting which one is valid, so that you can atomically 
extend/truncate file and mark its space allocated/freed in bitmaps or 
lists (BTW. this cc/txc pair is the same one that denotes if the directory 
entry is valid and another bit determines one of these two functions --- 
to save space).


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mikulas Patocka



On Sat, 28 Apr 2007, Mikulas Patocka wrote:


On Fri, 27 Apr 2007, Bill Huey wrote:
Hi

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or 
phase tree filesystems (TUX2);


--- BTW, I don't think that writing to unallocated parts of disk is good 
idea. These filesystems have cool write benchmarks, but one subtle (and 
unbenchmarkable) problem:
They group files according to time when they were created and not 
according to directory hierarchy.
When the user has directory with project files and he edited different 
files at different times, normal filesystems will place the files near 
each other (so that "grep blabla *" is fast) and log-structured 
filesystems will scatter the files over the whole disk.


Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:

> Actually, you don't need to apply the patch - just do
> 
>   echo 5 > /proc/sys/vm/dirty_background_ratio
>   echo 10 > /proc/sys/vm/dirty_ratio

That seems to have done the trick.  Amarok and GUI aren't exactly speed
demons while writeout is happening, but they are not hanging for
eternities.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 13:31:30 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On Apr 27, 2007  08:30 -0700, Linus Torvalds wrote:
> > On a good filesystem, when you do "fsync()" on a file, nothing at all 
> > happens to any other files. On ext3, it seems to sync the global journal, 
> > which means that just about *everything* that writes even a single byte 
> > (well, at least anything journalled, which would be all the normal 
> > directory ops etc) to disk will just *stop* dead cold!
> > 
> > It's horrid. And it really is ext3, not "fsync()".
> > 
> > I used to run reiserfs, and it had its problems, but this was the 
> > "feature" of ext3 that I've disliked most. If you run a MUA with local 
> > mail, it will do fsync's for most things, and things really hickup if you 
> > are doing some other writes at the same time. In contrast, with reiser, if 
> > you did a big untar or some other big write, if somebody fsync'ed a small 
> > file, it wasn't even a blip on the radar - the fsync would sync just that 
> > small thing.
> 
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too.  The reason is
> that if a journal commit doesn't flush the data as well then a crash will
> result in garbage (from old deleted files) being visible in the newly
> allocated file.  People used to complain about this with reiserfs all the
> time having corrupt data in new files after a crash, which is why I believe
> it was fixed.

People still complain about hey-my-files-are-all-full-of-zeroes on XFS.

> There definitely are some problems with the ext3 journal commit though.
> If the journal is full it will cause the whole journal to checkpoint out
> to the filesystem synchronously even if just space for a small transaction
> is needed.  That is doubly bad if you have a very large journal.  I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
> 

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer. 
Basically, do ordered-data with a commit-time inode walk, calling
do_sync_mapping_range().

Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.

And guess what?  We can then partly fix _this_ problem too.  If we're
running a commit on behalf of fsync(inode1) and we come across an inode2
which doesn't have any block allocation metadata in this commit, we don't
need to sync inode2's pages.

Weep.  It's times like this when I want to escape all this patch-wrangling
nonsense and go do some real stuff.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 13:09:06 -0600
Zan Lynx <[EMAIL PROTECTED]> wrote:

> On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote:
> [snip]
> > ext3's problem here is that a single fsync() requires that ext3 sync the
> > whole filesystem.  Because
> > 
> > - a journal commit can contain metadata from multiple files, and if we
> >   want to journal one file's metadata via fsync(), we unavoidably journal
> >   all the other file's metadata at the same time.
> > 
> > - ordered mode requires that we write a file's data blocks prior to
> >   journalling the metadata which refers to those blocks.
> > 
> > net result: syncing anything syncs the whole world.
> > 
> > There are a few areas in which this could conceivably be tuned up: if a
> > particular file doesn't currently have any metadata in the commit, we don't
> > actually need to sync its data blocks: we could just transfer them into
> > next commit.  Hard, unlikely to be of benefit.
> [snip]
> 
> How about mixing the ordered and data journal modes?  If the data blocks
> would fit, have fsync write them into the journal as is done in
> data=journal mode.  Then that file data is committed to disk as fsync
> requires, but it shouldn't require flushing all the previous metadata to
> get an ordered guarantee.

In some ways that would be quite neat: if a process does a small write then
fsyncs it, write it all into the journal.  That avoids a seek out to the
file's data blocks.

However it'd be quite hard to do, I expect: we don't know until commit time
how much data has been written to this file (actually, we don't even know
at commit-time, but we could, with quite some work, find out).

But none of this will solve the problem, because even with your optimised
fsync(), we still need to write out bonnie's large file at commit time,
when we fsync() your small write to a different file.

(And when I say "this problem" I refer to the known-about problem which
we're discussing here.  I suspect this in fact isn't Mike's problem - 20
minutes is crazy - it's not attributable to the fsync-syncs-everything
problem unless Mike's GUI is doing a huge numer of separate fsyncs)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, Jan Engelhardt wrote:
> 
> Interesting. For my laptop, I have configured like 90 for
> dirty_background_ratio and 95 for dirty_ratio. Makes for a nice
> delayed write, but I do not do workloads bigger than extracing kernel
> tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway.
> Setting it to something like 95, I could probably rm -Rf the kernel
> tree again and the disk never gets active because it is all cached.
> But if dirty_ratio is lowered, the disk will get active soon.

Yes. For laptops, you may want to
 - raise the dirty limits
 - increase the dirty scan times

but you do realize that if you then need memory for something else, 
latency just becomes *horrible*. So even on laptops, it's not obviously 
the right thing to do (these days, throwing money at the problem instead, 
and getting one of the nice new 1.8" flash disks, will solve all issues: 
you'd have no reason to try to delay spinning up the disk anyway).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Manoj Joseph

Linus Torvalds wrote:


On Fri, 27 Apr 2007, Andreas Dilger wrote:

It's true that this is a "feature" of ext3 with data=ordered (the default),
but I suspect the same thing is now true in reiserfs too.


Oh, well.. Journalling sucks.


Go back to ext2? ;)

I was actually _really_ hoping that somebody would come along and tell 
everybody that this whole journal-logging is stupid, and that it's just 
better to not ever re-write blocks on disk, but instead write to new 
blocks with version numbers (and not re-use old blocks until new versions 
are stable on disk).


Ah, "copy on write"! ZFS (Sun) and WAFL (NetApp) does this. Don't know 
about WAFL, but ZFS does logging too.


-Manoj

--
Manoj Joseph
http://kerneljunkie.blogspot.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Stephen Clark

Linus Torvalds wrote:


On Fri, 27 Apr 2007, Andreas Dilger wrote:
 


It's true that this is a "feature" of ext3 with data=ordered (the default),
but I suspect the same thing is now true in reiserfs too.
   



Oh, well.. Journalling sucks.

I was actually _really_ hoping that somebody would come along and tell 
everybody that this whole journal-logging is stupid, and that it's just 
better to not ever re-write blocks on disk, but instead write to new 
blocks with version numbers (and not re-use old blocks until new versions 
are stable on disk).


 

That sort of sounds like something NCR used to do in the mainframe days 
files had
generation numbers, and multiple generations of the files were kept 
around with the

OS automatically removing the older ones.

There was even somebody who did something like that for a PhD thesis, I 
forget the details (and it apparently died when the thesis was presumably 
accepted ;).


Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 




--

"They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety."  (Ben Franklin)


"The course of history shows that as a government grows, liberty 
decreases."  (Thomas Jefferson)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Gabriel C

Linus Torvalds wrote:
There was even somebody who did something like that for a PhD thesis, I 
forget the details (and it apparently died when the thesis was presumably 
accepted ;).


  



You mean SpadFS[1] right ?


Linus
  


Gabriel

[1] http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Miquel van Smoorenburg
In article <[EMAIL PROTECTED]> you write:
>I was actually _really_ hoping that somebody would come along and tell 
>everybody that this whole journal-logging is stupid, and that it's just 
>better to not ever re-write blocks on disk, but instead write to new 
>blocks with version numbers (and not re-use old blocks until new versions 
>are stable on disk).
>
>There was even somebody who did something like that for a PhD thesis, I 
>forget the details (and it apparently died when the thesis was presumably 
>accepted ;).

If you mean tux2, it died because of patent issues:

http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.3/0332.html

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread hui
On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:
> Oh, well.. Journalling sucks.
> 
> I was actually _really_ hoping that somebody would come along and tell 
> everybody that this whole journal-logging is stupid, and that it's just 
> better to not ever re-write blocks on disk, but instead write to new 
> blocks with version numbers (and not re-use old blocks until new versions 
> are stable on disk).
> 
> There was even somebody who did something like that for a PhD thesis, I 
> forget the details (and it apparently died when the thesis was presumably 
> accepted ;).

That sounds a whole lot like NetApp's WAFL file system and is heavily patented.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Jan Engelhardt

On Apr 27 2007 08:18, Linus Torvalds wrote:
>
>Actually, you don't need to apply the patch - just do
>
>   echo 5 > /proc/sys/vm/dirty_background_ratio
>   echo 10 > /proc/sys/vm/dirty_ratio
>
>and say if it seems to improve things. I think those are much saner 
>defaults especially for a desktop system (and probably for most servers 
>too, for that matter).

Interesting. For my laptop, I have configured like 90 for
dirty_background_ratio and 95 for dirty_ratio. Makes for a nice
delayed write, but I do not do workloads bigger than extracing kernel
tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway.
Setting it to something like 95, I could probably rm -Rf the kernel
tree again and the disk never gets active because it is all cached.
But if dirty_ratio is lowered, the disk will get active soon.

>Historical note: allowing about half of memory to contain dirty pages made 
>more sense back in the days when people had 16-64MB of memory, and a 
>single untar of even fairly small projects would otherwise hit the disk. 
>But memory sizes have grown *much* more quickly than disk speeds (and 
>latency requirements have gone down, not up), so a default that may 
>actually have been perfectly fine at some point seems crazy these days..



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Hua Zhong
The idea has not died and some NAS/file server vendors have already been
doing this for some time. (I am not sure but is WAFS the same thing?)

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:linux-kernel-
> [EMAIL PROTECTED] On Behalf Of Linus Torvalds
> Sent: Friday, April 27, 2007 12:51 PM
> To: Andreas Dilger
> Cc: Marat Buharov; Andrew Morton; Mike Galbraith; LKML; Jens Axboe;
> [EMAIL PROTECTED]; Alex Tomas
> Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose
> when FS is under heavy write load (massive starvation)
> 
> 
> 
> On Fri, 27 Apr 2007, Andreas Dilger wrote:
> >
> > It's true that this is a "feature" of ext3 with data=ordered (the
> default),
> > but I suspect the same thing is now true in reiserfs too.
> 
> Oh, well.. Journalling sucks.
> 
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new
> versions
> are stable on disk).
> 
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was
> presumably
> accepted ;).
> 
>   Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, Andreas Dilger wrote:
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too.

Oh, well.. Journalling sucks.

I was actually _really_ hoping that somebody would come along and tell 
everybody that this whole journal-logging is stupid, and that it's just 
better to not ever re-write blocks on disk, but instead write to new 
blocks with version numbers (and not re-use old blocks until new versions 
are stable on disk).

There was even somebody who did something like that for a PhD thesis, I 
forget the details (and it apparently died when the thesis was presumably 
accepted ;).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 13:31 -0600, Andreas Dilger wrote:
> I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.

I wouldn't be averse to test driving such a patch (understatement). You
have a pointer?

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Marko Macek

Linus Torvalds wrote:


On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
Could[/should] this stuff be changed from ratios to amounts? Or a quick 
boot-time test to use a ratio if the memory is small and an amount (like 
tax brackets, I would expect) if it's great?


Yes, the "percentage" thing was likely wrong. That said, there *is* some 
correlation between "lots of memory" and "high-end machine", and that in 
turn tends to correlate with "fast disk", so I don't think the percentage 
approach is really *horribly* wrong.


The main issue with the percentage is that we do export them as such 
through the /proc/ interface, and they are easy to change and understand. 
So changing them to amounts is non-trivial if you also want to support the 
old interfaces - and the advantage isn't obvious enough that it's a 
clear-cut case.


I wonder if it would be useful if the limit was 'data we can write out 
in 1 (configurable) second. This would typically mean either one 50mb 
(depending on disk) contigous block or 100-200 scattered blocks (since 
the typical disk latency is about 5-10ms).


Has anyone tried something like this?

Mark
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andreas Dilger
On Apr 27, 2007  08:30 -0700, Linus Torvalds wrote:
> On a good filesystem, when you do "fsync()" on a file, nothing at all 
> happens to any other files. On ext3, it seems to sync the global journal, 
> which means that just about *everything* that writes even a single byte 
> (well, at least anything journalled, which would be all the normal 
> directory ops etc) to disk will just *stop* dead cold!
> 
> It's horrid. And it really is ext3, not "fsync()".
> 
> I used to run reiserfs, and it had its problems, but this was the 
> "feature" of ext3 that I've disliked most. If you run a MUA with local 
> mail, it will do fsync's for most things, and things really hickup if you 
> are doing some other writes at the same time. In contrast, with reiser, if 
> you did a big untar or some other big write, if somebody fsync'ed a small 
> file, it wasn't even a blip on the radar - the fsync would sync just that 
> small thing.

It's true that this is a "feature" of ext3 with data=ordered (the default),
but I suspect the same thing is now true in reiserfs too.  The reason is
that if a journal commit doesn't flush the data as well then a crash will
result in garbage (from old deleted files) being visible in the newly
allocated file.  People used to complain about this with reiserfs all the
time having corrupt data in new files after a crash, which is why I believe
it was fixed.

There definitely are some problems with the ext3 journal commit though.
If the journal is full it will cause the whole journal to checkpoint out
to the filesystem synchronously even if just space for a small transaction
is needed.  That is doubly bad if you have a very large journal.  I believe
Alex has a patch to have it checkpoint much smaller chunks to the fs.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote:

> But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
> being performed, perhaps.

Yes.  I need to do a lot more testing.  All I see is one, and it's game
over.  Bizarre.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:

> Actually, you don't need to apply the patch - just do
> 
>   echo 5 > /proc/sys/vm/dirty_background_ratio
>   echo 10 > /proc/sys/vm/dirty_ratio
> 

I'll try this, and do some testing with other kernels as well.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Zan Lynx
On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote:
[snip]
> ext3's problem here is that a single fsync() requires that ext3 sync the
> whole filesystem.  Because
> 
> - a journal commit can contain metadata from multiple files, and if we
>   want to journal one file's metadata via fsync(), we unavoidably journal
>   all the other file's metadata at the same time.
> 
> - ordered mode requires that we write a file's data blocks prior to
>   journalling the metadata which refers to those blocks.
> 
> net result: syncing anything syncs the whole world.
> 
> There are a few areas in which this could conceivably be tuned up: if a
> particular file doesn't currently have any metadata in the commit, we don't
> actually need to sync its data blocks: we could just transfer them into
> next commit.  Hard, unlikely to be of benefit.
[snip]

How about mixing the ordered and data journal modes?  If the data blocks
would fit, have fsync write them into the journal as is done in
data=journal mode.  Then that file data is committed to disk as fsync
requires, but it shouldn't require flushing all the previous metadata to
get an ordered guarantee.

Or so it seems to me.
-- 
Zan Lynx <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 08:18:34 -0700 (PDT) Linus Torvalds <[EMAIL PROTECTED]> 
wrote:

>   echo 5 > /proc/sys/vm/dirty_background_ratio
>   echo 10 > /proc/sys/vm/dirty_ratio

That'll help a lot.

ext3's problem here is that a single fsync() requires that ext3 sync the
whole filesystem.  Because

- a journal commit can contain metadata from multiple files, and if we
  want to journal one file's metadata via fsync(), we unavoidably journal
  all the other file's metadata at the same time.

- ordered mode requires that we write a file's data blocks prior to
  journalling the metadata which refers to those blocks.

net result: syncing anything syncs the whole world.

There are a few areas in which this could conceivably be tuned up: if a
particular file doesn't currently have any metadata in the commit, we don't
actually need to sync its data blocks: we could just transfer them into
next commit.  Hard, unlikely to be of benefit.

Arguably, we could get away without syncing overwritten data blocks.  Users
would occasionally see older data than they otherwise would have after a
crash.  Could help a bit in some circumstances.

But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
being performed, perhaps.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Chuck Ebbert
Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
>> Could[/should] this stuff be changed from ratios to amounts? Or a quick 
>> boot-time test to use a ratio if the memory is small and an amount (like 
>> tax brackets, I would expect) if it's great?
> 
> Yes, the "percentage" thing was likely wrong. That said, there *is* some 
> correlation between "lots of memory" and "high-end machine", and that in 
> turn tends to correlate with "fast disk", so I don't think the percentage 
> approach is really *horribly* wrong.
> 
> The main issue with the percentage is that we do export them as such 
> through the /proc/ interface, and they are easy to change and understand. 
> So changing them to amounts is non-trivial if you also want to support the 
> old interfaces - and the advantage isn't obvious enough that it's a 
> clear-cut case.
> 

We could add a new "limit" field, though. If it defaulted to 0 (unlimited)
the default behavior wouldn't change.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
> 
> Could[/should] this stuff be changed from ratios to amounts? Or a quick 
> boot-time test to use a ratio if the memory is small and an amount (like 
> tax brackets, I would expect) if it's great?

Yes, the "percentage" thing was likely wrong. That said, there *is* some 
correlation between "lots of memory" and "high-end machine", and that in 
turn tends to correlate with "fast disk", so I don't think the percentage 
approach is really *horribly* wrong.

The main issue with the percentage is that we do export them as such 
through the /proc/ interface, and they are easy to change and understand. 
So changing them to amounts is non-trivial if you also want to support the 
old interfaces - and the advantage isn't obvious enough that it's a 
clear-cut case.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread John Anthony Kazos Jr.
> One thing to try out (and dammit, I should make it the default now in 
> 2.6.21) is to just make the dirty limits much lower. We've been talking 
> about this for ages, I think this might be the right time to do it.

Could[/should] this stuff be changed from ratios to amounts? Or a quick 
boot-time test to use a ratio if the memory is small and an amount (like 
tax brackets, I would expect) if it's great?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, Marat Buharov wrote:
>
> On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > Aside: why the heck do applications think that their data is so important
> > that they need to fsync it all the time.  I used to run a kernel on my
> > laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
> > pleasurable.
> 
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?

This really is an ext3 issue, not "fsync()".

On a good filesystem, when you do "fsync()" on a file, nothing at all 
happens to any other files. On ext3, it seems to sync the global journal, 
which means that just about *everything* that writes even a single byte 
(well, at least anything journalled, which would be all the normal 
directory ops etc) to disk will just *stop* dead cold!

It's horrid. And it really is ext3, not "fsync()".

I used to run reiserfs, and it had its problems, but this was the 
"feature" of ext3 that I've disliked most. If you run a MUA with local 
mail, it will do fsync's for most things, and things really hickup if you 
are doing some other writes at the same time. In contrast, with reiser, if 
you did a big untar or some other big write, if somebody fsync'ed a small 
file, it wasn't even a blip on the radar - the fsync would sync just that 
small thing.

Maybe I'm wrong on the exact details (I'm not really up on the ext3 
journal handling ;^), but you don't even have to know about any internals 
at all: you can just test it. Gaak.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2007, Mike Galbraith wrote:
> 
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load.  While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.

One thing to try out (and dammit, I should make it the default now in 
2.6.21) is to just make the dirty limits much lower. We've been talking 
about this for ages, I think this might be the right time to do it.

Especially with lots of memory, allowing 40% of that memory to be dirty is 
just insane (even if we limit it to "just" 40% of the normal memory zone. 
That can be gigabytes. And no amount of IO scheduling will make it 
pleasant to try to handle the situation where that much memory is dirty.

So I do believe that we could probably do something about the IO 
scheduling _too_:

 - break up large write requests (yeah, it will make for worse IO 
   throughput, but if make it configurable, and especially with 
   controllers that don't have insane overheads per command, the 
   difference between 128kB requests and 16MB requests is probably not 
   really even noticeable - SCSI things with large per-command overheads 
   are just stupid)

   Generating huge requests will automatically mean that they are 
   "unbreakable" from an IO scheduler perspective, so it's bad for latency 
   for other reqeusts once they've started.

 - maybe be more aggressive about prioritizing reads over writes.

but in the meantime, what happens if you apply this patch?

Actually, you don't need to apply the patch - just do

echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio

and say if it seems to improve things. I think those are much saner 
defaults especially for a desktop system (and probably for most servers 
too, for that matter).

Even 10% of memory dirty can be a whole lot of RAM, but it should 
hopefully be _better_ than the insane default we have now.

Historical note: allowing about half of memory to contain dirty pages made 
more sense back in the days when people had 16-64MB of memory, and a 
single untar of even fairly small projects would otherwise hit the disk. 
But memory sizes have grown *much* more quickly than disk speeds (and 
latency requirements have gone down, not up), so a default that may 
actually have been perfectly fine at some point seems crazy these days..

Linus

---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f469e3c..a794945 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -67,12 +67,12 @@ static inline long sync_writeback_pages(void)
 /*
  * Start background writeback (via pdflush) at this percentage
  */
-int dirty_background_ratio = 10;
+int dirty_background_ratio = 5;
 
 /*
  * The generator of dirty data starts writeback at this percentage
  */
-int vm_dirty_ratio = 40;
+int vm_dirty_ratio = 10;
 
 /*
  * The interval between `kupdate'-style writebacks, in jiffies
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mark Lord

Peter Zijlstra wrote:


No way is globally disabling fsync() a good thing. I guess Andrew just
is a sucker for punishment :-)


Mmm... perhaps another nice thing to include in laptop-mode operation?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Manoj Joseph

Marat Buharov wrote:

On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

Aside: why the heck do applications think that their data is so important
that they need to fsync it all the time.  I used to run a kernel on my
laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
pleasurable.


So, if having fake fsync() and fdatasync() is pleasurable for laptop
and desktop, may be it's time to add option into Kconfig which
disables normal fsync behaviour in favor of robust desktop?


Sure, a noop fsync/fdatasync would speed up some things. And I am sure 
Andrew Morton knew what he was doing and the consequences.


But unless you care nothing about your data, you should not do it. It is 
as simple as that. No, it does not give you a robust desktop!!


-Manoj

--
Manoj Joseph
http://kerneljunkie.blogspot.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Peter Zijlstra
On Fri, 2007-04-27 at 15:59 +0400, Marat Buharov wrote:
> On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > Aside: why the heck do applications think that their data is so important
> > that they need to fsync it all the time.  I used to run a kernel on my
> > laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
> > pleasurable.
> 
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?

Nah, just teaching user-space to behave themselves should be sufficient;
there is just no way kicker can justify doing a fdatasync(), I mean,
come on its just showing a friggin menu. I have always wondered why that
thing was so damn slow, like it needs to fetch stuff like that from all
four corners of disk, feh!

Just sliding over a sub-menu can take more than a second; I mean, it
_really_ is just faster to just start things from your favourite shell.

No way is globally disabling fsync() a good thing. I guess Andrew just
is a sucker for punishment :-)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Marat Buharov

On 4/27/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

Aside: why the heck do applications think that their data is so important
that they need to fsync it all the time.  I used to run a kernel on my
laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
pleasurable.


So, if having fake fsync() and fdatasync() is pleasurable for laptop
and desktop, may be it's time to add option into Kconfig which
disables normal fsync behaviour in favor of robust desktop?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote:

> Another livelock possibility is that bonnie is redirtying pages faster than
> commit can write them out, so commit got livelocked:
> 
> When I was doing the original port-from-2.2 I found that an application
> which does
> 
>   for ( ; ; )
>   pwrite(fd, "", 1, 0);
> 
> would permanently livelock the fs.  I fixed that, but it was six years ago,
> and perhaps we later unfixed it.

Well, box doesn't seem the least bit upset after quite a while now, so I
guess it didn't get unfixed.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:
> 
> > Greetings,
> > 
> > As subject states, my GUI is going away for extended periods of time
> > when my very full and likely highly fragmented (how to find out)
> > filesystem is under heavy write load.  While write is under way, if
> > amarok (mp3 player) is running, no song change will occur until write is
> > finished, and the GUI can go _entirely_ comatose for very long periods.
> > Usually, it will come back to life after write is finished, but
> > occasionally, a complete GUI restart is necessary.
> 
> I'd be suspecting a GUI bug if a restart is necessary.  Perhaps it went to
> lunch for so long in the kernel that some time-based thing went bad.

Yeah, there have been some KDE updates, maybe something went south.  I
know for sure that nothing this horrible used to happen during IO.  But
then when I used to regularly test IO, my disk heads didn't have to
traverse nearly as much either.

> Right.  One possibility here is that bonnie is stuffing new dirty blocks
> onto the committing transaction's ordered-data list and JBD commit is
> livelocking.  Only we're not supposed to be putting those blocks on that
> list.
> 
> Another livelock possibility is that bonnie is redirtying pages faster than
> commit can write them out, so commit got livelocked:
> 
> When I was doing the original port-from-2.2 I found that an application
> which does
> 
>   for ( ; ; )
>   pwrite(fd, "", 1, 0);
> 
> would permanently livelock the fs.  I fixed that, but it was six years ago,
> and perhaps we later unfixed it.

I'll try that.

> It would be most interesting to try data=writeback.

Seems somewhat better, but nothing close to tolerable.  I still had to
hot-key to a VT and kill the bonnie.

> hm, fsync.
> 
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time.  I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync().  Most
> pleasurable.

I thought unkind thoughts when I saw those traces :)

Thanks,

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:

> Greetings,
> 
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load.  While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.

I'd be suspecting a GUI bug if a restart is necessary.  Perhaps it went to
lunch for so long in the kernel that some time-based thing went bad.

> The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
> days ago.  I was letting SuSE's software update programs update my SuSE
> 10.2 system, and started a bonnie while it was running (because I had
> been seeing this on recent kernels, and wanted to see if it was in
> stable as well), WHAM, instant dead GUI.  When this happens, kbd and
> mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
> killed the bonnie.  No joy, GUI stayed utterly comatose until the
> updater finished roughly 20 minutes later, at which time the shells I'd
> tried to start popped up, and all worked as if nothing bad had ever
> happened.  During the time in between, no window could be brought into
> focus, nada.
> 
> While a bonnie is writing, if I poke KDE's menu button, that will
> instantly trigger nastiness, and a trace (this one was with a cfs
> kernel, but I just did same with virgin 2.6.21) shows that "kicker",
> KDE's launcher proggy does an fdatasync for some reason, and that's the
> end of it's world for ages.  When clicking on amarok's icon, it does an
> fsync, and that's the last thing that will happen in it's world until
> write is done as well.  I've repeated this with CFQ and AS IO
> schedulers.

Well that all sucks.

> I have a couple of old kernels lying around that I can test with, but I
> think it's going to be the same.  Seems to be ext3's journal that is
> causing my woes.  Below this trace of kicker is one of amarok during
> it's dead to the world time.
> 
> Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter.  .config is latest
> kernel tested attached.  Mount options are
> noatime,nodiratime,acl,user_xattr.
> 
> 
> [  308.046646] kickerD 0044 0  5897  1 (NOTLB)
> [  308.052611]f32abe4c 00200082 83398b5a 0044 c01c251e f32ab000 
> f32ab000 c01169b6 
> [  308.060926]f772fbcc cdc7e694 0039 8339857a 0044 83398b5a 
> 0044  
> [  308.069422]c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8  
> f32ab000 c1b5ab10 
> [  308.077927] Call Trace:
> [  308.080568]  [] log_wait_commit+0x9d/0x11f
> [  308.085549]  [] journal_stop+0x1a1/0x22a
> [  308.090364]  [] journal_force_commit+0x1d/0x20
> [  308.095699]  [] ext3_force_commit+0x24/0x26
> [  308.100774]  [] ext3_write_inode+0x2d/0x3b
> [  308.105771]  [] __writeback_single_inode+0x2df/0x3a9
> [  308.111633]  [] sync_inode+0x15/0x38
> [  308.116093]  [] ext3_sync_file+0xbd/0xc8
> [  308.120900]  [] do_fsync+0x58/0x8b
> [  308.125188]  [] __do_fsync+0x20/0x2f
> [  308.129656]  [] sys_fdatasync+0x10/0x12
> [  308.134384]  [] sysenter_past_esp+0x5d/0x81
> [  308.139441]  ===

Right.  One possibility here is that bonnie is stuffing new dirty blocks
onto the committing transaction's ordered-data list and JBD commit is
livelocking.  Only we're not supposed to be putting those blocks on that
list.

Another livelock possibility is that bonnie is redirtying pages faster than
commit can write them out, so commit got livelocked:

When I was doing the original port-from-2.2 I found that an application
which does

for ( ; ; )
pwrite(fd, "", 1, 0);

would permanently livelock the fs.  I fixed that, but it was six years ago,
and perhaps we later unfixed it.

It would be most interesting to try data=writeback.

> [  311.755953] bonnieD 0046 0  6146   5929 (NOTLB)
> [  311.761929]e7622a60 00200082 04d7e5fe 0046 03332bd5  
> e7622000 c02c0c54 
> [  311.770244]d8eaabcc e7622a64 f7d0c3ec 04d7e521 0046 04d7e5fe 
> 0046  
> [  311.778758]e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 
> c018b105 e7622a8c 
> [  311.787261] Call Trace:
> [  311.789904]  [] io_schedule+0xe/0x16
> [  311.794373]  [] sync_buffer+0x2e/0x32
> [  311.798927]  [] __wait_on_bit_lock+0x3f/0x62
> [  311.804089]  [] out_of_line_wait_on_bit_lock+0x5f/0x67
> [  311.810115]  [] __lock_buffer+0x2b/0x31
> [  311.814846]  [] sync_dirty_buffer+0x88/0xc3
> [  311.819921]  [] journal_dirty_data+0x1dd/0x205
> [  311.825256]  [] ext3_journal_dirty_data+0x12/0x37
> [  311.830858]  [] journal_dirty_data_fn+0x15/0x1c
> [  311.836280]  [] walk_page_buffers+0x36/0x68
> [  311.841347]  [] 

[ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
Greetings,

As subject states, my GUI is going away for extended periods of time
when my very full and likely highly fragmented (how to find out)
filesystem is under heavy write load.  While write is under way, if
amarok (mp3 player) is running, no song change will occur until write is
finished, and the GUI can go _entirely_ comatose for very long periods.
Usually, it will come back to life after write is finished, but
occasionally, a complete GUI restart is necessary.

The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
days ago.  I was letting SuSE's software update programs update my SuSE
10.2 system, and started a bonnie while it was running (because I had
been seeing this on recent kernels, and wanted to see if it was in
stable as well), WHAM, instant dead GUI.  When this happens, kbd and
mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
killed the bonnie.  No joy, GUI stayed utterly comatose until the
updater finished roughly 20 minutes later, at which time the shells I'd
tried to start popped up, and all worked as if nothing bad had ever
happened.  During the time in between, no window could be brought into
focus, nada.

While a bonnie is writing, if I poke KDE's menu button, that will
instantly trigger nastiness, and a trace (this one was with a cfs
kernel, but I just did same with virgin 2.6.21) shows that "kicker",
KDE's launcher proggy does an fdatasync for some reason, and that's the
end of it's world for ages.  When clicking on amarok's icon, it does an
fsync, and that's the last thing that will happen in it's world until
write is done as well.  I've repeated this with CFQ and AS IO
schedulers.

I have a couple of old kernels lying around that I can test with, but I
think it's going to be the same.  Seems to be ext3's journal that is
causing my woes.  Below this trace of kicker is one of amarok during
it's dead to the world time.

Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter.  .config is latest
kernel tested attached.  Mount options are
noatime,nodiratime,acl,user_xattr.


[  308.046646] kickerD 0044 0  5897  1 (NOTLB)
[  308.052611]f32abe4c 00200082 83398b5a 0044 c01c251e f32ab000 
f32ab000 c01169b6 
[  308.060926]f772fbcc cdc7e694 0039 8339857a 0044 83398b5a 
0044  
[  308.069422]c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8  
f32ab000 c1b5ab10 
[  308.077927] Call Trace:
[  308.080568]  [] log_wait_commit+0x9d/0x11f
[  308.085549]  [] journal_stop+0x1a1/0x22a
[  308.090364]  [] journal_force_commit+0x1d/0x20
[  308.095699]  [] ext3_force_commit+0x24/0x26
[  308.100774]  [] ext3_write_inode+0x2d/0x3b
[  308.105771]  [] __writeback_single_inode+0x2df/0x3a9
[  308.111633]  [] sync_inode+0x15/0x38
[  308.116093]  [] ext3_sync_file+0xbd/0xc8
[  308.120900]  [] do_fsync+0x58/0x8b
[  308.125188]  [] __do_fsync+0x20/0x2f
[  308.129656]  [] sys_fdatasync+0x10/0x12
[  308.134384]  [] sysenter_past_esp+0x5d/0x81
[  308.139441]  ===
[  311.755953] bonnieD 0046 0  6146   5929 (NOTLB)
[  311.761929]e7622a60 00200082 04d7e5fe 0046 03332bd5  
e7622000 c02c0c54 
[  311.770244]d8eaabcc e7622a64 f7d0c3ec 04d7e521 0046 04d7e5fe 
0046  
[  311.778758]e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 
c018b105 e7622a8c 
[  311.787261] Call Trace:
[  311.789904]  [] io_schedule+0xe/0x16
[  311.794373]  [] sync_buffer+0x2e/0x32
[  311.798927]  [] __wait_on_bit_lock+0x3f/0x62
[  311.804089]  [] out_of_line_wait_on_bit_lock+0x5f/0x67
[  311.810115]  [] __lock_buffer+0x2b/0x31
[  311.814846]  [] sync_dirty_buffer+0x88/0xc3
[  311.819921]  [] journal_dirty_data+0x1dd/0x205
[  311.825256]  [] ext3_journal_dirty_data+0x12/0x37
[  311.830858]  [] journal_dirty_data_fn+0x15/0x1c
[  311.836280]  [] walk_page_buffers+0x36/0x68
[  311.841347]  [] ext3_ordered_writepage+0x11a/0x191
[  311.847027]  [] generic_writepages+0x1f3/0x305
[  311.852344]  [] do_writepages+0x37/0x39
[  311.857064]  [] __writeback_single_inode+0x96/0x3a9
[  311.862842]  [] sync_sb_inodes+0x1bc/0x27f
[  311.867830]  [] writeback_inodes+0x98/0xe1
[  311.872819]  [] balance_dirty_pages_ratelimited_nr+0xc4/0x1bf
[  311.879461]  [] generic_file_buffered_write+0x32e/0x677
[  311.885576]  [] __generic_file_aio_write_nolock+0x2e2/0x57f
[  311.892044]  [] generic_file_aio_write+0x60/0xd4
[  311.897553]  [] ext3_file_write+0x27/0xa5
[  311.902455]  [] do_sync_write+0xcd/0x103
[  311.907270]  [] vfs_write+0xa8/0x128
[  311.911738]  [] sys_write+0x3d/0x64
[  311.916111]  [] sysenter_past_esp+0x5d/0x81
[  311.921185]  ===
[  311.924763] pdflush   D 0046 0  6147  5 (L-TLB)
[  311.930739]ec7e2ef0 0046 03f2b0ea 0046 ec7e2f0c c0186b45 
ec7e2000 c01169b6 
[  311.939052]ea14069c ec7e2f00 ec7e2f00 03f2afc9 0046 03f2b0ea 
0046 0282 
[  311.947557]ec7e2f00 ab4c ec7e2f30 ec7e2f20 

[ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Mike Galbraith
Greetings,

As subject states, my GUI is going away for extended periods of time
when my very full and likely highly fragmented (how to find out)
filesystem is under heavy write load.  While write is under way, if
amarok (mp3 player) is running, no song change will occur until write is
finished, and the GUI can go _entirely_ comatose for very long periods.
Usually, it will come back to life after write is finished, but
occasionally, a complete GUI restart is necessary.

The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
days ago.  I was letting SuSE's software update programs update my SuSE
10.2 system, and started a bonnie while it was running (because I had
been seeing this on recent kernels, and wanted to see if it was in
stable as well), WHAM, instant dead GUI.  When this happens, kbd and
mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
killed the bonnie.  No joy, GUI stayed utterly comatose until the
updater finished roughly 20 minutes later, at which time the shells I'd
tried to start popped up, and all worked as if nothing bad had ever
happened.  During the time in between, no window could be brought into
focus, nada.

While a bonnie is writing, if I poke KDE's menu button, that will
instantly trigger nastiness, and a trace (this one was with a cfs
kernel, but I just did same with virgin 2.6.21) shows that kicker,
KDE's launcher proggy does an fdatasync for some reason, and that's the
end of it's world for ages.  When clicking on amarok's icon, it does an
fsync, and that's the last thing that will happen in it's world until
write is done as well.  I've repeated this with CFQ and AS IO
schedulers.

I have a couple of old kernels lying around that I can test with, but I
think it's going to be the same.  Seems to be ext3's journal that is
causing my woes.  Below this trace of kicker is one of amarok during
it's dead to the world time.

Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter.  .config is latest
kernel tested attached.  Mount options are
noatime,nodiratime,acl,user_xattr.


[  308.046646] kickerD 0044 0  5897  1 (NOTLB)
[  308.052611]f32abe4c 00200082 83398b5a 0044 c01c251e f32ab000 
f32ab000 c01169b6 
[  308.060926]f772fbcc cdc7e694 0039 8339857a 0044 83398b5a 
0044  
[  308.069422]c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8  
f32ab000 c1b5ab10 
[  308.077927] Call Trace:
[  308.080568]  [c01c7ab8] log_wait_commit+0x9d/0x11f
[  308.085549]  [c01c250f] journal_stop+0x1a1/0x22a
[  308.090364]  [c01c2fce] journal_force_commit+0x1d/0x20
[  308.095699]  [c01bac1e] ext3_force_commit+0x24/0x26
[  308.100774]  [c01b50ea] ext3_write_inode+0x2d/0x3b
[  308.105771]  [c0186cf0] __writeback_single_inode+0x2df/0x3a9
[  308.111633]  [c0187641] sync_inode+0x15/0x38
[  308.116093]  [c01b1695] ext3_sync_file+0xbd/0xc8
[  308.120900]  [c0189a09] do_fsync+0x58/0x8b
[  308.125188]  [c0189a5c] __do_fsync+0x20/0x2f
[  308.129656]  [c0189a7b] sys_fdatasync+0x10/0x12
[  308.134384]  [c0103eec] sysenter_past_esp+0x5d/0x81
[  308.139441]  ===
[  311.755953] bonnieD 0046 0  6146   5929 (NOTLB)
[  311.761929]e7622a60 00200082 04d7e5fe 0046 03332bd5  
e7622000 c02c0c54 
[  311.770244]d8eaabcc e7622a64 f7d0c3ec 04d7e521 0046 04d7e5fe 
0046  
[  311.778758]e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 
c018b105 e7622a8c 
[  311.787261] Call Trace:
[  311.789904]  [c04a2f06] io_schedule+0xe/0x16
[  311.794373]  [c018b105] sync_buffer+0x2e/0x32
[  311.798927]  [c04a3756] __wait_on_bit_lock+0x3f/0x62
[  311.804089]  [c04a37d8] out_of_line_wait_on_bit_lock+0x5f/0x67
[  311.810115]  [c018b248] __lock_buffer+0x2b/0x31
[  311.814846]  [c018bb56] sync_dirty_buffer+0x88/0xc3
[  311.819921]  [c01c3ce4] journal_dirty_data+0x1dd/0x205
[  311.825256]  [c01b3300] ext3_journal_dirty_data+0x12/0x37
[  311.830858]  [c01b333a] journal_dirty_data_fn+0x15/0x1c
[  311.836280]  [c01b277d] walk_page_buffers+0x36/0x68
[  311.841347]  [c01b552f] ext3_ordered_writepage+0x11a/0x191
[  311.847027]  [c0152133] generic_writepages+0x1f3/0x305
[  311.852344]  [c015227c] do_writepages+0x37/0x39
[  311.857064]  [c0186aa7] __writeback_single_inode+0x96/0x3a9
[  311.862842]  [c0187037] sync_sb_inodes+0x1bc/0x27f
[  311.867830]  [c01875e3] writeback_inodes+0x98/0xe1
[  311.872819]  [c015240a] balance_dirty_pages_ratelimited_nr+0xc4/0x1bf
[  311.879461]  [c014df55] generic_file_buffered_write+0x32e/0x677
[  311.885576]  [c014e580] __generic_file_aio_write_nolock+0x2e2/0x57f
[  311.892044]  [c014e87d] generic_file_aio_write+0x60/0xd4
[  311.897553]  [c01b14f7] ext3_file_write+0x27/0xa5
[  311.902455]  [c016ab7b] do_sync_write+0xcd/0x103
[  311.907270]  [c016b37a] vfs_write+0xa8/0x128
[  311.911738]  [c016b873] sys_write+0x3d/0x64
[  311.916111]  [c0103eec] sysenter_past_esp+0x5d/0x81
[  311.921185]  ===
[  311.924763] pdflush   D 

  1   2   >