Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

> I have measured the bh refcount before the buffer_uptodate() for a few days.
> I found out that the bh refcount sometimes reached to 0 .
> So, I think following modifications are effective.
> 
> diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
> --- 2.4.30-rc3/fs/jbd/commit.c2005-04-06 17:14:47.0 +0900
> +++ 2.4.30-rc3_patch/fs/jbd/commit.c  2005-04-06 17:18:49.0 +0900
> @@ -302,11 +303,14 @@
>   if (unlikely(!buffer_uptodate(bh)))
>   err = -EIO;
>   /* the journal_head may have been removed now */
> + put_bh(bh);
>   lock_journal(journal);
>   goto write_out_data;

...

In further testing, this chunk is causing some problems.  It is entirely
legal for a buffer to be !uptodate here, although the path is somewhat
convoluted.  The trouble is, once the IO has completed,
journal_dirty_data() can steal the buffer from the committing
transaction.  And once that happens, journal_unmap_buffer() can then
release the buffer and clear the uptodate bit.  

I've run this under buffer tracing and can show you this exact path on a
trace.  It only takes a few seconds to reproduce under fsx.

For this to have happened, though, journal_dirty_data needs to have
waited on the data, so it has an opportunity to see the !uptodate state
at that point.

I think it's safe enough to sidestep this by double-checking to see
whether the bh is still on the committing transaction after we've waited
for the buffer and retaken journaling locks, but before testing
buffer_uptodate().

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-13 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

 I have measured the bh refcount before the buffer_uptodate() for a few days.
 I found out that the bh refcount sometimes reached to 0 .
 So, I think following modifications are effective.
 
 diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
 --- 2.4.30-rc3/fs/jbd/commit.c2005-04-06 17:14:47.0 +0900
 +++ 2.4.30-rc3_patch/fs/jbd/commit.c  2005-04-06 17:18:49.0 +0900
 @@ -302,11 +303,14 @@
   if (unlikely(!buffer_uptodate(bh)))
   err = -EIO;
   /* the journal_head may have been removed now */
 + put_bh(bh);
   lock_journal(journal);
   goto write_out_data;

...

In further testing, this chunk is causing some problems.  It is entirely
legal for a buffer to be !uptodate here, although the path is somewhat
convoluted.  The trouble is, once the IO has completed,
journal_dirty_data() can steal the buffer from the committing
transaction.  And once that happens, journal_unmap_buffer() can then
release the buffer and clear the uptodate bit.  

I've run this under buffer tracing and can show you this exact path on a
trace.  It only takes a few seconds to reproduce under fsx.

For this to have happened, though, journal_dirty_data needs to have
waited on the data, so it has an opportunity to see the !uptodate state
at that point.

I think it's safe enough to sidestep this by double-checking to see
whether the bh is still on the committing transaction after we've waited
for the buffer and retaken journaling locks, but before testing
buffer_uptodate().

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 21:46, Andrew Morton wrote:
> "Stephen C. Tweedie" <[EMAIL PROTECTED]> wrote:
> >
> > Andrew, what was the exact illegal state of the pages you were seeing
> >  when fixing that recent leak?  It looks like it's nothing more complex
> >  than dirty buffers on an anon page.
> 
> Correct.

Good --- I think I've got the debugging solid against 2.4 for that case,
patching against 2.6 now.

> >  I think that simply calling
> >  try_to_release_page() for all the remaining buffers at umount time will
> 
> Presumably these pages have no ->mapping, so try_to_release_page() will
> call try_to_free_buffers().

Exactly.

> >  The only thing that would be required on
> >  top of that would be a check that the page is also on the VM LRU lists.
> 
> Why do we have dirty buffers left over at umount time?

In the leak case we're worried about, the buffers are dirty but the page
is anon so there's nothing to clean them up.  The buffers _will_ be
destroyed by unmount, sure; invalidate_bdev() should see to that.  But
I'm doing the bh leak check before we get to the final
invalidate_bdev(), so they should still be available for testing at that
point.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Andrew Morton
"Stephen C. Tweedie" <[EMAIL PROTECTED]> wrote:
>
> Andrew, what was the exact illegal state of the pages you were seeing
>  when fixing that recent leak?  It looks like it's nothing more complex
>  than dirty buffers on an anon page.

Correct.

>  I think that simply calling
>  try_to_release_page() for all the remaining buffers at umount time will

Presumably these pages have no ->mapping, so try_to_release_page() will
call try_to_free_buffers().

>  be enough to catch these; if that function fails, it tells us that the
>  VM can't reclaim these pages.

Yes, if the buffers are dirty then 2.4's try_to_free_buffers() won't free
them.

>  The only thing that would be required on
>  top of that would be a check that the page is also on the VM LRU lists.

Why do we have dirty buffers left over at umount time?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote:

> I'm currently running with the buffer-trace debug patch, on 2.4, with a
> custom patch to put every buffer jbd ever sees onto a per-superblock
> list, and remove it only when the bh is destroyed in
> put_unused_buffer_head().  At unmount time, we can walk that list to
> find stale buffers attached to data pages (invalidate_buffers() already
> does such a walk for metadata.)

After a 50-process dbench run, unmount is showing ~300 buffers
unclaimed; that seems to be OK, it's a large stress test doing _lots_ of
create/unlink and we expect to see a few buffers around at the end where
they were truncated during commit and couldn't be instantly reclaimed.

The main thing now is to test these buffers to make 100% sure that they
are in a state where the VM can reclaim them.  That looks fine: the
buffers I see are unjournaled, have no jh, and are clean and on the
BUF_CLEAN lru.

Andrew, what was the exact illegal state of the pages you were seeing
when fixing that recent leak?  It looks like it's nothing more complex
than dirty buffers on an anon page.  I think that simply calling
try_to_release_page() for all the remaining buffers at umount time will
be enough to catch these; if that function fails, it tells us that the
VM can't reclaim these pages.  The only thing that would be required on
top of that would be a check that the page is also on the VM LRU lists.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote:

 I'm currently running with the buffer-trace debug patch, on 2.4, with a
 custom patch to put every buffer jbd ever sees onto a per-superblock
 list, and remove it only when the bh is destroyed in
 put_unused_buffer_head().  At unmount time, we can walk that list to
 find stale buffers attached to data pages (invalidate_buffers() already
 does such a walk for metadata.)

After a 50-process dbench run, unmount is showing ~300 buffers
unclaimed; that seems to be OK, it's a large stress test doing _lots_ of
create/unlink and we expect to see a few buffers around at the end where
they were truncated during commit and couldn't be instantly reclaimed.

The main thing now is to test these buffers to make 100% sure that they
are in a state where the VM can reclaim them.  That looks fine: the
buffers I see are unjournaled, have no jh, and are clean and on the
BUF_CLEAN lru.

Andrew, what was the exact illegal state of the pages you were seeing
when fixing that recent leak?  It looks like it's nothing more complex
than dirty buffers on an anon page.  I think that simply calling
try_to_release_page() for all the remaining buffers at umount time will
be enough to catch these; if that function fails, it tells us that the
VM can't reclaim these pages.  The only thing that would be required on
top of that would be a check that the page is also on the VM LRU lists.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Andrew Morton
Stephen C. Tweedie [EMAIL PROTECTED] wrote:

 Andrew, what was the exact illegal state of the pages you were seeing
  when fixing that recent leak?  It looks like it's nothing more complex
  than dirty buffers on an anon page.

Correct.

  I think that simply calling
  try_to_release_page() for all the remaining buffers at umount time will

Presumably these pages have no -mapping, so try_to_release_page() will
call try_to_free_buffers().

  be enough to catch these; if that function fails, it tells us that the
  VM can't reclaim these pages.

Yes, if the buffers are dirty then 2.4's try_to_free_buffers() won't free
them.

  The only thing that would be required on
  top of that would be a check that the page is also on the VM LRU lists.

Why do we have dirty buffers left over at umount time?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-11 Thread Stephen C. Tweedie
Hi,

On Mon, 2005-04-11 at 21:46, Andrew Morton wrote:
 Stephen C. Tweedie [EMAIL PROTECTED] wrote:
 
  Andrew, what was the exact illegal state of the pages you were seeing
   when fixing that recent leak?  It looks like it's nothing more complex
   than dirty buffers on an anon page.
 
 Correct.

Good --- I think I've got the debugging solid against 2.4 for that case,
patching against 2.6 now.

   I think that simply calling
   try_to_release_page() for all the remaining buffers at umount time will
 
 Presumably these pages have no -mapping, so try_to_release_page() will
 call try_to_free_buffers().

Exactly.

   The only thing that would be required on
   top of that would be a check that the page is also on the VM LRU lists.
 
 Why do we have dirty buffers left over at umount time?

In the leak case we're worried about, the buffers are dirty but the page
is anon so there's nothing to clean them up.  The buffers _will_ be
destroyed by unmount, sure; invalidate_bdev() should see to that.  But
I'm doing the bh leak check before we get to the final
invalidate_bdev(), so they should still be available for testing at that
point.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote:

> However, 2.6 is suspected of still having leaks in ext3.  To be certain
> that we're not just backporting one of those to 2.4, we need to
> understand who exactly is going to clean up these bh's if they are in
> fact unused once we complete this code and do the final put_bh().  
> 
> I'll give this patch a spin with Andrew's fsx-based leak stress and see
> if anything unpleasant appears.  

I'm currently running with the buffer-trace debug patch, on 2.4, with a
custom patch to put every buffer jbd ever sees onto a per-superblock
list, and remove it only when the bh is destroyed in
put_unused_buffer_head().  At unmount time, we can walk that list to
find stale buffers attached to data pages (invalidate_buffers() already
does such a walk for metadata.)

I just ran it with a quick test and it found 300,000 buffers still
present at unmount.  Whoops, I guess I need to move the check to _after_
the final invalidate_inodes() call. :-)

But this method ought to allow us to test for this sort of leak a lot
more reliably, and to report via the usual buffer history tracing the
most recent activity on any bh that does leak.

Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested
for Marcelo.  I've got uptodate buffer_trace for 2.6 anyway, so it
shouldn't be too hard.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-07 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote:

 However, 2.6 is suspected of still having leaks in ext3.  To be certain
 that we're not just backporting one of those to 2.4, we need to
 understand who exactly is going to clean up these bh's if they are in
 fact unused once we complete this code and do the final put_bh().  
 
 I'll give this patch a spin with Andrew's fsx-based leak stress and see
 if anything unpleasant appears.  

I'm currently running with the buffer-trace debug patch, on 2.4, with a
custom patch to put every buffer jbd ever sees onto a per-superblock
list, and remove it only when the bh is destroyed in
put_unused_buffer_head().  At unmount time, we can walk that list to
find stale buffers attached to data pages (invalidate_buffers() already
does such a walk for metadata.)

I just ran it with a quick test and it found 300,000 buffers still
present at unmount.  Whoops, I guess I need to move the check to _after_
the final invalidate_inodes() call. :-)

But this method ought to allow us to test for this sort of leak a lot
more reliably, and to report via the usual buffer history tracing the
most recent activity on any bh that does leak.

Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested
for Marcelo.  I've got uptodate buffer_trace for 2.6 anyway, so it
shouldn't be too hard.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Hifumi Hisashi
Hi,
At 23:20 05/04/06, Stephen C. Tweedie wrote:
>Yes.  But it is conventional to interpret a short write as being a
>failure.  Returning less bytes than were requested in the write
>indicates that the rest failed.  It just doesn't give the exact nature
>of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
>never permitted unless there are errors of some description.
When commit_write() FULLY succeed (requested bytes == returned bytes) but
generic_osync_inode() return error due to I/O failure, current 
do_generic_file_write() cannot
return error. I encountered above situation a lot under an I/O trouble 
condition .

In ver 2.6.11, the return value of generic_osync_inode() is returned 
directry to user
when I/O failure occur.

thanks.
  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

> I have measured the bh refcount before the buffer_uptodate() for a few days.
> I found out that the bh refcount sometimes reached to 0 .
> So, I think following modifications are effective.

Thanks --- it certainly looks like this should fix the dereferencing of
invalid bh's, and this code mirrors what 2.6 does around that area.

However, 2.6 is suspected of still having leaks in ext3.  To be certain
that we're not just backporting one of those to 2.4, we need to
understand who exactly is going to clean up these bh's if they are in
fact unused once we complete this code and do the final put_bh().  

I'll give this patch a spin with Andrew's fsx-based leak stress and see
if anything unpleasant appears.  

Cheers,
 Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

>  >Certainly it's normal for a short read/write to imply either error or
>  >EOF, without the error necessarily needing to be returned explicitly.
>  >I'm not convinced that the Singleunix language actually requires that,
>  >but it seems the most obvious and consistent behaviour.

> When an O_SYNC flag is set , if commit_write() succeed but 
> generic_osync_inode() return
> error due to I/O failure, write() must fail .

Yes.  But it is conventional to interpret a short write as being a
failure.  Returning less bytes than were requested in the write
indicates that the rest failed.  It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
never permitted unless there are errors of some description.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Hifumi Hisashi
Hi.
At 07:40 05/04/06, Stephen C. Tweedie wrote:
>Sorry, was offline for a week last week; I'll try to look at this more
>closely tomorrow.  Checking the buffer_uptodate() without either a
>refcount or a lock certainly looks unsafe at first glance.
>
>There are lots of ways to pin the bh in that particular bit of the
>code.  The important thing will be to do so without causing leaks if
>we're truly finished with the buffer after this flush.
>
I have measured the bh refcount before the buffer_uptodate() for a few days.
I found out that the bh refcount sometimes reached to 0 .
So, I think following modifications are effective.
diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
--- 2.4.30-rc3/fs/jbd/commit.c  2005-04-06 17:14:47.0 +0900
+++ 2.4.30-rc3_patch/fs/jbd/commit.c2005-04-06 17:18:49.0 +0900
@@ -295,6 +295,7 @@
struct buffer_head *bh;
jh = jh->b_tprev;/* Wait on the last written */
bh = jh2bh(jh);
+   get_bh(bh);
if (buffer_locked(bh)) {
spin_unlock(_datalist_lock);
unlock_journal(journal);
@@ -302,11 +303,14 @@
if (unlikely(!buffer_uptodate(bh)))
err = -EIO;
/* the journal_head may have been removed now */
+   put_bh(bh);
lock_journal(journal);
goto write_out_data;
} else if (buffer_dirty(bh)) {
+   put_bh(bh);
goto write_out_data_locked;
}
+   put_bh(bh);
} while (jh != commit_transaction->t_sync_datalist);
goto write_out_data_locked;

>
>> > If some of the write succeeded and some failed, then I believe the
>> > correct behaviour is to return the number of bytes that succeeded.
>> > However this change to the return status (remember the above patch is
>> > a reversal) causes any failure to over-ride any success. This, I
>> > think, is wrong.
>>
>> Yeap, that part also looks wrong.
>
>Certainly it's normal for a short read/write to imply either error or
>EOF, without the error necessarily needing to be returned explicitly.
>I'm not convinced that the Singleunix language actually requires that,
>but it seems the most obvious and consistent behaviour.
>
>--Stephen
When an O_SYNC flag is set , if commit_write() succeed but 
generic_osync_inode() return
error due to I/O failure, write() must fail .

I think that following error handling code is rational in 
do_generic_file_write() .

if (file->f_flags & O_SYNC)
err = (status < 0) ? status : written;
else
err = written ? written : status;
out:
return err;
Thanks. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Hifumi Hisashi
Hi.
At 07:40 05/04/06, Stephen C. Tweedie wrote:
Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow.  Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code.  The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.

I have measured the bh refcount before the buffer_uptodate() for a few days.
I found out that the bh refcount sometimes reached to 0 .
So, I think following modifications are effective.
diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
--- 2.4.30-rc3/fs/jbd/commit.c  2005-04-06 17:14:47.0 +0900
+++ 2.4.30-rc3_patch/fs/jbd/commit.c2005-04-06 17:18:49.0 +0900
@@ -295,6 +295,7 @@
struct buffer_head *bh;
jh = jh-b_tprev;/* Wait on the last written */
bh = jh2bh(jh);
+   get_bh(bh);
if (buffer_locked(bh)) {
spin_unlock(journal_datalist_lock);
unlock_journal(journal);
@@ -302,11 +303,14 @@
if (unlikely(!buffer_uptodate(bh)))
err = -EIO;
/* the journal_head may have been removed now */
+   put_bh(bh);
lock_journal(journal);
goto write_out_data;
} else if (buffer_dirty(bh)) {
+   put_bh(bh);
goto write_out_data_locked;
}
+   put_bh(bh);
} while (jh != commit_transaction-t_sync_datalist);
goto write_out_data_locked;


  If some of the write succeeded and some failed, then I believe the
  correct behaviour is to return the number of bytes that succeeded.
  However this change to the return status (remember the above patch is
  a reversal) causes any failure to over-ride any success. This, I
  think, is wrong.

 Yeap, that part also looks wrong.

Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly.
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen
When an O_SYNC flag is set , if commit_write() succeed but 
generic_osync_inode() return
error due to I/O failure, write() must fail .

I think that following error handling code is rational in 
do_generic_file_write() .

if (file-f_flags  O_SYNC)
err = (status  0) ? status : written;
else
err = written ? written : status;
out:
return err;
Thanks. 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

  Certainly it's normal for a short read/write to imply either error or
  EOF, without the error necessarily needing to be returned explicitly.
  I'm not convinced that the Singleunix language actually requires that,
  but it seems the most obvious and consistent behaviour.

 When an O_SYNC flag is set , if commit_write() succeed but 
 generic_osync_inode() return
 error due to I/O failure, write() must fail .

Yes.  But it is conventional to interpret a short write as being a
failure.  Returning less bytes than were requested in the write
indicates that the rest failed.  It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
never permitted unless there are errors of some description.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

 I have measured the bh refcount before the buffer_uptodate() for a few days.
 I found out that the bh refcount sometimes reached to 0 .
 So, I think following modifications are effective.

Thanks --- it certainly looks like this should fix the dereferencing of
invalid bh's, and this code mirrors what 2.6 does around that area.

However, 2.6 is suspected of still having leaks in ext3.  To be certain
that we're not just backporting one of those to 2.4, we need to
understand who exactly is going to clean up these bh's if they are in
fact unused once we complete this code and do the final put_bh().  

I'll give this patch a spin with Andrew's fsx-based leak stress and see
if anything unpleasant appears.  

Cheers,
 Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-06 Thread Hifumi Hisashi
Hi,
At 23:20 05/04/06, Stephen C. Tweedie wrote:
Yes.  But it is conventional to interpret a short write as being a
failure.  Returning less bytes than were requested in the write
indicates that the rest failed.  It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
never permitted unless there are errors of some description.
When commit_write() FULLY succeed (requested bytes == returned bytes) but
generic_osync_inode() return error due to I/O failure, current 
do_generic_file_write() cannot
return error. I encountered above situation a lot under an I/O trouble 
condition .

In ver 2.6.11, the return value of generic_osync_inode() is returned 
directry to user
when I/O failure occur.

thanks.
  

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-05 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote:

> > I'm not certain that this is right, but it seems possible and would
> > explain the symptoms.  Maybe Stephen or Andrew could comments?
> 
> Andrew, Stephen?

Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow.  Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code.  The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.


> > If some of the write succeeded and some failed, then I believe the
> > correct behaviour is to return the number of bytes that succeeded.
> > However this change to the return status (remember the above patch is
> > a reversal) causes any failure to over-ride any success. This, I
> > think, is wrong.
> 
> Yeap, that part also looks wrong.

Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly. 
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-04-05 Thread Stephen C. Tweedie
Hi,

On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote:

  I'm not certain that this is right, but it seems possible and would
  explain the symptoms.  Maybe Stephen or Andrew could comments?
 
 Andrew, Stephen?

Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow.  Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code.  The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.


  If some of the write succeeded and some failed, then I believe the
  correct behaviour is to return the number of bytes that succeeded.
  However this change to the return status (remember the above patch is
  a reversal) causes any failure to over-ride any success. This, I
  think, is wrong.
 
 Yeap, that part also looks wrong.

Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly. 
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-03-30 Thread Marcelo Tosatti
On Wed, Mar 30, 2005 at 02:06:39PM +1000, Neil Brown wrote:
> On Tuesday March 29, [EMAIL PROTECTED] wrote:
> > 
> > Attached is the backout patch, for convenience.
> 
> Thanks.  I had another look, and think I may be able to see the
> problem.  If I'm right, it is a problem with this patch.
> 
> > diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
> > --- a/fs/jbd/commit.c   2005-03-29 18:50:55 -03:00
> > +++ b/fs/jbd/commit.c   2005-03-29 18:50:55 -03:00
> > @@ -92,7 +92,7 @@
> > struct buffer_head *wbuf[64];
> > int bufs;
> > int flags;
> > -   int err = 0;
> > +   int err;
> > unsigned long blocknr;
> > char *tagp = NULL;
> > journal_header_t *header;
> > @@ -299,8 +299,6 @@
> > spin_unlock(_datalist_lock);
> > unlock_journal(journal);
> > wait_on_buffer(bh);
> > -   if (unlikely(!buffer_uptodate(bh)))
> > -   err = -EIO;
> > /* the journal_head may have been removed now */
> > lock_journal(journal);
> > goto write_out_data;
> 
> 
> I think the "!buffer_update(bh)" test is not safe at this point as,
> after the wait_on_buffer which could cause a schedule, the bh may
> no longer exist, or be for the same block.  There doesn't seem to be
> any locking or refcounting that would keep it valid.
> 
> Note the comment "the journal_head may have been removed now".
> If the journal_head is gone, the associated buffer_head is likely gone
> as well. 

Seems to be possible, yes.

> I'm not certain that this is right, but it seems possible and would
> explain the symptoms.  Maybe Stephen or Andrew could comments?

Andrew, Stephen?

> > --- a/mm/filemap.c  2005-03-29 18:50:55 -03:00
> > +++ b/mm/filemap.c  2005-03-29 18:50:55 -03:00
> > @@ -3261,12 +3261,7 @@
> > status = generic_osync_inode(inode, 
> > OSYNC_METADATA|OSYNC_DATA);
> > }
> > 
> > -   /*
> > -* generic_osync_inode always returns 0 or negative value.
> > -* So 'status < written' is always true, and written should
> > -* be returned if status >= 0.
> > -*/
> > -   err = (status < 0) ? status : written;
> > +   err = written ? written : status;
> >  out:
> >  
> > return err;
> 
> As an aside, this looks extremely dubious to me.
> 
> There is a loop earlier in this routine (do_generic_file_write) that
> passes a piece-at-a-time of the write request to prepare_write /
> commit_write.
> Successes are counted in 'written'.  A failure causes the loop to
> abort with a status in 'status'.
> 
> If some of the write succeeded and some failed, then I believe the
> correct behaviour is to return the number of bytes that succeeded.
> However this change to the return status (remember the above patch is
> a reversal) causes any failure to over-ride any success. This, I
> think, is wrong.

Yeap, that part also looks wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-03-30 Thread Marcelo Tosatti
On Wed, Mar 30, 2005 at 02:06:39PM +1000, Neil Brown wrote:
 On Tuesday March 29, [EMAIL PROTECTED] wrote:
  
  Attached is the backout patch, for convenience.
 
 Thanks.  I had another look, and think I may be able to see the
 problem.  If I'm right, it is a problem with this patch.
 
  diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
  --- a/fs/jbd/commit.c   2005-03-29 18:50:55 -03:00
  +++ b/fs/jbd/commit.c   2005-03-29 18:50:55 -03:00
  @@ -92,7 +92,7 @@
  struct buffer_head *wbuf[64];
  int bufs;
  int flags;
  -   int err = 0;
  +   int err;
  unsigned long blocknr;
  char *tagp = NULL;
  journal_header_t *header;
  @@ -299,8 +299,6 @@
  spin_unlock(journal_datalist_lock);
  unlock_journal(journal);
  wait_on_buffer(bh);
  -   if (unlikely(!buffer_uptodate(bh)))
  -   err = -EIO;
  /* the journal_head may have been removed now */
  lock_journal(journal);
  goto write_out_data;
 
 
 I think the !buffer_update(bh) test is not safe at this point as,
 after the wait_on_buffer which could cause a schedule, the bh may
 no longer exist, or be for the same block.  There doesn't seem to be
 any locking or refcounting that would keep it valid.
 
 Note the comment the journal_head may have been removed now.
 If the journal_head is gone, the associated buffer_head is likely gone
 as well. 

Seems to be possible, yes.

 I'm not certain that this is right, but it seems possible and would
 explain the symptoms.  Maybe Stephen or Andrew could comments?

Andrew, Stephen?

  --- a/mm/filemap.c  2005-03-29 18:50:55 -03:00
  +++ b/mm/filemap.c  2005-03-29 18:50:55 -03:00
  @@ -3261,12 +3261,7 @@
  status = generic_osync_inode(inode, 
  OSYNC_METADATA|OSYNC_DATA);
  }
  
  -   /*
  -* generic_osync_inode always returns 0 or negative value.
  -* So 'status  written' is always true, and written should
  -* be returned if status = 0.
  -*/
  -   err = (status  0) ? status : written;
  +   err = written ? written : status;
   out:
   
  return err;
 
 As an aside, this looks extremely dubious to me.
 
 There is a loop earlier in this routine (do_generic_file_write) that
 passes a piece-at-a-time of the write request to prepare_write /
 commit_write.
 Successes are counted in 'written'.  A failure causes the loop to
 abort with a status in 'status'.
 
 If some of the write succeeded and some failed, then I believe the
 correct behaviour is to return the number of bytes that succeeded.
 However this change to the return status (remember the above patch is
 a reversal) causes any failure to over-ride any success. This, I
 think, is wrong.

Yeap, that part also looks wrong.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-03-29 Thread Neil Brown
On Tuesday March 29, [EMAIL PROTECTED] wrote:
> 
> Attached is the backout patch, for convenience.

Thanks.  I had another look, and think I may be able to see the
problem.  If I'm right, it is a problem with this patch.

> diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
> --- a/fs/jbd/commit.c 2005-03-29 18:50:55 -03:00
> +++ b/fs/jbd/commit.c 2005-03-29 18:50:55 -03:00
> @@ -92,7 +92,7 @@
>   struct buffer_head *wbuf[64];
>   int bufs;
>   int flags;
> - int err = 0;
> + int err;
>   unsigned long blocknr;
>   char *tagp = NULL;
>   journal_header_t *header;
> @@ -299,8 +299,6 @@
>   spin_unlock(_datalist_lock);
>   unlock_journal(journal);
>   wait_on_buffer(bh);
> - if (unlikely(!buffer_uptodate(bh)))
> - err = -EIO;
>   /* the journal_head may have been removed now */
>   lock_journal(journal);
>   goto write_out_data;


I think the "!buffer_update(bh)" test is not safe at this point as,
after the wait_on_buffer which could cause a schedule, the bh may
no longer exist, or be for the same block.  There doesn't seem to be
any locking or refcounting that would keep it valid.

Note the comment "the journal_head may have been removed now".
If the journal_head is gone, the associated buffer_head is likely gone
as well. 

I'm not certain that this is right, but it seems possible and would
explain the symptoms.  Maybe Stephen or Andrew could comments?


> --- a/mm/filemap.c2005-03-29 18:50:55 -03:00
> +++ b/mm/filemap.c2005-03-29 18:50:55 -03:00
> @@ -3261,12 +3261,7 @@
>   status = generic_osync_inode(inode, 
> OSYNC_METADATA|OSYNC_DATA);
>   }
>   
> - /*
> -  * generic_osync_inode always returns 0 or negative value.
> -  * So 'status < written' is always true, and written should
> -  * be returned if status >= 0.
> -  */
> - err = (status < 0) ? status : written;
> + err = written ? written : status;
>  out:
>  
>   return err;

As an aside, this looks extremely dubious to me.

There is a loop earlier in this routine (do_generic_file_write) that
passes a piece-at-a-time of the write request to prepare_write /
commit_write.
Successes are counted in 'written'.  A failure causes the loop to
abort with a status in 'status'.

If some of the write succeeded and some failed, then I believe the
correct behaviour is to return the number of bytes that succeeded.
However this change to the return status (remember the above patch is
a reversal) causes any failure to over-ride any success. This, I
think, is wrong.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

2005-03-29 Thread Neil Brown
On Tuesday March 29, [EMAIL PROTECTED] wrote:
 
 Attached is the backout patch, for convenience.

Thanks.  I had another look, and think I may be able to see the
problem.  If I'm right, it is a problem with this patch.

 diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
 --- a/fs/jbd/commit.c 2005-03-29 18:50:55 -03:00
 +++ b/fs/jbd/commit.c 2005-03-29 18:50:55 -03:00
 @@ -92,7 +92,7 @@
   struct buffer_head *wbuf[64];
   int bufs;
   int flags;
 - int err = 0;
 + int err;
   unsigned long blocknr;
   char *tagp = NULL;
   journal_header_t *header;
 @@ -299,8 +299,6 @@
   spin_unlock(journal_datalist_lock);
   unlock_journal(journal);
   wait_on_buffer(bh);
 - if (unlikely(!buffer_uptodate(bh)))
 - err = -EIO;
   /* the journal_head may have been removed now */
   lock_journal(journal);
   goto write_out_data;


I think the !buffer_update(bh) test is not safe at this point as,
after the wait_on_buffer which could cause a schedule, the bh may
no longer exist, or be for the same block.  There doesn't seem to be
any locking or refcounting that would keep it valid.

Note the comment the journal_head may have been removed now.
If the journal_head is gone, the associated buffer_head is likely gone
as well. 

I'm not certain that this is right, but it seems possible and would
explain the symptoms.  Maybe Stephen or Andrew could comments?


 --- a/mm/filemap.c2005-03-29 18:50:55 -03:00
 +++ b/mm/filemap.c2005-03-29 18:50:55 -03:00
 @@ -3261,12 +3261,7 @@
   status = generic_osync_inode(inode, 
 OSYNC_METADATA|OSYNC_DATA);
   }
   
 - /*
 -  * generic_osync_inode always returns 0 or negative value.
 -  * So 'status  written' is always true, and written should
 -  * be returned if status = 0.
 -  */
 - err = (status  0) ? status : written;
 + err = written ? written : status;
  out:
  
   return err;

As an aside, this looks extremely dubious to me.

There is a loop earlier in this routine (do_generic_file_write) that
passes a piece-at-a-time of the write request to prepare_write /
commit_write.
Successes are counted in 'written'.  A failure causes the loop to
abort with a status in 'status'.

If some of the write succeeded and some failed, then I believe the
correct behaviour is to return the number of bytes that succeeded.
However this change to the return status (remember the above patch is
a reversal) causes any failure to over-ride any success. This, I
think, is wrong.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/