Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Hugh Dickins
On Thu, 11 Oct 2007, Ryan Finnie wrote:
> On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > shit.  That's a nasty bug.  Really userspace should be testing for -1, but
> > the msync() library function should only ever return 0 or -1.
> >
> > Does this fix it?
> >
> > --- a/mm/page-writeback.c~a
> > +++ a/mm/page-writeback.c
> > @@ -850,8 +850,10 @@ retry:
> >
> > ret = (*writepage)(page, wbc, data);
> >
> > -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> > +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > unlock_page(page);
> > +   ret = 0;
> > +   }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> > done = 1;
> > if (wbc->nonblocking && bdi_write_congested(bdi)) {
> > _
> >
> 
> Pekka Enberg replied with an identical patch a few days ago, but for
> some reason the same condition flows up to msync as -1 EIO instead of
> AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
> thread is below.  Thanks.

Each time I sit down to follow what's going on with writepage and
unionfs and msync, I get distracted: I really haven't researched
this properly.

But I keep suspecting that the answer might be the patch below (which
rather follows what drivers/block/rd.c is doing).  I'm especially
worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
to userspace, bad enough in itself, you might be liable to hit that
BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Please let us know if this patch does fix it:
then I'll try harder to work out what goes on.

Thanks,
Hugh

--- 2.6.23/mm/shmem.c   2007-10-09 21:31:38.0 +0100
+++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100
@@ -916,6 +916,11 @@ static int shmem_writepage(struct page *
struct inode *inode;
 
BUG_ON(!PageLocked(page));
+   if (!wbc->for_reclaim) {
+   set_page_dirty(page);
+   unlock_page(page);
+   return 0;
+   }
BUG_ON(page_mapped(page));
 
mapping = page->mapping;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread David Chinner
On Fri, Oct 12, 2007 at 07:53:53AM +1000, David Chinner wrote:
> On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
> > On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
> > 
> > > So it's almost certainly pointing at an elevator or driver change, not an
> > > XFS change.
> > 
> > heh, git bisect begs to differ :)
> > 
> > 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
> > 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <[EMAIL 
> > PROTECTED]>
> > Date:   Sat Nov 11 18:05:00 2006 +1100
> > 
> > [XFS] Prevent a deadlock when xfslogd unpins inodes.
> 
> Oh, of course - I failed to notice the significance of
> this loop in your test:
> 
>   while [foo]; do
>   touch fred
>   rm fred
>   done
> 
> The inode allocator keeps reusing the same inode.  If the
> transaction that did the unlink has not hit the disk before we
> allocate the inode again, we have to force the log to get the unlink
> transaction to disk to get the xfs inode unpinned (i.e. able to be
> modified in memory again).
> 
> It's the log force I/O that's introducing the latency.
> 
> If we don't force the log, then we have a possible use-after free
> of the linux inode because of a fundamental mismatch between
> the XFS inode life cycle and the linux inode life cycle. The
> use-after free only occurs on large machines under heavy, heavy
> metadata load to many disks and filesystems (requires enough
> traffic to overload an xfslogd) and is very difficult to
> reproduce (large machine, lots of disks and 20-30 hours MTTF).
> 
> I'll have a look at other ways to solve this problem, but it
> took 6 months to find a solution to the race in the first place
> so don't hold your breath.

You can breath again. Here's a test patch (warning - may harm
kittens - not fully tested or verified) that solves both
the use-after-free issue (by avoiding it altogether) as well the
unlink/create latency because the log force is no longer there.

(yay! xfsqa test 016 passes again ;)

It does have other possible side effects triggering extra
log forces elsewhere on inode writeback and affects sync behaviour
so it's only a proof of concept at this point.

The latency output looks like:



open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000281>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000184>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000188>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000182>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000225>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000182>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000185>



open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000405>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000224>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000199>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000163>



open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000145>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.817704>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000148>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000143>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000154>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000147>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000158>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000144>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000379>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000151>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000190>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000191>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000150>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.797099>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000139>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 
<0.000163>



So we still see some operations show high latency - that will most likely
be due to the allocation transaction filling a log buffer and pushing it
to disk, then having to wait because I/O is in progress on all other log
buffers.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/linux-2.6/xfs_iops.c |   16 
 fs/xfs/xfs_iget.c   |   18 --
 fs/xfs/xfs_inod

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Ryan Finnie
On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> shit.  That's a nasty bug.  Really userspace should be testing for -1, but
> the msync() library function should only ever return 0 or -1.
>
> Does this fix it?
>
> --- a/mm/page-writeback.c~a
> +++ a/mm/page-writeback.c
> @@ -850,8 +850,10 @@ retry:
>
> ret = (*writepage)(page, wbc, data);
>
> -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> unlock_page(page);
> +   ret = 0;
> +   }
> if (ret || (--(wbc->nr_to_write) <= 0))
> done = 1;
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> _
>

Pekka Enberg replied with an identical patch a few days ago, but for
some reason the same condition flows up to msync as -1 EIO instead of
AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
thread is below.  Thanks.

Ryan

On 10/7/07, Ryan Finnie <[EMAIL PROTECTED]> wrote:
> On 10/7/07, Pekka J Enberg <[EMAIL PROTECTED]> wrote:
> > On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> > > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> > > Therefore, some user programs fail, esp. if they're written such as
> > > this:
> >
> ...
> > It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
> > writeback of the page in the near future. I wonder if it's enough that we
> > change the return value to zero from
> > mm/page-writeback.c:write_cache_pages() in case we hit 
> > AOP_WRITEPAGE_ACTIVE...
>
> Doesn't appear to be enough.  I can't figure out why (since it appears
> write_cache_pages bubbles up directly to sys_msync), but with that
> patch applied, in my test case[1], msync returns -1 EIO.  However,
> with the exact same kernel without that patch applied, msync returns
> 524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
> 524288 to 0, I can't figure out how it eventually returns  -1 EIO.
>
> Ryan
>
> [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
> standard livecd setup
>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread David Chinner
On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
> On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
> 
> > So it's almost certainly pointing at an elevator or driver change, not an
> > XFS change.
> 
> heh, git bisect begs to differ :)
> 
> 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
> 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <[EMAIL 
> PROTECTED]>
> Date:   Sat Nov 11 18:05:00 2006 +1100
> 
> [XFS] Prevent a deadlock when xfslogd unpins inodes.

Oh, of course - I failed to notice the significance of
this loop in your test:

while [foo]; do
touch fred
rm fred
done

The inode allocator keeps reusing the same inode.  If the
transaction that did the unlink has not hit the disk before we
allocate the inode again, we have to force the log to get the unlink
transaction to disk to get the xfs inode unpinned (i.e. able to be
modified in memory again).

It's the log force I/O that's introducing the latency.

If we don't force the log, then we have a possible use-after free
of the linux inode because of a fundamental mismatch between
the XFS inode life cycle and the linux inode life cycle. The
use-after free only occurs on large machines under heavy, heavy
metadata load to many disks and filesystems (requires enough
traffic to overload an xfslogd) and is very difficult to
reproduce (large machine, lots of disks and 20-30 hours MTTF).

I'll have a look at other ways to solve this problem, but it
took 6 months to find a solution to the race in the first place
so don't hold your breath.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Andrew Morton
On Sun, 7 Oct 2007 15:20:19 -0400
Erez Zadok <[EMAIL PROTECTED]> wrote:

> According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back
> to the VFS/VM.  Indeed some filesystems such as tmpfs can return
> AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
> return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.
> 
> Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> Therefore, some user programs fail, esp. if they're written such as this:
> 
>  err = msync(...);
>  if (err != 0)
>   // fail
> 
> They temporarily fixed the specific program in question (apt-get) to check
> 
>  if (err < 0)
>   // fail
> 
> Is this a bug indeed, or are user programs supposed to handle
> AOP_WRITEPAGE_ACTIVATE (I hope not the latter).  If it's a kernel bug, what
> should the kernel return: a zero, or an -errno (and which one)?
> 

shit.  That's a nasty bug.  Really userspace should be testing for -1, but
the msync() library function should only ever return 0 or -1.

Does this fix it?

--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -850,8 +850,10 @@ retry:
 
ret = (*writepage)(page, wbc, data);
 
-   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+   ret = 0;
+   }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
_

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread Andrew Clayton
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:

> So it's almost certainly pointing at an elevator or driver change, not
> an XFS change.

heh, git bisect begs to differ :)

4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit
commit 4c60658e0f4e253cf275f12b7c76bf128515a774
Author: David Chinner <[EMAIL PROTECTED]>
Date:   Sat Nov 11 18:05:00 2006 +1100

[XFS] Prevent a deadlock when xfslogd unpins inodes.

The previous fixes for the use after free in xfs_iunpin left a nasty log
deadlock when xfslogd unpinned the inode and dropped the last reference to
the inode. the ->clear_inode() method can issue transactions, and if the
log was full, the transaction could push on the log and get stuck trying
to push the inode it was currently unpinning.

To fix this, we provide xfs_iunpin a guarantee that it will always have a
valid xfs_inode <-> linux inode link or a particular flag will be set on
the inode. We then use log forces during lookup to ensure transactions are
completed before we recycle the inode. This ensures that xfs_iunpin will
never use the linux inode after it is being freed, and any lookup on an
inode on the reclaim list will wait until it is safe to attach a new linux
inode to the xfs inode.

SGI-PV: 956832
SGI-Modid: xfs-linux-melb:xfs-kern:27359a

Signed-off-by: David Chinner <[EMAIL PROTECTED]>
Signed-off-by: Shailendra Tripathi <[EMAIL PROTECTED]>
Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]>
Signed-off-by: Tim Shimmin <[EMAIL PROTECTED]>

:04 04 1c47ba44bc404456c87c5a493d543a8d30696b88 
92319b34585d03c64e53890a80e550b579a0363d M  fs


If you'd like more info, don't hesitate to ask.

> Cheers,
> 
> dave.

Cheers,

Andrew
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS regression?

2007-10-11 Thread Andrew Clayton
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:

 
> Latencies are an order of magnitude lower at 60-70ms because the disks
> have less deep queues. This is expected - deep queues and multiple
> outstanding I/Os are the enemy of single I/O latency
> 
> If I remount with barriers enabled, the latency at nr_requests=128
> goes up to a consistent 2.2s. Not surprising, we're flushing the drive
> cache very regularly now and it points to the create or truncate
> transaction having to pushing log buffers to disk. The latency remains
> at 70-80ms at nr_requests=4.

Thanks for the info. I did try fiddling nr_requests but I made it
bigger. I'll try with it lower.
 
> > It seems this problem was introduced between 2.6.18 and 2.6.19. 
> 
> When the new SATA driver infrastructure was introduced. Do you have
> NCQ enabled on more recent kernels and not on 2.6.18? If so, try
> disabling it and see if the problem goes away

Unfortunately the drives in the file server don't support NCQ. Not sure
if it's supported in the machine I was testing on (it's certainly a few
years old).
 
> > The other thing I've found is that if I do the dd to an ext3 fs (on
> > the same disk at least) while running the test in the XFS fs then I
> > also see the latencies.
> 
> So it's almost certainly pointing at an elevator or driver change, not
> an XFS change.

OK, though it doesn't seem to effect ext3. I'm going to run a git
bisect to see what it comes up with.

> Cheers,
> 
> dave.

Cheers,

Andrew
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/31] IGET: Stop BFS from using iget() and read_inode() [try #3]

2007-10-11 Thread David Howells
Roel Kluin <[EMAIL PROTECTED]> wrote:

> > +   if (IS_ERR(inode))
> > +   return ERR_PTR(-ENOMEM);
> > +   if (!(inode->i_state & I_NEW))
> > +   return inode;
> 
> Don't you have to unlock_new_inode(inode) before returning?

In the first case, no because an OOM error was returned rather than an inode,
and in the second case, no because an extant non-locked inode was returned.

The inode is only returned locked by iget_locked() if it is also new - in
which case we don't return in either of the above two statements.

> > +error:
> 
> and also here?
> 
> > +   iget_failed(inode);
> > +   return ERR_PTR(-EIO);

Take a look at what iget_failed() does (patch #3):

void iget_failed(struct inode *inode)
{
make_bad_inode(inode);
unlock_new_inode(inode);
iput(inode);
}

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html