Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thu, 11 Oct 2007, Ryan Finnie wrote: > On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > shit. That's a nasty bug. Really userspace should be testing for -1, but > > the msync() library function should only ever return 0 or -1. > > > > Does this fix it? > > > > --- a/mm/page-writeback.c~a > > +++ a/mm/page-writeback.c > > @@ -850,8 +850,10 @@ retry: > > > > ret = (*writepage)(page, wbc, data); > > > > - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) > > + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { > > unlock_page(page); > > + ret = 0; > > + } > > if (ret || (--(wbc->nr_to_write) <= 0)) > > done = 1; > > if (wbc->nonblocking && bdi_write_congested(bdi)) { > > _ > > > > Pekka Enberg replied with an identical patch a few days ago, but for > some reason the same condition flows up to msync as -1 EIO instead of > AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the > thread is below. Thanks. Each time I sit down to follow what's going on with writepage and unionfs and msync, I get distracted: I really haven't researched this properly. But I keep suspecting that the answer might be the patch below (which rather follows what drivers/block/rd.c is doing). I'm especially worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned to userspace, bad enough in itself, you might be liable to hit that BUG_ON(page_mapped(page)). shmem_writepage does not expect to be called by anyone outside mm/vmscan.c, but unionfs can now get to it? Please let us know if this patch does fix it: then I'll try harder to work out what goes on. Thanks, Hugh --- 2.6.23/mm/shmem.c 2007-10-09 21:31:38.0 +0100 +++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100 @@ -916,6 +916,11 @@ static int shmem_writepage(struct page * struct inode *inode; BUG_ON(!PageLocked(page)); + if (!wbc->for_reclaim) { + set_page_dirty(page); + unlock_page(page); + return 0; + } BUG_ON(page_mapped(page)); mapping = page->mapping; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Fri, Oct 12, 2007 at 07:53:53AM +1000, David Chinner wrote: > On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote: > > On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: > > > > > So it's almost certainly pointing at an elevator or driver change, not an > > > XFS change. > > > > heh, git bisect begs to differ :) > > > > 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit > > 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <[EMAIL > > PROTECTED]> > > Date: Sat Nov 11 18:05:00 2006 +1100 > > > > [XFS] Prevent a deadlock when xfslogd unpins inodes. > > Oh, of course - I failed to notice the significance of > this loop in your test: > > while [foo]; do > touch fred > rm fred > done > > The inode allocator keeps reusing the same inode. If the > transaction that did the unlink has not hit the disk before we > allocate the inode again, we have to force the log to get the unlink > transaction to disk to get the xfs inode unpinned (i.e. able to be > modified in memory again). > > It's the log force I/O that's introducing the latency. > > If we don't force the log, then we have a possible use-after free > of the linux inode because of a fundamental mismatch between > the XFS inode life cycle and the linux inode life cycle. The > use-after free only occurs on large machines under heavy, heavy > metadata load to many disks and filesystems (requires enough > traffic to overload an xfslogd) and is very difficult to > reproduce (large machine, lots of disks and 20-30 hours MTTF). > > I'll have a look at other ways to solve this problem, but it > took 6 months to find a solution to the race in the first place > so don't hold your breath. You can breath again. Here's a test patch (warning - may harm kittens - not fully tested or verified) that solves both the use-after-free issue (by avoiding it altogether) as well the unlink/create latency because the log force is no longer there. (yay! xfsqa test 016 passes again ;) It does have other possible side effects triggering extra log forces elsewhere on inode writeback and affects sync behaviour so it's only a proof of concept at this point. The latency output looks like: open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000281> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000184> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000188> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000182> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000225> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000182> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000185> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000405> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000224> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000199> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000163> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000145> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.817704> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000148> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000143> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000154> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000147> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000158> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000144> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000379> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000151> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000190> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000191> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000150> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.797099> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000139> open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000163> So we still see some operations show high latency - that will most likely be due to the allocation transaction filling a log buffer and pushing it to disk, then having to wait because I/O is in progress on all other log buffers. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_iops.c | 16 fs/xfs/xfs_iget.c | 18 -- fs/xfs/xfs_inod
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > shit. That's a nasty bug. Really userspace should be testing for -1, but > the msync() library function should only ever return 0 or -1. > > Does this fix it? > > --- a/mm/page-writeback.c~a > +++ a/mm/page-writeback.c > @@ -850,8 +850,10 @@ retry: > > ret = (*writepage)(page, wbc, data); > > - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) > + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { > unlock_page(page); > + ret = 0; > + } > if (ret || (--(wbc->nr_to_write) <= 0)) > done = 1; > if (wbc->nonblocking && bdi_write_congested(bdi)) { > _ > Pekka Enberg replied with an identical patch a few days ago, but for some reason the same condition flows up to msync as -1 EIO instead of AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the thread is below. Thanks. Ryan On 10/7/07, Ryan Finnie <[EMAIL PROTECTED]> wrote: > On 10/7/07, Pekka J Enberg <[EMAIL PROTECTED]> wrote: > > On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes > > > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. > > > Therefore, some user programs fail, esp. if they're written such as > > > this: > > > ... > > It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid > > writeback of the page in the near future. I wonder if it's enough that we > > change the return value to zero from > > mm/page-writeback.c:write_cache_pages() in case we hit > > AOP_WRITEPAGE_ACTIVE... > > Doesn't appear to be enough. I can't figure out why (since it appears > write_cache_pages bubbles up directly to sys_msync), but with that > patch applied, in my test case[1], msync returns -1 EIO. However, > with the exact same kernel without that patch applied, msync returns > 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips > 524288 to 0, I can't figure out how it eventually returns -1 EIO. > > Ryan > > [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom, > standard livecd setup > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote: > On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: > > > So it's almost certainly pointing at an elevator or driver change, not an > > XFS change. > > heh, git bisect begs to differ :) > > 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit > 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <[EMAIL > PROTECTED]> > Date: Sat Nov 11 18:05:00 2006 +1100 > > [XFS] Prevent a deadlock when xfslogd unpins inodes. Oh, of course - I failed to notice the significance of this loop in your test: while [foo]; do touch fred rm fred done The inode allocator keeps reusing the same inode. If the transaction that did the unlink has not hit the disk before we allocate the inode again, we have to force the log to get the unlink transaction to disk to get the xfs inode unpinned (i.e. able to be modified in memory again). It's the log force I/O that's introducing the latency. If we don't force the log, then we have a possible use-after free of the linux inode because of a fundamental mismatch between the XFS inode life cycle and the linux inode life cycle. The use-after free only occurs on large machines under heavy, heavy metadata load to many disks and filesystems (requires enough traffic to overload an xfslogd) and is very difficult to reproduce (large machine, lots of disks and 20-30 hours MTTF). I'll have a look at other ways to solve this problem, but it took 6 months to find a solution to the race in the first place so don't hold your breath. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 7 Oct 2007 15:20:19 -0400 Erez Zadok <[EMAIL PROTECTED]> wrote: > According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back > to the VFS/VM. Indeed some filesystems such as tmpfs can return > AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also > return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it. > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. > Therefore, some user programs fail, esp. if they're written such as this: > > err = msync(...); > if (err != 0) > // fail > > They temporarily fixed the specific program in question (apt-get) to check > > if (err < 0) > // fail > > Is this a bug indeed, or are user programs supposed to handle > AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what > should the kernel return: a zero, or an -errno (and which one)? > shit. That's a nasty bug. Really userspace should be testing for -1, but the msync() library function should only ever return 0 or -1. Does this fix it? --- a/mm/page-writeback.c~a +++ a/mm/page-writeback.c @@ -850,8 +850,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc->nr_to_write) <= 0)) done = 1; if (wbc->nonblocking && bdi_write_congested(bdi)) { _ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: > So it's almost certainly pointing at an elevator or driver change, not > an XFS change. heh, git bisect begs to differ :) 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <[EMAIL PROTECTED]> Date: Sat Nov 11 18:05:00 2006 +1100 [XFS] Prevent a deadlock when xfslogd unpins inodes. The previous fixes for the use after free in xfs_iunpin left a nasty log deadlock when xfslogd unpinned the inode and dropped the last reference to the inode. the ->clear_inode() method can issue transactions, and if the log was full, the transaction could push on the log and get stuck trying to push the inode it was currently unpinning. To fix this, we provide xfs_iunpin a guarantee that it will always have a valid xfs_inode <-> linux inode link or a particular flag will be set on the inode. We then use log forces during lookup to ensure transactions are completed before we recycle the inode. This ensures that xfs_iunpin will never use the linux inode after it is being freed, and any lookup on an inode on the reclaim list will wait until it is safe to attach a new linux inode to the xfs inode. SGI-PV: 956832 SGI-Modid: xfs-linux-melb:xfs-kern:27359a Signed-off-by: David Chinner <[EMAIL PROTECTED]> Signed-off-by: Shailendra Tripathi <[EMAIL PROTECTED]> Signed-off-by: Takenori Nagano <[EMAIL PROTECTED]> Signed-off-by: Tim Shimmin <[EMAIL PROTECTED]> :04 04 1c47ba44bc404456c87c5a493d543a8d30696b88 92319b34585d03c64e53890a80e550b579a0363d M fs If you'd like more info, don't hesitate to ask. > Cheers, > > dave. Cheers, Andrew - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS regression?
On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote: > Latencies are an order of magnitude lower at 60-70ms because the disks > have less deep queues. This is expected - deep queues and multiple > outstanding I/Os are the enemy of single I/O latency > > If I remount with barriers enabled, the latency at nr_requests=128 > goes up to a consistent 2.2s. Not surprising, we're flushing the drive > cache very regularly now and it points to the create or truncate > transaction having to pushing log buffers to disk. The latency remains > at 70-80ms at nr_requests=4. Thanks for the info. I did try fiddling nr_requests but I made it bigger. I'll try with it lower. > > It seems this problem was introduced between 2.6.18 and 2.6.19. > > When the new SATA driver infrastructure was introduced. Do you have > NCQ enabled on more recent kernels and not on 2.6.18? If so, try > disabling it and see if the problem goes away Unfortunately the drives in the file server don't support NCQ. Not sure if it's supported in the machine I was testing on (it's certainly a few years old). > > The other thing I've found is that if I do the dd to an ext3 fs (on > > the same disk at least) while running the test in the XFS fs then I > > also see the latencies. > > So it's almost certainly pointing at an elevator or driver change, not > an XFS change. OK, though it doesn't seem to effect ext3. I'm going to run a git bisect to see what it comes up with. > Cheers, > > dave. Cheers, Andrew - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/31] IGET: Stop BFS from using iget() and read_inode() [try #3]
Roel Kluin <[EMAIL PROTECTED]> wrote: > > + if (IS_ERR(inode)) > > + return ERR_PTR(-ENOMEM); > > + if (!(inode->i_state & I_NEW)) > > + return inode; > > Don't you have to unlock_new_inode(inode) before returning? In the first case, no because an OOM error was returned rather than an inode, and in the second case, no because an extant non-locked inode was returned. The inode is only returned locked by iget_locked() if it is also new - in which case we don't return in either of the above two statements. > > +error: > > and also here? > > > + iget_failed(inode); > > + return ERR_PTR(-EIO); Take a look at what iget_failed() does (patch #3): void iget_failed(struct inode *inode) { make_bad_inode(inode); unlock_new_inode(inode); iput(inode); } David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html