Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote: > From: Michal Hocko <mho...@suse.com> > > kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore} > API to prevent from reclaim recursion into the fs because vmalloc can > invoke unconditional GFP_KERNEL allocations and these functions might be > called from the NOFS contexts. The memalloc_noio_save will enforce > GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be > unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should > provide exactly what we need here - implicit GFP_NOFS context. > > Changes since v1 > - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages > as per Brian Foster > > Signed-off-by: Michal Hocko <mho...@suse.com> > --- Looks fine to me: Reviewed-by: Brian Foster <bfos...@redhat.com> > fs/xfs/kmem.c| 10 +- > fs/xfs/xfs_buf.c | 8 > 2 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index a76a05dae96b..d69ed5e76621 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags) > void * > kmem_zalloc_large(size_t size, xfs_km_flags_t flags) > { > - unsigned noio_flag = 0; > + unsigned nofs_flag = 0; > void*ptr; > gfp_t lflags; > > @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags) >* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering >* the filesystem here and potentially deadlocking. >*/ > - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > - noio_flag = memalloc_noio_save(); > + if (flags & KM_NOFS) > + nofs_flag = memalloc_nofs_save(); > > lflags = kmem_flags_convert(flags); > ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); > > - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > - memalloc_noio_restore(noio_flag); > + if (flags & KM_NOFS) > + memalloc_nofs_restore(nofs_flag); > > return ptr; > } > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 7f0a01f7b592..8cb8dd4cdfd8 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -441,17 +441,17 @@ _xfs_buf_map_pages( > bp->b_addr = NULL; > } else { > int retried = 0; > - unsigned noio_flag; > + unsigned nofs_flag; > > /* >* vm_map_ram() will allocate auxillary structures (e.g. >* pagetables) with GFP_KERNEL, yet we are likely to be under >* GFP_NOFS context here. Hence we need to tell memory reclaim > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent >* memory reclaim re-entering the filesystem here and >* potentially deadlocking. >*/ > - noio_flag = memalloc_noio_save(); > + nofs_flag = memalloc_nofs_save(); > do { > bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, > -1, PAGE_KERNEL); > @@ -459,7 +459,7 @@ _xfs_buf_map_pages( > break; > vm_unmap_aliases(); > } while (retried++ <= 1); > - memalloc_noio_restore(noio_flag); > + memalloc_nofs_restore(nofs_flag); > > if (!bp->b_addr) > return -ENOMEM; > -- > 2.11.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
On Thu, Dec 15, 2016 at 03:07:09PM +0100, Michal Hocko wrote: > From: Michal Hocko <mho...@suse.com> > > xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite > some time ago. We would like to make this concept more generic and use > it for other filesystems as well. Let's start by giving the flag a > more genric name PF_MEMALLOC_NOFS which is in line with an exiting Typos: generic existing > PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO > contexts. Replace all PF_FSTRANS usage from the xfs code in the first > step before we introduce a full API for it as xfs uses the flag directly > anyway. > > This patch doesn't introduce any functional change. > > Signed-off-by: Michal Hocko <mho...@suse.com> > --- Otherwise seems fine to me: Reviewed-by: Brian Foster <bfos...@redhat.com> > fs/xfs/kmem.c | 4 ++-- > fs/xfs/kmem.h | 2 +- > fs/xfs/libxfs/xfs_btree.c | 2 +- > fs/xfs/xfs_aops.c | 6 +++--- > fs/xfs/xfs_trans.c| 12 ++-- > include/linux/sched.h | 2 ++ > 6 files changed, 15 insertions(+), 13 deletions(-) > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index 339c696bbc01..a76a05dae96b 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags) >* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering >* the filesystem here and potentially deadlocking. >*/ > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > noio_flag = memalloc_noio_save(); > > lflags = kmem_flags_convert(flags); > ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); > > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > memalloc_noio_restore(noio_flag); > > return ptr; > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > index ea3984091d58..e40ddd12900b 100644 > --- a/fs/xfs/kmem.h > +++ b/fs/xfs/kmem.h > @@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags) > lflags = GFP_ATOMIC | __GFP_NOWARN; > } else { > lflags = GFP_KERNEL | __GFP_NOWARN; > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) > lflags &= ~__GFP_FS; > } > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > index 21e6a6ab6b9a..a2672ba4dc33 100644 > --- a/fs/xfs/libxfs/xfs_btree.c > +++ b/fs/xfs/libxfs/xfs_btree.c > @@ -2866,7 +2866,7 @@ xfs_btree_split_worker( > struct xfs_btree_split_args *args = container_of(work, > struct xfs_btree_split_args, > work); > unsigned long pflags; > - unsigned long new_pflags = PF_FSTRANS; > + unsigned long new_pflags = PF_MEMALLOC_NOFS; > > /* >* we are in a transaction context here, but may also be doing work > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 0f56fcd3a5d5..61ca9f9c5a12 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc( >* We hand off the transaction to the completion thread now, so >* clear the flag here. >*/ > - current_restore_flags_nested(>t_pflags, PF_FSTRANS); > + current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS); > return 0; > } > > @@ -252,7 +252,7 @@ xfs_setfilesize_ioend( >* thus we need to mark ourselves as being in a transaction manually. >* Similarly for freeze protection. >*/ > - current_set_flags_nested(>t_pflags, PF_FSTRANS); > + current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS); > __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS); > > /* we abort the update if there was an IO error */ > @@ -1015,7 +1015,7 @@ xfs_do_writepage( >* Given that we do not allow direct reclaim to call us, we should >* never be called while in a filesystem transaction. >*/ > - if (WARN_ON_ONCE(current->flags & PF_FSTRANS)) > + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) > goto redirty; > > /* > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 70f42ea86dfb..f5969c8274fc 100644 > --- a/fs/xfs/xfs_t
Re: [PATCH 5/9] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote: > From: Michal Hocko> > kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore} > API to prevent from reclaim recursion into the fs because vmalloc can > invoke unconditional GFP_KERNEL allocations and these functions might be > called from the NOFS contexts. The memalloc_noio_save will enforce > GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be > unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should > provide exactly what we need here - implicit GFP_NOFS context. > > Signed-off-by: Michal Hocko > --- > fs/xfs/kmem.c| 10 +- > fs/xfs/xfs_buf.c | 8 > 2 files changed, 9 insertions(+), 9 deletions(-) > ... > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index f31ae592dcae..5c6f9bd4d8be 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -441,17 +441,17 @@ _xfs_buf_map_pages( > bp->b_addr = NULL; > } else { > int retried = 0; > - unsigned noio_flag; > + unsigned nofs_flag; > > /* >* vm_map_ram() will allocate auxillary structures (e.g. >* pagetables) with GFP_KERNEL, yet we are likely to be under >* GFP_NOFS context here. Hence we need to tell memory reclaim > - * that we are in such a context via PF_MEMALLOC_NOIO to prevent > + * that we are in such a context via PF_MEMALLOC_NOFS to prevent >* memory reclaim re-entering the filesystem here and >* potentially deadlocking. >*/ > - noio_flag = memalloc_noio_save(); > + nofs_flag = memalloc_nofs_save(); > do { > bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, > -1, PAGE_KERNEL); > @@ -459,7 +459,7 @@ _xfs_buf_map_pages( > break; > vm_unmap_aliases(); > } while (retried++ <= 1); > - memalloc_noio_restore(noio_flag); > + memalloc_noio_restore(nofs_flag); memalloc_nofs_restore() ? Brian > > if (!bp->b_addr) > return -ENOMEM; > -- > 2.10.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives
On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote: > Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used. > --- > From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001 > From: Michal Hocko> Date: Thu, 15 Dec 2016 13:06:43 +0100 > Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim > lockdep false positives > > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing > KM_NOFS tags to keep lockdep happy") and use the new flag for them > instead. There is really no reason to make these allocations contexts > weaker just because of the lockdep which even might not be enabled > in most cases. > Hi Michal, I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with the original problem. FWIW, there was another KM_NOFS instance added by that commit in xlog_cil_prepare_log_vecs() that is now in xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the original issue still applies..? Brian > Changes since v1 > - check for KM_NOLOCKDEP in kmem_flags_convert to not hit sanity BUG_ON > as per Mike Galbraith > > Signed-off-by: Michal Hocko > --- > fs/xfs/kmem.h| 6 +- > fs/xfs/libxfs/xfs_da_btree.c | 4 ++-- > fs/xfs/xfs_buf.c | 2 +- > fs/xfs/xfs_dir2_readdir.c| 2 +- > 4 files changed, 9 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > index 689f746224e7..d5d634ef1f7f 100644 > --- a/fs/xfs/kmem.h > +++ b/fs/xfs/kmem.h > @@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t; > #define KM_NOFS ((__force xfs_km_flags_t)0x0004u) > #define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u) > #define KM_ZERO ((__force xfs_km_flags_t)0x0010u) > +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u) > > /* > * We use a special process flag to avoid recursive callbacks into > @@ -44,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags) > { > gfp_t lflags; > > - BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO)); > + BUG_ON(flags & > ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP)); > > if (flags & KM_NOSLEEP) { > lflags = GFP_ATOMIC | __GFP_NOWARN; > @@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags) > if (flags & KM_ZERO) > lflags |= __GFP_ZERO; > > + if (flags & KM_NOLOCKDEP) > + lflags |= __GFP_NOLOCKDEP; > + > return lflags; > } > > diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c > index f2dc1a950c85..b8b5f6914863 100644 > --- a/fs/xfs/libxfs/xfs_da_btree.c > +++ b/fs/xfs/libxfs/xfs_da_btree.c > @@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec( > > if (nirecs > 1) { > map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map), > - KM_SLEEP | KM_NOFS); > + KM_SLEEP | KM_NOLOCKDEP); > if (!map) > return -ENOMEM; > *mapp = map; > @@ -2488,7 +2488,7 @@ xfs_dabuf_map( >*/ > if (nfsb != 1) > irecs = kmem_zalloc(sizeof(irec) * nfsb, > - KM_SLEEP | KM_NOFS); > + KM_SLEEP | KM_NOLOCKDEP); > > nirecs = nfsb; > error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs, > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 7f0a01f7b592..f31ae592dcae 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -1785,7 +1785,7 @@ xfs_alloc_buftarg( > { > xfs_buftarg_t *btp; > > - btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS); > + btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP); > > btp->bt_mount = mp; > btp->bt_dev = bdev->bd_dev; > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c > index 003a99b83bd8..033ed65d7ce6 100644 > --- a/fs/xfs/xfs_dir2_readdir.c > +++ b/fs/xfs/xfs_dir2_readdir.c > @@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents( > length = howmany(bufsize + geo->blksize, (1 << geo->fsblog)); > map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) + > (length * sizeof(struct xfs_bmbt_irec)), > -KM_SLEEP | KM_NOFS); > +KM_SLEEP | KM_NOLOCKDEP); > map_info->map_size = length; > > /* > -- > 2.10.2 > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this
Re: Is is possible to submit binary image as fstest test case?
On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote: > Hi, > > Just as the title says, for some case(OK, btrfs again) we need to catch a > file system in special timing. > > In this specific case, we need to grab a btrfs image undergoing balancing, > just before the balance finished. > > Although we can use flakey to drop all write, we still don't have method to > catch the timing of the that transaction. > > > On the other hand, we can tweak our local kernel, adding msleep()/message > and dump the disk during the sleep. > And the image I dumped can easily trigger btrfs kernel and user-space bug. > > So I'm wondering if I can just upload a zipped raw image as part of the test > case? > Doesn't necessarily bother me one way or the other, but something we've done with XFS in such situations is introduce a DEBUG mode only sysfs tunable that delays certain infrastructure (log recovery in our case) to coordinate with test cases that try to reproduce such timing/racing problems. See test xfs/051 for an example.. Brian > Thanks, > Qu > > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/311: Disable dmesg check
On Fri, Jul 17, 2015 at 12:56:43AM -0400, Chandan Rajendra wrote: When running generic/311 on Btrfs' subpagesize-blocksize patchset (on ppc64 with 4k sectorsize and 16k node/leaf size) I noticed the following call trace, BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 found 158 BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 found 158 BTRFS: Transaction aborted (error -5) WARNING: at /root/repos/linux/fs/btrfs/super.c:260 Modules linked in: CPU: 3 PID: 30769 Comm: umount Tainted: GWL 4.0.0-rc5-11671-g8b82e73e #63 task: c00079aaddb0 ti: c00079a48000 task.ti: c00079a48000 NIP: c0499aa0 LR: c0499a9c CTR: c0779630 REGS: c00079a4b480 TRAP: 0700 Tainted: GW L (4.0.0-rc5-11671-g8b82e73e) MSR: 800100029032 SF,EE,ME,IR,DR,RI CR: 28008828 XER: 2000 CFAR: c0a23914 SOFTE: 1 GPR00: c0499a9c c00079a4b700 c103bdf8 0025 GPR04: 0001 0502 c107e918 0cda GPR08: 0007 0007 0001 c10f5044 GPR12: 28008822 cfdc0d80 2000 10152e00 GPR16: 010002979380 10140724 GPR20: GPR24: c000151f61a8 c00055e5e800 c0aac270 GPR28: 04a4 fffb c00055e5e800 c000679204d0 NIP [c0499aa0] .__btrfs_abort_transaction+0x180/0x190 LR [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190 Call Trace: [c00079a4b700] [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190 (unreliable) [c00079a4b7a0] [c0541678] .__btrfs_run_delayed_items+0xe8/0x220 [c00079a4b850] [c04d5b3c] .btrfs_commit_transaction+0x37c/0xca0 [c00079a4b960] [c049824c] .btrfs_sync_fs+0x6c/0x1a0 [c00079a4ba00] [c0255270] .sync_filesystem+0xd0/0x100 [c00079a4ba80] [c0218070] .generic_shutdown_super+0x40/0x170 [c00079a4bb10] [c0218598] .kill_anon_super+0x18/0x30 [c00079a4bb90] [c0498418] .btrfs_kill_super+0x18/0xc0 [c00079a4bc10] [c0218ac8] .deactivate_locked_super+0x98/0xe0 [c00079a4bc90] [c023e744] .cleanup_mnt+0x54/0xa0 [c00079a4bd10] [c00b7d14] .task_work_run+0x114/0x150 [c00079a4bdb0] [c0015f84] .do_notify_resume+0x74/0x80 [c00079a4be30] [c0009838] .ret_from_except_lite+0x64/0x68 Instruction dump: ebc1fff0 ebe1fff8 4bfffb28 6000 3ce2ffcd 38e7e818 4bbc 3c62ffd2 7fa4eb78 3863b808 48589e1d 6000 0fe0 4bfffedc 6000 6000 BTRFS: error (device dm-0) in __btrfs_run_delayed_items:1188: errno=-5 IO failure The call trace is seen when executing _run_test() for the 8th time. The above trace is actually a false-positive failure as indicated below, fsync-tester fsync(fd) Write delayed inode item to fs tree (assume transid to be 160) (assume tree block to start at logical address 29720576) md5sum $testfile This causes a delayed inode to be added Load flakey table i.e. drop writes that are initiated from now onwards Unmount filesystem btrfs_sync_fs is invoked Write 29720576 metadata block to disk free_extent_buffer(29720576) release_extent_buffer(29720576) Start writing delayed inode Traverse the fs tree (assume the parent tree block of 29720576 is still in memory) When reading 29720576 from disk, parent's blkptr will have generation set to 160. But the on-disk tree block will have an older generation (say, 158). Transid verification fails and hence the transaction gets aborted The test only cares about the FS instance before the unmount operation (i.e. the synced FS). Hence to get the test to pass, ignore the false-positive trace that could be generated. Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com --- tests/generic/311 | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tests/generic/311 b/tests/generic/311 index d21b6eb..cd6391d 100755 --- a/tests/generic/311 +++ b/tests/generic/311 @@ -64,6 +64,8 @@ _require_xfs_io_command falloc [ -x $here/src/fsync-tester ] || _notrun fsync-tester not built +_disable_dmesg_check + Hmm, I don't think this is something we'd want to do unconditionally. E.g., if something hits the logs for xfs or ext4, we probably want to hear about it. Perhaps check that the fs is btrfs and possibly the fs params are such that the known warning occurs..? I'd defer to the btrfs folks on how best to check that, so long as it doesn't affect other fs'. Brian rm -f $seqres.full SEED=1 testfile=$SCRATCH_MNT/$seq.fsync -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe fstests in the body of a message to
Re: [PATCH] xfstests: generic: test for discard properly discarding unused extents
On Wed, Apr 01, 2015 at 03:01:07PM -0400, Jeff Mahoney wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 4/1/15 2:44 PM, Brian Foster wrote: On Mon, Mar 30, 2015 at 03:11:06PM -0400, Jeff Mahoney wrote: This tests tests four conditions where discard can potentially not discard unused extents completely. We test, with -o discard and with fstrim, scenarios of removing many relatively small files and removing several large files. The important part of the two scenarios is that the large files must be large enough to span a blockgroup alone. It's possible for an entire block group to be emptied and dropped without an opportunity to discard individual extents as would happen with smaller files. The test confirms the discards have occured by using a sparse file mounted via loopback to punch holes and then check how many blocks are still allocated within the file. Signed-off-by: Jeff Mahoney je...@suse.com --- The code looks mostly Ok to me, a few notes below. Those aside, this is a longish test. It takes me about 8 minutes to run on my typical low end vm. My test hardware is a 16 core / 16 GB RAM machine using a commodity SSD. It ran pretty quickly. Yeah, as Dave mentioned, I test on anything from beefier bare metal hardware to fairly resource constrained VMs. We certainly have longer running tests, but sometimes I exclude them when I'm just trying to do regression tests on ongoing development work, etc. I suppose I should start by explaining that I wrote the test to be btrfs specific and then realized that the only thing that was /actually/ btrfs-specific was the btrfs filesystem sync call. I ran it on XFS to ensure it worked as expected, but didn't have any reason to try to adapt it to work in any other environment. Is the 1GB block group magic value mutable in any way, or is it a hardcoded thing (for btrfs I presume)? It would be nice if we could shrink that a bit. If not, perhaps there are some other ways to reduce the runtime... It's not hardcoded for btrfs, but it is by far the most common sized block group. I'd prefer to test what people are using. Ok... - Is there any reason a single discard or trim test instance must be all large or small files? In other words, is there something that this wouldn't catch if the 10GB were 50% filled with large files and %50 with small files? That would allow us to trim the maximum on the range of small file creation and only have two invocations instead of four. Only to draw attention to the obvious failure cases, which are probably specific to btrfs. If a file spans an entire block group and is removed, it skips the individual discards and depends on the block group removal to discard the entire thing (this wasn't happening). If there are lots of small files, it hits different paths, and I wanted to make it clear which one each mode of the test was targeting. Otherwise, whoever hits the failure is going to end up having to do it manually, which defeats the purpose of having an automated test case, IM O. So it seems to me this is somewhat a mix of a functional test, a stress test and a targeted regression test. IIUC, the regression could probably be reproduced and tested for using a smaller block group and thus require significantly less time. The functional test requires testing that the various discard mechanisms work appropriately (e.g., mix of files), but still shouldn't require writing so much data. A stress test certainly requires a larger fs and writing a bit more data. That could probably be managed in a variety of ways. You could throw the whole thing under btrfs as is, split it up into separate tests that are grouped such that it's easier to exclude the longer running test, use fs-specific values as discussed above, use a percentage of SCRATCH_DEV and let the tester determine the time based on the devices under test, etc. The only thing I really care about is the length of time running the test on slower storage. The idea of scaling the file sizes seems a reasonable enough workaround to me, assuming we can get that 8 minutes down to a couple minutes or so. - If the 1GB thing is in fact a btrfs thing, could we make the core test a bit more size agnostic (e.g., perhaps pass the file count/size values as parameters) and then scale the parameters up exclusively for btrfs? For example, set defaults of fssize=1G, largefile=100MB, smallfile=[512b-5MB] or something of that nature and override them to the 10GB, 1GB, 32k-... values for btrfs? That way we don't need to write as much data for fs' where it might not be necessary. If someone wants to weigh in on what sane defaults for other file systems might be, sure. The order of magnitude numbers I threw out above seem reasonable to me, at least for xfs. We can always tweak it later. tests/generic/326 | 164
Re: [PATCH] xfstests: generic: test for discard properly discarding unused extents
On Mon, Mar 30, 2015 at 03:11:06PM -0400, Jeff Mahoney wrote: This tests tests four conditions where discard can potentially not discard unused extents completely. We test, with -o discard and with fstrim, scenarios of removing many relatively small files and removing several large files. The important part of the two scenarios is that the large files must be large enough to span a blockgroup alone. It's possible for an entire block group to be emptied and dropped without an opportunity to discard individual extents as would happen with smaller files. The test confirms the discards have occured by using a sparse file mounted via loopback to punch holes and then check how many blocks are still allocated within the file. Signed-off-by: Jeff Mahoney je...@suse.com --- The code looks mostly Ok to me, a few notes below. Those aside, this is a longish test. It takes me about 8 minutes to run on my typical low end vm. Is the 1GB block group magic value mutable in any way, or is it a hardcoded thing (for btrfs I presume)? It would be nice if we could shrink that a bit. If not, perhaps there are some other ways to reduce the runtime... - Is there any reason a single discard or trim test instance must be all large or small files? In other words, is there something that this wouldn't catch if the 10GB were 50% filled with large files and %50 with small files? That would allow us to trim the maximum on the range of small file creation and only have two invocations instead of four. - If the 1GB thing is in fact a btrfs thing, could we make the core test a bit more size agnostic (e.g., perhaps pass the file count/size values as parameters) and then scale the parameters up exclusively for btrfs? For example, set defaults of fssize=1G, largefile=100MB, smallfile=[512b-5MB] or something of that nature and override them to the 10GB, 1GB, 32k-... values for btrfs? That way we don't need to write as much data for fs' where it might not be necessary. tests/generic/326 | 164 ++ tests/generic/326.out | 5 ++ tests/generic/group | 1 + 3 files changed, 170 insertions(+) create mode 100644 tests/generic/326 create mode 100644 tests/generic/326.out diff --git a/tests/generic/326 b/tests/generic/326 new file mode 100644 index 000..923a27f --- /dev/null +++ b/tests/generic/326 @@ -0,0 +1,164 @@ +#! /bin/bash +# FSQA Test No. 326 +# +# This test uses a loopback mount with PUNCH_HOLE support to test +# whether discard operations are working as expected. +# +# It tests both -odiscard and fstrim. +# +# Copyright (C) 2015 SUSE. All Rights Reserved. +# Author: Jeff Mahoney je...@suse.com +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +tmp=/tmp/$$ +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +loopdev= +tmpdir= +_cleanup() +{ + [ -n $tmpdir ] umount $tmpdir + [ -n $loopdev ] losetup -d $loopdev +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_need_to_be_root +_supported_fs generic +_supported_os Linux +_require_scratch +_require_fstrim + +rm -f $seqres.full + +_scratch_mkfs $seqres.full +_require_fs_space $SCRATCH_MNT $(( 10 * 1024 * 1024 )) +_scratch_mount + +test_discard() +{ + discard=$1 + files=$2 + + tmpfile=$SCRATCH_MNT/testfs.img.$$ + tmpdir=$SCRATCH_MNT/testdir.$$ + mkdir -p $tmpdir || _fail !!! failed to create temp mount dir + + # Create a sparse file to host the file system + dd if=/dev/zero of=$tmpfile bs=1M count=1 seek=10240 $seqres.full \ + || _fail !!! failed to create fs image file xfs_io -c truncate ... ? + + opts= + if [ $discard = discard ]; then + opts=-o discard + fi + losetup -f $tmpfile + loopdev=$(losetup -j $tmpfile|awk -F: '{print $1}') + _mkfs_dev $loopdev $seqres.full + $MOUNT_PROG $opts $loopdev $tmpdir \ + || _fail !!! failed to loopback mount + + if [ $files = large ]; then + # Create files larger than 1GB so each one occupies + # more than one
Re: Fiemap inconsistent behaviour when file offset range isn't on a block boundary
On Tue, Jul 15, 2014 at 04:20:29PM +0630, Chandan Rajendra wrote: All the filesystems created and used below have 4k blocksize. The file.bin file mentioned below is 8k in size and has two 4k extents. The test program used can be found at http://fpaste.org/118057/. 1. First run (file range being passed is on block boundaries). , | [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ; | do | echo -- File: $f ---; | /root/print-fiemap 0 8192 $f; | done | -- File: /mnt/btrfs/file.bin --- | File range: 0 - 8191. | Found 2 extents. | Fiemap information: | Logical: 0 | Physical: 12582912 | Length: 4096 | Flags: | | Logical: 4096 | Physical: 12656640 | Length: 4096 | Flags: FIEMAP_EXTENT_LAST | | | -- File: /mnt/ext4/file.bin --- | File range: 0 - 8191. | Found 2 extents. | Fiemap information: | Logical: 0 | Physical: 135270400 | Length: 4096 | Flags: | | Logical: 4096 | Physical: 135278592 | Length: 4096 | Flags: FIEMAP_EXTENT_LAST | | | -- File: /mnt/xfs/file.bin --- | File range: 0 - 8191. | Found 2 extents. | Fiemap information: | Logical: 0 | Physical: 49152 | Length: 4096 | Flags: | | Logical: 4096 | Physical: 57344 | Length: 4096 | Flags: FIEMAP_EXTENT_LAST | ` 2. If the file offset range being passed as input to fiemap ioctl is not on block boundaries and falls within an extent's range then that extent is skipped. , | [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ; | do | echo -- File: $f ---; | /root/print-fiemap 1 4095 $f; | done | -- File: /mnt/btrfs/file.bin --- | File range: 1 - 4095. | Found 0 extents. | | | -- File: /mnt/ext4/file.bin --- | File range: 1 - 4095. | Found 1 extents. | Fiemap information: | Logical: 0 | Physical: 135270400 | Length: 4096 | Flags: | | -- File: /mnt/xfs/file.bin --- | File range: 1 - 4095. | Found 2 extents. | Fiemap information: | Logical: 0 | Physical: 49152 | Length: 4096 | Flags: | | Logical: 4096 | Physical: 57344 | Length: 4096 | Flags: FIEMAP_EXTENT_LAST | ` From linux/Documentation/filesystems/fiemap.txt, fm_start, and fm_length specify the logical range within the file which the process would like mappings for. Extents returned mirror those on disk - that is, the logical offset of the 1st returned extent may start before fm_start, and the range covered by the last returned extent may end after fm_length. All offsets and lengths are in bytes. So IMHO, the above would mean that all the extents that map the file range [fm_start, fm_start + fm_length - 1] should be returned by a fiemap ioctl call (as done by ext4). In the case of Btrfs, the commit ea8efc74bd0402b4d5f663d007b4e25fa29ea778 i.e. Btrfs: make sure not to return overlapping extents to fiemap, forces the first extent returned by btrfs_fiemap() to start from fm_start (if fm_start is greater than the file offset mapped by the containing extent's first byte). Can somebody please list some example scenarios where extent_fiemap() ends up returning dupclicate and overlapping extents? Also, the commit 4d479cf010d56ec9c54f302d039918f1296b i.e. Btrfs: sectorsize align offsets in fiemap, rounds up first byte of the file offset range to the next block. Shouldn't it be rounded down instead? XFS lists both the extents even though the first one encompasses the file range specified in the input. I gave this a test on XFS with a file that looks like this: # xfs_bmap -v /mnt/file /mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..15]: 102368..1023831 (51168..51183) 16 1 1: [16..63]:1832..18790 (1832..1879)48 1 I narrowed the print_fiemap behavior down to this: # ./print_fiemap 1 7680 /mnt/file File range: 1 - 7680. Found 1 extents. Fiemap information: Logical: 0 Physical: 52412416 Length: 8192 Flags: FIEMAP_EXTENT_UNWRITTEN | # ./print_fiemap 1 7681 /mnt/file File range: 1 - 7681. Found 2 extents. Fiemap information:
Re: Fiemap inconsistent behaviour when file offset range isn't on a block boundary
On Tue, Jul 15, 2014 at 08:15:12AM -0400, Brian Foster wrote: On Tue, Jul 15, 2014 at 04:20:29PM +0630, Chandan Rajendra wrote: All the filesystems created and used below have 4k blocksize. The file.bin file mentioned below is 8k in size and has two 4k extents. The test program used can be found at http://fpaste.org/118057/. 1. First run (file range being passed is on block boundaries). , | [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ; | do | echo -- File: $f ---; | /root/print-fiemap 0 8192 $f; | done | -- File: /mnt/btrfs/file.bin --- | File range: 0 - 8191. | Found 2 extents. | Fiemap information: |Logical: 0 |Physical: 12582912 |Length: 4096 |Flags: | |Logical: 4096 |Physical: 12656640 |Length: 4096 |Flags: FIEMAP_EXTENT_LAST | | | -- File: /mnt/ext4/file.bin --- | File range: 0 - 8191. | Found 2 extents. | Fiemap information: |Logical: 0 |Physical: 135270400 |Length: 4096 |Flags: | |Logical: 4096 |Physical: 135278592 |Length: 4096 |Flags: FIEMAP_EXTENT_LAST | | | -- File: /mnt/xfs/file.bin --- | File range: 0 - 8191. | Found 2 extents. | Fiemap information: |Logical: 0 |Physical: 49152 |Length: 4096 |Flags: | |Logical: 4096 |Physical: 57344 |Length: 4096 |Flags: FIEMAP_EXTENT_LAST | ` 2. If the file offset range being passed as input to fiemap ioctl is not on block boundaries and falls within an extent's range then that extent is skipped. , | [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ; | do | echo -- File: $f ---; | /root/print-fiemap 1 4095 $f; | done | -- File: /mnt/btrfs/file.bin --- | File range: 1 - 4095. | Found 0 extents. | | | -- File: /mnt/ext4/file.bin --- | File range: 1 - 4095. | Found 1 extents. | Fiemap information: |Logical: 0 |Physical: 135270400 |Length: 4096 |Flags: | | -- File: /mnt/xfs/file.bin --- | File range: 1 - 4095. | Found 2 extents. | Fiemap information: |Logical: 0 |Physical: 49152 |Length: 4096 |Flags: | |Logical: 4096 |Physical: 57344 |Length: 4096 |Flags: FIEMAP_EXTENT_LAST | ` From linux/Documentation/filesystems/fiemap.txt, fm_start, and fm_length specify the logical range within the file which the process would like mappings for. Extents returned mirror those on disk - that is, the logical offset of the 1st returned extent may start before fm_start, and the range covered by the last returned extent may end after fm_length. All offsets and lengths are in bytes. So IMHO, the above would mean that all the extents that map the file range [fm_start, fm_start + fm_length - 1] should be returned by a fiemap ioctl call (as done by ext4). In the case of Btrfs, the commit ea8efc74bd0402b4d5f663d007b4e25fa29ea778 i.e. Btrfs: make sure not to return overlapping extents to fiemap, forces the first extent returned by btrfs_fiemap() to start from fm_start (if fm_start is greater than the file offset mapped by the containing extent's first byte). Can somebody please list some example scenarios where extent_fiemap() ends up returning dupclicate and overlapping extents? Also, the commit 4d479cf010d56ec9c54f302d039918f1296b i.e. Btrfs: sectorsize align offsets in fiemap, rounds up first byte of the file offset range to the next block. Shouldn't it be rounded down instead? XFS lists both the extents even though the first one encompasses the file range specified in the input. I gave this a test on XFS with a file that looks like this: # xfs_bmap -v /mnt/file /mnt/file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..15]: 102368..1023831 (51168..51183) 16 1 1: [16..63]:1832..18790 (1832..1879