Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-01-09 Thread Brian Foster
On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mho...@suse.com>
> ---

Looks fine to me:

Reviewed-by: Brian Foster <bfos...@redhat.com>

>  fs/xfs/kmem.c| 10 +-
>  fs/xfs/xfs_buf.c |  8 
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> - unsigned noio_flag = 0;
> + unsigned nofs_flag = 0;
>   void*ptr;
>   gfp_t   lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>* the filesystem here and potentially deadlocking.
>*/
> - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> - noio_flag = memalloc_noio_save();
> + if (flags & KM_NOFS)
> + nofs_flag = memalloc_nofs_save();
>  
>   lflags = kmem_flags_convert(flags);
>   ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> - memalloc_noio_restore(noio_flag);
> + if (flags & KM_NOFS)
> + memalloc_nofs_restore(nofs_flag);
>  
>   return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>   bp->b_addr = NULL;
>   } else {
>   int retried = 0;
> - unsigned noio_flag;
> + unsigned nofs_flag;
>  
>   /*
>* vm_map_ram() will allocate auxillary structures (e.g.
>* pagetables) with GFP_KERNEL, yet we are likely to be under
>* GFP_NOFS context here. Hence we need to tell memory reclaim
> -  * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +  * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>* memory reclaim re-entering the filesystem here and
>* potentially deadlocking.
>*/
> - noio_flag = memalloc_noio_save();
> + nofs_flag = memalloc_nofs_save();
>   do {
>   bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>   -1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>   break;
>   vm_unmap_aliases();
>   } while (retried++ <= 1);
> - memalloc_noio_restore(noio_flag);
> + memalloc_nofs_restore(nofs_flag);
>  
>   if (!bp->b_addr)
>   return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2016-12-16 Thread Brian Foster
On Thu, Dec 15, 2016 at 03:07:09PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting

Typos: generic   existing

> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mho...@suse.com>
> ---

Otherwise seems fine to me:

Reviewed-by: Brian Foster <bfos...@redhat.com>

>  fs/xfs/kmem.c |  4 ++--
>  fs/xfs/kmem.h |  2 +-
>  fs/xfs/libxfs/xfs_btree.c |  2 +-
>  fs/xfs/xfs_aops.c |  6 +++---
>  fs/xfs/xfs_trans.c| 12 ++--
>  include/linux/sched.h |  2 ++
>  6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>* the filesystem here and potentially deadlocking.
>*/
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   noio_flag = memalloc_noio_save();
>  
>   lflags = kmem_flags_convert(flags);
>   ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   memalloc_noio_restore(noio_flag);
>  
>   return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ea3984091d58..e40ddd12900b 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
>   } else {
>   lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>   lflags &= ~__GFP_FS;
>   }
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
>   struct xfs_btree_split_args *args = container_of(work,
>   struct xfs_btree_split_args, 
> work);
>   unsigned long   pflags;
> - unsigned long   new_pflags = PF_FSTRANS;
> + unsigned long   new_pflags = PF_MEMALLOC_NOFS;
>  
>   /*
>* we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0f56fcd3a5d5..61ca9f9c5a12 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
>* We hand off the transaction to the completion thread now, so
>* clear the flag here.
>*/
> - current_restore_flags_nested(>t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>   return 0;
>  }
>  
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
>* thus we need to mark ourselves as being in a transaction manually.
>* Similarly for freeze protection.
>*/
> - current_set_flags_nested(>t_pflags, PF_FSTRANS);
> + current_set_flags_nested(>t_pflags, PF_MEMALLOC_NOFS);
>   __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>   /* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
>* Given that we do not allow direct reclaim to call us, we should
>* never be called while in a filesystem transaction.
>*/
> - if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>   goto redirty;
>  
>   /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_t

Re: [PATCH 5/9] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2016-12-16 Thread Brian Foster
On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Signed-off-by: Michal Hocko 
> ---
>  fs/xfs/kmem.c| 10 +-
>  fs/xfs/xfs_buf.c |  8 
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f31ae592dcae..5c6f9bd4d8be 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>   bp->b_addr = NULL;
>   } else {
>   int retried = 0;
> - unsigned noio_flag;
> + unsigned nofs_flag;
>  
>   /*
>* vm_map_ram() will allocate auxillary structures (e.g.
>* pagetables) with GFP_KERNEL, yet we are likely to be under
>* GFP_NOFS context here. Hence we need to tell memory reclaim
> -  * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +  * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>* memory reclaim re-entering the filesystem here and
>* potentially deadlocking.
>*/
> - noio_flag = memalloc_noio_save();
> + nofs_flag = memalloc_nofs_save();
>   do {
>   bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>   -1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>   break;
>   vm_unmap_aliases();
>   } while (retried++ <= 1);
> - memalloc_noio_restore(noio_flag);
> + memalloc_noio_restore(nofs_flag);

memalloc_nofs_restore() ?

Brian

>  
>   if (!bp->b_addr)
>   return -ENOMEM;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2016-12-16 Thread Brian Foster
On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote:
> Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
> ---
> From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 15 Dec 2016 13:06:43 +0100
> Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
>  lockdep false positives
> 
> Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy") and use the new flag for them
> instead. There is really no reason to make these allocations contexts
> weaker just because of the lockdep which even might not be enabled
> in most cases.
> 

Hi Michal,

I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing
KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with
the original problem. FWIW, there was another KM_NOFS instance added by
that commit in xlog_cil_prepare_log_vecs() that is now in
xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the
original issue still applies..?

Brian

> Changes since v1
> - check for KM_NOLOCKDEP in kmem_flags_convert to not hit sanity BUG_ON
>   as per Mike Galbraith
> 
> Signed-off-by: Michal Hocko 
> ---
>  fs/xfs/kmem.h| 6 +-
>  fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
>  fs/xfs/xfs_buf.c | 2 +-
>  fs/xfs/xfs_dir2_readdir.c| 2 +-
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d5d634ef1f7f 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t;
>  #define KM_NOFS  ((__force xfs_km_flags_t)0x0004u)
>  #define KM_MAYFAIL   ((__force xfs_km_flags_t)0x0008u)
>  #define KM_ZERO  ((__force xfs_km_flags_t)0x0010u)
> +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u)
>  
>  /*
>   * We use a special process flag to avoid recursive callbacks into
> @@ -44,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  {
>   gfp_t   lflags;
>  
> - BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> + BUG_ON(flags & 
> ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>  
>   if (flags & KM_NOSLEEP) {
>   lflags = GFP_ATOMIC | __GFP_NOWARN;
> @@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
>   if (flags & KM_ZERO)
>   lflags |= __GFP_ZERO;
>  
> + if (flags & KM_NOLOCKDEP)
> + lflags |= __GFP_NOLOCKDEP;
> +
>   return lflags;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index f2dc1a950c85..b8b5f6914863 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec(
>  
>   if (nirecs > 1) {
>   map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
> -   KM_SLEEP | KM_NOFS);
> +   KM_SLEEP | KM_NOLOCKDEP);
>   if (!map)
>   return -ENOMEM;
>   *mapp = map;
> @@ -2488,7 +2488,7 @@ xfs_dabuf_map(
>*/
>   if (nfsb != 1)
>   irecs = kmem_zalloc(sizeof(irec) * nfsb,
> - KM_SLEEP | KM_NOFS);
> + KM_SLEEP | KM_NOLOCKDEP);
>  
>   nirecs = nfsb;
>   error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..f31ae592dcae 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1785,7 +1785,7 @@ xfs_alloc_buftarg(
>  {
>   xfs_buftarg_t   *btp;
>  
> - btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
> + btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP);
>  
>   btp->bt_mount = mp;
>   btp->bt_dev =  bdev->bd_dev;
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 003a99b83bd8..033ed65d7ce6 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents(
>   length = howmany(bufsize + geo->blksize, (1 << geo->fsblog));
>   map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
>   (length * sizeof(struct xfs_bmbt_irec)),
> -KM_SLEEP | KM_NOFS);
> +KM_SLEEP | KM_NOLOCKDEP);
>   map_info->map_size = length;
>  
>   /*
> -- 
> 2.10.2
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this 

Re: Is is possible to submit binary image as fstest test case?

2016-10-06 Thread Brian Foster
On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> Hi,
> 
> Just as the title says, for some case(OK, btrfs again) we need to catch a
> file system in special timing.
> 
> In this specific case, we need to grab a btrfs image undergoing balancing,
> just before the balance finished.
> 
> Although we can use flakey to drop all write, we still don't have method to
> catch the timing of the that transaction.
> 
> 
> On the other hand, we can tweak our local kernel, adding msleep()/message
> and dump the disk during the sleep.
> And the image I dumped can easily trigger btrfs kernel and user-space bug.
> 
> So I'm wondering if I can just upload a zipped raw image as part of the test
> case?
> 

Doesn't necessarily bother me one way or the other, but something we've
done with XFS in such situations is introduce a DEBUG mode only sysfs
tunable that delays certain infrastructure (log recovery in our case) to
coordinate with test cases that try to reproduce such timing/racing
problems.

See test xfs/051 for an example..

Brian

> Thanks,
> Qu
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/311: Disable dmesg check

2015-07-17 Thread Brian Foster
On Fri, Jul 17, 2015 at 12:56:43AM -0400, Chandan Rajendra wrote:
 When running generic/311 on Btrfs' subpagesize-blocksize patchset (on ppc64
 with 4k sectorsize and 16k node/leaf size) I noticed the following call trace,
 
 BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 
 found 158
 BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 
 found 158
 BTRFS: Transaction aborted (error -5)
 
 WARNING: at /root/repos/linux/fs/btrfs/super.c:260
 Modules linked in:
 CPU: 3 PID: 30769 Comm: umount Tainted: GWL 
 4.0.0-rc5-11671-g8b82e73e #63
 task: c00079aaddb0 ti: c00079a48000 task.ti: c00079a48000
 NIP: c0499aa0 LR: c0499a9c CTR: c0779630
 REGS: c00079a4b480 TRAP: 0700   Tainted: GW   L   
 (4.0.0-rc5-11671-g8b82e73e)
 MSR: 800100029032 SF,EE,ME,IR,DR,RI  CR: 28008828  XER: 2000
 CFAR: c0a23914 SOFTE: 1
 GPR00: c0499a9c c00079a4b700 c103bdf8 0025
 GPR04: 0001 0502 c107e918 0cda
 GPR08: 0007 0007 0001 c10f5044
 GPR12: 28008822 cfdc0d80 2000 10152e00
 GPR16: 010002979380 10140724  
 GPR20:    
 GPR24: c000151f61a8  c00055e5e800 c0aac270
 GPR28: 04a4 fffb c00055e5e800 c000679204d0
 NIP [c0499aa0] .__btrfs_abort_transaction+0x180/0x190
 LR [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190
 Call Trace:
 [c00079a4b700] [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190 
 (unreliable)
 [c00079a4b7a0] [c0541678] .__btrfs_run_delayed_items+0xe8/0x220
 [c00079a4b850] [c04d5b3c] .btrfs_commit_transaction+0x37c/0xca0
 [c00079a4b960] [c049824c] .btrfs_sync_fs+0x6c/0x1a0
 [c00079a4ba00] [c0255270] .sync_filesystem+0xd0/0x100
 [c00079a4ba80] [c0218070] .generic_shutdown_super+0x40/0x170
 [c00079a4bb10] [c0218598] .kill_anon_super+0x18/0x30
 [c00079a4bb90] [c0498418] .btrfs_kill_super+0x18/0xc0
 [c00079a4bc10] [c0218ac8] .deactivate_locked_super+0x98/0xe0
 [c00079a4bc90] [c023e744] .cleanup_mnt+0x54/0xa0
 [c00079a4bd10] [c00b7d14] .task_work_run+0x114/0x150
 [c00079a4bdb0] [c0015f84] .do_notify_resume+0x74/0x80
 [c00079a4be30] [c0009838] .ret_from_except_lite+0x64/0x68
 Instruction dump:
 ebc1fff0 ebe1fff8 4bfffb28 6000 3ce2ffcd 38e7e818 4bbc 3c62ffd2
 7fa4eb78 3863b808 48589e1d 6000 0fe0 4bfffedc 6000 6000
 BTRFS: error (device dm-0) in __btrfs_run_delayed_items:1188: errno=-5 IO 
 failure
 
 
 The call trace is seen when executing _run_test() for the 8th time.
 The above trace is actually a false-positive failure as indicated below,
  fsync-tester
fsync(fd)
Write delayed inode item to fs tree
  (assume transid to be 160)
  (assume tree block to start at logical address 29720576)
  md5sum $testfile
This causes a delayed inode to be added
  Load flakey table
i.e. drop writes that are initiated from now onwards
  Unmount filesystem
btrfs_sync_fs is invoked
  Write 29720576 metadata block to disk
  free_extent_buffer(29720576)
release_extent_buffer(29720576)
Start writing delayed inode
  Traverse the fs tree
(assume the parent tree block of 29720576 is still in memory)
When reading 29720576 from disk, parent's blkptr will have generation
set to 160. But the on-disk tree block will have an older
generation (say, 158). Transid verification fails and hence the
transaction gets aborted
 
 The test only cares about the FS instance before the unmount
 operation (i.e. the synced FS). Hence to get the test to pass, ignore the
 false-positive trace that could be generated.
 
 Signed-off-by: Chandan Rajendra chan...@linux.vnet.ibm.com
 ---
  tests/generic/311 | 2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/tests/generic/311 b/tests/generic/311
 index d21b6eb..cd6391d 100755
 --- a/tests/generic/311
 +++ b/tests/generic/311
 @@ -64,6 +64,8 @@ _require_xfs_io_command falloc
  
  [ -x $here/src/fsync-tester ] || _notrun fsync-tester not built
  
 +_disable_dmesg_check
 +

Hmm, I don't think this is something we'd want to do unconditionally.
E.g., if something hits the logs for xfs or ext4, we probably want to
hear about it.

Perhaps check that the fs is btrfs and possibly the fs params are such
that the known warning occurs..? I'd defer to the btrfs folks on how
best to check that, so long as it doesn't affect other fs'.

Brian

  rm -f $seqres.full
  SEED=1
  testfile=$SCRATCH_MNT/$seq.fsync
 -- 
 2.1.0
 
 --
 To unsubscribe from this list: send the line unsubscribe fstests in
 the body of a message to 

Re: [PATCH] xfstests: generic: test for discard properly discarding unused extents

2015-04-02 Thread Brian Foster
On Wed, Apr 01, 2015 at 03:01:07PM -0400, Jeff Mahoney wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 4/1/15 2:44 PM, Brian Foster wrote:
  On Mon, Mar 30, 2015 at 03:11:06PM -0400, Jeff Mahoney wrote:
  This tests tests four conditions where discard can potentially 
  not discard unused extents completely.
  
  We test, with -o discard and with fstrim, scenarios of removing 
  many relatively small files and removing several large files.
  
  The important part of the two scenarios is that the large files 
  must be large enough to span a blockgroup alone. It's possible 
  for an entire block group to be emptied and dropped without an 
  opportunity to discard individual extents as would happen with 
  smaller files.
  
  The test confirms the discards have occured by using a sparse 
  file mounted via loopback to punch holes and then check how many 
  blocks are still allocated within the file.
  
  Signed-off-by: Jeff Mahoney je...@suse.com ---
  
  The code looks mostly Ok to me, a few notes below. Those aside, 
  this is a longish test. It takes me about 8 minutes to run on my 
  typical low end vm.
 
 My test hardware is a 16 core / 16 GB RAM machine using a commodity
 SSD. It ran pretty quickly.
 

Yeah, as Dave mentioned, I test on anything from beefier bare metal
hardware to fairly resource constrained VMs. We certainly have longer
running tests, but sometimes I exclude them when I'm just trying to do
regression tests on ongoing development work, etc.

 I suppose I should start by explaining that I wrote the test to be
 btrfs specific and then realized that the only thing that was
 /actually/ btrfs-specific was the btrfs filesystem sync call. I ran it
 on XFS to ensure it worked as expected, but didn't have any reason to
 try to adapt it to work in any other environment.
 
  Is the 1GB block group magic value mutable in any way, or is it a 
  hardcoded thing (for btrfs I presume)? It would be nice if we could
  shrink that a bit. If not, perhaps there are some other ways to
  reduce the runtime...
 
 It's not hardcoded for btrfs, but it is by far the most common sized
 block group. I'd prefer to test what people are using.
 

Ok...

  - Is there any reason a single discard or trim test instance must 
  be all large or small files? In other words, is there something 
  that this wouldn't catch if the 10GB were 50% filled with large 
  files and %50 with small files? That would allow us to trim the 
  maximum on the range of small file creation and only have two 
  invocations instead of four.
 
 Only to draw attention to the obvious failure cases, which are
 probably specific to btrfs. If a file spans an entire block group and
 is removed, it skips the individual discards and depends on the block
 group removal to discard the entire thing (this wasn't happening). If
 there are lots of small files, it hits different paths, and I wanted
 to make it clear which one each mode of the test was targeting.
 Otherwise, whoever hits the failure is going to end up having to do it
 manually, which defeats the purpose of having an automated test case, IM
 O.
 

So it seems to me this is somewhat a mix of a functional test, a stress
test and a targeted regression test. IIUC, the regression could probably
be reproduced and tested for using a smaller block group and thus
require significantly less time. The functional test requires testing
that the various discard mechanisms work appropriately (e.g., mix of
files), but still shouldn't require writing so much data. A stress test
certainly requires a larger fs and writing a bit more data.

That could probably be managed in a variety of ways. You could throw the
whole thing under btrfs as is, split it up into separate tests that are
grouped such that it's easier to exclude the longer running test, use
fs-specific values as discussed above, use a percentage of SCRATCH_DEV
and let the tester determine the time based on the devices under test,
etc.

The only thing I really care about is the length of time running the
test on slower storage. The idea of scaling the file sizes seems a
reasonable enough workaround to me, assuming we can get that 8 minutes
down to a couple minutes or so.

  - If the 1GB thing is in fact a btrfs thing, could we make the
  core test a bit more size agnostic (e.g., perhaps pass the file 
  count/size values as parameters) and then scale the parameters up 
  exclusively for btrfs? For example, set defaults of fssize=1G, 
  largefile=100MB, smallfile=[512b-5MB] or something of that nature 
  and override them to the 10GB, 1GB, 32k-... values for btrfs? That 
  way we don't need to write as much data for fs' where it might not 
  be necessary.
 
 If someone wants to weigh in on what sane defaults for other file
 systems might be, sure.
 

The order of magnitude numbers I threw out above seem reasonable to me,
at least for xfs. We can always tweak it later.

  tests/generic/326 | 164

Re: [PATCH] xfstests: generic: test for discard properly discarding unused extents

2015-04-01 Thread Brian Foster
On Mon, Mar 30, 2015 at 03:11:06PM -0400, Jeff Mahoney wrote:
 This tests tests four conditions where discard can potentially not
 discard unused extents completely.
 
 We test, with -o discard and with fstrim, scenarios of removing many
 relatively small files and removing several large files.
 
 The important part of the two scenarios is that the large files must be
 large enough to span a blockgroup alone. It's possible for an
 entire block group to be emptied and dropped without an opportunity to
 discard individual extents as would happen with smaller files.
 
 The test confirms the discards have occured by using a sparse file
 mounted via loopback to punch holes and then check how many blocks
 are still allocated within the file.
 
 Signed-off-by: Jeff Mahoney je...@suse.com
 ---

The code looks mostly Ok to me, a few notes below. Those aside, this is
a longish test. It takes me about 8 minutes to run on my typical low end
vm.

Is the 1GB block group magic value mutable in any way, or is it a
hardcoded thing (for btrfs I presume)? It would be nice if we could
shrink that a bit. If not, perhaps there are some other ways to reduce
the runtime...

- Is there any reason a single discard or trim test instance must be all
large or small files? In other words, is there something that this
wouldn't catch if the 10GB were 50% filled with large files and %50 with
small files? That would allow us to trim the maximum on the range of
small file creation and only have two invocations instead of four.

- If the 1GB thing is in fact a btrfs thing, could we make the core test
a bit more size agnostic (e.g., perhaps pass the file count/size
values as parameters) and then scale the parameters up exclusively for
btrfs? For example, set defaults of fssize=1G, largefile=100MB,
smallfile=[512b-5MB] or something of that nature and override them to
the 10GB, 1GB, 32k-... values for btrfs? That way we don't need to write
as much data for fs' where it might not be necessary.

  tests/generic/326 | 164 
 ++
  tests/generic/326.out |   5 ++
  tests/generic/group   |   1 +
  3 files changed, 170 insertions(+)
  create mode 100644 tests/generic/326
  create mode 100644 tests/generic/326.out
 
 diff --git a/tests/generic/326 b/tests/generic/326
 new file mode 100644
 index 000..923a27f
 --- /dev/null
 +++ b/tests/generic/326
 @@ -0,0 +1,164 @@
 +#! /bin/bash
 +# FSQA Test No. 326
 +#
 +# This test uses a loopback mount with PUNCH_HOLE support to test
 +# whether discard operations are working as expected.
 +#
 +# It tests both -odiscard and fstrim.
 +#
 +# Copyright (C) 2015 SUSE. All Rights Reserved.
 +# Author: Jeff Mahoney je...@suse.com
 +#
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License as
 +# published by the Free Software Foundation.
 +#
 +# This program is distributed in the hope that it would be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write the Free Software Foundation,
 +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 +#---
 +#
 +
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo QA output created by $seq
 +
 +tmp=/tmp/$$
 +status=1 # failure is the default!
 +trap _cleanup; exit \$status 0 1 2 3 15
 +
 +loopdev=
 +tmpdir=
 +_cleanup()
 +{
 + [ -n $tmpdir ]  umount $tmpdir
 + [ -n $loopdev ]  losetup -d $loopdev
 +}
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +
 +# real QA test starts here
 +_need_to_be_root
 +_supported_fs generic
 +_supported_os Linux
 +_require_scratch
 +_require_fstrim
 +
 +rm -f $seqres.full
 +
 +_scratch_mkfs  $seqres.full
 +_require_fs_space $SCRATCH_MNT $(( 10 * 1024 * 1024 ))
 +_scratch_mount
 +
 +test_discard()
 +{
 + discard=$1
 + files=$2
 +
 + tmpfile=$SCRATCH_MNT/testfs.img.$$
 + tmpdir=$SCRATCH_MNT/testdir.$$
 + mkdir -p $tmpdir || _fail !!! failed to create temp mount dir
 +
 + # Create a sparse file to host the file system
 + dd if=/dev/zero of=$tmpfile bs=1M count=1 seek=10240  $seqres.full \
 + || _fail !!! failed to create fs image file

xfs_io -c truncate ... ?

 +
 + opts=
 + if [ $discard = discard ]; then
 + opts=-o discard
 + fi
 + losetup -f $tmpfile
 + loopdev=$(losetup -j $tmpfile|awk -F: '{print $1}')
 + _mkfs_dev $loopdev  $seqres.full
 + $MOUNT_PROG $opts $loopdev $tmpdir \
 + || _fail !!! failed to loopback mount
 +
 + if [ $files = large ]; then
 + # Create files larger than 1GB so each one occupies
 + # more than one 

Re: Fiemap inconsistent behaviour when file offset range isn't on a block boundary

2014-07-15 Thread Brian Foster
On Tue, Jul 15, 2014 at 04:20:29PM +0630, Chandan Rajendra wrote:
 All the filesystems created and used below have 4k blocksize. The
 file.bin file mentioned below is 8k in size and has two 4k
 extents. The test program used can be found at http://fpaste.org/118057/.
 
 1. First run (file range being passed is on block boundaries).
,
| [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ;
| do
| echo -- File: $f ---;
| /root/print-fiemap 0 8192 $f;
| done
| -- File: /mnt/btrfs/file.bin ---
| File range: 0 - 8191.
| Found 2 extents.
| Fiemap information:
|  Logical: 0
|  Physical: 12582912
|  Length: 4096
|  Flags:
| 
|  Logical: 4096
|  Physical: 12656640
|  Length: 4096
|  Flags: FIEMAP_EXTENT_LAST |
| 
| -- File: /mnt/ext4/file.bin ---
| File range: 0 - 8191.
| Found 2 extents.
| Fiemap information:
|  Logical: 0
|  Physical: 135270400
|  Length: 4096
|  Flags:
| 
|  Logical: 4096
|  Physical: 135278592
|  Length: 4096
|  Flags: FIEMAP_EXTENT_LAST |
| 
| -- File: /mnt/xfs/file.bin ---
| File range: 0 - 8191.
| Found 2 extents.
| Fiemap information:
|  Logical: 0
|  Physical: 49152
|  Length: 4096
|  Flags:
| 
|  Logical: 4096
|  Physical: 57344
|  Length: 4096
|  Flags: FIEMAP_EXTENT_LAST |
`
 
 2. If the file offset range being passed as input to fiemap ioctl is
not on block boundaries and falls within an extent's range then that
extent is skipped.
,
| [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ;
|  do
|  echo -- File: $f ---;
|  /root/print-fiemap 1 4095 $f;
|  done
| -- File: /mnt/btrfs/file.bin ---
| File range: 1 - 4095.
| Found 0 extents.
| 
| 
| -- File: /mnt/ext4/file.bin ---
| File range: 1 - 4095.
| Found 1 extents.
| Fiemap information:
|  Logical: 0
|  Physical: 135270400
|  Length: 4096
|  Flags:
| 
| -- File: /mnt/xfs/file.bin ---
| File range: 1 - 4095.
| Found 2 extents.
| Fiemap information:
|  Logical: 0
|  Physical: 49152
|  Length: 4096
|  Flags:
| 
|  Logical: 4096
|  Physical: 57344
|  Length: 4096
|  Flags: FIEMAP_EXTENT_LAST |
`
 
From linux/Documentation/filesystems/fiemap.txt, fm_start, and
fm_length specify the logical range within the file which the
process would like mappings for. Extents returned mirror those on
disk - that is, the logical offset of the 1st returned extent may
start before fm_start, and the range covered by the last returned
extent may end after fm_length. All offsets and lengths are in
bytes.
 
So IMHO, the above would mean that all the extents that map the
file range [fm_start, fm_start + fm_length - 1] should be returned
by a fiemap ioctl call (as done by ext4).
 
In the case of Btrfs, the commit
ea8efc74bd0402b4d5f663d007b4e25fa29ea778 i.e. Btrfs: make sure not
to return overlapping extents to fiemap, forces the first extent
returned by btrfs_fiemap() to start from fm_start (if fm_start is
greater than the file offset mapped by the containing extent's
first byte). Can somebody please list some example scenarios where
extent_fiemap() ends up returning dupclicate and overlapping
extents?
Also, the commit 4d479cf010d56ec9c54f302d039918f1296b
i.e. Btrfs: sectorsize align offsets in fiemap, rounds up first
byte of the file offset range to the next block. Shouldn't it be
rounded down instead?
 
XFS lists both the extents even though the first one encompasses the
file range specified in the input.
  

I gave this a test on XFS with a file that looks like this:

# xfs_bmap -v /mnt/file 
/mnt/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSETTOTAL FLAGS
   0: [0..15]: 102368..1023831 (51168..51183)  16 1
   1: [16..63]:1832..18790 (1832..1879)48 1

I narrowed the print_fiemap behavior down to this:

# ./print_fiemap 1 7680 /mnt/file 
File range: 1 - 7680.
Found 1 extents.
Fiemap information:
Logical: 0
Physical: 52412416
Length: 8192
Flags: FIEMAP_EXTENT_UNWRITTEN | 

# ./print_fiemap 1 7681 /mnt/file 
File range: 1 - 7681.
Found 2 extents.
Fiemap information:
   

Re: Fiemap inconsistent behaviour when file offset range isn't on a block boundary

2014-07-15 Thread Brian Foster
On Tue, Jul 15, 2014 at 08:15:12AM -0400, Brian Foster wrote:
 On Tue, Jul 15, 2014 at 04:20:29PM +0630, Chandan Rajendra wrote:
  All the filesystems created and used below have 4k blocksize. The
  file.bin file mentioned below is 8k in size and has two 4k
  extents. The test program used can be found at http://fpaste.org/118057/.
  
  1. First run (file range being passed is on block boundaries).
 ,
 | [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ;
 | do
 | echo -- File: $f ---;
 | /root/print-fiemap 0 8192 $f;
 | done
 | -- File: /mnt/btrfs/file.bin ---
 | File range: 0 - 8191.
 | Found 2 extents.
 | Fiemap information:
 |Logical: 0
 |Physical: 12582912
 |Length: 4096
 |Flags:
 | 
 |Logical: 4096
 |Physical: 12656640
 |Length: 4096
 |Flags: FIEMAP_EXTENT_LAST |
 | 
 | -- File: /mnt/ext4/file.bin ---
 | File range: 0 - 8191.
 | Found 2 extents.
 | Fiemap information:
 |Logical: 0
 |Physical: 135270400
 |Length: 4096
 |Flags:
 | 
 |Logical: 4096
 |Physical: 135278592
 |Length: 4096
 |Flags: FIEMAP_EXTENT_LAST |
 | 
 | -- File: /mnt/xfs/file.bin ---
 | File range: 0 - 8191.
 | Found 2 extents.
 | Fiemap information:
 |Logical: 0
 |Physical: 49152
 |Length: 4096
 |Flags:
 | 
 |Logical: 4096
 |Physical: 57344
 |Length: 4096
 |Flags: FIEMAP_EXTENT_LAST |
 `
  
  2. If the file offset range being passed as input to fiemap ioctl is
 not on block boundaries and falls within an extent's range then that
 extent is skipped.
 ,
 | [root@guest0 btrfs]# for f in $(ls -1 /mnt/{btrfs,ext4,xfs}/file.bin) ;
 |  do
 |  echo -- File: $f ---;
 |  /root/print-fiemap 1 4095 $f;
 |  done
 | -- File: /mnt/btrfs/file.bin ---
 | File range: 1 - 4095.
 | Found 0 extents.
 | 
 | 
 | -- File: /mnt/ext4/file.bin ---
 | File range: 1 - 4095.
 | Found 1 extents.
 | Fiemap information:
 |Logical: 0
 |Physical: 135270400
 |Length: 4096
 |Flags:
 | 
 | -- File: /mnt/xfs/file.bin ---
 | File range: 1 - 4095.
 | Found 2 extents.
 | Fiemap information:
 |Logical: 0
 |Physical: 49152
 |Length: 4096
 |Flags:
 | 
 |Logical: 4096
 |Physical: 57344
 |Length: 4096
 |Flags: FIEMAP_EXTENT_LAST |
 `
  
 From linux/Documentation/filesystems/fiemap.txt, fm_start, and
 fm_length specify the logical range within the file which the
 process would like mappings for. Extents returned mirror those on
 disk - that is, the logical offset of the 1st returned extent may
 start before fm_start, and the range covered by the last returned
 extent may end after fm_length. All offsets and lengths are in
 bytes.
  
 So IMHO, the above would mean that all the extents that map the
 file range [fm_start, fm_start + fm_length - 1] should be returned
 by a fiemap ioctl call (as done by ext4).
  
 In the case of Btrfs, the commit
 ea8efc74bd0402b4d5f663d007b4e25fa29ea778 i.e. Btrfs: make sure not
 to return overlapping extents to fiemap, forces the first extent
 returned by btrfs_fiemap() to start from fm_start (if fm_start is
 greater than the file offset mapped by the containing extent's
 first byte). Can somebody please list some example scenarios where
 extent_fiemap() ends up returning dupclicate and overlapping
 extents?
 Also, the commit 4d479cf010d56ec9c54f302d039918f1296b
 i.e. Btrfs: sectorsize align offsets in fiemap, rounds up first
 byte of the file offset range to the next block. Shouldn't it be
 rounded down instead?
  
 XFS lists both the extents even though the first one encompasses the
 file range specified in the input.
   
 
 I gave this a test on XFS with a file that looks like this:
 
 # xfs_bmap -v /mnt/file 
 /mnt/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSETTOTAL FLAGS
0: [0..15]: 102368..1023831 (51168..51183)  16 1
1: [16..63]:1832..18790 (1832..1879