Re: [RFC PATCH v4 6/6] Btrfs: support swap files

2018-05-25 Thread Omar Sandoval
On Fri, May 25, 2018 at 01:07:00PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.05.2018 00:41, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Implement the swap file a_ops on Btrfs. Activation needs to make sure
> > that the file can be used as a swap file, which currently means it must
> > be fully allocated as nocow with no compression on one device. It also
> > sets up the swap extents directly with add_swap_extent(), so export it.
> > 
> > Signed-off-by: Omar Sandoval 
> 
> What testing (apart form the xfstest patches you sent) have this code
> seen?

Light testing with my swapme script [1] and btrfsck to make sure I
didn't swap to the wrong place. I was meaning to put this through
something more intensive like a kernel build, thanks for the reminder.
As opposed to the previous approach, the swapin/swapout paths are
simple, core code, so the edge cases I'm worried about are really in
activate/deactivate and other ioctls breaking things.

1: https://github.com/osandov/osandov-linux/blob/master/scripts/swapme.c

> Have you run it with lockdep enabled (I'm asking because when I
> picked up your v3 there was quite a bunch of deadlock warnings). Also
> see some inline questions below.

Yup, I've been running it with lockdep, no warnings. The nice part of
this approach is that there's no new locking involved, just whatever the
swap code does itself.

> > ---
> >  fs/btrfs/inode.c | 220 +++
> >  mm/swapfile.c|   1 +
> >  2 files changed, 221 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 9228cb866115..6cca8529e307 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void 
> > *private_data, u64 start, u64 end)
> > }
> >  }
> >  
> 
> > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> > *file,
> > +  sector_t *span)
> > +{
> > +   struct inode *inode = file_inode(file);
> > +   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +   struct extent_io_tree *io_tree = _I(inode)->io_tree;
> > +   struct extent_state *cached_state = NULL;
> > +   struct extent_map *em = NULL;
> > +   struct btrfs_device *device = NULL;
> > +   struct btrfs_swap_info bsi = {
> > +   .lowest_ppage = (sector_t)-1ULL,
> > +   };
> > +   int ret = 0;
> > +   u64 isize = inode->i_size;
> > +   u64 start;
> > +
> > +   /*
> > +* The inode is locked, so these flags won't change after we check them.
> > +*/
> > +   if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> > +   btrfs_err(fs_info, "swapfile is compressed");
> > +   return -EINVAL;
> > +   }
> > +   if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> > +   btrfs_err(fs_info, "swapfile is copy-on-write");
> > +   return -EINVAL;
> > +   }
> > +
> > +   /*
> > +* Balance or device remove/replace/resize can move stuff around from
> > +* under us. The EXCL_OP flag makes sure they aren't running/won't run
> > +* concurrently while we are mapping the swap extents, and the fs_info
> > +* nr_swapfiles counter prevents them from running while the swap file
> > +* is active and moving the extents. Note that this also prevents a
> > +* concurrent device add which isn't actually necessary, but it's not
> > +* really worth the trouble to allow it.
> > +*/
> > +   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> > +   return -EBUSY;
> > +   atomic_inc(_info->nr_swapfiles);
> > +   /*
> > +* Snapshots can create extents which require COW even if NODATACOW is
> > +* set. We use this counter to prevent snapshots. We must increment it
> > +* before walking the extents because we don't want a concurrent
> > +* snapshot to run after we've already checked the extents.
> > +*/
> > +   atomic_inc(_I(inode)->root->nr_swapfiles);
> > +
> > +   lock_extent_bits(io_tree, 0, isize - 1, _state);
> > +   start = 0;
> > +   while (start < isize) {
> > +   u64 end, logical_block_start, physical_block_start;
> > +   u64 len = isize - start;
> > +
> > +   em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
> > +   if (IS_ERR(em)) {
> > +   ret = PTR_ERR(em);
> > +   goto out;
> > +   }
> > +   end = extent_map_end(em);
> > +
> > +   if (em->block_start == EXTENT_MAP_HOLE) {
> > +   btrfs_err(fs_info, "swapfile has holes");
> > +   ret = -EINVAL;
> > +   goto out;
> > +   }
> > +   if (em->block_start == EXTENT_MAP_INLINE) {
> > +   /*
> > +* It's unlikely we'll ever actually find ourselves
> > +* here, as a file small enough to fit inline won't be
> > +* big enough to store more than the swap header, 

Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-05-25 Thread David Sterba
On Fri, May 25, 2018 at 09:00:58AM -0700, Omar Sandoval wrote:
> On Fri, May 25, 2018 at 04:50:55PM +0200, David Sterba wrote:
> > On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval 
> > > 
> > > When a swap file is active, we must make sure that the extents of the
> > > file are not moved and that they don't become shared. That means that
> > > the following are not safe:
> > > 
> > > - chattr +c (enable compression)
> > > - reflink
> > > - dedupe
> > > - snapshot
> > > - defrag
> > > - balance
> > > - device remove/replace/resize
> > > 
> > > Don't allow those to happen on an active swap file. Balance and device
> > > remove/replace/resize in particular are disallowed entirely; in the
> > > future, we can relax this so that relocation skips/errors out only on
> > > chunks containing an active swap file.
> > 
> > Hm, disabling the entire balance is too intrusive. It's clear that the
> > swapfile causes a lot of trouble when it goes against the dynamic
> > capabilities of btrfs (relocation and the functionality that builds on
> > it).
> > 
> > Skipping the swapfile extents should be implemented at minimum.
> 
> Sure thing, this should definitely be possible. For balance, we can skip
> them; for resize or delete, it of course has to fail if it encounters
> swap extents. I'll take a stab at it.

We can detect if there's an active swap file on the filesystem before
shrink, delete or replace is started so the user is not surprised if it
fails in the end, or not start the operations at all and give some hints
what to do.

> > We can
> > add some heuristics that will group the swap extents to a small number
> > of chunks so the impact of unmovable chunks is limited.
> > 
> > I haven't looked at the implementation, but it might be possible to
> > internally find a different location for the swap extent once it's not
> > used for the actual paged data.
> > 
> > In an ideal case, the swap extents could span entire chunks (1G) and not
> > mix with regular data/metadata.
> > 
> > > Note that we don't have to worry about chattr -C (disable nocow), which
> > > we ignore for non-empty files, because an active swapfile must be
> > > non-empty and can't be truncated. We also don't have to worry about
> > > autodefrag because it's only done on COW files. Truncate and fallocate
> > > are already taken care of by the generic code. Device add doesn't do
> > > relocation so it's not an issue, either.
> > 
> > Ok, fine the remaining easy cases are covered.
> > 
> > I don't know if you mentioned that elsewhere (as design questions are
> > in this patch), the allocation profile is single, or is it also possible
> > to have striped or duplicated swap extents?
> 
> That's briefly mentioned in the last patch, only single data is
> supported, although I think I can easily relax that to also allow RAID0.
> Anything else is much harder to support, but we need to start somewhere.

Of course, support for single is absolutelly fine for the first
implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: add parent_transid parameter to veirfy_level_key

2018-05-25 Thread David Sterba
On Fri, May 18, 2018 at 01:10:59PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年05月18日 13:02, Nikolay Borisov wrote:
> > 
> > 
> > On 18.05.2018 05:59, Liu Bo wrote:
> >> As verify_level_key() is checked after verify_parent_transid(), i.e.
> >>
> >> if (verify_parent_transid())
> >>ret = -EIO;
> >> else if (verify_level_key())
> >>ret = -EUCLEAN;
> >>
> >> if parent_transid is 0, verify_parent_transid() skips verifying
> >> parent_transid and considers eb as valid, and if verify_level_key()
> >> reports something wrong, we're not going to know if it's caused by
> >> corrupted metadata or non-checkecd eb (e.g. stale eb).
> > 
> > It's really unclear (from the changelog) how the stale eb ties with
> > parent_transid. You should have explained the relationship between
> > checking parenttransid and stale/non-stale ebs
> 
> It's in fact related to the fixing patch:
> Btrfs: fix the corruption by reading stale btree blocks
> 
> But whatever, extra mention won't hurt.

I've added reference to the patch as it has more details so we don't
have to repeat it in the changelog.

Patch added to misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static

2018-05-25 Thread Omar Sandoval
On Fri, May 25, 2018 at 12:21:41PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.05.2018 00:41, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
> > make it non-static.
> > 
> > Signed-off-by: Omar Sandoval 
> 
> Reviewed-by: Nikolay Borisov 
> 
> nit: How about introducing proper kernel doc for this function, now that
> it becomes public just as good practice so that eventually we will have
> proper kernel doc for all public interfaces. You could also mention that
> it needs a paired free_extent_map

Will do, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-05-25 Thread Omar Sandoval
On Fri, May 25, 2018 at 04:50:55PM +0200, David Sterba wrote:
> On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > When a swap file is active, we must make sure that the extents of the
> > file are not moved and that they don't become shared. That means that
> > the following are not safe:
> > 
> > - chattr +c (enable compression)
> > - reflink
> > - dedupe
> > - snapshot
> > - defrag
> > - balance
> > - device remove/replace/resize
> > 
> > Don't allow those to happen on an active swap file. Balance and device
> > remove/replace/resize in particular are disallowed entirely; in the
> > future, we can relax this so that relocation skips/errors out only on
> > chunks containing an active swap file.
> 
> Hm, disabling the entire balance is too intrusive. It's clear that the
> swapfile causes a lot of trouble when it goes against the dynamic
> capabilities of btrfs (relocation and the functionality that builds on
> it).
> 
> Skipping the swapfile extents should be implemented at minimum.

Sure thing, this should definitely be possible. For balance, we can skip
them; for resize or delete, it of course has to fail if it encounters
swap extents. I'll take a stab at it.

> We can
> add some heuristics that will group the swap extents to a small number
> of chunks so the impact of unmovable chunks is limited.
> 
> I haven't looked at the implementation, but it might be possible to
> internally find a different location for the swap extent once it's not
> used for the actual paged data.
> 
> In an ideal case, the swap extents could span entire chunks (1G) and not
> mix with regular data/metadata.
> 
> > Note that we don't have to worry about chattr -C (disable nocow), which
> > we ignore for non-empty files, because an active swapfile must be
> > non-empty and can't be truncated. We also don't have to worry about
> > autodefrag because it's only done on COW files. Truncate and fallocate
> > are already taken care of by the generic code. Device add doesn't do
> > relocation so it's not an issue, either.
> 
> Ok, fine the remaining easy cases are covered.
> 
> I don't know if you mentioned that elsewhere (as design questions are
> in this patch), the allocation profile is single, or is it also possible
> to have striped or duplicated swap extents?

That's briefly mentioned in the last patch, only single data is
supported, although I think I can easily relax that to also allow RAID0.
Anything else is much harder to support, but we need to start somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-25 Thread David Sterba
On Wed, May 23, 2018 at 10:54:22AM +0800, Anand Jain wrote:
> btrfs_free_extra_devids() is called only in the mount context which
> traverses through the fs_devices::devices and frees the orphan devices
> devices in the given %fs_devices if any. As the search for the orphan
> device is limited to fs_devices::devices so we don't need the global
> uuid_mutex.
> 
> There can't be any mount-point based ioctl threads in this context as
> the mount thread is not yet returned. But there can be the btrfs-control
> based scan ioctls thread which calls device_list_add().
> 
> Here in the mount thread the fs_devices::opened is incremented way before
> btrfs_free_extra_devids() is called and in the scan context the fs_devices
> which are already opened neither be freed or alloc-able at
> device_list_add().
> 
> But lets say you change the device-path and call the scan again, then scan
> would update the new device path and this operation could race against the
> btrfs_free_extra_devids() thread, which might be in the process of
> free-ing the same device. So synchronize it by using the
> device_list_mutex.
> 
> This scenario is a very corner case, and practically the scan and mount
> are anyway serialized by the usage so unless the race is instrumented its
> very difficult to achieve.
> 
> Signed-off-by: Anand Jain 

Thanks, this explanation is much better and addresses the questions I
have while reading the code.

Reviewed-by: David Sterba 

> ---
> Currently device_list_add() is very lean on its device_list_mutex usage,
> a cleanup and fix is wip.

I also have a WIP patch to rewrite device_list_add. There were quite
some changes around the device locking so I'd need to refresh it on top
of current code first, it's not in a shape to be posted yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/3] btrfs: Do super block verification before writing it to disk

2018-05-25 Thread David Sterba
On Fri, May 25, 2018 at 12:43:25PM +0800, Qu Wenruo wrote:
> Reported-by: Ken Swenson 
> Reported-by: Ben Parsons <9parso...@gmail.com>
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/disk-io.c | 43 +++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b981ecc4b6f9..d6c0cee627d9 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2610,6 +2610,41 @@ static int btrfs_validate_mount_super(struct 
> btrfs_fs_info *fs_info)
>   return __validate_super(fs_info, fs_info->super_copy, 0);
>  }
>  
> +/*
> + * Check the validation of super block at write time.
> + * Some checks like bytenr check will be skipped as their values will be
> + * overwritten soon.
> + * Extra checks like csum type and incompact flags will be executed here.
> + */
> +static int btrfs_validate_write_super(struct btrfs_fs_info *fs_info,
> +   struct btrfs_super_block *sb)
> +{
> + int ret;
> +
> + ret = __validate_super(fs_info, sb, -1);

Please note that the patches have been slightly updated in the committed
version. Updates to patches that come that late after the initial
iterations are more convenient as incremental updates.

> + if (ret < 0)
> + goto out;
> + if (btrfs_super_csum_type(sb) != BTRFS_CSUM_TYPE_CRC32) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info, "invalid csum type, has %u want %u",
> +   btrfs_super_csum_type(sb), BTRFS_CSUM_TYPE_CRC32);
> + goto out;
> + }
> + if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info,
> + "invalid incompact flags, has 0x%llu valid mask 0x%llu",
> +   btrfs_super_incompat_flags(sb),
> +   BTRFS_FEATURE_INCOMPAT_SUPP);
> + goto out;
> + }
> +out:
> + if (ret < 0)
> + btrfs_err(fs_info,
> + "super block corruption detected before writing it to disk");
> + return ret;
> +}
> +
>  int open_ctree(struct super_block *sb,
>  struct btrfs_fs_devices *fs_devices,
>  char *options)
> @@ -3770,6 +3805,14 @@ int write_all_supers(struct btrfs_fs_info *fs_info, 
> int max_mirrors)
>   flags = btrfs_super_flags(sb);
>   btrfs_set_super_flags(sb, flags | BTRFS_HEADER_FLAG_WRITTEN);
>  
> + ret = btrfs_validate_write_super(fs_info, sb);
> + if (ret < 0) {
> + mutex_unlock(_info->fs_devices->device_list_mutex);
> + btrfs_handle_fs_error(fs_info, -EUCLEAN,
> + "unexpected superblock corruption detected");
> + return -EUCLEAN;

The shortcut from here makes sense but we still have to wait for any
previous superblock writes that could have happened (write_dev_supers).

The sequence would be:

at device 1
check sb
write

at device 2
check sb
failed write

now the bh from device 1 write would have one more reference.

As the change is only in the 3rd patch, please send only that one and
based on the patch in misc-next. Thanks.

> + }
> +
>   ret = write_dev_supers(dev, sb, max_mirrors);

(Here the reference gets incremented.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-05-25 Thread David Sterba
On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> When a swap file is active, we must make sure that the extents of the
> file are not moved and that they don't become shared. That means that
> the following are not safe:
> 
> - chattr +c (enable compression)
> - reflink
> - dedupe
> - snapshot
> - defrag
> - balance
> - device remove/replace/resize
> 
> Don't allow those to happen on an active swap file. Balance and device
> remove/replace/resize in particular are disallowed entirely; in the
> future, we can relax this so that relocation skips/errors out only on
> chunks containing an active swap file.

Hm, disabling the entire balance is too intrusive. It's clear that the
swapfile causes a lot of trouble when it goes against the dynamic
capabilities of btrfs (relocation and the functionality that builds on
it).

Skipping the swapfile extents should be implemented at minimum. We can
add some heuristics that will group the swap extents to a small number
of chunks so the impact of unmovable chunks is limited.

I haven't looked at the implementation, but it might be possible to
internally find a different location for the swap extent once it's not
used for the actual paged data.

In an ideal case, the swap extents could span entire chunks (1G) and not
mix with regular data/metadata.

> Note that we don't have to worry about chattr -C (disable nocow), which
> we ignore for non-empty files, because an active swapfile must be
> non-empty and can't be truncated. We also don't have to worry about
> autodefrag because it's only done on COW files. Truncate and fallocate
> are already taken care of by the generic code. Device add doesn't do
> relocation so it's not an issue, either.

Ok, fine the remaining easy cases are covered.

I don't know if you mentioned that elsewhere (as design questions are
in this patch), the allocation profile is single, or is it also possible
to have striped or duplicated swap extents?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[josef-btrfs:blk-iolatency 6/13] common.c:undefined reference to `blkcg_maybe_throttle_current'

2018-05-25 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
blk-iolatency
head:   6bbc4451bcc7359b14e190615ff671fd63e34d62
commit: 605adede5ce998f9e834c1783e74dc0660b346db [6/13] blkcg: add generic 
throttling mechanism
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
git checkout 605adede5ce998f9e834c1783e74dc0660b346db
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/entry/common.o: In function `prepare_exit_to_usermode':
>> common.c:(.text+0x139): undefined reference to `blkcg_maybe_throttle_current'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[josef-btrfs:blk-iolatency 7/13] include/linux/blkdev.h:2092:9: note: in expansion of macro 'NULL'

2018-05-25 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
blk-iolatency
head:   6bbc4451bcc7359b14e190615ff671fd63e34d62
commit: e13e26d1ad161949ae1fdd02623c2d81a666328e [7/13] memcontrol: schedule 
throttling if we are congested
config: x86_64-randconfig-x018-201820 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
git checkout e13e26d1ad161949ae1fdd02623c2d81a666328e
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/posix_types.h:5:0,
from include/uapi/linux/types.h:14,
from include/linux/compiler.h:169,
from include/asm-generic/bug.h:5,
from arch/x86/include/asm/bug.h:83,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from include/linux/pagemap.h:8,
from fs/configfs/inode.c:32:
   include/linux/blkdev.h: In function 'bdev_get_queue':
   include/linux/stddef.h:8:14: warning: return makes integer from pointer 
without a cast [-Wint-conversion]
#define NULL ((void *)0)
 ^
>> include/linux/blkdev.h:2092:9: note: in expansion of macro 'NULL'
 return NULL;
^~~~
   Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
   Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
   Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_read
   Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_write
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64
   Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64
   Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD
   Cyclomatic Complexity 1 include/linux/list.h:__list_del
   Cyclomatic Complexity 2 include/linux/list.h:__list_del_entry
   Cyclomatic Complexity 1 include/linux/list.h:list_del_init
   Cyclomatic Complexity 1 include/asm-generic/getorder.h:__get_order
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_read
   Cyclomatic Complexity 1 
arch/x86/include/asm/atomic.h:arch_atomic_dec_and_test
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_read
   Cyclomatic Complexity 1 
include/asm-generic/atomic-instrumented.h:atomic_dec_and_test
   Cyclomatic Complexity 1 include/linux/spinlock.h:spin_lock
   Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock
   Cyclomatic Complexity 1 include/linux/list_bl.h:hlist_bl_unhashed
   Cyclomatic Complexity 3 include/linux/dcache.h:dget_dlock
   Cyclomatic Complexity 3 include/linux/dcache.h:dget
   Cyclomatic Complexity 1 include/linux/dcache.h:d_unhashed
   Cyclomatic Complexity 1 include/linux/dcache.h:d_really_is_negative
   Cyclomatic Complexity 1 include/linux/dcache.h:d_really_is_positive
   Cyclomatic Complexity 3 include/linux/dcache.h:simple_positive
   Cyclomatic Complexity 1 include/linux/dcache.h:d_inode
   Cyclomatic Complexity 1 include/linux/fs.h:inode_lock
   Cyclomatic Complexity 1 include/linux/fs.h:inode_unlock
   Cyclomatic Complexity 56 include/linux/slab.h:kmalloc_index
   Cyclomatic Complexity 68 include/linux/slab.h:kmalloc_large
   Cyclomatic Complexity 9 include/linux/slab.h:kmalloc
   Cyclomatic Complexity 1 include/linux/slab.h:kzalloc
   Cyclomatic Complexity 3 
fs/configfs/configfs_internal.h:release_configfs_dirent
   Cyclomatic Complexity 3 fs/configfs/configfs_internal.h:configfs_put
   Cyclomatic Complexity 1 fs/configfs/inode.c:set_default_inode_attr
   Cyclomatic Complexity 1 fs/configfs/inode.c:set_inode_attr
   Cyclomatic Complexity 24 fs/configfs/inode.c:configfs_setattr
   Cyclomatic Complexity 5 fs/configfs/inode.c:configfs_set_inode_lock_class
   Cyclomatic Complexity 5 fs/configfs/inode.c:configfs_new_inode
   Cyclomatic Complexity 7 fs/configfs/inode.c:configfs_create
   Cyclomatic Complexity 9 fs/configfs/inode.c:configfs_get_name
   Cyclomatic Complexity 4 fs/configfs/inode.c:configfs_drop_dentry
   Cyclomatic Complexity 10 fs/configfs/inode.c:configfs_hash_and_remove
--
   In file included from include/uapi/linux/posix_types.h:5:0,
from include/uapi/linux/types.h:14,
from include/linux/types.h:6,
from include/linux/list.h:5,
from include/linux/module.h:9,
from drivers/mmc/core/core.c:13:
   include/linux/blkdev.h: In function 'bdev_get_queue':
   include/linux/stddef.h:8:14: warning: return makes integer from pointer 
without a cast [-Wint-conversion]
#define NULL ((void *)0)
 ^
>> include/linux/blkdev.h:2092:9: note: in expansion of macro 'NULL'
 return NULL;
^~~~
   Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
   Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
   

Re: [PATCH] btrfs: qgroup: More meaningful qgroup_rescan_init error message

2018-05-25 Thread David Sterba
On Wed, May 02, 2018 at 01:28:03PM +0800, Qu Wenruo wrote:
> Error message from qgroup_rescan_init() mostly looks like:
> 
> --
> BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115
> --
> 
> Which is far from meaningful, and sometimes confusing as for above
> -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes
> it can also indicates problem if return value is -EINVAL.
> 
> Change it to some more meaningful messages like:
> 
> --
> BTRFS info (device nvme0n1p1): qgroup rescan is already in progress
> --
> 
> And
> 
> --
> BTRFS err(device nvme0n1p1): qgroup_rescan_init failed, qgroup is not enabled
> --
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/qgroup.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index ec2339a49ec3..a5742e9e9a14 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2760,26 +2760,37 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 
> progress_objectid,
>  {
>   int ret = 0;
>  
> - if (!init_flags &&
> - (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN) ||
> -  !(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_ON))) {
> - ret = -EINVAL;
> - goto err;
> + if (!init_flags) {
> + /* we're resuming qgroup rescan at mount time */
> + if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN))
> + btrfs_err(fs_info, "%s failed, qgroup is not enabled",
> +   __func__);

I've updated this to avoid the function name as it's supposed to be a
user message, but it's the same string without the _. Also the error
level was originally info, but as you now filter the important messages,
from the wording I'd say they're warnings not really errors.

Anyway, updated and added to misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-25 Thread Matthew Wilcox
On Thu, May 24, 2018 at 05:29:43PM +0200, Michal Hocko wrote:
> > ie if we had more,
> > could we solve our pain by making them more generic?
> 
> Well, if you have more you will consume more bits in the struct pages,
> right?

Not necessarily ... the zone number is stored in the struct page
currently, so either two or three bits are used right now.  In my
proposal, one can infer the zone of a page from its PFN, except for
ZONE_MOVABLE.  So we could trim down to just one bit per struct page
for 32-bit machines while using 3 bits on 64-bit machines, where there
is plenty of space.

> > it more-or-less sucks that the devices with 28-bit DMA limits are forced
> > to allocate from the low 16MB when they're perfectly capable of using the
> > low 256MB.
> 
> Do we actually care all that much about those? If yes then we should
> probably follow the ZONE_DMA (x86) path and use a CMA region for them.
> I mean most devices should be good with very limited addressability or
> below 4G, no?

Sure.  One other thing I meant to mention was the media devices
(TV capture cards and so on) which want a vmalloc_32() allocation.
On 32-bit machines right now, we allocate from LOWMEM, when we really
should be allocating from the 1GB-4GB region.  32-bit machines generally
don't have a ZONE_DMA32 today.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 6/6] Btrfs: support swap files

2018-05-25 Thread Nikolay Borisov


On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Implement the swap file a_ops on Btrfs. Activation needs to make sure
> that the file can be used as a swap file, which currently means it must
> be fully allocated as nocow with no compression on one device. It also
> sets up the swap extents directly with add_swap_extent(), so export it.
> 
> Signed-off-by: Omar Sandoval 

What testing (apart form the xfstest patches you sent) have this code
seen? Have you run it with lockdep enabled (I'm asking because when I
picked up your v3 there was quite a bunch of deadlock warnings). Also
see some inline questions below.

> ---
>  fs/btrfs/inode.c | 220 +++
>  mm/swapfile.c|   1 +
>  2 files changed, 221 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9228cb866115..6cca8529e307 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void *private_data, 
> u64 start, u64 end)
>   }
>  }
>  

> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> *file,
> +sector_t *span)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> + struct extent_io_tree *io_tree = _I(inode)->io_tree;
> + struct extent_state *cached_state = NULL;
> + struct extent_map *em = NULL;
> + struct btrfs_device *device = NULL;
> + struct btrfs_swap_info bsi = {
> + .lowest_ppage = (sector_t)-1ULL,
> + };
> + int ret = 0;
> + u64 isize = inode->i_size;
> + u64 start;
> +
> + /*
> +  * The inode is locked, so these flags won't change after we check them.
> +  */
> + if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> + btrfs_err(fs_info, "swapfile is compressed");
> + return -EINVAL;
> + }
> + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> + btrfs_err(fs_info, "swapfile is copy-on-write");
> + return -EINVAL;
> + }
> +
> + /*
> +  * Balance or device remove/replace/resize can move stuff around from
> +  * under us. The EXCL_OP flag makes sure they aren't running/won't run
> +  * concurrently while we are mapping the swap extents, and the fs_info
> +  * nr_swapfiles counter prevents them from running while the swap file
> +  * is active and moving the extents. Note that this also prevents a
> +  * concurrent device add which isn't actually necessary, but it's not
> +  * really worth the trouble to allow it.
> +  */
> + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> + return -EBUSY;
> + atomic_inc(_info->nr_swapfiles);
> + /*
> +  * Snapshots can create extents which require COW even if NODATACOW is
> +  * set. We use this counter to prevent snapshots. We must increment it
> +  * before walking the extents because we don't want a concurrent
> +  * snapshot to run after we've already checked the extents.
> +  */
> + atomic_inc(_I(inode)->root->nr_swapfiles);
> +
> + lock_extent_bits(io_tree, 0, isize - 1, _state);
> + start = 0;
> + while (start < isize) {
> + u64 end, logical_block_start, physical_block_start;
> + u64 len = isize - start;
> +
> + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
> + if (IS_ERR(em)) {
> + ret = PTR_ERR(em);
> + goto out;
> + }
> + end = extent_map_end(em);
> +
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + btrfs_err(fs_info, "swapfile has holes");
> + ret = -EINVAL;
> + goto out;
> + }
> + if (em->block_start == EXTENT_MAP_INLINE) {
> + /*
> +  * It's unlikely we'll ever actually find ourselves
> +  * here, as a file small enough to fit inline won't be
> +  * big enough to store more than the swap header, but in
> +  * case something changes in the future, let's catch it
> +  * here rather than later.
> +  */
> + btrfs_err(fs_info, "swapfile is inline");
> + ret = -EINVAL;
> + goto out;
> + }
> + if (test_bit(EXTENT_FLAG_COMPRESSED, >flags)) {
> + btrfs_err(fs_info, "swapfile is compressed");
> + ret = -EINVAL;
> + goto out;
> + }

Isn't this implied by the earlier BTRFS_I(inode)->flags &
BTRFS_INODE_COMPRESS check ?
> +
> + logical_block_start = em->block_start + (start - em->start);

So which offset are you calculating by doing the start - 

RE: [External] Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-25 Thread Huaisheng HS1 Ye
From: Michal Hocko [mailto:mho...@kernel.org]
Sent: Thursday, May 24, 2018 8:19 PM> 
> > Let me try to reply your questions.
> > Exactly, GFP_ZONE_TABLE is too complicated. I think there are two advantages
> > from the series of patches.
> >
> > 1. XOR operation is simple and efficient, GFP_ZONE_TABLE/BAD need to do 
> > twice
> > shift operations, the first is for getting a zone_type and the second is for
> > checking the to be returned type is a correct or not. But with these patch 
> > XOR
> > operation just needs to use once. Because the bottom 3 bits of GFP bitmask 
> > have
> > been used to represent the encoded zone number, we can say there is no bad 
> > zone
> > number if all callers could use it without buggy way. Of course, the 
> > returned
> > zone type in gfp_zone needs to be no more than ZONE_MOVABLE.
> 
> But you are losing the ability to check for wrong usage. And it seems
> that the sad reality is that the existing code do screw up.

In my opinion, originally there shouldn't be such many wrong combinations of 
these bottom 3 bits. For any user, whether or driver and fs, they should make a 
decision that which zone is they preferred. Matthew's idea is great, because 
with it the user must offer an unambiguous flag to gfp zone bits.

Ideally, before any user wants to modify the address zone modifier, they should 
clear it firstly, then ORing the GFP zone flag which comes from the zone they 
prefer.
With these patches, we can loudly announce that, the bottom 3 bits of zone mask 
couldn't accept internal ORing operations.
The operations like __GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM is illegal. The 
current GFP_ZONE_TABLE is precisely the root of this problem, that is 
__GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM are formatted as 0x1, 0x2 and 0x4.

> 
> > 2. GFP_ZONE_TABLE has limit with the amount of zone types. Current 
> > GFP_ZONE_TABLE
> > is 32 bits, in general, there are 4 zone types for most ofX86_64 platform, 
> > they
> > are ZONE_DMA, ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE. If we want to 
> > expand the
> > amount of zone types to larger than 4, the zone shift should be 3.
> 
> But we do not want to expand the number of zones IMHO. The existing zoo
> is quite a maint. pain.
> 
> That being said. I am not saying that I am in love with GFP_ZONE_TABLE.
> It always makes my head explode when I look there but it seems to work
> with the current code and it is optimized for it. If you want to change
> this then you should make sure you describe reasons _why_ this is an
> improvement. And I would argue that "we can have more zones" is a
> relevant one.

Yes, GFP_ZONE_TABLE is too complicated. The patches have 4 advantages as below.

* The address zone modifiers have new operation method, that is, user should 
decide which zone is preferred at first, then give the encoded zone number to 
bottom 3 bits in GFP mask. That is much direct and clear than before.

* No bad zone combination, because user should choose just one address zone 
modifier always.
* Better performance and efficiency, current gfp_zone has to take shifting 
operation twice for GFP_ZONE_TABLE and GFP_ZONE_BAD. With these patches, 
gfp_zone() just needs one XOR.
* Up to 8 zones can be used. At least it isn't a disadvantage, right?

Sincerely,
Huaisheng Ye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static

2018-05-25 Thread Nikolay Borisov


On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
> make it non-static.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

nit: How about introducing proper kernel doc for this function, now that
it becomes public just as good practice so that eventually we will have
proper kernel doc for all public interfaces. You could also mention that
it needs a paired free_extent_map
> ---
>  fs/btrfs/volumes.c | 22 +++---
>  fs/btrfs/volumes.h |  2 ++
>  2 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 33b3d329ebb9..6e1a89c6b362 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2789,8 +2789,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info 
> *fs_info, u64 chunk_offset)
>   return ret;
>  }
>  
> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> - u64 logical, u64 length)
> +struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
> +u64 logical, u64 length)
>  {
>   struct extent_map_tree *em_tree;
>   struct extent_map *em;
> @@ -2827,7 +2827,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>   int i, ret = 0;
>   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>  
> - em = get_chunk_map(fs_info, chunk_offset, 1);
> + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>   if (IS_ERR(em)) {
>   /*
>* This is a logic error, but we don't want to just rely on the
> @@ -4962,7 +4962,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle 
> *trans,
>   int i = 0;
>   int ret = 0;
>  
> - em = get_chunk_map(fs_info, chunk_offset, chunk_size);
> + em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
>   if (IS_ERR(em))
>   return PTR_ERR(em);
>  
> @@ -5105,7 +5105,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, 
> u64 chunk_offset)
>   int miss_ndevs = 0;
>   int i;
>  
> - em = get_chunk_map(fs_info, chunk_offset, 1);
> + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>   if (IS_ERR(em))
>   return 1;
>  
> @@ -5165,7 +5165,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
> logical, u64 len)
>   struct map_lookup *map;
>   int ret;
>  
> - em = get_chunk_map(fs_info, logical, len);
> + em = btrfs_get_chunk_map(fs_info, logical, len);
>   if (IS_ERR(em))
>   /*
>* We could return errors for these cases, but that could get
> @@ -5211,7 +5211,7 @@ unsigned long btrfs_full_stripe_len(struct 
> btrfs_fs_info *fs_info,
>   struct map_lookup *map;
>   unsigned long len = fs_info->sectorsize;
>  
> - em = get_chunk_map(fs_info, logical, len);
> + em = btrfs_get_chunk_map(fs_info, logical, len);
>  
>   if (!WARN_ON(IS_ERR(em))) {
>   map = em->map_lookup;
> @@ -5228,7 +5228,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>   struct map_lookup *map;
>   int ret = 0;
>  
> - em = get_chunk_map(fs_info, logical, len);
> + em = btrfs_get_chunk_map(fs_info, logical, len);
>  
>   if(!WARN_ON(IS_ERR(em))) {
>   map = em->map_lookup;
> @@ -5387,7 +5387,7 @@ static int __btrfs_map_block_for_discard(struct 
> btrfs_fs_info *fs_info,
>   /* discard always return a bbio */
>   ASSERT(bbio_ret);
>  
> - em = get_chunk_map(fs_info, logical, length);
> + em = btrfs_get_chunk_map(fs_info, logical, length);
>   if (IS_ERR(em))
>   return PTR_ERR(em);
>  
> @@ -5713,7 +5713,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   return __btrfs_map_block_for_discard(fs_info, logical,
>*length, bbio_ret);
>  
> - em = get_chunk_map(fs_info, logical, *length);
> + em = btrfs_get_chunk_map(fs_info, logical, *length);
>   if (IS_ERR(em))
>   return PTR_ERR(em);
>  
> @@ -6012,7 +6012,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 
> chunk_start,
>   u64 rmap_len;
>   int i, j, nr = 0;
>  
> - em = get_chunk_map(fs_info, chunk_start, 1);
> + em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
>   if (IS_ERR(em))
>   return -EIO;
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 5139ec8daf4c..d3dedfd1324b 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -467,6 +467,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
> *fs_info,
>  int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
>   struct btrfs_fs_info *fs_info,
>   u64 chunk_offset, u64 chunk_size);

Re: [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device()

2018-05-25 Thread Nikolay Borisov


On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> btrfs_ioctl_rm_dev() and btrfs_ioctl_rm_dev_v2() both manipulate this
> bit. Let's move it into the common btrfs_rm_device(), which also makes
> the following change to deal with swap files easier.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/ioctl.c   | 13 -
>  fs/btrfs/volumes.c |  4 
>  2 files changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 9af3be96099f..82be4a94334b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3091,18 +3091,12 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, 
> void __user *arg)
>   goto out;
>   }
>  
> - if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
> - ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> - goto out;
> - }
> -
>   if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) {
>   ret = btrfs_rm_device(fs_info, NULL, vol_args->devid);
>   } else {
>   vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
>   ret = btrfs_rm_device(fs_info, vol_args->name, 0);
>   }
> - clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
>  
>   if (!ret) {
>   if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
> @@ -3133,11 +3127,6 @@ static long btrfs_ioctl_rm_dev(struct file *file, void 
> __user *arg)
>   if (ret)
>   return ret;
>  
> - if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
> - ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> - goto out_drop_write;
> - }
> -
>   vol_args = memdup_user(arg, sizeof(*vol_args));
>   if (IS_ERR(vol_args)) {
>   ret = PTR_ERR(vol_args);
> @@ -3151,8 +3140,6 @@ static long btrfs_ioctl_rm_dev(struct file *file, void 
> __user *arg)
>   btrfs_info(fs_info, "disk deleted %s", vol_args->name);
>   kfree(vol_args);
>  out:
> - clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
> -out_drop_write:
>   mnt_drop_write_file(file);
>  
>   return ret;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index b6757b53c297..9cfac177214f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1951,6 +1951,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, 
> const char *device_path,
>   u64 num_devices;
>   int ret = 0;
>  
> + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> + return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> +
>   mutex_lock(_mutex);
>  
>   num_devices = fs_devices->num_devices;
> @@ -2069,6 +2072,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, 
> const char *device_path,
>  
>  out:
>   mutex_unlock(_mutex);
> + clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
>   return ret;
>  
>  error_undo:
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation

2018-05-25 Thread Nikolay Borisov


On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The documentation for these functions is wrong in several ways:
> 
> - swap_activate() is called with the inode locked
> - swap_activate() takes a swap_info_struct * and a sector_t *
> - swap_activate() can also return a positive number of extents it added
>   itself
> - swap_deactivate() does not return anything
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  Documentation/filesystems/Locking | 17 +++--
>  Documentation/filesystems/vfs.txt | 12 
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking 
> b/Documentation/filesystems/Locking
> index 75d2d57e2c44..7f009e98fa3c 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -211,8 +211,9 @@ prototypes:
>   int (*launder_page)(struct page *);
>   int (*is_partially_uptodate)(struct page *, unsigned long, unsigned 
> long);
>   int (*error_remove_page)(struct address_space *, struct page *);
> - int (*swap_activate)(struct file *);
> - int (*swap_deactivate)(struct file *);
> + int (*swap_activate)(struct swap_info_struct *, struct file *,
> +  sector_t *);
> + void (*swap_deactivate)(struct file *);
>  
>  locking rules:
>   All except set_page_dirty and freepage may block
> @@ -236,8 +237,8 @@ putback_page: yes
>  launder_page:yes
>  is_partially_uptodate:   yes
>  error_remove_page:   yes
> -swap_activate:   no
> -swap_deactivate: no
> +swap_activate:   yes
> +swap_deactivate: no
>  
>   ->write_begin(), ->write_end() and ->readpage() may be called from
>  the request handler (/dev/loop).
> @@ -334,14 +335,10 @@ cleaned, or an error value if not. Note that in order 
> to prevent the page
>  getting mapped back in and redirtied, it needs to be kept locked
>  across the entire operation.
>  
> - ->swap_activate will be called with a non-zero argument on
> -files backing (non block device backed) swapfiles. A return value
> -of zero indicates success, in which case this file can be used for
> -backing swapspace. The swapspace operations will be proxied to the
> -address space operations.
> + ->swap_activate is called from sys_swapon() with the inode locked.
>  
>   ->swap_deactivate() will be called in the sys_swapoff()
> -path after ->swap_activate() returned success.
> +path after ->swap_activate() returned success. The inode is not locked.
>  
>  --- file_lock_operations --
>  prototypes:
> diff --git a/Documentation/filesystems/vfs.txt 
> b/Documentation/filesystems/vfs.txt
> index 5fd325df59e2..0149109d94d1 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -650,8 +650,9 @@ struct address_space_operations {
>   unsigned long);
>   void (*is_dirty_writeback) (struct page *, bool *, bool *);
>   int (*error_remove_page) (struct mapping *mapping, struct page *page);
> - int (*swap_activate)(struct file *);
> - int (*swap_deactivate)(struct file *);
> + int (*swap_activate)(struct swap_info_struct *, struct file *,
> +  sector_t *);
> + void (*swap_deactivate)(struct file *);
>  };
>  
>writepage: called by the VM to write a dirty page to backing store.
> @@ -828,8 +829,11 @@ struct address_space_operations {
>  
>swap_activate: Called when swapon is used on a file to allocate
>   space if necessary and pin the block lookup information in
> - memory. A return value of zero indicates success,
> - in which case this file can be used to back swapspace.
> + memory. If this returns zero, the swap system will call the address
> + space operations ->readpage() and ->direct_IO(). Alternatively, this
> + may call add_swap_extent() and return the number of extents added, in
> + which case the swap system will use the provided blocks directly
> + instead of going through the filesystem.
>  
>swap_deactivate: Called during swapoff on files where swap_activate
>   was successful.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-05-25 Thread Nikolay Borisov


On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
> go through the filesystem, and to make swapoff() call
> ->swap_deactivate(). For Btrfs, we want the latter but not the former,
> so split this flag into two. This makes us always call
> ->swap_deactivate() if ->swap_activate() succeeded, not just if it
> didn't add any swap extents itself.
> 
> This also resolves the issue of the very misleading name of SWP_FILE,
> which is only used for swap files over NFS.
> 
> Signed-off-by: Omar Sandoval 

Generally looks good:

Reviewed-by: Nikolay Borisov 

just one cleanup suggestion, see below.
> ---
>  include/linux/swap.h | 13 +++--
>  mm/page_io.c |  6 +++---
>  mm/swapfile.c| 13 -
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..29dfd436435c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -167,13 +167,14 @@ enum {
>   SWP_SOLIDSTATE  = (1 << 4), /* blkdev seeks are cheap */
>   SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
>   SWP_BLKDEV  = (1 << 6), /* its a block device */
> - SWP_FILE= (1 << 7), /* set after swap_activate success */
> - SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */
> - SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
> - SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
> - SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
> + SWP_ACTIVATED   = (1 << 7), /* set after swap_activate success */
> + SWP_FS  = (1 << 8), /* swap file goes through fs */
> + SWP_AREA_DISCARD = (1 << 9),/* single-time swap area discards */
> + SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
> + SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
> + SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
>   /* add others here before... */
> - SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
> + SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> diff --git a/mm/page_io.c b/mm/page_io.c
> index b41cf9644585..f2d06c1d0cc1 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct 
> writeback_control *wbc,
>   struct swap_info_struct *sis = page_swap_info(page);
>  
>   VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> - if (sis->flags & SWP_FILE) {
> + if (sis->flags & SWP_FS) {

Not necessarily for this series but something to keep in mind:

nit: The way the fs case is tucked onto the __swap_writepage is a bit
ugly. How about factoring out the filesystem/blockdev casae into two
functions :

__swap_writepage_fs/__swap_writepage_bdev

then in swap_writepage we just have the if (sis->flags & SWP_FS) check
and dispatch to either write_page_fs or writepage_bdev.


>   struct kiocb kiocb;
>   struct file *swap_file = sis->swap_file;
>   struct address_space *mapping = swap_file->f_mapping;
> @@ -364,7 +364,7 @@ int swap_readpage(struct page *page, bool synchronous)
>   goto out;
>   }
>  
> - if (sis->flags & SWP_FILE) {
> + if (sis->flags & SWP_FS) {
>   struct file *swap_file = sis->swap_file;
>   struct address_space *mapping = swap_file->f_mapping;
>  
> @@ -422,7 +422,7 @@ int swap_set_page_dirty(struct page *page)
>  {
>   struct swap_info_struct *sis = page_swap_info(page);
>  
> - if (sis->flags & SWP_FILE) {
> + if (sis->flags & SWP_FS) {
>   struct address_space *mapping = sis->swap_file->f_mapping;
>  
>   VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cc2cf04d9018..886c9d89b144 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -973,7 +973,7 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t 
> swp_entries[])
>   goto nextsi;
>   }
>   if (cluster) {
> - if (!(si->flags & SWP_FILE))
> + if (!(si->flags & SWP_FS))
>   n_ret = swap_alloc_cluster(si, swp_entries);
>   } else
>   n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> @@ -2327,12 +2327,13 @@ static void destroy_swap_extents(struct 
> swap_info_struct *sis)
>   kfree(se);
>   }
>  
> - if (sis->flags & SWP_FILE) {
> + if (sis->flags & SWP_ACTIVATED) {
>   struct file *swap_file = sis->swap_file;
>   struct address_space *mapping = swap_file->f_mapping;
>  
> - 

Re: [PATCH v4 3/3] btrfs: Do super block verification before writing it to disk

2018-05-25 Thread Qu Wenruo


On 2018年05月25日 14:33, Nikolay Borisov wrote:
> 
> 
> On 25.05.2018 07:43, Qu Wenruo wrote:
>> There are already 2 reports about strangely corrupted super blocks,
>> where csum still matches but extra garbage gets slipped into super block.
>>
>> The corruption would looks like:
>> --
>> superblock: bytenr=65536, device=/dev/sdc1
>> -
>> csum_type   41700 (INVALID)
>> csum0x3b252d3a [match]
>> bytenr  65536
>> flags   0x1
>> ( WRITTEN )
>> magic   _BHRfS_M [match]
>> ...
>> incompat_flags  0x5b224169
>> ( MIXED_BACKREF |
>>   COMPRESS_LZO |
>>   BIG_METADATA |
>>   EXTENDED_IREF |
>>   SKINNY_METADATA |
>>   unknown flag: 0x5b224000 )
>> ...
>> --
>> Or
>> --
>> superblock: bytenr=65536, device=/dev/mapper/x
>> -
>> csum_type  35355 (INVALID)
>> csum_size  32
>> csum   0xf0dbeddd [match]
>> bytenr 65536
>> flags  0x1
>>( WRITTEN )
>> magic  _BHRfS_M [match]
>> ...
>> incompat_flags 0x176d2169
>>( MIXED_BACKREF |
>>  COMPRESS_LZO |
>>  BIG_METADATA |
>>  EXTENDED_IREF |
>>  SKINNY_METADATA |
>>  unknown flag: 0x176d2000 )
>> --
>>
>> Obviously, csum_type and incompat_flags get some garbage, but its csum
>> still matches, which means kernel calculates the csum based on corrupted
>> super block memory.
>> And after manually fixing these values, the filesystem is completely
>> healthy without any problem exposed by btrfs check.
>>
>> Although the cause is still unknown, at least detect it and prevent further
>> corruption.
>>
>> Reported-by: Ken Swenson 
>> Reported-by: Ben Parsons <9parso...@gmail.com>
>> Signed-off-by: Qu Wenruo 
> 
> Reviewed-by: Nikolay Borisov  , however 1 question
> see below
> 
>> ---
>>  fs/btrfs/disk-io.c | 43 +++
>>  1 file changed, 43 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index b981ecc4b6f9..d6c0cee627d9 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2610,6 +2610,41 @@ static int btrfs_validate_mount_super(struct 
>> btrfs_fs_info *fs_info)
>>  return __validate_super(fs_info, fs_info->super_copy, 0);
>>  }
>>  
>> +/*
>> + * Check the validation of super block at write time.
>> + * Some checks like bytenr check will be skipped as their values will be
>> + * overwritten soon.
>> + * Extra checks like csum type and incompact flags will be executed here.
>> + */
> 
> Is this comment correct though, since this function is called right
> before write_dev_supers which is performing the actual write and doesn't
> really modify anything on the superblock ?

It still does extra modification on bytenr and csum.

> Why would they be
> overwritten? Perhaps it's somewhat stale since I see in the old version
> (the one which is in misc-next now) we call btrfs_validate_write_super
> before going into the loop which writes the super on each dev.

Sorry I forgot to mention in this patch (only mentioned in cover letter).
There is one case that we didn't cover before.

For seed sprout case (mount seed device, and then add a new device), the
fs_info->super_copy is still the old seed device one, thus fsid won't
match with newly generated fsid.

So we still need to do the check later on, other than just using super_copy.

Thanks,
Qu

> 
>> +static int btrfs_validate_write_super(struct btrfs_fs_info *fs_info,
>> +  struct btrfs_super_block *sb)
>> +{
>> +int ret;
>> +
>> +ret = __validate_super(fs_info, sb, -1);
>> +if (ret < 0)
>> +goto out;
>> +if (btrfs_super_csum_type(sb) != BTRFS_CSUM_TYPE_CRC32) {
>> +ret = -EUCLEAN;
>> +btrfs_err(fs_info, "invalid csum type, has %u want %u",
>> +  btrfs_super_csum_type(sb), BTRFS_CSUM_TYPE_CRC32);
>> +goto out;
>> +}
>> +if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
>> +ret = -EUCLEAN;
>> +btrfs_err(fs_info,
>> +"invalid incompact flags, has 0x%llu valid mask 0x%llu",
>> +  btrfs_super_incompat_flags(sb),
>> +  BTRFS_FEATURE_INCOMPAT_SUPP);
>> +goto out;
>> +}
>> +out:
>> +if (ret < 0)
>> +btrfs_err(fs_info,
>> +"super block corruption detected before writing it to disk");
>> +

Re: [PATCH v4 3/3] btrfs: Do super block verification before writing it to disk

2018-05-25 Thread Nikolay Borisov


On 25.05.2018 07:43, Qu Wenruo wrote:
> There are already 2 reports about strangely corrupted super blocks,
> where csum still matches but extra garbage gets slipped into super block.
> 
> The corruption would looks like:
> --
> superblock: bytenr=65536, device=/dev/sdc1
> -
> csum_type   41700 (INVALID)
> csum0x3b252d3a [match]
> bytenr  65536
> flags   0x1
> ( WRITTEN )
> magic   _BHRfS_M [match]
> ...
> incompat_flags  0x5b224169
> ( MIXED_BACKREF |
>   COMPRESS_LZO |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA |
>   unknown flag: 0x5b224000 )
> ...
> --
> Or
> --
> superblock: bytenr=65536, device=/dev/mapper/x
> -
> csum_type  35355 (INVALID)
> csum_size  32
> csum   0xf0dbeddd [match]
> bytenr 65536
> flags  0x1
>( WRITTEN )
> magic  _BHRfS_M [match]
> ...
> incompat_flags 0x176d2169
>( MIXED_BACKREF |
>  COMPRESS_LZO |
>  BIG_METADATA |
>  EXTENDED_IREF |
>  SKINNY_METADATA |
>  unknown flag: 0x176d2000 )
> --
> 
> Obviously, csum_type and incompat_flags get some garbage, but its csum
> still matches, which means kernel calculates the csum based on corrupted
> super block memory.
> And after manually fixing these values, the filesystem is completely
> healthy without any problem exposed by btrfs check.
> 
> Although the cause is still unknown, at least detect it and prevent further
> corruption.
> 
> Reported-by: Ken Swenson 
> Reported-by: Ben Parsons <9parso...@gmail.com>
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov  , however 1 question
see below

> ---
>  fs/btrfs/disk-io.c | 43 +++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b981ecc4b6f9..d6c0cee627d9 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2610,6 +2610,41 @@ static int btrfs_validate_mount_super(struct 
> btrfs_fs_info *fs_info)
>   return __validate_super(fs_info, fs_info->super_copy, 0);
>  }
>  
> +/*
> + * Check the validation of super block at write time.
> + * Some checks like bytenr check will be skipped as their values will be
> + * overwritten soon.
> + * Extra checks like csum type and incompact flags will be executed here.
> + */

Is this comment correct though, since this function is called right
before write_dev_supers which is performing the actual write and doesn't
really modify anything on the superblock ? Why would they be
overwritten? Perhaps it's somewhat stale since I see in the old version
(the one which is in misc-next now) we call btrfs_validate_write_super
before going into the loop which writes the super on each dev.

> +static int btrfs_validate_write_super(struct btrfs_fs_info *fs_info,
> +   struct btrfs_super_block *sb)
> +{
> + int ret;
> +
> + ret = __validate_super(fs_info, sb, -1);
> + if (ret < 0)
> + goto out;
> + if (btrfs_super_csum_type(sb) != BTRFS_CSUM_TYPE_CRC32) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info, "invalid csum type, has %u want %u",
> +   btrfs_super_csum_type(sb), BTRFS_CSUM_TYPE_CRC32);
> + goto out;
> + }
> + if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info,
> + "invalid incompact flags, has 0x%llu valid mask 0x%llu",
> +   btrfs_super_incompat_flags(sb),
> +   BTRFS_FEATURE_INCOMPAT_SUPP);
> + goto out;
> + }
> +out:
> + if (ret < 0)
> + btrfs_err(fs_info,
> + "super block corruption detected before writing it to disk");
> + return ret;
> +}
> +
>  int open_ctree(struct super_block *sb,
>  struct btrfs_fs_devices *fs_devices,
>  char *options)
> @@ -3770,6 +3805,14 @@ int write_all_supers(struct btrfs_fs_info *fs_info, 
> int max_mirrors)
>   flags = btrfs_super_flags(sb);
>   btrfs_set_super_flags(sb, flags | BTRFS_HEADER_FLAG_WRITTEN);
>  
> + ret = btrfs_validate_write_super(fs_info, sb);
> + if (ret < 0) {
> + mutex_unlock(_info->fs_devices->device_list_mutex);
> + btrfs_handle_fs_error(fs_info,