Re: [PATCH v3 1/6] btrfs-progs: image: Use correct device size when restoring

2018-10-11 Thread Qu Wenruo



On 2018/10/11 下午8:07, Nikolay Borisov wrote:
> 
> 
> On  8.10.2018 15:30, Qu Wenruo wrote:
>> When restoring btrfs image, the total_bytes of device item is not
>> updated correctly. In fact total_bytes can be left 0 for restored image.
>>
>> It doesn't trigger any error because btrfs check never checks
>> total_bytes of dev item.
>> However this is going to change.
>>
>> Fix it by populating total_bytes of device item with the end position of
>> last dev extent to make later btrfs check happy.
> 
> 'Setting it to the end position' really means "setting the total device
> size to the allocated space on the device". Is it more clear if stated
> as the second way ?

Not exactly.

Don't forget that we have 0~1M reserved, and it's possible to have dev
extent holes.
So it's not "allocated space" but really "the end position of the last
dev extent"

Thanks,
Qu

> 
> In any case:
> 
> Reviewed-by: Nikolay Borisov 
> 
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  image/main.c | 48 +---
>>  1 file changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/image/main.c b/image/main.c
>> index 351c5a256938..d5b89bc3149f 100644
>> --- a/image/main.c
>> +++ b/image/main.c
>> @@ -2082,15 +2082,17 @@ static void remap_overlapping_chunks(struct 
>> mdrestore_struct *mdres)
>>  }
>>  
>>  static int fixup_devices(struct btrfs_fs_info *fs_info,
>> - struct mdrestore_struct *mdres, off_t dev_size)
>> + struct mdrestore_struct *mdres, int out_fd)
>>  {
>>  struct btrfs_trans_handle *trans;
>>  struct btrfs_dev_item *dev_item;
>> +struct btrfs_dev_extent *dev_ext;
>>  struct btrfs_path path;
>>  struct extent_buffer *leaf;
>>  struct btrfs_root *root = fs_info->chunk_root;
>>  struct btrfs_key key;
>>  u64 devid, cur_devid;
>> +u64 dev_size; /* Get from last dev extents */
>>  int ret;
>>  
>>  trans = btrfs_start_transaction(fs_info->tree_root, 1);
>> @@ -2101,16 +2103,56 @@ static int fixup_devices(struct btrfs_fs_info 
>> *fs_info,
>>  
>>  dev_item = _info->super_copy->dev_item;
>>  
>> +btrfs_init_path();
>>  devid = btrfs_stack_device_id(dev_item);
>>  
>> +key.objectid = devid;
>> +key.type = BTRFS_DEV_EXTENT_KEY;
>> +key.offset = (u64)-1;
>> +
>> +ret = btrfs_search_slot(NULL, fs_info->dev_root, , , 0, 0);
>> +if (ret < 0) {
>> +error("failed to locate last dev extent of devid %llu: %s",
>> +devid, strerror(-ret));
>> +btrfs_release_path();
>> +return ret;
>> +}
>> +if (ret == 0) {
> 
> nit: I'd prefer if this is an else if since it emphasizes the fact that
> the if/elseif construct works on a single value.
> 
>> +error("found invalid dev extent devid %llu offset -1",
>> +devid);
>> +btrfs_release_path();
>> +return -EUCLEAN;
>> +}
>> +ret = btrfs_previous_item(fs_info->dev_root, , devid,
>> +  BTRFS_DEV_EXTENT_KEY);
>> +if (ret > 0)
>> +ret = -ENOENT;
>> +if (ret < 0) {
> 
> ditto
> 
>> +error("failed to locate last dev extent of devid %llu: %s",
>> +devid, strerror(-ret));
>> +btrfs_release_path();
>> +return ret;
>> +}
>> +
>> +btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
>> +dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
>> + struct btrfs_dev_extent);
>> +dev_size = key.offset + btrfs_dev_extent_length(path.nodes[0], dev_ext);
>> +btrfs_release_path();
>> +
>>  btrfs_set_stack_device_total_bytes(dev_item, dev_size);
>>  btrfs_set_stack_device_bytes_used(dev_item, mdres->alloced_chunks);
>> +/* Don't forget to enlarge the real file */
>> +ret = ftruncate64(out_fd, dev_size);
>> +if (ret < 0) {
>> +error("failed to enlarge result image: %s", strerror(errno));
>> +return -errno;
>> +}
>>  
>>  key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
>>  key.type = BTRFS_DEV_ITEM_KEY;
>>  key.offset = 0;
>>  
>> -btrfs_init_path();
>>  
>>  again:
>>  ret = btrfs_search_slot(trans, root, , , -1, 1);
>> @@ -2275,7 +2317,7 @@ static int restore_metadump(const char *input, FILE 
>> *out, int old_restore,
>>  return 1;
>>  }
>>  
>> -ret = fixup_devices(info, , st.st_size);
>> +ret = fixup_devices(info, , fileno(out));
>>  close_ctree(info->chunk_root);
>>  if (ret)
>>  goto out;
>>


[PATCH] fstests: btrfs use forget if not reload

2018-10-11 Thread Anand Jain
btrfs reload was introduced to cleanup the device list inside the btrfs
kernel module.

The problem with the reload approach is that you can't run btrfs test
cases 124,125, 154 and 164 on the system with btrfs as root fs.

Now as we are introducing the btrfs forget feature as an btrfs device
scan option [1], so here are its pertaining changes in the fstests.

So these changes ensures we use the forget feature where available and
if its not then falls back to the reload approach.

[1]
btrfs-progs: add cli to forget one or all scanned devices
btrfs: introduce feature to forget a btrfs device

Signed-off-by: Anand Jain 
---
 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/common/btrfs b/common/btrfs
index 26dc0bb9600f..c7fbec11c8c1 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize()
$BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\
grep sectorsize | awk '{print $2}'
 }
+
+_btrfs_supports_forget()
+{
+   $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \
+   $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1
+}
+
+_require_btrfs_forget_if_not_fs_loadable()
+{
+   _btrfs_supports_forget && return
+
+   _require_loadable_fs_module "btrfs"
+}
+
+_btrfs_forget_if_not_fs_reload()
+{
+   _btrfs_supports_forget && return
+
+   _reload_fs_module "btrfs"
+}
diff --git a/tests/btrfs/124 b/tests/btrfs/124
index ce3ad6aa3a58..edbeff1443f5 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -51,7 +51,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full
 _scratch_unmount
 
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 echo >> $seqres.full
 echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> 
$seqres.full
@@ -125,7 +125,7 @@ echo
 echo "Mount degraded with the other dev"
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1
 _run_btrfs_util_prog filesystem show
 checkpoint3=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index e38de264b28e..9161a2ddeaad 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -50,7 +50,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 3
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 3
 
@@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full
 _scratch_unmount
 echo "clean btrfs ko" >> $seqres.full
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1
 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \
>>$seqres.full 2>&1
@@ -138,7 +138,7 @@ echo "Mount degraded but with other dev"
 
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1
 
diff --git a/tests/btrfs/154 b/tests/btrfs/154
index 99ea232aba4c..01745d20e9fc 100755
--- a/tests/btrfs/154
+++ b/tests/btrfs/154
@@ -36,7 +36,7 @@ rm -f $seqres.full
 _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -90,7 +90,7 @@ degrade_mount_write()
 
echo "clean btrfs ko" >> $seqres.full
# un-scan the btrfs devices
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
cnt=$(( $COUNT/10 ))
dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
@@ -142,7 +142,7 @@ verify()
echo "unmount" >> $seqres.full
 
_scratch_unmount
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1
verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1`
verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/164 b/tests/btrfs/164
index 097191a0e493..55042c4035e0 100755
--- a/tests/btrfs/164
+++ b/tests/btrfs/164
@@ -36,7 +36,7 @@ rm -f $seqres.full
 # Modify as appropriate.
 _supported_fs btrfs
 _supported_os Linux
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 _require_scratch_dev_pool 2
 
 _scratch_dev_pool_get 2
@@ -69,7 +69,7 @@ delete_seed()
 {
_run_btrfs_util_prog device delete $dev_seed $SCRATCH_MNT

[PATCH v9] Add cli and ioctl to forget scanned device(s)

2018-10-11 Thread Anand Jain
v9:
 Make forget as a btrfs device scan option.
 Use forget in the fstests, now you can run fstests with btrfs as rootfs
  which helps to exercise the uuid_mutex lock.

v8:
 Change log update in the kernel patch.

v7:
 Use struct btrfs_ioctl_vol_args (instead of struct
  btrfs_ioctl_vol_args_v2) as its inline with other ioctl
  btrfs-control
 The CLI usage/features remains same. However internally the ioctl flag
  is not required to delete all the unmounted devices. Instead leave
  btrfs_ioctl_vol_args::name NULL.

v6:
 Use the changed fn name btrfs_free_stale_devices().

 Change in title:
 Old v5:
 Cover-letter:
  [PATCH v5] Add cli and ioctl to ignore a scanned device
 Kernel:
  [PATCH v5] btrfs: introduce feature to ignore a btrfs device
 Progs:
  [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli

v5:
  Adds feature to delete all stale devices
  Reuses btrfs_free_stale_devices() fn and so depends on the
patch-set [1] in the ML.
  Uses struct btrfs_ioctl_vol_args_v2 instead of
struct btrfs_ioctl_vol_args as arg
  Does the device path matching instead of btrfs_device matching
(we won't delete the mounted device as btrfs_free_stale_devices()
checks for it)
v4:
  No change. But as the ML thread may be confusing, so resend.
v3:
  No change. Send to correct ML.
v2:
  Accepts review from Nikolay, details are in the specific patch.
  Patch 1/2 is renamed from
[PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list 
delete
  to
[PATCH 1/2] btrfs: add function to device list delete

Adds cli and ioctl to forget a scanned device or forget all stale
devices in the kernel.


Anand Jain (1):
  btrfs: introduce feature to forget a btrfs device

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

Anand Jain (1):
  btrfs-progs: add cli to forget one or all scanned devices

 cmds-device.c | 63 ++-
 ioctl.h   |  2 ++
 2 files changed, 56 insertions(+), 9 deletions(-)

[1]
Anand Jain (1):
  fstests: btrfs use forget if not reload

 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

-- 
1.8.3.1



[PATCH] btrfs: introduce feature to forget a btrfs device

2018-10-11 Thread Anand Jain
Support for a new command 'btrfs dev forget [dev]' is proposed here
to undo the effects of 'btrfs dev scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty.
IOW(BTRFS_IOCTL_MAGIC, 5, ..)
This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
the /dev/btrfs-control to forget one or all devices, (devices which are
not mounted) from the btrfs kernel.

The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
set to specify the device path. And all unmounted devices can be removed
from the kernel if no device path is provided.

Again, the devices are removed only if the relevant fsid aren't mounted.

This new cli can provide..
 . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the
   device is not going to be mounted.
 . Ability to mount the device in degraded mode when one of the other
   device is corrupted like in split brain raid1.
 . Running test cases which requires btrfs.ko-reload if the rootfs
   is btrfs.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6601c9aa5e35..f14610ef1c66 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2244,6 +2244,9 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
ret = PTR_ERR_OR_ZERO(device);
mutex_unlock(_mutex);
break;
+   case BTRFS_IOC_FORGET_DEV:
+   ret = btrfs_forget_devices(vol->name);
+   break;
case BTRFS_IOC_DEVICES_READY:
mutex_lock(_mutex);
device = btrfs_scan_one_device(vol->name, FMODE_READ,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4405e430da6..e2ab203d9c1d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device 
*bdev, u64 bytenr,
return 0;
 }
 
+int btrfs_forget_devices(const char *path)
+{
+   mutex_lock(_mutex);
+   btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL);
+   mutex_unlock(_mutex);
+
+   return 0;
+}
+
 /*
  * Look for a btrfs signature on a device. This may be called out of the mount 
path
  * and we are not allowed to call set_blocksize during the scan. The superblock
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 23e9285d88de..394079d5d08d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -406,6 +406,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
   fmode_t flags, void *holder);
+int btrfs_forget_devices(const char *path);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
 void btrfs_assign_next_active_device(struct btrfs_device *device,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5ca1d21fc4a7..b1be7f828cb4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -836,6 +836,8 @@ enum btrfs_err_code {
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
1.8.3.1



[PATCH] btrfs-progs: add cli to forget one or all scanned devices

2018-10-11 Thread Anand Jain
This patch adds cli
  btrfs device forget [dev]
to remove the given device structure in the kernel if the device
is unmounted. If no argument is given it shall remove all stale
(device which are not mounted) from the kernel.

Signed-off-by: Anand Jain 
---
 cmds-device.c | 63 ++-
 ioctl.h   |  2 ++
 2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 2a05f70a76a9..ecc391ea01d8 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -254,10 +254,32 @@ static int cmd_device_delete(int argc, char **argv)
return _cmd_device_remove(argc, argv, cmd_device_delete_usage);
 }
 
+static int btrfs_forget_devices(char *path)
+{
+   struct btrfs_ioctl_vol_args args;
+   int ret;
+   int fd;
+
+   fd = open("/dev/btrfs-control", O_RDWR);
+   if (fd < 0)
+   return -errno;
+
+   memset(, 0, sizeof(args));
+   if (path)
+   strncpy_null(args.name, path);
+   ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, );
+   if (ret)
+   ret = -errno;
+   close(fd);
+   return ret;
+}
+
 static const char * const cmd_device_scan_usage[] = {
-   "btrfs device scan [(-d|--all-devices)| [...]]",
-   "Scan devices for a btrfs filesystem",
+   "btrfs device scan [(-d|--all-devices)|-u|--forget| "\
+   "[...]]",
+   "Scan or forget (deregister) devices for a btrfs filesystem",
" -d|--all-devices (deprecated)",
+   " -u|--forget [ ..]",
NULL
 };
 
@@ -267,32 +289,40 @@ static int cmd_device_scan(int argc, char **argv)
int devstart;
int all = 0;
int ret = 0;
+   int forget = 0;
 
optind = 0;
while (1) {
int c;
static const struct option long_options[] = {
{ "all-devices", no_argument, NULL, 'd'},
+   { "forget", no_argument, NULL, 'u'},
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "d", long_options, NULL);
+   c = getopt_long(argc, argv, "du", long_options, NULL);
if (c < 0)
break;
switch (c) {
case 'd':
all = 1;
break;
+   case 'u':
+   forget = 1;
+   break;
default:
usage(cmd_device_scan_usage);
}
}
devstart = optind;
 
+   if (all && forget)
+   usage(cmd_device_scan_usage);
+
if (all && check_argc_max(argc - optind, 1))
usage(cmd_device_scan_usage);
 
-   if (all || argc - optind == 0) {
+   if (!forget && (all || argc - optind == 0)) {
printf("Scanning for Btrfs filesystems\n");
ret = btrfs_scan_devices();
error_on(ret, "error %d while scanning", ret);
@@ -301,6 +331,13 @@ static int cmd_device_scan(int argc, char **argv)
goto out;
}
 
+   if (forget && (all || argc - optind == 0)) {
+   ret = btrfs_forget_devices(NULL);
+   if (ret)
+   error("Can't forget: %s", strerror(-ret));
+   goto out;
+   }
+
for( i = devstart ; i < argc ; i++ ){
char *path;
 
@@ -315,11 +352,19 @@ static int cmd_device_scan(int argc, char **argv)
ret = 1;
goto out;
}
-   printf("Scanning for Btrfs filesystems in '%s'\n", path);
-   if (btrfs_register_one_device(path) != 0) {
-   ret = 1;
-   free(path);
-   goto out;
+   if (forget) {
+   ret = btrfs_forget_devices(path);
+   if (ret)
+   error("Can't forget '%s': %s",
+   path, strerror(-ret));
+   } else {
+   printf("Scanning for Btrfs filesystems in '%s'\n",
+   path);
+   if (btrfs_register_one_device(path) != 0) {
+   ret = 1;
+   free(path);
+   goto out;
+   }
}
free(path);
}
diff --git a/ioctl.h b/ioctl.h
index 709e996f401c..e27d80e09392 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -721,6 +721,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV 

Re: errors reported by btrfs-check

2018-10-11 Thread Qu Wenruo


On 2018/10/12 上午4:30, Jürgen Herrmann wrote:
> Hi!
> 
> I just did a btrfs check on my laptop's btrfs filesystem while i was
> on the usb stick rescue system.
> 
> the following errors where reported:
> root@mint:/home/mint# btrfs check /dev/mapper/sda3crypt
> Checking filesystem on /dev/mapper/sda3crypt
> UUID: a914c141-72bf-448b-847f-d64ee82d8b7b
> checking extents
> checking free space cache
> checking fs roots
> root 258 inode 3082368 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 258 inode 3082370 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 258 inode 3082371 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 258 inode 3082373 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 258 inode 3082414 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 258 inode 3082415 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 258 inode 3082421 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> root 1387 inode 3082368 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1387 inode 3082370 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1387 inode 3082371 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1387 inode 3082372 errors 100, file extent discount
> Found file extent holes:
> start: 8192, len: 4096
> start: 16384, len: 4096
> start: 24576, len: 4096
> start: 32768, len: 4096
> start: 40960, len: 4096
> start: 49152, len: 20480
> start: 73728, len: 4096
> start: 81920, len: 4096
> start: 90112, len: 8192
> root 1387 inode 3082373 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1387 inode 3082374 errors 100, file extent discount
> Found file extent holes:
> start: 8192, len: 4096
> start: 16384, len: 20480
> start: 40960, len: 12288
> start: 57344, len: 4096
> start: 65536, len: 8192
> root 1387 inode 3082380 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 233472
> root 1387 inode 3082386 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> root 1387 inode 3082398 errors 100, file extent discount
> Found file extent holes:
> start: 20480, len: 16384
> root 1387 inode 3082414 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1387 inode 3082415 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1387 inode 3082421 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> root 1391 inode 3082368 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1391 inode 3082370 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1391 inode 3082371 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1391 inode 3082373 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1391 inode 3082386 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> root 1391 inode 3082414 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1391 inode 3082415 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1391 inode 3082421 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> root 1394 inode 3082368 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1394 inode 3082370 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1394 inode 3082371 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1394 inode 3082373 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1394 inode 3082386 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> root 1394 inode 3082414 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1394 inode 3082415 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 8192
> root 1394 inode 3082421 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4096
> ERROR: errors found in fs roots
> found 469458231296 bytes used, error(s) found
> total csum bytes: 451180560
> total tree bytes: 4558831616
> total fs tree bytes: 3802955776
> total extent tree bytes: 245055488
> btree space waste bytes: 842802897
> file data blocks allocated: 

Re: errors reported by btrfs-check

2018-10-11 Thread Jürgen Herrmann

4.17.1 compiled from source.

Best regards,
Jürgen

Am 12. Oktober 2018 00:36:26 schrieb Chris Murphy :


What version of btrfs-progs?



Mit AquaMail Android
https://www.mobisystems.com/aqua-mail




[PATCH] btrfs-progs: fix compile warning when using gcc8 to compile btrfs-progs

2018-10-11 Thread Su Yanjun
When using gcc8 compiles utils.c, it complains as below:

utils.c:852:45: warning: '%s' directive output may be truncated writing
up to 4095 bytes into a region of size 4084 [-Wformat-truncation=]
   snprintf(path, sizeof(path), "/dev/mapper/%s", name);
 ^~   
In file included from /usr/include/stdio.h:873,
 from utils.c:20:
/usr/include/bits/stdio2.h:67:10: note: '__builtin___snprintf_chk'
output between 13 and 4108 bytes into a destination of size 4096
   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
  ^~~~
__bos (__s), __fmt, __va_arg_pack ());
~

This isn't a type of warning we care about, particularly when PATH_MAX
is much less than either.

Using the GCC option -Wno-format-truncation to disable this.

Signed-off-by: Su Yanjun 
---
 configure.ac | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index df02f20655d9..c626beca8b77 100644
--- a/configure.ac
+++ b/configure.ac
@@ -12,7 +12,7 @@ LIBBTRFS_MAJOR=0
 LIBBTRFS_MINOR=1
 LIBBTRFS_PATCHLEVEL=2
 
-CFLAGS=${CFLAGS:-"-g -O1 -Wall -D_FORTIFY_SOURCE=2"}
+CFLAGS=${CFLAGS:-"-g -O1 -Wall -D_FORTIFY_SOURCE=2 -Wno-format-truncation"}
 AC_SUBST([CFLAGS])
 
 AC_PREREQ([2.60])
-- 
2.19.1





Re: errors reported by btrfs-check

2018-10-11 Thread Chris Murphy
What version of btrfs-progs?


Re: [PATCH 32/42] btrfs: only free reserved extent if we didn't insert it

2018-10-11 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:58 PM Josef Bacik  wrote:
>
> When we insert the file extent once the ordered extent completes we free
> the reserved extent reservation as it'll have been migrated to the
> bytes_used counter.  However if we error out after this step we'll still
> clear the reserved extent reservation, resulting in a negative
> accounting of the reserved bytes for the block group and space info.
> Fix this by only doing the free if we didn't successfully insert a file
> extent for this extent.
>
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 
Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/inode.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5a91055a13b2..2b257d14bd3d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
> bool truncated = false;
> bool range_locked = false;
> bool clear_new_delalloc_bytes = false;
> +   bool clear_reserved_extent = true;
>
> if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
> !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) &&
> @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
> logical_len, logical_len,
> compress_type, 0, 0,
> BTRFS_FILE_EXTENT_REG);
> -   if (!ret)
> +   if (!ret) {
> +   clear_reserved_extent = false;
> btrfs_release_delalloc_bytes(fs_info,
>  ordered_extent->start,
>  
> ordered_extent->disk_len);
> +   }
> }
> unpin_extent_cache(_I(inode)->extent_tree,
>ordered_extent->file_offset, ordered_extent->len,
> @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>  * wrong we need to return the space for this ordered extent
>  * back to the allocator.  We only free the extent in the
>  * truncated case if we didn't write out the extent at all.
> +*
> +* If we made it past insert_reserved_file_extent before we
> +* errored out then we don't need to do this as the accounting
> +* has already been done.
>  */
> if ((ret || !logical_len) &&
> +   clear_reserved_extent &&
> !test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
> !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags))
> btrfs_free_reserved_extent(fs_info,
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 42/42] btrfs: don't run delayed_iputs in commit

2018-10-11 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:58 PM Josef Bacik  wrote:
>
> This could result in a really bad case where we do something like
>
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
>   btrfs_run_delayed_iputs
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
> ... forever
>
> We have plenty of other places where we run delayed iputs that are much
> safer, let those do the work.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

Great catch!

> ---
>  fs/btrfs/transaction.c | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 9168efaca37e..c91dc36fccae 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>
> kmem_cache_free(btrfs_trans_handle_cachep, trans);
>
> -   /*
> -* If fs has been frozen, we can not handle delayed iputs, otherwise
> -* it'll result in deadlock about SB_FREEZE_FS.
> -*/
> -   if (current != fs_info->transaction_kthread &&
> -   current != fs_info->cleaner_kthread &&
> -   !test_bit(BTRFS_FS_FROZEN, _info->flags))
> -   btrfs_run_delayed_iputs(fs_info);
> -
> return ret;
>
>  scrub_continue:
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-11 Thread Hans van Kranenburg
On 10/11/2018 09:40 PM, Hans van Kranenburg wrote:
> On 10/11/2018 05:13 PM, David Sterba wrote:
>> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>>> This patch set contains an additional fix for a newly exposed bug after
>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>
>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>
>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>> before starting to change more things so it can be applied to earlier
>>> LTS kernels.
>>>
>>> Besides that patch, which is fixing the bug in a way that is least
>>> intrusive, I added a bunch of other patches to help getting the chunk
>>> allocator code in a state that is a bit less error-prone and
>>> bug-attracting.
>>>
>>> When running this and trying the reproduction scenario, I can now see
>>> that the created DUP device extent is 827326464 bytes long, which is
>>> good.
>>>
>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>> a list of related use cases and test more things to at least walk
>>> through a bunch of obvious use cases to see if there's nothing exploding
>>> too quickly with these changes. However, I'm quite confident about it,
>>> so I'm sharing all of it already.
>>>
>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>> I'm still very much in a learning stage regarding kernel development.
>>
>> The patches look good, thanks. Problem is explained, preparatory work is
>> separated from the fix itself.
> 
> \o/
> 
>>> The stable patches handling workflow is not 100% clear to me yet. I
>>> guess I have to add a Fixes: in the DUP patch which points to the
>>> previous commit 92e222df7b.
>>
>> Almost nobody does it right, no worries. If you can identify a single
>> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
>> with version where it makes sense & applies is enough. I do that check
>> myself regardless of what's in the patch.
> 
> It's 92e222df7b and the thing I'm not sure about is if it also will
> catch the same patch which was already backported to LTS kernels since
> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
> 4.14, 4.9, 4.4, 3.16...
> 
>> I ran the patches in a VM and hit a division-by-zero in test
>> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
>> patch 3/6.
> 
> Ah interesting, dev replace.
> 
> I'll play around with replace and find out how to run the tests properly
> and then reproduce this.
> 
> The code introduced in patch 3 is removed again in patch 6, so I don't
> suspect that one. :)

Actually, while writing this I realize that this means it should be
tested separately (like, older kernel with only 3), because, well,
obvious I guess.

> But, I'll find out.
> 
> Thanks for testing.
> 
> Hans
> 
>> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
>> 1 transid 5 /dev/vdb
>> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
>> 2 transid 5 /dev/vdc
>> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
>> [ 3116.088644] BTRFS info (device vdb): has skinny extents
>> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
>> [ 3116.093971] BTRFS info (device vdb): checking UUID tree
>> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) 
>> to /dev/vdd started
>> [ 3125.860269] divide error:  [#1] PREEMPT SMP
>> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ 
>> #288
>> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
>> [ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206
>> [ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: 
>> 0002
>> [ 3125.873587] RDX:  RSI: 94e017818c80 RDI: 
>> 0200
>> [ 3125.874677] RBP: 8080 R08:  R09: 
>> 0002
>> [ 3125.875816] R10: 0003 R11: 8090 R12: 
>> 
>> [ 3125.876742] R13: 0001 R14: 0001 R15: 
>> 0002
>> [ 3125.877657] FS:  7f6de34208c0() GS:94e07d60() 
>> knlGS:
>> [ 3125.878862] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 3125.880080] CR2: 7ffe963d5ce8 CR3: 7659b000 CR4: 
>> 06e0
>> [ 3125.881485] Call Trace:
>> [ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
>> [ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
>> [ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
>> [ 3125.884658]  ? is_module_address+0x11/0x30
>> [ 3125.885271]  ? wait_for_completion+0x160/0x190
>> [ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
>> [ 

errors reported by btrfs-check

2018-10-11 Thread Jürgen Herrmann

Hi!

I just did a btrfs check on my laptop's btrfs filesystem while i was
on the usb stick rescue system.

the following errors where reported:
root@mint:/home/mint# btrfs check /dev/mapper/sda3crypt
Checking filesystem on /dev/mapper/sda3crypt
UUID: a914c141-72bf-448b-847f-d64ee82d8b7b
checking extents
checking free space cache
checking fs roots
root 258 inode 3082368 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 258 inode 3082370 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 258 inode 3082371 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 258 inode 3082373 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 258 inode 3082414 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 258 inode 3082415 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 258 inode 3082421 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
root 1387 inode 3082368 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1387 inode 3082370 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1387 inode 3082371 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1387 inode 3082372 errors 100, file extent discount
Found file extent holes:
start: 8192, len: 4096
start: 16384, len: 4096
start: 24576, len: 4096
start: 32768, len: 4096
start: 40960, len: 4096
start: 49152, len: 20480
start: 73728, len: 4096
start: 81920, len: 4096
start: 90112, len: 8192
root 1387 inode 3082373 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1387 inode 3082374 errors 100, file extent discount
Found file extent holes:
start: 8192, len: 4096
start: 16384, len: 20480
start: 40960, len: 12288
start: 57344, len: 4096
start: 65536, len: 8192
root 1387 inode 3082380 errors 100, file extent discount
Found file extent holes:
start: 0, len: 233472
root 1387 inode 3082386 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
root 1387 inode 3082398 errors 100, file extent discount
Found file extent holes:
start: 20480, len: 16384
root 1387 inode 3082414 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1387 inode 3082415 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1387 inode 3082421 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
root 1391 inode 3082368 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1391 inode 3082370 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1391 inode 3082371 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1391 inode 3082373 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1391 inode 3082386 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
root 1391 inode 3082414 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1391 inode 3082415 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1391 inode 3082421 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
root 1394 inode 3082368 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1394 inode 3082370 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1394 inode 3082371 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1394 inode 3082373 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1394 inode 3082386 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
root 1394 inode 3082414 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1394 inode 3082415 errors 100, file extent discount
Found file extent holes:
start: 0, len: 8192
root 1394 inode 3082421 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4096
ERROR: errors found in fs roots
found 469458231296 bytes used, error(s) found
total csum bytes: 451180560
total tree bytes: 4558831616
total fs tree bytes: 3802955776
total extent tree bytes: 245055488
btree space waste bytes: 842802897
file data blocks allocated: 9656815640576
 referenced 929225080832

Scrub completes ok though.

I'm prepared to wipe the fs if needed, more than one backup is ready :)


[PATCH 28/42] btrfs: handle delayed ref head accounting cleanup in abort

2018-10-11 Thread Josef Bacik
We weren't doing any of the accounting cleanup when we aborted
transactions.  Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  5 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/extent-tree.c | 13 ++---
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29db902511c1..e40356ca0295 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -35,6 +35,7 @@
 struct btrfs_trans_handle;
 struct btrfs_transaction;
 struct btrfs_pending_snapshot;
+struct btrfs_delayed_ref_root;
 extern struct kmem_cache *btrfs_trans_handle_cachep;
 extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
@@ -2623,6 +2624,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
   unsigned long count);
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
 unsigned long count, u64 transid, int wait);
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head);
 int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
len);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fe1f229320ef..54fbdc944a3f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (pin_bytes)
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
btrfs_put_delayed_ref_head(head);
cond_resched();
spin_lock(_refs->lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 609f49c69c8d..3ca42f4cd462 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2475,12 +2475,11 @@ static int run_and_cleanup_extent_op(struct 
btrfs_trans_handle *trans,
return ret ? ret : 1;
 }
 
-static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
-   struct btrfs_delayed_ref_head *head)
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head)
 {
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_delayed_ref_root *delayed_refs =
-   >transaction->delayed_refs;
int nr_items = 1;
 
if (head->total_ref_mod < 0) {
@@ -2558,7 +2557,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 
trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
@@ -7227,7 +7226,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 40/42] btrfs: drop min_size from evict_refill_and_join

2018-10-11 Thread Josef Bacik
We don't need it, rsv->size is set once and never changes throughout
its lifetime, so just use that for the reserve size.

Reviewed-by: David Sterba 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ab8242b10601..dbcca915e681 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5339,8 +5339,7 @@ static void evict_inode_truncate_pages(struct inode 
*inode)
 }
 
 static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root 
*root,
-   struct btrfs_block_rsv 
*rsv,
-   u64 min_size)
+   struct btrfs_block_rsv 
*rsv)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
@@ -5350,7 +5349,7 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
struct btrfs_trans_handle *trans;
int ret;
 
-   ret = btrfs_block_rsv_refill(root, rsv, min_size,
+   ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
 BTRFS_RESERVE_FLUSH_LIMIT);
 
if (ret && ++failures > 2) {
@@ -5368,7 +5367,7 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 * it.
 */
if (!btrfs_check_space_for_delayed_refs(fs_info) &&
-   !btrfs_block_rsv_migrate(global_rsv, rsv, min_size, 0))
+   !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, 0))
return trans;
 
/* If not, commit and try again. */
@@ -5384,7 +5383,6 @@ void btrfs_evict_inode(struct inode *inode)
struct btrfs_trans_handle *trans;
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_block_rsv *rsv;
-   u64 min_size;
int ret;
 
trace_btrfs_inode_evict(inode);
@@ -5394,8 +5392,6 @@ void btrfs_evict_inode(struct inode *inode)
return;
}
 
-   min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
-
evict_inode_truncate_pages(inode);
 
if (inode->i_nlink &&
@@ -5428,13 +5424,13 @@ void btrfs_evict_inode(struct inode *inode)
rsv = btrfs_alloc_block_rsv(fs_info, BTRFS_BLOCK_RSV_TEMP);
if (!rsv)
goto no_delete;
-   rsv->size = min_size;
+   rsv->size = btrfs_calc_trunc_metadata_size(fs_info, 1);
rsv->failfast = 1;
 
btrfs_i_size_write(BTRFS_I(inode), 0);
 
while (1) {
-   trans = evict_refill_and_join(root, rsv, min_size);
+   trans = evict_refill_and_join(root, rsv);
if (IS_ERR(trans))
goto free_rsv;
 
@@ -5459,7 +5455,7 @@ void btrfs_evict_inode(struct inode *inode)
 * If it turns out that we are dropping too many of these, we might want
 * to add a mechanism for retrying these after a commit.
 */
-   trans = evict_refill_and_join(root, rsv, min_size);
+   trans = evict_refill_and_join(root, rsv);
if (!IS_ERR(trans)) {
trans->block_rsv = rsv;
btrfs_orphan_del(trans, BTRFS_I(inode));
-- 
2.14.3



[PATCH 39/42] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-10-11 Thread Josef Bacik
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  4 +++-
 fs/btrfs/disk-io.c |  5 ++---
 fs/btrfs/extent-tree.c |  9 +
 fs/btrfs/inode.c   | 21 +
 4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e40356ca0295..1ef0b1649cad 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -894,7 +894,8 @@ struct btrfs_fs_info {
 
spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
-   struct mutex cleaner_delayed_iput_mutex;
+   atomic_t nr_delayed_iputs;
+   wait_queue_head_t delayed_iputs_wait;
 
/* this protects tree_mod_seq_list */
spinlock_t tree_mod_seq_lock;
@@ -3212,6 +3213,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
  u64 start, u64 num_bytes, u64 min_size,
  loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 51b2a5bf25e5..3dce9ff72e41 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
goto sleep;
}
 
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
 
again = btrfs_clean_one_deleted_snapshot(root);
mutex_unlock(_info->cleaner_mutex);
@@ -2677,7 +2675,6 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->delete_unused_bgs_mutex);
mutex_init(_info->reloc_mutex);
mutex_init(_info->delalloc_root_mutex);
-   mutex_init(_info->cleaner_delayed_iput_mutex);
seqlock_init(_info->profiles_lock);
 
INIT_LIST_HEAD(_info->dirty_cowonly_roots);
@@ -2699,6 +2696,7 @@ int open_ctree(struct super_block *sb,
atomic_set(_info->defrag_running, 0);
atomic_set(_info->qgroup_op_seq, 0);
atomic_set(_info->reada_works_cnt, 0);
+   atomic_set(_info->nr_delayed_iputs, 0);
atomic64_set(_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2776,6 +2774,7 @@ int open_ctree(struct super_block *sb,
init_waitqueue_head(_info->transaction_wait);
init_waitqueue_head(_info->transaction_blocked_wait);
init_waitqueue_head(_info->async_submit_wait);
+   init_waitqueue_head(_info->delayed_iputs_wait);
 
INIT_LIST_HEAD(_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be18b40d2d48..882b55b79497 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4258,8 +4258,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 * operations. Wait for it to finish so that
 * more space is released.
 */
-   
mutex_lock(_info->cleaner_delayed_iput_mutex);
-   
mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   ret = btrfs_wait_on_delayed_iputs(fs_info);
+   if (ret)
+   return ret;
goto again;
} else {
btrfs_end_transaction(trans);
@@ -4829,9 +4830,9 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * pinned space, so make sure we run the iputs before we do our pinned
 * bytes check below.
 */
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   wait_event(fs_info->delayed_iputs_wait,
+  atomic_read(_info->nr_delayed_iputs) == 0);
 
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0a1671fb03bf..ab8242b10601 100644
--- a/fs/btrfs/inode.c
+++ 

[PATCH 26/42] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock

2018-10-11 Thread Josef Bacik
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 39bd158466cd..121ab180a78a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
 
head = rb_entry(node, struct btrfs_delayed_ref_head,
href_node);
-   if (!mutex_trylock(>mutex)) {
-   refcount_inc(>refs);
-   spin_unlock(_refs->lock);
-
-   mutex_lock(>mutex);
-   mutex_unlock(>mutex);
-   btrfs_put_delayed_ref_head(head);
-   spin_lock(_refs->lock);
+   if (btrfs_delayed_ref_lock(delayed_refs, head))
continue;
-   }
+
spin_lock(>lock);
while ((n = rb_first(>ref_tree)) != NULL) {
ref = rb_entry(n, struct btrfs_delayed_ref_node,
-- 
2.14.3



[PATCH 38/42] btrfs: be more explicit about allowed flush states

2018-10-11 Thread Josef Bacik
For FLUSH_LIMIT flushers we really can only allocate chunks and flush
delayed inode items, everything else is problematic.  I added a bunch of
new states and it lead to weirdness in the FLUSH_LIMIT case because I
forgot about how it worked.  So instead explicitly declare the states
that are ok for flushing with FLUSH_LIMIT and use that for our state
machine.  Then as we add new things that are safe we can just add them
to this list.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b74d8a97f7c..be18b40d2d48 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5110,12 +5110,18 @@ void btrfs_init_async_reclaim_work(struct work_struct 
*work)
INIT_WORK(work, btrfs_async_reclaim_metadata_space);
 }
 
+static const enum btrfs_flush_state priority_flush_states[] = {
+   FLUSH_DELAYED_ITEMS_NR,
+   FLUSH_DELAYED_ITEMS,
+   ALLOC_CHUNK,
+};
+
 static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
struct reserve_ticket *ticket)
 {
u64 to_reclaim;
-   int flush_state = FLUSH_DELAYED_ITEMS_NR;
+   int flush_state = 0;
 
spin_lock(_info->lock);
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
@@ -5127,7 +5133,8 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
spin_unlock(_info->lock);
 
do {
-   flush_space(fs_info, space_info, to_reclaim, flush_state);
+   flush_space(fs_info, space_info, to_reclaim,
+   priority_flush_states[flush_state]);
flush_state++;
spin_lock(_info->lock);
if (ticket->bytes == 0) {
@@ -5135,15 +5142,7 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
return;
}
spin_unlock(_info->lock);
-
-   /*
-* Priority flushers can't wait on delalloc without
-* deadlocking.
-*/
-   if (flush_state == FLUSH_DELALLOC ||
-   flush_state == FLUSH_DELALLOC_WAIT)
-   flush_state = ALLOC_CHUNK;
-   } while (flush_state < COMMIT_TRANS);
+   } while (flush_state < ARRAY_SIZE(priority_flush_states));
 }
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
-- 
2.14.3



[PATCH 30/42] btrfs: just delete pending bgs if we are aborted

2018-10-11 Thread Josef Bacik
We still need to do all of the accounting cleanup for pending block
groups if we abort.  So set the ret to trans->aborted so if we aborted
the cleanup happens and everybody is happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6fa7d19bcbb1..7e66a6351aad 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10363,9 +10363,15 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans)
struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_block_group_item item;
struct btrfs_key key;
-   int ret = 0;
+   int ret;
bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
+   /*
+* If we aborted the transaction with pending bg's we need to just
+* cleanup the list and carry on.
+*/
+   ret = trans->aborted;
+
trans->can_flush_pending_bgs = false;
while (!list_empty(>new_bgs)) {
block_group = list_first_entry(>new_bgs,
-- 
2.14.3



[PATCH 36/42] btrfs: wait on caching when putting the bg cache

2018-10-11 Thread Josef Bacik
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row.  This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.  This is because we aren't waiting for the caching threads
to be done before freeing everything up, so to fix this make sure we
wait on any outstanding caching that's being done before we free up the
block group, so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers().  This fixes the panic I was seeing
consistently in testing.

[ cut here ]
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 
4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 
04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: :c900061e79d0 EFLAGS: 00010202
RAX:  RBX: 880071542e00 RCX: 00533000
RDX: 88006bb74380 RSI: 0008 RDI: 88007816
RBP: 0001 R08: 8800781cd200 R09: 00503000
R10: 88006cd21200 R11:  R12: 
R13:  R14: 8800781cd200 R15: 880071542e00
FS:  () GS:88007fd0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0817ffc4 CR3: 78314000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 btree_submit_bio_hook+0x8a/0xd0
 submit_one_bio+0x5d/0x80
 read_extent_buffer_pages+0x18a/0x320
 btree_read_extent_buffer_pages+0xbc/0x200
 ? alloc_extent_buffer+0x359/0x3e0
 read_tree_block+0x3d/0x60
 read_block_for_search.isra.30+0x1a5/0x360
 btrfs_search_slot+0x41b/0xa10
 btrfs_next_old_leaf+0x212/0x470
 caching_thread+0x323/0x490
 normal_work_helper+0xc5/0x310
 process_one_work+0x141/0x340
 worker_thread+0x44/0x3c0
 kthread+0xf8/0x130
 ? process_one_work+0x340/0x340
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40
Code: ff ff 48 8b 4c 24 28 48 89 de 48 8b 7c 24 08 e8 d1 e5 04 00 89 c3 e9 08 
ff ff ff 4d 89 c6 49 89 df e9 27 fe ff ff e8 5a 3a bb ff <0f> 0b 0f 0b e9 57 ff 
ff ff 48 8b 7c 24 08 4c 89 f9 4c 89 ea 48
RIP: btrfs_map_bio+0x346/0x370 RSP: c900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6174d1b7875b..4b74d8a97f7c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9894,6 +9894,7 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info 
*info)
 
block_group = btrfs_lookup_first_block_group(info, last);
while (block_group) {
+   wait_block_group_cache_done(block_group);
spin_lock(_group->lock);
if (block_group->iref)
break;
-- 
2.14.3



[PATCH 34/42] btrfs: wait on ordered extents on abort cleanup

2018-10-11 Thread Josef Bacik
If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54fbdc944a3f..51b2a5bf25e5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct 
btrfs_fs_info *fs_info)
spin_lock(_info->ordered_root_lock);
}
spin_unlock(_info->ordered_root_lock);
+
+   /*
+* We need this here because if we've been flipped read-only we won't
+* get sync() from the umount, so we need to make sure any ordered
+* extents that haven't had their dirty pages IO start writeout yet
+* actually get run and error out properly.
+*/
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 }
 
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-- 
2.14.3



[PATCH 33/42] btrfs: fix insert_reserved error handling

2018-10-11 Thread Josef Bacik
We were not handling the reserved byte accounting properly for data
references.  Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases.  So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7e66a6351aad..6174d1b7875b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
   insert_reserved);
else
BUG();
+   if (ret && insert_reserved)
+   btrfs_pin_extent(trans->fs_info, node->bytenr,
+node->num_bytes, 1);
return ret;
 }
 
@@ -8257,21 +8260,14 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
}
 
path = btrfs_alloc_path();
-   if (!path) {
-   btrfs_free_and_pin_reserved_extent(fs_info,
-  extent_key.objectid,
-  fs_info->nodesize);
+   if (!path)
return -ENOMEM;
-   }
 
path->leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
  _key, size);
if (ret) {
btrfs_free_path(path);
-   btrfs_free_and_pin_reserved_extent(fs_info,
-  extent_key.objectid,
-  fs_info->nodesize);
return ret;
}
 
-- 
2.14.3



[PATCH 42/42] btrfs: don't run delayed_iputs in commit

2018-10-11 Thread Josef Bacik
This could result in a really bad case where we do something like

evict
  evict_refill_and_join
btrfs_commit_transaction
  btrfs_run_delayed_iputs
evict
  evict_refill_and_join
btrfs_commit_transaction
... forever

We have plenty of other places where we run delayed iputs that are much
safer, let those do the work.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9168efaca37e..c91dc36fccae 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
 
kmem_cache_free(btrfs_trans_handle_cachep, trans);
 
-   /*
-* If fs has been frozen, we can not handle delayed iputs, otherwise
-* it'll result in deadlock about SB_FREEZE_FS.
-*/
-   if (current != fs_info->transaction_kthread &&
-   current != fs_info->cleaner_kthread &&
-   !test_bit(BTRFS_FS_FROZEN, _info->flags))
-   btrfs_run_delayed_iputs(fs_info);
-
return ret;
 
 scrub_continue:
-- 
2.14.3



[PATCH 35/42] MAINTAINERS: update my email address for btrfs

2018-10-11 Thread Josef Bacik
My work email is completely useless, switch it to my personal address so
I get emails on a account I actually pay attention to.

Signed-off-by: Josef Bacik 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 32fbc6f732d4..7723dc958e99 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3095,7 +3095,7 @@ F:drivers/gpio/gpio-bt8xx.c
 
 BTRFS FILE SYSTEM
 M: Chris Mason 
-M: Josef Bacik 
+M: Josef Bacik 
 M: David Sterba 
 L: linux-btrfs@vger.kernel.org
 W: http://btrfs.wiki.kernel.org/
-- 
2.14.3



[PATCH 37/42] btrfs: wakeup cleaner thread when adding delayed iput

2018-10-11 Thread Josef Bacik
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  The cleaner
thread only gets woken up every 30 seconds, so instead wake it up to do
it's work so that we can free up that space as quickly as possible.

Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2b257d14bd3d..0a1671fb03bf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3323,6 +3323,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
ASSERT(list_empty(>delayed_iput));
list_add_tail(>delayed_iput, _info->delayed_iputs);
spin_unlock(_info->delayed_iput_lock);
+   wake_up_process(fs_info->cleaner_kthread);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
-- 
2.14.3



[PATCH 41/42] btrfs: reserve extra space during evict()

2018-10-11 Thread Josef Bacik
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dbcca915e681..9f7da5e3c741 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5343,13 +5343,15 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
+   u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
int failures = 0;
 
for (;;) {
struct btrfs_trans_handle *trans;
int ret;
 
-   ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
+   ret = btrfs_block_rsv_refill(root, rsv,
+rsv->size + delayed_refs_extra,
 BTRFS_RESERVE_FLUSH_LIMIT);
 
if (ret && ++failures > 2) {
@@ -5358,9 +5360,28 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
return ERR_PTR(-ENOSPC);
}
 
+   /*
+* Evict can generate a large amount of delayed refs without
+* having a way to add space back since we exhaust our temporary
+* block rsv.  We aren't allowed to do FLUSH_ALL in this case
+* because we could deadlock with so many things in the flushing
+* code, so we have to try and hold some extra space to
+* compensate for our delayed ref generation.  If we can't get
+* that space then we need see if we can steal our minimum from
+* the global reserve.  We will be ratelimited by the amount of
+* space we have for the delayed refs rsv, so we'll end up
+* committing and trying again.
+*/
trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans) || !ret)
+   if (IS_ERR(trans) || !ret) {
+   if (!IS_ERR(trans)) {
+   trans->block_rsv = _info->trans_block_rsv;
+   trans->bytes_reserved = delayed_refs_extra;
+   btrfs_block_rsv_migrate(rsv, trans->block_rsv,
+   delayed_refs_extra, 1);
+   }
return trans;
+   }
 
/*
 * Try to steal from the global reserve if there is space for
-- 
2.14.3



[PATCH 22/42] btrfs: only run delayed refs if we're committing

2018-10-11 Thread Josef Bacik
I noticed in a giant dbench run that we spent a lot of time on lock
contention while running transaction commit.  This is because dbench
results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
they all run the delayed refs first thing, so they all contend with
each other.  This leads to seconds of 0 throughput.  Change this to only
run the delayed refs if we're the ones committing the transaction.  This
makes the latency go away and we get no more lock contention.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a0f19ca0bd6c..39a2bddb0b29 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1925,15 +1925,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   /* make a pass through all the delayed refs we have so far
-* any runnings procs may add more while we are here
-*/
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
cur_trans = trans->transaction;
 
/*
@@ -1946,12 +1937,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
if (!list_empty(>new_bgs))
btrfs_create_pending_block_groups(trans);
 
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
 
@@ -2022,6 +2007,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
spin_unlock(_info->trans_lock);
}
 
+   /*
+* We are now the only one in the commit area, we can run delayed refs
+* without hitting a bunch of lock contention from a lot of people
+* trying to commit the transaction at once.
+*/
+   ret = btrfs_run_delayed_refs(trans, 0);
+   if (ret)
+   goto cleanup_transaction;
+
extwriter_counter_dec(cur_trans, trans->type);
 
ret = btrfs_start_delalloc_flush(fs_info);
-- 
2.14.3



[PATCH 25/42] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock

2018-10-11 Thread Josef Bacik
We don't need the trans except to get the delayed_refs_root, so just
pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a
day.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 5 +
 fs/btrfs/delayed-ref.h | 2 +-
 fs/btrfs/extent-tree.c | 2 +-
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 96ce087747b2..87778645bf4a 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr,
return NULL;
 }
 
-int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
+int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
   struct btrfs_delayed_ref_head *head)
 {
-   struct btrfs_delayed_ref_root *delayed_refs;
-
-   delayed_refs = >transaction->delayed_refs;
lockdep_assert_held(_refs->lock);
if (mutex_trylock(>mutex))
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 7769177b489e..ee636d7a710a 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
*trans,
 struct btrfs_delayed_ref_head *
 btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
u64 bytenr);
-int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
+int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
   struct btrfs_delayed_ref_head *head);
 static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head 
*head)
 {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ec8ede2b1dec..609f49c69c8d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2600,7 +2600,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 
/* grab the lock that says we are going to process
 * all the refs for this head */
-   ret = btrfs_delayed_ref_lock(trans, locked_ref);
+   ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref);
spin_unlock(_refs->lock);
/*
 * we may have dropped the spin lock to get the head
-- 
2.14.3



[PATCH 20/42] btrfs: don't use ctl->free_space for max_extent_size

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group.  We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6e7bc3197737..4f48d047a1ec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7496,6 +7496,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
struct btrfs_block_group_cache *block_group = NULL;
u64 search_start = 0;
u64 max_extent_size = 0;
+   u64 max_free_space = 0;
u64 empty_cluster = 0;
struct btrfs_space_info *space_info;
int loop = 0;
@@ -7791,8 +7792,8 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
spin_lock(>tree_lock);
if (ctl->free_space <
num_bytes + empty_cluster + empty_size) {
-   if (ctl->free_space > max_extent_size)
-   max_extent_size = ctl->free_space;
+   max_free_space = max(max_free_space,
+ctl->free_space);
spin_unlock(>tree_lock);
goto loop;
}
@@ -7959,6 +7960,8 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
}
 out:
if (ret == -ENOSPC) {
+   if (!max_extent_size)
+   max_extent_size = max_free_space;
spin_lock(_info->lock);
space_info->max_extent_size = max_extent_size;
spin_unlock(_info->lock);
-- 
2.14.3



[PATCH 16/42] btrfs: loop in inode_rsv_refill

2018-10-11 Thread Josef Bacik
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 56 --
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7a53f6a29ebc..3aba96442472 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5766,6 +5766,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
+static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
+ u64 *metadata_bytes, u64 *qgroup_bytes)
+{
+   *metadata_bytes = 0;
+   *qgroup_bytes = 0;
+
+   spin_lock(_rsv->lock);
+   if (block_rsv->reserved < block_rsv->size)
+   *metadata_bytes = block_rsv->size - block_rsv->reserved;
+   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+   *qgroup_bytes = block_rsv->qgroup_rsv_size -
+   block_rsv->qgroup_rsv_reserved;
+   spin_unlock(_rsv->lock);
+}
+
 /**
  * btrfs_inode_rsv_refill - refill the inode block rsv.
  * @inode - the inode we are refilling.
@@ -5781,25 +5796,37 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
 {
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = >block_rsv;
-   u64 num_bytes = 0;
+   u64 num_bytes = 0, last = 0;
u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
-   spin_lock(_rsv->lock);
-   if (block_rsv->reserved < block_rsv->size)
-   num_bytes = block_rsv->size - block_rsv->reserved;
-   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-   qgroup_num_bytes = block_rsv->qgroup_rsv_size -
-  block_rsv->qgroup_rsv_reserved;
-   spin_unlock(_rsv->lock);
-
+   __get_refill_bytes(block_rsv, _bytes, _num_bytes);
if (num_bytes == 0)
return 0;
 
-   ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
-   if (ret)
-   return ret;
-   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   do {
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, 
qgroup_num_bytes, true);
+   if (ret)
+   return ret;
+   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   if (ret) {
+   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   last = num_bytes;
+   /*
+* If we are fragmented we can end up with a lot of
+* outstanding extents which will make our size be much
+* larger than our reserved amount.  If we happen to
+* try to do a reservation here that may result in us
+* trying to do a pretty hefty reservation, which we may
+* not need once delalloc flushing happens.  If this is
+* the case try and do the reserve again.
+*/
+   if (flush == BTRFS_RESERVE_FLUSH_ALL)
+   __get_refill_bytes(block_rsv, _bytes,
+  _num_bytes);
+   }
+   } while (ret && last != num_bytes);
+
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
trace_btrfs_space_reservation(root->fs_info, "delalloc",
@@ -5809,8 +5836,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
spin_lock(_rsv->lock);
block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
spin_unlock(_rsv->lock);
-   } else
-   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   }
return ret;
 }
 
-- 
2.14.3



[PATCH 23/42] btrfs: make sure we create all new bgs

2018-10-11 Thread Josef Bacik
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations.  So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Reviewed-by: David Sterba 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4f48d047a1ec..ec8ede2b1dec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10361,7 +10361,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_block_group_cache *block_group, *tmp;
+   struct btrfs_block_group_cache *block_group;
struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_block_group_item item;
struct btrfs_key key;
@@ -10369,7 +10369,10 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans)
bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
trans->can_flush_pending_bgs = false;
-   list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) {
+   while (!list_empty(>new_bgs)) {
+   block_group = list_first_entry(>new_bgs,
+  struct btrfs_block_group_cache,
+  bg_list);
if (ret)
goto next;
 
-- 
2.14.3



[PATCH 10/42] btrfs: protect space cache inode alloc with nofs

2018-10-11 Thread Josef Bacik
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
Reviewed-by: David Sterba 
---
 fs/btrfs/free-space-cache.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c3888c113d81..e077ad3b4549 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "free-space-cache.h"
 #include "transaction.h"
@@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct 
btrfs_root *root,
struct btrfs_free_space_header *header;
struct extent_buffer *leaf;
struct inode *inode = NULL;
+   unsigned nofs_flag;
int ret;
 
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
@@ -68,7 +70,13 @@ static struct inode *__lookup_free_space_inode(struct 
btrfs_root *root,
btrfs_disk_key_to_cpu(, _key);
btrfs_release_path(path);
 
+   /*
+* We are often under a trans handle at this point, so we need to make
+* sure NOFS is set to keep us from deadlocking.
+*/
+   nofs_flag = memalloc_nofs_save();
inode = btrfs_iget(fs_info->sb, , root, NULL);
+   memalloc_nofs_restore(nofs_flag);
if (IS_ERR(inode))
return inode;
 
-- 
2.14.3



[PATCH 24/42] btrfs: assert on non-empty delayed iputs

2018-10-11 Thread Josef Bacik
I ran into an issue where there was some reference being held on an
inode that I couldn't track.  This assert wasn't triggered, but it at
least rules out we're doing something stupid.

Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 377ad9c1cb17..39bd158466cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
kthread_stop(fs_info->transaction_kthread);
kthread_stop(fs_info->cleaner_kthread);
 
+   ASSERT(list_empty(_info->delayed_iputs));
set_bit(BTRFS_FS_CLOSING_DONE, _info->flags);
 
btrfs_free_qgroup_config(fs_info);
-- 
2.14.3



[PATCH 32/42] btrfs: only free reserved extent if we didn't insert it

2018-10-11 Thread Josef Bacik
When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter.  However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5a91055a13b2..2b257d14bd3d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
bool truncated = false;
bool range_locked = false;
bool clear_new_delalloc_bytes = false;
+   bool clear_reserved_extent = true;
 
if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
!test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) &&
@@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
logical_len, logical_len,
compress_type, 0, 0,
BTRFS_FILE_EXTENT_REG);
-   if (!ret)
+   if (!ret) {
+   clear_reserved_extent = false;
btrfs_release_delalloc_bytes(fs_info,
 ordered_extent->start,
 ordered_extent->disk_len);
+   }
}
unpin_extent_cache(_I(inode)->extent_tree,
   ordered_extent->file_offset, ordered_extent->len,
@@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
 * wrong we need to return the space for this ordered extent
 * back to the allocator.  We only free the extent in the
 * truncated case if we didn't write out the extent at all.
+*
+* If we made it past insert_reserved_file_extent before we
+* errored out then we don't need to do this as the accounting
+* has already been done.
 */
if ((ret || !logical_len) &&
+   clear_reserved_extent &&
!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
!test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags))
btrfs_free_reserved_extent(fs_info,
-- 
2.14.3



[PATCH 21/42] btrfs: reset max_extent_size on clear in a bitmap

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.

Reviewed-by: Liu Bo 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/free-space-cache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 2e96ee7da3ec..d2a863a2ee24 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1687,6 +1687,8 @@ static inline void __bitmap_clear_bits(struct 
btrfs_free_space_ctl *ctl,
bitmap_clear(info->bitmap, start, count);
 
info->bytes -= bytes;
+   if (info->max_extent_size > ctl->unit)
+   info->max_extent_size = 0;
 }
 
 static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
-- 
2.14.3



[PATCH 18/42] btrfs: move the dio_sem higher up the callchain

2018-10-11 Thread Josef Bacik
We're getting a lockdep splat because we take the dio_sem under the
log_mutex.  What we really need is to protect fsync() from logging an
extent map for an extent we never waited on higher up, so just guard the
whole thing with dio_sem.

==
WARNING: possible circular locking dependency detected
4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted
--
aio-dio-invalid/30928 is trying to acquire lock:
92621cfd (>mmap_sem){}, at: get_user_pages_unlocked+0x5a/0x1e0

but task is already holding lock:
cefe6b35 (>dio_sem){}, at: btrfs_direct_IO+0x3be/0x400

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (>dio_sem){}:
   lock_acquire+0xbd/0x220
   down_write+0x51/0xb0
   btrfs_log_changed_extents+0x80/0xa40
   btrfs_log_inode+0xbaf/0x1000
   btrfs_log_inode_parent+0x26f/0xa80
   btrfs_log_dentry_safe+0x50/0x70
   btrfs_sync_file+0x357/0x540
   do_fsync+0x38/0x60
   __ia32_sys_fdatasync+0x12/0x20
   do_fast_syscall_32+0x9a/0x2f0
   entry_SYSENTER_compat+0x84/0x96

-> #4 (>log_mutex){+.+.}:
   lock_acquire+0xbd/0x220
   __mutex_lock+0x86/0xa10
   btrfs_record_unlink_dir+0x2a/0xa0
   btrfs_unlink+0x5a/0xc0
   vfs_unlink+0xb1/0x1a0
   do_unlinkat+0x264/0x2b0
   do_fast_syscall_32+0x9a/0x2f0
   entry_SYSENTER_compat+0x84/0x96

-> #3 (sb_internal#2){.+.+}:
   lock_acquire+0xbd/0x220
   __sb_start_write+0x14d/0x230
   start_transaction+0x3e6/0x590
   btrfs_evict_inode+0x475/0x640
   evict+0xbf/0x1b0
   btrfs_run_delayed_iputs+0x6c/0x90
   cleaner_kthread+0x124/0x1a0
   kthread+0x106/0x140
   ret_from_fork+0x3a/0x50

-> #2 (_info->cleaner_delayed_iput_mutex){+.+.}:
   lock_acquire+0xbd/0x220
   __mutex_lock+0x86/0xa10
   btrfs_alloc_data_chunk_ondemand+0x197/0x530
   btrfs_check_data_free_space+0x4c/0x90
   btrfs_delalloc_reserve_space+0x20/0x60
   btrfs_page_mkwrite+0x87/0x520
   do_page_mkwrite+0x31/0xa0
   __handle_mm_fault+0x799/0xb00
   handle_mm_fault+0x7c/0xe0
   __do_page_fault+0x1d3/0x4a0
   async_page_fault+0x1e/0x30

-> #1 (sb_pagefaults){.+.+}:
   lock_acquire+0xbd/0x220
   __sb_start_write+0x14d/0x230
   btrfs_page_mkwrite+0x6a/0x520
   do_page_mkwrite+0x31/0xa0
   __handle_mm_fault+0x799/0xb00
   handle_mm_fault+0x7c/0xe0
   __do_page_fault+0x1d3/0x4a0
   async_page_fault+0x1e/0x30

-> #0 (>mmap_sem){}:
   __lock_acquire+0x42e/0x7a0
   lock_acquire+0xbd/0x220
   down_read+0x48/0xb0
   get_user_pages_unlocked+0x5a/0x1e0
   get_user_pages_fast+0xa4/0x150
   iov_iter_get_pages+0xc3/0x340
   do_direct_IO+0xf93/0x1d70
   __blockdev_direct_IO+0x32d/0x1c20
   btrfs_direct_IO+0x227/0x400
   generic_file_direct_write+0xcf/0x180
   btrfs_file_write_iter+0x308/0x58c
   aio_write+0xf8/0x1d0
   io_submit_one+0x3a9/0x620
   __ia32_compat_sys_io_submit+0xb2/0x270
   do_int80_syscall_32+0x5b/0x1a0
   entry_INT80_compat+0x88/0xa0

other info that might help us debug this:

Chain exists of:
  >mmap_sem --> >log_mutex --> >dio_sem

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(>dio_sem);
   lock(>log_mutex);
   lock(>dio_sem);
  lock(>mmap_sem);

 *** DEADLOCK ***

1 lock held by aio-dio-invalid/30928:
 #0: cefe6b35 (>dio_sem){}, at: btrfs_direct_IO+0x3be/0x400

stack backtrace:
CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 
4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 
04/01/2014
Call Trace:
 dump_stack+0x7c/0xbb
 print_circular_bug.isra.37+0x297/0x2a4
 check_prev_add.constprop.45+0x781/0x7a0
 ? __lock_acquire+0x42e/0x7a0
 validate_chain.isra.41+0x7f0/0xb00
 __lock_acquire+0x42e/0x7a0
 lock_acquire+0xbd/0x220
 ? get_user_pages_unlocked+0x5a/0x1e0
 down_read+0x48/0xb0
 ? get_user_pages_unlocked+0x5a/0x1e0
 get_user_pages_unlocked+0x5a/0x1e0
 get_user_pages_fast+0xa4/0x150
 iov_iter_get_pages+0xc3/0x340
 do_direct_IO+0xf93/0x1d70
 ? __alloc_workqueue_key+0x358/0x490
 ? __blockdev_direct_IO+0x14b/0x1c20
 __blockdev_direct_IO+0x32d/0x1c20
 ? btrfs_run_delalloc_work+0x40/0x40
 ? can_nocow_extent+0x490/0x490
 ? kvm_clock_read+0x1f/0x30
 ? can_nocow_extent+0x490/0x490
 ? btrfs_run_delalloc_work+0x40/0x40
 btrfs_direct_IO+0x227/0x400
 ? btrfs_run_delalloc_work+0x40/0x40
 generic_file_direct_write+0xcf/0x180
 btrfs_file_write_iter+0x308/0x58c
 aio_write+0xf8/0x1d0
 ? kvm_clock_read+0x1f/0x30
 ? __might_fault+0x3e/0x90
 io_submit_one+0x3a9/0x620
 ? io_submit_one+0xe5/0x620
 __ia32_compat_sys_io_submit+0xb2/0x270
 do_int80_syscall_32+0x5b/0x1a0
 entry_INT80_compat+0x88/0xa0


[PATCH 15/42] btrfs: don't enospc all tickets on flush failure

2018-10-11 Thread Josef Bacik
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f84537a1d7eb..7a53f6a29ebc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4779,6 +4779,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+   u64 orig_bytes;
u64 bytes;
int error;
struct list_head list;
@@ -5000,7 +5001,7 @@ static inline int need_do_async_reclaim(struct 
btrfs_fs_info *fs_info,
!test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
struct reserve_ticket *ticket;
 
@@ -5009,7 +5010,10 @@ static void wake_all_tickets(struct list_head *head)
list_del_init(>list);
ticket->error = -ENOSPC;
wake_up(>wait);
+   if (ticket->bytes != ticket->orig_bytes)
+   return true;
}
+   return false;
 }
 
 /*
@@ -5077,8 +5081,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
-   wake_all_tickets(_info->tickets);
-   space_info->flush = 0;
+   if (wake_all_tickets(_info->tickets)) {
+   flush_state = FLUSH_DELAYED_ITEMS_NR;
+   commit_cycles--;
+   } else {
+   space_info->flush = 0;
+   }
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
}
@@ -5130,10 +5138,11 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info,
-  struct reserve_ticket *ticket, u64 orig_bytes)
+  struct reserve_ticket *ticket)
 
 {
DEFINE_WAIT(wait);
+   u64 reclaim_bytes = 0;
int ret = 0;
 
spin_lock(_info->lock);
@@ -5154,14 +5163,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
ret = ticket->error;
if (!list_empty(>list))
list_del_init(>list);
-   if (ticket->bytes && ticket->bytes < orig_bytes) {
-   u64 num_bytes = orig_bytes - ticket->bytes;
-   space_info->bytes_may_use -= num_bytes;
-   trace_btrfs_space_reservation(fs_info, "space_info",
- space_info->flags, num_bytes, 0);
-   }
+   if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+   reclaim_bytes = ticket->orig_bytes - ticket->bytes;
spin_unlock(_info->lock);
 
+   if (reclaim_bytes)
+   space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
return ret;
 }
 
@@ -5187,6 +5194,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket ticket;
u64 used;
+   u64 reclaim_bytes = 0;
int ret = 0;
 
ASSERT(orig_bytes);
@@ -5222,6 +5230,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 * the list and we will do our own flushing further down.
 */
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+   ticket.orig_bytes = orig_bytes;
ticket.bytes = orig_bytes;
ticket.error = 0;
init_waitqueue_head();
@@ -5262,25 +5271,21 @@ static int __reserve_metadata_bytes(struct 
btrfs_fs_info *fs_info,
return ret;
 
if (flush == BTRFS_RESERVE_FLUSH_ALL)
-   return wait_reserve_ticket(fs_info, space_info, ,
-  orig_bytes);
+   return wait_reserve_ticket(fs_info, 

[PATCH 29/42] btrfs: call btrfs_create_pending_block_groups unconditionally

2018-10-11 Thread Josef Bacik
The first thing we do is loop through the list, this

if (!list_empty())
btrfs_create_pending_block_groups();

thing is just wasted space.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 3 +--
 fs/btrfs/transaction.c | 6 ++
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3ca42f4cd462..6fa7d19bcbb1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2978,8 +2978,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
}
 
if (run_all) {
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
spin_lock(_refs->lock);
node = rb_first(_refs->href_root);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 39a2bddb0b29..46ca775a709e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -846,8 +846,7 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
btrfs_trans_release_chunk_metadata(trans);
 
@@ -1934,8 +1933,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
cur_trans->delayed_refs.flushing = 1;
smp_wmb();
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
-- 
2.14.3



[PATCH 17/42] btrfs: run delayed iputs before committing

2018-10-11 Thread Josef Bacik
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3aba96442472..6e7bc3197737 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4823,6 +4823,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
+   /*
+* If we have pending delayed iputs then we could free up a bunch of
+* pinned space, so make sure we run the iputs before we do our pinned
+* bytes check below.
+*/
+   mutex_lock(_info->cleaner_delayed_iput_mutex);
+   btrfs_run_delayed_iputs(fs_info);
+   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
return -ENOSPC;
-- 
2.14.3



[PATCH 31/42] btrfs: cleanup pending bgs on transaction abort

2018-10-11 Thread Josef Bacik
We may abort the transaction during a commit and not have a chance to
run the pending bgs stuff, which will leave block groups on our list and
cause us accounting issues and leaked memory.  Fix this by running the
pending bgs when we cleanup a transaction.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 46ca775a709e..9168efaca37e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2280,6 +2280,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_scrub_continue(fs_info);
 cleanup_transaction:
btrfs_trans_release_metadata(trans);
+   /* This cleans up the pending block groups list properly. */
+   if (!trans->aborted)
+   trans->aborted = ret;
+   btrfs_create_pending_block_groups(trans);
btrfs_trans_release_chunk_metadata(trans);
trans->block_rsv = NULL;
btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
-- 
2.14.3



[PATCH 12/42] btrfs: don't use global rsv for chunk allocation

2018-10-11 Thread Josef Bacik
We've done this forever because of the voodoo around knowing how much
space we have.  However we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily.  Instead use the actual used amount when
determining if we need to allocate a chunk or not.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c9913c59686b..c0f6110419b2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4374,21 +4374,12 @@ static inline u64 calc_global_rsv_need_space(struct 
btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
-   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
u64 bytes_used = btrfs_space_info_used(sinfo, false);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
return 1;
 
-   /*
-* We need to take into account the global rsv because for all intents
-* and purposes it's used space.  Don't worry about locking the
-* global_rsv, it doesn't change except when the transaction commits.
-*/
-   if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   bytes_used += calc_global_rsv_need_space(global_rsv);
-
/*
 * in limited mode, we want to have some free space up to
 * about 1% of the FS size.
-- 
2.14.3



[PATCH 27/42] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head

2018-10-11 Thread Josef Bacik
Instead of open coding this stuff use the helper instead.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 121ab180a78a..fe1f229320ef 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (head->must_insert_reserved)
pin_bytes = true;
btrfs_free_delayed_extent_op(head->extent_op);
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
-   atomic_dec(_refs->num_entries);
-   rb_erase(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
mutex_unlock(>mutex);
-- 
2.14.3



[PATCH 14/42] btrfs: reset max_extent_size properly

2018-10-11 Thread Josef Bacik
If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation.  We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cd2280962c8c..f84537a1d7eb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans, u64 flags,
goto out;
} else {
ret = 1;
+   space_info->max_extent_size = 0;
}
 
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
@@ -6671,6 +6672,7 @@ static int btrfs_free_reserved_bytes(struct 
btrfs_block_group_cache *cache,
space_info->bytes_readonly += num_bytes;
cache->reserved -= num_bytes;
space_info->bytes_reserved -= num_bytes;
+   space_info->max_extent_size = 0;
 
if (delalloc)
cache->delalloc_bytes -= num_bytes;
-- 
2.14.3



[PATCH 19/42] btrfs: set max_extent_size properly

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case.  Fix up all the logic to make this
consistent.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/free-space-cache.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index e077ad3b4549..2e96ee7da3ec 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1770,6 +1770,18 @@ static int search_bitmap(struct btrfs_free_space_ctl 
*ctl,
return -1;
 }
 
+static void set_max_extent_size(struct btrfs_free_space *entry,
+   u64 *max_extent_size)
+{
+   if (entry->bitmap) {
+   if (entry->max_extent_size > *max_extent_size)
+   *max_extent_size = entry->max_extent_size;
+   } else {
+   if (entry->bytes > *max_extent_size)
+   *max_extent_size = entry->bytes;
+   }
+}
+
 /* Cache the size of the max extent in bytes */
 static struct btrfs_free_space *
 find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
@@ -1791,8 +1803,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
*offset, u64 *bytes,
for (node = >offset_index; node; node = rb_next(node)) {
entry = rb_entry(node, struct btrfs_free_space, offset_index);
if (entry->bytes < *bytes) {
-   if (entry->bytes > *max_extent_size)
-   *max_extent_size = entry->bytes;
+   set_max_extent_size(entry, max_extent_size);
continue;
}
 
@@ -1810,8 +1821,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
*offset, u64 *bytes,
}
 
if (entry->bytes < *bytes + align_off) {
-   if (entry->bytes > *max_extent_size)
-   *max_extent_size = entry->bytes;
+   set_max_extent_size(entry, max_extent_size);
continue;
}
 
@@ -1823,8 +1833,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
*offset, u64 *bytes,
*offset = tmp;
*bytes = size;
return entry;
-   } else if (size > *max_extent_size) {
-   *max_extent_size = size;
+   } else {
+   set_max_extent_size(entry, max_extent_size);
}
continue;
}
@@ -2684,8 +2694,7 @@ static u64 btrfs_alloc_from_bitmap(struct 
btrfs_block_group_cache *block_group,
 
err = search_bitmap(ctl, entry, _start, _bytes, true);
if (err) {
-   if (search_bytes > *max_extent_size)
-   *max_extent_size = search_bytes;
+   set_max_extent_size(entry, max_extent_size);
return 0;
}
 
@@ -2722,8 +2731,8 @@ u64 btrfs_alloc_from_cluster(struct 
btrfs_block_group_cache *block_group,
 
entry = rb_entry(node, struct btrfs_free_space, offset_index);
while (1) {
-   if (entry->bytes < bytes && entry->bytes > *max_extent_size)
-   *max_extent_size = entry->bytes;
+   if (entry->bytes < bytes)
+   set_max_extent_size(entry, max_extent_size);
 
if (entry->bytes < bytes ||
(!entry->bitmap && entry->offset < min_start)) {
-- 
2.14.3



[PATCH 13/42] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-10-11 Thread Josef Bacik
With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/extent-tree.c   | 18 +-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1a2c3b629af2..29db902511c1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2730,7 +2730,8 @@ enum btrfs_flush_state {
FLUSH_DELALLOC  =   5,
FLUSH_DELALLOC_WAIT =   6,
ALLOC_CHUNK =   7,
-   COMMIT_TRANS=   8,
+   ALLOC_CHUNK_FORCE   =   8,
+   COMMIT_TRANS=   9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c0f6110419b2..cd2280962c8c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4914,6 +4914,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
btrfs_end_transaction(trans);
break;
case ALLOC_CHUNK:
+   case ALLOC_CHUNK_FORCE:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -4921,7 +4922,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
ret = do_chunk_alloc(trans,
 btrfs_metadata_alloc_profile(fs_info),
-CHUNK_ALLOC_NO_FORCE);
+(state == ALLOC_CHUNK) ?
+CHUNK_ALLOC_NO_FORCE :
+CHUNK_ALLOC_FORCE);
btrfs_end_transaction(trans);
if (ret > 0 || ret == -ENOSPC)
ret = 0;
@@ -5057,6 +5060,19 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
commit_cycles--;
}
 
+   /*
+* We don't want to force a chunk allocation until we've tried
+* pretty hard to reclaim space.  Think of the case where we
+* free'd up a bunch of space and so have a lot of pinned space
+* to reclaim.  We would rather use that than possibly create a
+* underutilized metadata chunk.  So if this is our first run
+* through the flushing state machine skip ALLOC_CHUNK_FORCE and
+* commit the transaction.  If nothing has changed the next go
+* around then we can force a chunk allocation.
+*/
+   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+   flush_state++;
+
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 7d205e50b09c..fdb23181b5b7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
{ FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
+   { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3



[PATCH 09/42] btrfs: release metadata before running delayed refs

2018-10-11 Thread Josef Bacik
We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 117e0c4a914a..a0f19ca0bd6c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1922,6 +1922,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
return ret;
}
 
+   btrfs_trans_release_metadata(trans);
+   trans->block_rsv = NULL;
+
/* make a pass through all the delayed refs we have so far
 * any runnings procs may add more while we are here
 */
@@ -1931,9 +1934,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
return ret;
}
 
-   btrfs_trans_release_metadata(trans);
-   trans->block_rsv = NULL;
-
cur_trans = trans->transaction;
 
/*
-- 
2.14.3



[PATCH 11/42] btrfs: fix truncate throttling

2018-10-11 Thread Josef Bacik
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 79 
 1 file changed, 17 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cd00ec869c96..5a91055a13b2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4493,31 +4493,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 bytes_deleted)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret;
-
-   /*
-* This is only used to apply pressure to the enospc system, we don't
-* intend to use this reservation at all.
-*/
-   bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-   bytes_deleted *= fs_info->nodesize;
-   ret = btrfs_block_rsv_add(root, _info->trans_block_rsv,
- bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-   if (!ret) {
-   trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid,
- bytes_deleted, 1);
-   trans->bytes_reserved += bytes_deleted;
-   }
-   return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4562,7 +4537,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
-   bool should_end = false;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4775,15 +4749,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
btrfs_abort_transaction(trans, ret);
break;
}
-   if (btrfs_should_throttle_delayed_refs(trans, fs_info))
-   btrfs_async_run_delayed_refs(fs_info,
-   trans->delayed_ref_updates * 2,
-   trans->transid, 0);
if (be_nice) {
-   if (truncate_space_check(trans, root,
-extent_num_bytes)) {
-   should_end = true;
-   }
if (btrfs_should_throttle_delayed_refs(trans,
   fs_info))
should_throttle = true;
@@ -4795,7 +4761,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
if (path->slots[0] == 0 ||
path->slots[0] != pending_del_slot ||
-   should_throttle || should_end) {
+   should_throttle) {
if (pending_del_nr) {
ret = btrfs_del_items(trans, root, path,
pending_del_slot,
@@ -4807,23 +4773,24 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
pending_del_nr = 0;
}
btrfs_release_path(path);
-   if (should_throttle) {
-   unsigned long updates = 
trans->delayed_ref_updates;
-   if (updates) {
-   trans->delayed_ref_updates = 0;
-   ret = btrfs_run_delayed_refs(trans,
-  updates * 2);
-   if (ret)
-   break;
-   }
-   }
+
/*
-* if we failed to refill our space rsv, bail out
-* and let the transaction restart
+* We can generate a lot of delayed refs, so we need to
+* throttle every once and a while and make sure we're
+* adding enough space to keep up with the work we are
+* generating.  Since we hold a transaction here we
+* can't flush, and we don't want to FLUSH_LIMIT because
+* we could have generated too many delayed refs to
+* actually allocate, so just bail if we're short and
+* let the 

[PATCH 04/42] btrfs: only track ref_heads in delayed_ref_updates

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3a9e4ac21794..27f7dd4e3d52 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
ref->in_tree = 0;
btrfs_put_delayed_ref(ref);
atomic_dec(_refs->num_entries);
-   if (trans->delayed_ref_updates)
-   trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
if (ref->action == BTRFS_ADD_DELAYED_REF)
list_add_tail(>add_list, >ref_add_list);
atomic_inc(>num_entries);
-   trans->delayed_ref_updates++;
spin_unlock(>lock);
return ret;
 }
-- 
2.14.3



[PATCH 03/42] btrfs: cleanup extent_op handling

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a44d55e36e11..98f36dfeccb0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2442,19 +2442,33 @@ static void unselect_delayed_ref_head(struct 
btrfs_delayed_ref_root *delayed_ref
btrfs_delayed_ref_unlock(head);
 }
 
-static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head)
+static struct btrfs_delayed_extent_op *
+cleanup_extent_op(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_head *head)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
-   int ret;
 
if (!extent_op)
-   return 0;
-   head->extent_op = NULL;
+   return NULL;
+
if (head->must_insert_reserved) {
+   head->extent_op = NULL;
btrfs_free_delayed_extent_op(extent_op);
-   return 0;
+   return NULL;
}
+   return extent_op;
+}
+
+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
+struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_delayed_extent_op *extent_op =
+   cleanup_extent_op(trans, head);
+   int ret;
+
+   if (!extent_op)
+   return 0;
+   head->extent_op = NULL;
spin_unlock(>lock);
ret = run_delayed_extent_op(trans, head, extent_op);
btrfs_free_delayed_extent_op(extent_op);
@@ -2506,7 +2520,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
 
delayed_refs = >transaction->delayed_refs;
 
-   ret = cleanup_extent_op(trans, head);
+   ret = run_and_cleanup_extent_op(trans, head);
if (ret < 0) {
unselect_delayed_ref_head(delayed_refs, head);
btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!RB_EMPTY_ROOT(>ref_tree))
goto out;
 
-   if (head->extent_op) {
-   if (!head->must_insert_reserved)
-   goto out;
-   btrfs_free_delayed_extent_op(head->extent_op);
-   head->extent_op = NULL;
-   }
+   if (cleanup_extent_op(trans, head) != NULL)
+   goto out;
 
/*
 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3



[PATCH 08/42] btrfs: dump block_rsv whe dumping space info

2018-10-11 Thread Josef Bacik
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index da73b3e5bc39..c9913c59686b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7918,6 +7918,15 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
+#define DUMP_BLOCK_RSV(fs_info, rsv_name)  \
+do {   \
+   struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name;   \
+   spin_lock(&__rsv->lock);\
+   btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu",  \
+  __rsv->size, __rsv->reserved);   \
+   spin_unlock(&__rsv->lock);  \
+} while (0)
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *info, u64 bytes,
int dump_block_groups)
@@ -7937,6 +7946,12 @@ static void dump_space_info(struct btrfs_fs_info 
*fs_info,
info->bytes_readonly);
spin_unlock(>lock);
 
+   DUMP_BLOCK_RSV(fs_info, global_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
+
if (!dump_block_groups)
return;
 
-- 
2.14.3



[PATCH 02/42] btrfs: add cleanup_ref_head_accounting helper

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 67 +-
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d24a0de4a2e7..a44d55e36e11 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+
+   if (head->total_ref_mod < 0) {
+   struct btrfs_space_info *space_info;
+   u64 flags;
+
+   if (head->is_data)
+   flags = BTRFS_BLOCK_GROUP_DATA;
+   else if (head->is_system)
+   flags = BTRFS_BLOCK_GROUP_SYSTEM;
+   else
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   space_info = __find_space_info(fs_info, flags);
+   ASSERT(space_info);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+   if (head->is_data) {
+   spin_lock(_refs->lock);
+   delayed_refs->pending_csums -= head->num_bytes;
+   spin_unlock(_refs->lock);
+   }
+   }
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+ head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *head)
 {
@@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-   trace_run_delayed_ref_head(fs_info, head, 0);
-
-   if (head->total_ref_mod < 0) {
-   struct btrfs_space_info *space_info;
-   u64 flags;
-
-   if (head->is_data)
-   flags = BTRFS_BLOCK_GROUP_DATA;
-   else if (head->is_system)
-   flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   else
-   flags = BTRFS_BLOCK_GROUP_METADATA;
-   space_info = __find_space_info(fs_info, flags);
-   ASSERT(space_info);
-   percpu_counter_add_batch(_info->total_bytes_pinned,
-  -head->num_bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-   if (head->is_data) {
-   spin_lock(_refs->lock);
-   delayed_refs->pending_csums -= head->num_bytes;
-   spin_unlock(_refs->lock);
-   }
-   }
-
if (head->must_insert_reserved) {
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
@@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   /* Also free its reserved qgroup space */
-   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
- head->qgroup_reserved);
+   cleanup_ref_head_accounting(trans, head);
+
+   trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
btrfs_put_delayed_ref_head(head);
return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
+   cleanup_ref_head_accounting(trans, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 05/42] btrfs: only count ref heads run in __btrfs_run_delayed_refs

2018-10-11 Thread Josef Bacik
We pick the number of ref's to run based on the number of ref heads, and
only make the decision to stop once we've processed entire ref heads, so
only count the ref heads we've run and bail once we've hit the number of
ref heads we wanted to process.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 98f36dfeccb0..b32bd38390dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2592,6 +2592,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
spin_unlock(_refs->lock);
break;
}
+   count++;
 
/* grab the lock that says we are going to process
 * all the refs for this head */
@@ -2605,7 +2606,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 */
if (ret == -EAGAIN) {
locked_ref = NULL;
-   count++;
continue;
}
}
@@ -2633,7 +2633,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
unselect_delayed_ref_head(delayed_refs, locked_ref);
locked_ref = NULL;
cond_resched();
-   count++;
continue;
}
 
@@ -2651,7 +2650,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
return ret;
}
locked_ref = NULL;
-   count++;
continue;
}
 
@@ -2702,7 +2700,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
}
 
btrfs_put_delayed_ref(ref);
-   count++;
cond_resched();
}
 
-- 
2.14.3



[PATCH 07/42] btrfs: check if free bgs for commit

2018-10-11 Thread Josef Bacik
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1213f573eea2..da73b3e5bc39 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4830,10 +4830,18 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+   trans = btrfs_join_transaction(fs_info->extent_root);
+   if (IS_ERR(trans))
+   return -ENOSPC;
+
+   /*
+* See if there is enough pinned space to make this reservation, or if
+* we have bg's that are going to be freed, allowing us to possibly do a
+* chunk allocation the next loop through.
+*/
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
+   __percpu_counter_compare(_info->total_bytes_pinned, bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
/*
@@ -4841,7 +4849,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * this reservation.
 */
if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   goto enospc;
 
spin_lock(_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
@@ -4855,17 +4863,14 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
bytes -= reclaim_bytes;
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
+bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   goto enospc;
 commit:
-   trans = btrfs_join_transaction(fs_info->extent_root);
-   if (IS_ERR(trans))
-   return -ENOSPC;
-
return btrfs_commit_transaction(trans);
+enospc:
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
 }
 
 /*
-- 
2.14.3



[PATCH 06/42] btrfs: introduce delayed_refs_rsv

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  27 +++--
 fs/btrfs/delayed-ref.c   |  28 -
 fs/btrfs/disk-io.c   |   4 +
 fs/btrfs/extent-tree.c   | 279 +++
 fs/btrfs/inode.c |   2 +-
 fs/btrfs/transaction.c   |  77 ++--
 include/trace/events/btrfs.h |   2 +
 7 files changed, 312 insertions(+), 107 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 66f1d3895bca..1a2c3b629af2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -452,8 +452,9 @@ struct btrfs_space_info {
 #defineBTRFS_BLOCK_RSV_TRANS   3
 #defineBTRFS_BLOCK_RSV_CHUNK   4
 #defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+#define BTRFS_BLOCK_RSV_DELREFS6
+#defineBTRFS_BLOCK_RSV_EMPTY   7
+#defineBTRFS_BLOCK_RSV_TEMP8
 
 struct btrfs_block_rsv {
u64 size;
@@ -794,6 +795,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv chunk_block_rsv;
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
+   /* block reservation for delayed refs */
+   struct btrfs_block_rsv delayed_refs_rsv;
 
struct btrfs_block_rsv empty_block_rsv;
 
@@ -2608,8 +2611,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
btrfs_fs_info *fs_info,
 
 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info);
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
-  struct btrfs_fs_info *fs_info);
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
 void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 const u64 start);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
@@ -2723,10 +2725,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR  =   1,
FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
+   FLUSH_DELAYED_REFS_NR   =   3,
+   FLUSH_DELAYED_REFS  =   4,
+   FLUSH_DELALLOC  =   5,
+   FLUSH_DELALLOC_WAIT =   6,
+   ALLOC_CHUNK =   7,
+   COMMIT_TRANS=   8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
@@ -2777,6 +2781,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info,
+   enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+  struct btrfs_block_rsv *src,
+  u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git 

[PATCH 00/42][v4] My current patch queue

2018-10-11 Thread Josef Bacik
v3->v4:
- added stacktraces to all the changelogs
- added the various reviewed-by's.
- fixed the loop in inode_rsv_refill to not use goto again;

v2->v3:
- reworked the truncate/evict throttling, we were still occasionally hitting
  enospc aborts in production in these paths because we were too aggressive with
  space usage.
- reworked the delayed iput stuff to be a little less racey and less deadlocky.
- Addressed the comments from Dave and Omar.
- A lot of production testing.

v1->v2:
- addressed all of the issues brought up.
- added more comments.
- split up some patches.

original message:

This is the current queue of things that I've been working on.  The main thing
these patches are doing is separating out the delayed refs reservations from the
global reserve into their own block rsv.  We have been consistently hitting
issues in production where we abort a transaction because we run out of the
global reserve either while running delayed refs or while updating dirty block
groups.  This is because the math around global reserves is made up bullshit
magic that has been tweaked more and more throughout the years.  The result is
something that is inconsistent across the board and sometimes wrong.  So instead
we need a way to know exactly how much space we need to keep around in order to
satisfy our outstanding delayed refs and our dirty block groups.

Since we don't know how many delayed refs we need at the start of any
modification we simply use the nr_items passed into btrfs_start_transaction() as
a guess for what we may need.  This has the side effect of putting more pressure
on the ENOSPC system, but it's pressure we can deal with more intelligently
because we always know how much space we have outstanding, instead of guessing
with weird global reserve math.

This works similar to every other reservation we have, we reserve the worst case
up front, and then at transaction end time we free up any space we didn't
actually use for delayed refs.

My performance tests show that we are bit faster now since we can do more
intelligent flushing and don't have to fall back on simply committing the
transaction in hopes that we have enough space for everything we need to do.

That leads me to the 2nd part of this pull, there's a bunch of fixes around
ENOSPC.  Because we are a bit faster now there were a bunch of things uncovered
in testing, but they seem to be all resolved now.

The final chunk of fixes are around transaction aborts.  There were a lot of
accounting bugs I was running into while running generic/435, so I fixed a bunch
of those up so now it runs cleanly.

I have been running these patches through xfstests on multiple machines for a
while, they are pretty solid and ready for wider testing and review.  Thanks,

Josef


[PATCH 01/42] btrfs: add btrfs_delete_ref_head helper

2018-10-11 Thread Josef Bacik
From: Josef Bacik 

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/delayed-ref.c | 14 ++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 22 +++---
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 62ff545ba1f7..3a9e4ac21794 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head)
+{
+   lockdep_assert_held(_refs->lock);
+   lockdep_assert_held(>lock);
+
+   rb_erase(>href_node, _refs->href_root);
+   RB_CLEAR_NODE(>href_node);
+   atomic_dec(_refs->num_entries);
+   delayed_refs->num_heads--;
+   if (head->processing == 0)
+   delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d9f2a4ebd5db..7769177b489e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 {
mutex_unlock(>mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *
 btrfs_select_ref_head(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f77226d8020a..d24a0de4a2e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(_refs->lock);
return 1;
}
-   delayed_refs->num_heads--;
-   rb_erase(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
-   atomic_dec(_refs->num_entries);
 
trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!mutex_trylock(>mutex))
goto out;
 
-   /*
-* at this point we have a head with no other entries.  Go
-* ahead and process it.
-*/
-   rb_erase(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
-   atomic_dec(_refs->num_entries);
-
-   /*
-* we don't take a ref on the node because we're removing it from the
-* tree, so we just steal the ref the tree was holding.
-*/
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
+   btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
+
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-- 
2.14.3



Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-11 Thread Hans van Kranenburg
On 10/11/2018 05:13 PM, David Sterba wrote:
> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>> This patch set contains an additional fix for a newly exposed bug after
>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>
>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>
>> The DUP fix is "fix more DUP stripe size handling". I did that one
>> before starting to change more things so it can be applied to earlier
>> LTS kernels.
>>
>> Besides that patch, which is fixing the bug in a way that is least
>> intrusive, I added a bunch of other patches to help getting the chunk
>> allocator code in a state that is a bit less error-prone and
>> bug-attracting.
>>
>> When running this and trying the reproduction scenario, I can now see
>> that the created DUP device extent is 827326464 bytes long, which is
>> good.
>>
>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>> a list of related use cases and test more things to at least walk
>> through a bunch of obvious use cases to see if there's nothing exploding
>> too quickly with these changes. However, I'm quite confident about it,
>> so I'm sharing all of it already.
>>
>> Any feedback and review is appreciated. Be gentle and keep in mind that
>> I'm still very much in a learning stage regarding kernel development.
> 
> The patches look good, thanks. Problem is explained, preparatory work is
> separated from the fix itself.

\o/

>> The stable patches handling workflow is not 100% clear to me yet. I
>> guess I have to add a Fixes: in the DUP patch which points to the
>> previous commit 92e222df7b.
> 
> Almost nobody does it right, no worries. If you can identify a single
> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
> with version where it makes sense & applies is enough. I do that check
> myself regardless of what's in the patch.

It's 92e222df7b and the thing I'm not sure about is if it also will
catch the same patch which was already backported to LTS kernels since
92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
4.14, 4.9, 4.4, 3.16...

> I ran the patches in a VM and hit a division-by-zero in test
> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
> patch 3/6.

Ah interesting, dev replace.

I'll play around with replace and find out how to run the tests properly
and then reproduce this.

The code introduced in patch 3 is removed again in patch 6, so I don't
suspect that one. :)

But, I'll find out.

Thanks for testing.

Hans

> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
> 1 transid 5 /dev/vdb
> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 
> 2 transid 5 /dev/vdc
> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
> [ 3116.088644] BTRFS info (device vdb): has skinny extents
> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
> [ 3116.093971] BTRFS info (device vdb): checking UUID tree
> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) 
> to /dev/vdd started
> [ 3125.860269] divide error:  [#1] PREEMPT SMP
> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ 
> #288
> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
> [ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206
> [ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: 
> 0002
> [ 3125.873587] RDX:  RSI: 94e017818c80 RDI: 
> 0200
> [ 3125.874677] RBP: 8080 R08:  R09: 
> 0002
> [ 3125.875816] R10: 0003 R11: 8090 R12: 
> 
> [ 3125.876742] R13: 0001 R14: 0001 R15: 
> 0002
> [ 3125.877657] FS:  7f6de34208c0() GS:94e07d60() 
> knlGS:
> [ 3125.878862] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3125.880080] CR2: 7ffe963d5ce8 CR3: 7659b000 CR4: 
> 06e0
> [ 3125.881485] Call Trace:
> [ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
> [ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
> [ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
> [ 3125.884658]  ? is_module_address+0x11/0x30
> [ 3125.885271]  ? wait_for_completion+0x160/0x190
> [ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
> [ 3125.887767]  ? start_transaction+0xa1/0x470 [btrfs]
> [ 3125.888648]  btrfs_dev_replace_start.cold.19+0x155/0x17e [btrfs]
> [ 3125.889459]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
> [ 3125.890251]  btrfs_ioctl+0x2a94/0x31d0 [btrfs]
> [ 3125.890885]  ? do_sigaction+0x7c/0x210
> [ 3125.891731]  ? do_vfs_ioctl+0xa2/0x6b0
> [ 

Re: [PATCH 07/42] btrfs: check if free bgs for commit

2018-10-11 Thread Josef Bacik
On Thu, Oct 04, 2018 at 01:24:24PM +0200, David Sterba wrote:
> On Fri, Sep 28, 2018 at 07:17:46AM -0400, Josef Bacik wrote:
> > may_commit_transaction will skip committing the transaction if we don't
> > have enough pinned space or if we're trying to find space for a SYSTEM
> > chunk.  However if we have pending free block groups in this transaction
> > we still want to commit as we may be able to allocate a chunk to make
> > our reservation.  So instead of just returning ENOSPC, check if we have
> > free block groups pending, and if so commit the transaction to allow us
> > to use that free space.
> > 
> > Signed-off-by: Josef Bacik 
> > Reviewed-by: Omar Sandoval 
> > ---
> >  fs/btrfs/extent-tree.c | 33 +++--
> >  1 file changed, 19 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 1213f573eea2..da73b3e5bc39 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4830,10 +4830,18 @@ static int may_commit_transaction(struct 
> > btrfs_fs_info *fs_info,
> > if (!bytes)
> > return 0;
> >  
> > -   /* See if there is enough pinned space to make this reservation */
> > -   if (__percpu_counter_compare(_info->total_bytes_pinned,
> > -  bytes,
> > -  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
> > +   trans = btrfs_join_transaction(fs_info->extent_root);
> > +   if (IS_ERR(trans))
> > +   return -ENOSPC;
> 
> Why do you set override the error to ENOSPC, instead of passing
> PTR_ERR(trans)? There are more reasons why btrfs_join_transaction could
> fail: EROFS after error, EDQUOT when quotas say no, ENOMEM from
> ulist_add, that's just a sample I quickly found.

(argh I hit r instead of g, sending again)

Because may_commit_transaction is only called during flushing, we don't actually
care about the value, just that it failed.  Thanks,

Josef


Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-11 Thread David Sterba
On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
> This patch set contains an additional fix for a newly exposed bug after
> the previous attempt to fix a chunk allocator bug for new DUP chunks:
> 
> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
> 
> The DUP fix is "fix more DUP stripe size handling". I did that one
> before starting to change more things so it can be applied to earlier
> LTS kernels.
> 
> Besides that patch, which is fixing the bug in a way that is least
> intrusive, I added a bunch of other patches to help getting the chunk
> allocator code in a state that is a bit less error-prone and
> bug-attracting.
> 
> When running this and trying the reproduction scenario, I can now see
> that the created DUP device extent is 827326464 bytes long, which is
> good.
> 
> I wrote and tested this on top of linus 4.19-rc5. I still need to create
> a list of related use cases and test more things to at least walk
> through a bunch of obvious use cases to see if there's nothing exploding
> too quickly with these changes. However, I'm quite confident about it,
> so I'm sharing all of it already.
> 
> Any feedback and review is appreciated. Be gentle and keep in mind that
> I'm still very much in a learning stage regarding kernel development.

The patches look good, thanks. Problem is explained, preparatory work is
separated from the fix itself.

> The stable patches handling workflow is not 100% clear to me yet. I
> guess I have to add a Fixes: in the DUP patch which points to the
> previous commit 92e222df7b.

Almost nobody does it right, no worries. If you can identify a single
patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
with version where it makes sense & applies is enough. I do that check
myself regardless of what's in the patch.

I ran the patches in a VM and hit a division-by-zero in test
fstests/btrfs/011, stacktrace below. First guess is that it's caused by
patch 3/6.

[ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 1 
transid 5 /dev/vdb
[ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 2 
transid 5 /dev/vdc
[ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
[ 3116.088644] BTRFS info (device vdb): has skinny extents
[ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
[ 3116.093971] BTRFS info (device vdb): checking UUID tree
[ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to 
/dev/vdd started
[ 3125.860269] divide error:  [#1] PREEMPT SMP
[ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ #288
[ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
[ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
[ 3125.870541] RSP: 0018:a4ea0409fa48 EFLAGS: 00010206
[ 3125.871862] RAX: 0400 RBX: 94e074374508 RCX: 0002
[ 3125.873587] RDX:  RSI: 94e017818c80 RDI: 0200
[ 3125.874677] RBP: 8080 R08:  R09: 0002
[ 3125.875816] R10: 0003 R11: 8090 R12: 
[ 3125.876742] R13: 0001 R14: 0001 R15: 0002
[ 3125.877657] FS:  7f6de34208c0() GS:94e07d60() 
knlGS:
[ 3125.878862] CS:  0010 DS:  ES:  CR0: 80050033
[ 3125.880080] CR2: 7ffe963d5ce8 CR3: 7659b000 CR4: 06e0
[ 3125.881485] Call Trace:
[ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
[ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
[ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
[ 3125.884658]  ? is_module_address+0x11/0x30
[ 3125.885271]  ? wait_for_completion+0x160/0x190
[ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
[ 3125.887767]  ? start_transaction+0xa1/0x470 [btrfs]
[ 3125.888648]  btrfs_dev_replace_start.cold.19+0x155/0x17e [btrfs]
[ 3125.889459]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 3125.890251]  btrfs_ioctl+0x2a94/0x31d0 [btrfs]
[ 3125.890885]  ? do_sigaction+0x7c/0x210
[ 3125.891731]  ? do_vfs_ioctl+0xa2/0x6b0
[ 3125.892652]  do_vfs_ioctl+0xa2/0x6b0
[ 3125.893642]  ? do_sigaction+0x1a7/0x210
[ 3125.894665]  ksys_ioctl+0x3a/0x70
[ 3125.895523]  __x64_sys_ioctl+0x16/0x20
[ 3125.896339]  do_syscall_64+0x5a/0x1a0
[ 3125.896949]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3125.897638] RIP: 0033:0x7f6de28ecaa7
[ 3125.901313] RSP: 002b:7ffe963da9c8 EFLAGS: 0246 ORIG_RAX: 
0010
[ 3125.902486] RAX: ffda RBX: 0003 RCX: 7f6de28ecaa7
[ 3125.903538] RDX: 7ffe963dae00 RSI: ca289435 RDI: 0003
[ 3125.904878] RBP:  R08:  R09: 
[ 3125.905788] R10: 0008 R11: 0246 R12: 7ffe963de26f

[PATCH 0/6] FSID change kernel support

2018-10-11 Thread Nikolay Borisov
Hello, 

Here is the second posting of the fsid change support for the kernel. For 
background information you can refer to v1 [0]. The main changes in this version
are around the handling of possible split-brain scenarios. I've changed a bit 
how the userspace code works and now the process is split among 2 transactions. 
The first one flagging "we are about to change fsid" and once it's persisted on
all disks a second one does the actual change. This of course is not enough 
to guarantee full consistency so I had to extend the device scanning to 
gracefully handle such cases. I believe I have covered everything but more 
review will be appreciated. 

So patch 1 implements the basic functionality with no split-brain handling 
whatsoever. Patch 2 is a minor cleanup. Patch 3 deals with a split-brain that 
can occur if power loss happens during the initial transaction (the one setting
the beginning flag). Patch 4 adds some information that is needed in the last 2 
patches. Patch 5 handles failure between transaction 1 and transaction 2 and 
finally patch 6 handles the case of power loss during transaction 1 but for an 
fs which has already undergone at least one successful fsid change. More 
details about the exact failure modes are in each respective patch.

One thing which could be improved but I ran out of ideas is the naming of the 
ancillary functions - find_fsid_inprogress and find_fsid_changed. 


I've actually tested the split-brain handing code with specially crafted 
images. 
They will be part of the user-space submissions and I believe I have full 
coverage for that. 

[0] 
https://lore.kernel.org/linux-btrfs/1535531774-29830-1-git-send-email-nbori...@suse.com/

Nikolay Borisov (6):
  btrfs: Introduce support for FSID change without metadata rewrite
  btrfs: Remove fsid/metadata_fsid fields from btrfs_info
  btrfs: Add handling for disk split-brain scenario during fsid change
  btrfs: Introduce 2 more members to struct btrfs_fs_devices
  btrfs: Handle one more split-brain scenario during fsid change
  btrfs: Handle final split-brain possibility during fsid change

 fs/btrfs/check-integrity.c  |   2 +-
 fs/btrfs/ctree.c|   5 +-
 fs/btrfs/ctree.h|  10 +-
 fs/btrfs/disk-io.c  |  53 ---
 fs/btrfs/extent-tree.c  |   2 +-
 fs/btrfs/ioctl.c|   2 +-
 fs/btrfs/super.c|   2 +-
 fs/btrfs/volumes.c  | 196 
 fs/btrfs/volumes.h  |   6 ++
 include/trace/events/btrfs.h|   2 +-
 include/uapi/linux/btrfs.h  |   1 +
 include/uapi/linux/btrfs_tree.h |   1 +
 12 files changed, 241 insertions(+), 41 deletions(-)

-- 
2.7.4



[PATCH 0/8] FSID change userspace v2

2018-10-11 Thread Nikolay Borisov
Here is the second posting of the FSID change support for user space. For 
background information refer to the the initial posting [0]. The major changes
in this version are: 
 - Modified the sequence of operations when changing the fsid. Now it's split
 among 2 transactions with the first one setting a flag (similarly to what 
 the old fsid change code does) and the second transaction applying the new 
 FSID and incompat flag

 - Expanded the test coverage with several crafted images which simulate 
 failure scenarios that could occur while fsid change is in progress. 

 - Also added the last 2 clean up patches which can be merged independently of
 the fsid changes.



[0] 
https://lore.kernel.org/linux-btrfs/1535531754-29774-1-git-send-email-nbori...@suse.com/
Nikolay Borisov (8):
  btrfstune: Remove fs_info arg from change_device_uuid
  btrfstune: Rename change_header_uuid to change_buffer_header_uuid
  btrfs-progs: Add support for metadata_uuid field.
  btrfstune: Add support for changing the user uuid
  btrfs-progs: tests: Add tests for changing fsid feature
  btrfs-progs: Remove fsid/metdata_uuid fields from fs_info
  btrfs-progs: Remove btrfs_fs_info::new_fsid
  btrfs-progs: Directly pass root to change_devices_uuid

 btrfstune.c | 237 ++--
 check/main.c|   2 +-
 chunk-recover.c |  17 +-
 cmds-filesystem.c   |   2 +
 cmds-inspect-dump-super.c   |  22 ++-
 convert/common.c|   2 +
 ctree.c |  15 +-
 ctree.h |  10 +-
 disk-io.c   |  62 +--
 image/main.c|  25 ++-
 tests/misc-tests/033-metadata-uuid/disk1.raw.xz | Bin 0 -> 78336 bytes
 tests/misc-tests/033-metadata-uuid/disk2.raw.xz | Bin 0 -> 77664 bytes
 tests/misc-tests/033-metadata-uuid/disk3.raw.xz | Bin 0 -> 78328 bytes
 tests/misc-tests/033-metadata-uuid/disk4.raw.xz | Bin 0 -> 77592 bytes
 tests/misc-tests/033-metadata-uuid/disk5.raw.xz | Bin 0 -> 78348 bytes
 tests/misc-tests/033-metadata-uuid/disk6.raw.xz | Bin 0 -> 77552 bytes
 tests/misc-tests/033-metadata-uuid/test.sh  | 225 ++
 volumes.c   |  37 +++-
 volumes.h   |   1 +
 19 files changed, 557 insertions(+), 100 deletions(-)
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk1.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk2.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk3.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk4.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk5.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk6.raw.xz
 create mode 100755 tests/misc-tests/033-metadata-uuid/test.sh

-- 
2.7.4



[PATCH 1/8] btrfstune: Remove fs_info arg from change_device_uuid

2018-10-11 Thread Nikolay Borisov
This function already takes an extent buffer that contains a reference
to the fs_info. Use that and reduce argument count. No functional
changes.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 889b931c55d8..723317a00f3a 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -179,10 +179,10 @@ static int change_extents_uuid(struct btrfs_fs_info 
*fs_info)
return ret;
 }
 
-static int change_device_uuid(struct btrfs_fs_info *fs_info, struct 
extent_buffer *eb,
- int slot)
+static int change_device_uuid(struct extent_buffer *eb, int slot)
 {
struct btrfs_dev_item *di;
+   struct btrfs_fs_info *fs_info = eb->fs_info;
int ret = 0;
 
di = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);
@@ -217,7 +217,7 @@ static int change_devices_uuid(struct btrfs_fs_info 
*fs_info)
if (key.type != BTRFS_DEV_ITEM_KEY ||
key.objectid != BTRFS_DEV_ITEMS_OBJECTID)
goto next;
-   ret = change_device_uuid(fs_info, path.nodes[0], path.slots[0]);
+   ret = change_device_uuid(path.nodes[0], path.slots[0]);
if (ret < 0)
goto out;
 next:
-- 
2.7.4



[PATCH 5/6] btrfs: Handle one more split-brain scenario during fsid change

2018-10-11 Thread Nikolay Borisov
This commit continues hardening the scanning code to handle cases where
power loss could have caused disks in a multi-disk filesystem to be
in inconsistent state. Namely handle the situation that can occur when
some of the disks in multi-disk fs have completed their fsid change i.e
they have METADATA_UUID incompat flag set, have cleared the
FSID_CHANGING_V2 flag and their fsid/metadata_uuid are different. At
the same time the other half of the disks will have their
fsid/metadata_uuid unchanged and will only have FSID_CHANGING_V2 flag.

This is handled by adding additional code in the scan path which:

 a) In case first a device with FSID_CHANGING_V2 flag is scanned and
 btrfs_fs_devices is created with matching fsid/metdata_uuid then when
 a device with completed fsid change is scanned it will detect this
 via the new code in find_fsid i.e that such an fs_devices exist that
 fsid_change flag is set to true, it's metadata_uuid/fsid match and
 the metadata_uuid of the scanned device matches that of the fs_devices.
 In this case, it's important to note that the devices which has its
 fsid change completed will have a higher generation number than the
 device with FSID_CHANGING_V2 flag set, so its superblock block will
 be used during mount. To prevent an assertion triggering because
 the sb used for mounting will have differing fsid/metadata_uuid than
 the ones in the fs_devices struct also add code in device_list_add
 which overwrites the values in fs_devices.

 b) Alternatively we can end up with a device that completed its
 fsid change to be scanned first which will create the respective
 btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
 case when a device with FSID_CHANGING_V2 flag set is scanned it will
 call the newly added find_fsid_inprogress function which will return
 the correct fs_devices.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 78 +++---
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c2b66d15e08d..2c9879a81884 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -382,6 +382,26 @@ find_fsid(const u8 *fsid, const u8 *metadata_fsid)
 
ASSERT(fsid);
 
+   if (metadata_fsid) {
+
+   /*
+* Handle scanned device having completed its fsid change but
+* belonging to a fs_devices that was created by first scanning
+* a device which didn't have it's fsid/metadata_uuid changed
+* at all and the CHANGING_FSID flag set. 4/a
+*/
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (fs_devices->fsid_change &&
+   memcmp(metadata_fsid, fs_devices->fsid,
+  BTRFS_FSID_SIZE) == 0 &&
+   memcmp(fs_devices->fsid, fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE) == 0) {
+   return fs_devices;
+   }
+   }
+   }
+
+   /* Handle non-split brain cases */
list_for_each_entry(fs_devices, _uuids, fs_list) {
if (metadata_fsid) {
if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0
@@ -768,6 +788,27 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
 }
 
 /*
+ * Handle scanned device having its FSID_CHANGING flag set and the fs_devices
+ * being created with a disk that has already completed its fsid change.
+ */
+static struct btrfs_fs_devices *find_fsid_inprogress(
+   struct btrfs_super_block *disk_super)
+{
+   struct btrfs_fs_devices *fs_devices;
+
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (memcmp(fs_devices->metadata_uuid, fs_devices->fsid,
+  BTRFS_FSID_SIZE) != 0 &&
+   memcmp(fs_devices->metadata_uuid, disk_super->fsid,
+  BTRFS_FSID_SIZE) == 0 && !fs_devices->fsid_change) {
+   return fs_devices;
+   }
+   }
+
+   return NULL;
+}
+
+/*
  * Add new device to list of registered devices
  *
  * Returns:
@@ -779,7 +820,7 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
   bool *new_device_added)
 {
struct btrfs_device *device;
-   struct btrfs_fs_devices *fs_devices;
+   struct btrfs_fs_devices *fs_devices = NULL;
struct rcu_string *name;
u64 found_transid = btrfs_super_generation(disk_super);
u64 devid = btrfs_stack_device_id(_super->dev_item);
@@ -788,10 +829,24 @@ static noinline struct btrfs_device 
*device_list_add(const char *path,
bool fsid_change_in_progress = (btrfs_super_flags(disk_super) &
BTRFS_SUPER_FLAG_CHANGING_FSID_v2);
 
-   if 

[PATCH 2/8] btrfstune: Rename change_header_uuid to change_buffer_header_uuid

2018-10-11 Thread Nikolay Borisov
Rename the function so its name falls in line with the rest of btrfs
nomenclature. So when we change the uuid of the header it's in fact
of the buffer header. Also remove the root argument since it's used
solely to take a reference to fs_info which can be done via the
mandatory eb argument. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 723317a00f3a..3745a00c46ae 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -91,9 +91,9 @@ static int set_super_incompat_flags(struct btrfs_root *root, 
u64 flags)
return ret;
 }
 
-static int change_header_uuid(struct btrfs_root *root, struct extent_buffer 
*eb)
+static int change_buffer_header_uuid(struct extent_buffer *eb)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_fs_info *fs_info = eb->fs_info;
int same_fsid = 1;
int same_chunk_tree_uuid = 1;
int ret;
@@ -157,7 +157,7 @@ static int change_extents_uuid(struct btrfs_fs_info 
*fs_info)
ret = PTR_ERR(eb);
goto out;
}
-   ret = change_header_uuid(root, eb);
+   ret = change_buffer_header_uuid(eb);
free_extent_buffer(eb);
if (ret < 0) {
error("failed to change uuid of tree block: %llu",
-- 
2.7.4



[PATCH 6/8] btrfs-progs: Remove fsid/metdata_uuid fields from fs_info

2018-10-11 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov 
---
 btrfstune.c |  5 +++--
 check/main.c|  2 +-
 chunk-recover.c | 11 +--
 ctree.c | 11 ++-
 ctree.h |  2 --
 disk-io.c   | 18 +-
 volumes.c   |  5 +++--
 7 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 29610476cd99..c1e4fa8067e8 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -360,7 +360,8 @@ static int change_fsid_prepare(struct btrfs_fs_info 
*fs_info)
return ret;
 
/* Also need to change the metadatauuid of the fs info */
-   memcpy(fs_info->metadata_uuid, fs_info->new_fsid, BTRFS_FSID_SIZE);
+   memcpy(fs_info->fs_devices->metadata_uuid, fs_info->new_fsid,
+  BTRFS_FSID_SIZE);
 
/* also restore new chunk_tree_id into tree_root for restore */
write_extent_buffer(tree_root->node, fs_info->new_chunk_tree_uuid,
@@ -417,7 +418,7 @@ static int change_uuid(struct btrfs_fs_info *fs_info, const 
char *new_fsid_str)
fs_info->new_fsid = new_fsid;
fs_info->new_chunk_tree_uuid = new_chunk_id;
 
-   memcpy(old_fsid, (const char*)fs_info->fsid, BTRFS_UUID_SIZE);
+   memcpy(old_fsid, (const char*)fs_info->fs_devices->fsid, 
BTRFS_UUID_SIZE);
uuid_unparse(old_fsid, uuid_buf);
printf("Current fsid: %s\n", uuid_buf);
 
diff --git a/check/main.c b/check/main.c
index 0790264190f2..0aede2742dcf 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8418,7 +8418,7 @@ static int btrfs_fsck_reinit_root(struct 
btrfs_trans_handle *trans,
btrfs_set_header_backref_rev(c, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(c, root->root_key.objectid);
 
-   write_extent_buffer(c, root->fs_info->metadata_uuid,
+   write_extent_buffer(c, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(c, root->fs_info->chunk_tree_uuid,
diff --git a/chunk-recover.c b/chunk-recover.c
index 31325bfc54e0..959c169f79a4 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1155,7 +1155,7 @@ static int __rebuild_chunk_root(struct btrfs_trans_handle 
*trans,
btrfs_set_header_level(cow, 0);
btrfs_set_header_backref_rev(cow, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(cow, BTRFS_CHUNK_TREE_OBJECTID);
-   write_extent_buffer(cow, root->fs_info->metadata_uuid,
+   write_extent_buffer(cow, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(cow, root->fs_info->chunk_tree_uuid,
@@ -1457,7 +1457,7 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
goto out_devices;
}
 
-   memcpy(fs_info->fsid, _super->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(disk_super->fsid, rc->fs_devices->fsid, 
BTRFS_FSID_SIZE));
fs_info->sectorsize = btrfs_super_sectorsize(disk_super);
fs_info->nodesize = btrfs_super_nodesize(disk_super);
fs_info->stripesize = btrfs_super_stripesize(disk_super);
@@ -1469,10 +1469,9 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
features = btrfs_super_incompat_flags(disk_super);
 
if (features & BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
-   memcpy(fs_info->metadata_uuid, disk_super->metadata_uuid,
-  BTRFS_FSID_SIZE);
-   else
-   memcpy(fs_info->metadata_uuid, fs_info->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(disk_super->metadata_uuid,
+  fs_info->fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE));
 
btrfs_setup_root(fs_info->chunk_root, fs_info,
 BTRFS_CHUNK_TREE_OBJECTID);
diff --git a/ctree.c b/ctree.c
index 883c2ae1861d..c79837e0206f 100644
--- a/ctree.c
+++ b/ctree.c
@@ -23,6 +23,7 @@
 #include "internal.h"
 #include "sizes.h"
 #include "messages.h"
+#include "volumes.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_path *path, int level);
@@ -134,7 +135,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer(cow, root->fs_info->metadata_uuid,
+   write_extent_buffer(cow, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
@@ -308,7 +309,7 @@ int __btrfs_cow_block(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer(cow, root->fs_info->metadata_uuid,
+   write_extent_buffer(cow, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
WARN_ON(!(buf->flags & EXTENT_BAD_TRANSID) &&
@@ -1474,7 

[PATCH 4/8] btrfstune: Add support for changing the user uuid

2018-10-11 Thread Nikolay Borisov
This allows us to change the use-visible UUID on filesytems from
userspace if desired, by copying the existing UUID to the new location
for metadata comparisons. If this is done, an incompat flag must be
set to prevent older filesystems from mounting the filesystem, but
the original UUID can be restored, and the incompat flag removed.

This introduces the new -m|-M UUID options similar to current
-u|-U UUID ones with the difference that we don't rewrite the fsid but
just copy the old uuid and set a new one. Additionally running with
[-M old-uuid] clears the incompat flag and retains only fsid on-disk.

Additionally it's not allowed to intermix -m/-u/-U/-M options in a
single invocation of btrfstune, nor is it allowed to change the uuid
while there is a uuid rewrite in-progress. Also changing the uuid of a
seed device is not currently allowed (can change in the future).

Example:

btrfstune -m /dev/loop1
btrfs inspect-internal dump-super /dev/loop1

superblock: bytenr=65536, device=/dev/loop1
-
csum_type   0 (crc32c)
csum_size   4
csum0x4b7ea749 [match]



fsid0efc41d3-4451-49f3-8108-7b8bdbcf5ae8
metadata_uuid   352715e7-62cf-4ae0-92ee-85a574adc318



incompat_flags  0x541
( MIXED_BACKREF |
  EXTENDED_IREF |
  SKINNY_METADATA |
  METADATA_UUID )



dev_item.uuid   0610deee-dfc3-498b-9449-a06533cdec98
dev_item.fsid   352715e7-62cf-4ae0-92ee-85a574adc318 [match]



mount /dev/loop1 btrfs-mnt/
btrfs fi show btrfs-mnt/

Label: none  uuid: 0efc41d3-4451-49f3-8108-7b8bdbcf5ae8
Total devices 1 FS bytes used 128.00KiB
devid1 size 5.00GiB used 536.00MiB path /dev/loop1

In this case a new btrfs filesystem was created and the original uuid
was 352715e7-62cf-4ae0-92ee-85a574adc318, then btrfstune was run which
copied that value over to metadata_uuid field and set the current fsid
to 0efc41d3-4451-49f3-8108-7b8bdbcf5ae8. And as far as userspace is
concerned this is the fsid of the fs.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c | 184 +++-
 ctree.h |   1 +
 2 files changed, 160 insertions(+), 25 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 62a075b2defc..29610476cd99 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -73,6 +73,117 @@ static int update_seeding_flag(struct btrfs_root *root, int 
set_flag)
return ret;
 }
 
+/*
+ * Return 0 for no unfinished fsid change.
+ * Return >0 for unfinished fsid change, and restore unfinished fsid/
+ * chunk_tree_id into fsid_ret/chunk_id_ret.
+ */
+static int check_unfinished_fsid_change(struct btrfs_fs_info *fs_info,
+   uuid_t fsid_ret, uuid_t chunk_id_ret)
+{
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   u64 flags = btrfs_super_flags(fs_info->super_copy);
+
+   if (flags & (BTRFS_SUPER_FLAG_CHANGING_FSID |
+BTRFS_SUPER_FLAG_CHANGING_FSID_V2)) {
+   memcpy(fsid_ret, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
+   read_extent_buffer(tree_root->node, chunk_id_ret,
+   btrfs_header_chunk_tree_uuid(tree_root->node),
+   BTRFS_UUID_SIZE);
+   return 1;
+   }
+   return 0;
+}
+
+static int set_metadata_uuid(struct btrfs_root *root, const char *uuid_string)
+{
+   struct btrfs_super_block *disk_super;
+   uuid_t new_fsid, unused1, unused2;
+   struct btrfs_trans_handle *trans;
+   bool new_uuid = true;
+   u64 incompat_flags;
+   bool uuid_changed;
+   u64 super_flags;
+   int ret;
+
+   disk_super = root->fs_info->super_copy;
+   super_flags = btrfs_super_flags(disk_super);
+   incompat_flags = btrfs_super_incompat_flags(disk_super);
+   uuid_changed = incompat_flags & BTRFS_FEATURE_INCOMPAT_METADATA_UUID;
+
+   if (super_flags & BTRFS_SUPER_FLAG_SEEDING) {
+   fprintf(stderr, "Cannot set metadata UUID on a seed device\n");
+   return 1;
+   }
+
+   if (check_unfinished_fsid_change(root->fs_info, unused1, unused2)) {
+   fprintf(stderr, "UUID rewrite in progress, cannot change "
+   "fsid\n");
+   return 1;
+   }
+
+   if (uuid_string)
+   uuid_parse(uuid_string, new_fsid);
+   else
+   uuid_generate(new_fsid);
+
+   new_uuid = (memcmp(new_fsid, disk_super->fsid, BTRFS_FSID_SIZE) != 0);
+
+   /* Step 1 sest the in progress flag */
+   trans = btrfs_start_transaction(root, 1);
+   super_flags |= BTRFS_SUPER_FLAG_CHANGING_FSID_V2;
+   btrfs_set_super_flags(disk_super, super_flags);
+   ret = btrfs_commit_transaction(trans, root);
+   if (ret < 0)
+   return ret;
+

[PATCH 8/8] btrfs-progs: Directly pass root to change_devices_uuid

2018-10-11 Thread Nikolay Borisov
This function currently takes an fs_info only to reference the chunk
root. Just pass the root directly. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index d093d26f288c..81af11e9be60 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -310,9 +310,8 @@ static int change_device_uuid(struct extent_buffer *eb, int 
slot,
return ret;
 }
 
-static int change_devices_uuid(struct btrfs_fs_info *fs_info, uuid_t new_fsid)
+static int change_devices_uuid(struct btrfs_root *root, uuid_t new_fsid)
 {
-   struct btrfs_root *root = fs_info->chunk_root;
struct btrfs_path path;
struct btrfs_key key = {0, 0, 0};
int ret = 0;
@@ -439,7 +438,7 @@ static int change_uuid(struct btrfs_fs_info *fs_info, const 
char *new_fsid_str)
 
/* Then devices */
printf("Change fsid on devices\n");
-   ret = change_devices_uuid(fs_info, new_fsid);
+   ret = change_devices_uuid(fs_info->chunk_root, new_fsid);
if (ret < 0) {
error("failed to change UUID of devices: %d", ret);
goto out;
-- 
2.7.4



[PATCH 3/8] btrfs-progs: Add support for metadata_uuid field.

2018-10-11 Thread Nikolay Borisov
Add support for a new metadata_uuid field. This is just a preparatory
commit which switches all users of the fsid field for metdata comparison
purposes to utilize the new field. This more or less mirrors the
kernel patch, additionally:

 * Update 'btrfs inspect-internal dump-super' to account for the new
 field. This involes introducing the 'metadata_uuid' line to the
 output and updating the logic for comparing the fs uuid to the
 dev_item uuid.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c   |  3 +++
 check/main.c  |  2 +-
 chunk-recover.c   | 16 ++---
 cmds-filesystem.c |  2 ++
 cmds-inspect-dump-super.c | 22 ++---
 convert/common.c  |  2 ++
 ctree.c   | 14 +--
 ctree.h   |  8 +--
 disk-io.c | 60 ++-
 image/main.c  | 25 +---
 volumes.c | 34 +--
 volumes.h |  1 +
 12 files changed, 146 insertions(+), 43 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 3745a00c46ae..62a075b2defc 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -248,6 +248,9 @@ static int change_fsid_prepare(struct btrfs_fs_info 
*fs_info)
if (ret < 0)
return ret;
 
+   /* Also need to change the metadatauuid of the fs info */
+   memcpy(fs_info->metadata_uuid, fs_info->new_fsid, BTRFS_FSID_SIZE);
+
/* also restore new chunk_tree_id into tree_root for restore */
write_extent_buffer(tree_root->node, fs_info->new_chunk_tree_uuid,
btrfs_header_chunk_tree_uuid(tree_root->node),
diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..0790264190f2 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8418,7 +8418,7 @@ static int btrfs_fsck_reinit_root(struct 
btrfs_trans_handle *trans,
btrfs_set_header_backref_rev(c, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(c, root->root_key.objectid);
 
-   write_extent_buffer(c, root->fs_info->fsid,
+   write_extent_buffer(c, root->fs_info->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(c, root->fs_info->chunk_tree_uuid,
diff --git a/chunk-recover.c b/chunk-recover.c
index 1d30db51d8ed..31325bfc54e0 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -759,7 +759,7 @@ static int scan_one_device(void *dev_scan_struct)
rc->nodesize)
break;
 
-   if (memcmp_extent_buffer(buf, rc->fs_devices->fsid,
+   if (memcmp_extent_buffer(buf, rc->fs_devices->metadata_uuid,
 btrfs_header_fsid(),
 BTRFS_FSID_SIZE)) {
bytenr += rc->sectorsize;
@@ -1155,7 +1155,7 @@ static int __rebuild_chunk_root(struct btrfs_trans_handle 
*trans,
btrfs_set_header_level(cow, 0);
btrfs_set_header_backref_rev(cow, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(cow, BTRFS_CHUNK_TREE_OBJECTID);
-   write_extent_buffer(cow, root->fs_info->fsid,
+   write_extent_buffer(cow, root->fs_info->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(cow, root->fs_info->chunk_tree_uuid,
@@ -1192,7 +1192,8 @@ static int __rebuild_device_items(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_device_io_width(dev_item, dev->io_width);
btrfs_set_stack_device_sector_size(dev_item, dev->sector_size);
memcpy(dev_item->uuid, dev->uuid, BTRFS_UUID_SIZE);
-   memcpy(dev_item->fsid, dev->fs_devices->fsid, BTRFS_UUID_SIZE);
+   memcpy(dev_item->fsid, dev->fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE);
 
ret = btrfs_insert_item(trans, root, ,
dev_item, sizeof(*dev_item));
@@ -1432,6 +1433,7 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
struct btrfs_fs_info *fs_info;
struct btrfs_super_block *disk_super;
struct extent_buffer *eb;
+   u64 features;
int ret;
 
fs_info = btrfs_new_fs_info(1, BTRFS_SUPER_INFO_OFFSET);
@@ -1464,6 +1466,14 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
if (ret)
goto out_devices;
 
+   features = btrfs_super_incompat_flags(disk_super);
+
+   if (features & BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+   memcpy(fs_info->metadata_uuid, disk_super->metadata_uuid,
+  BTRFS_FSID_SIZE);
+   else
+   memcpy(fs_info->metadata_uuid, fs_info->fsid, BTRFS_FSID_SIZE);
+
btrfs_setup_root(fs_info->chunk_root, fs_info,
 BTRFS_CHUNK_TREE_OBJECTID);
 
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311bdfe3..10946b31976f 100644
--- 

[PATCH 7/8] btrfs-progs: Remove btrfs_fs_info::new_fsid

2018-10-11 Thread Nikolay Borisov
This member was used only by btrfstune when using the old method to
change fsid. It's only an in-memory value with a very specific
purpose so it makes no sense to pollute a generic structure such as
btrfs_fs_info with it. Just remove it and pass it as a function
argument where pertinent. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c | 46 +-
 ctree.h |  1 -
 2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index c1e4fa8067e8..d093d26f288c 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -202,15 +202,15 @@ static int set_super_incompat_flags(struct btrfs_root 
*root, u64 flags)
return ret;
 }
 
-static int change_buffer_header_uuid(struct extent_buffer *eb)
+static int change_buffer_header_uuid(struct extent_buffer *eb, uuid_t new_fsid)
 {
struct btrfs_fs_info *fs_info = eb->fs_info;
int same_fsid = 1;
int same_chunk_tree_uuid = 1;
int ret;
 
-   same_fsid = !memcmp_extent_buffer(eb, fs_info->new_fsid,
-   btrfs_header_fsid(), BTRFS_FSID_SIZE);
+   same_fsid = !memcmp_extent_buffer(eb, new_fsid, btrfs_header_fsid(),
+ BTRFS_FSID_SIZE);
same_chunk_tree_uuid =
!memcmp_extent_buffer(eb, fs_info->new_chunk_tree_uuid,
btrfs_header_chunk_tree_uuid(eb),
@@ -218,7 +218,7 @@ static int change_buffer_header_uuid(struct extent_buffer 
*eb)
if (same_fsid && same_chunk_tree_uuid)
return 0;
if (!same_fsid)
-   write_extent_buffer(eb, fs_info->new_fsid, btrfs_header_fsid(),
+   write_extent_buffer(eb, new_fsid, btrfs_header_fsid(),
BTRFS_FSID_SIZE);
if (!same_chunk_tree_uuid)
write_extent_buffer(eb, fs_info->new_chunk_tree_uuid,
@@ -229,7 +229,7 @@ static int change_buffer_header_uuid(struct extent_buffer 
*eb)
return ret;
 }
 
-static int change_extents_uuid(struct btrfs_fs_info *fs_info)
+static int change_extents_uuid(struct btrfs_fs_info *fs_info, uuid_t new_fsid)
 {
struct btrfs_root *root = fs_info->extent_root;
struct btrfs_path path;
@@ -268,7 +268,7 @@ static int change_extents_uuid(struct btrfs_fs_info 
*fs_info)
ret = PTR_ERR(eb);
goto out;
}
-   ret = change_buffer_header_uuid(eb);
+   ret = change_buffer_header_uuid(eb, new_fsid);
free_extent_buffer(eb);
if (ret < 0) {
error("failed to change uuid of tree block: %llu",
@@ -290,27 +290,27 @@ static int change_extents_uuid(struct btrfs_fs_info 
*fs_info)
return ret;
 }
 
-static int change_device_uuid(struct extent_buffer *eb, int slot)
+static int change_device_uuid(struct extent_buffer *eb, int slot,
+ uuid_t new_fsid)
 {
struct btrfs_dev_item *di;
struct btrfs_fs_info *fs_info = eb->fs_info;
int ret = 0;
 
di = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);
-   if (!memcmp_extent_buffer(eb, fs_info->new_fsid,
+   if (!memcmp_extent_buffer(eb, new_fsid,
  (unsigned long)btrfs_device_fsid(di),
  BTRFS_FSID_SIZE))
return ret;
 
-   write_extent_buffer(eb, fs_info->new_fsid,
-   (unsigned long)btrfs_device_fsid(di),
+   write_extent_buffer(eb, new_fsid, (unsigned long)btrfs_device_fsid(di),
BTRFS_FSID_SIZE);
ret = write_tree_block(NULL, fs_info, eb);
 
return ret;
 }
 
-static int change_devices_uuid(struct btrfs_fs_info *fs_info)
+static int change_devices_uuid(struct btrfs_fs_info *fs_info, uuid_t new_fsid)
 {
struct btrfs_root *root = fs_info->chunk_root;
struct btrfs_path path;
@@ -328,7 +328,8 @@ static int change_devices_uuid(struct btrfs_fs_info 
*fs_info)
if (key.type != BTRFS_DEV_ITEM_KEY ||
key.objectid != BTRFS_DEV_ITEMS_OBJECTID)
goto next;
-   ret = change_device_uuid(path.nodes[0], path.slots[0]);
+   ret = change_device_uuid(path.nodes[0], path.slots[0],
+new_fsid);
if (ret < 0)
goto out;
 next:
@@ -345,7 +346,7 @@ static int change_devices_uuid(struct btrfs_fs_info 
*fs_info)
return ret;
 }
 
-static int change_fsid_prepare(struct btrfs_fs_info *fs_info)
+static int change_fsid_prepare(struct btrfs_fs_info *fs_info, uuid_t new_fsid)
 {
struct btrfs_root *tree_root = fs_info->tree_root;
u64 flags = btrfs_super_flags(fs_info->super_copy);
@@ -354,14 +355,13 @@ static int change_fsid_prepare(struct btrfs_fs_info 
*fs_info)
flags |= 

[PATCH 5/8] btrfs-progs: tests: Add tests for changing fsid feature

2018-10-11 Thread Nikolay Borisov
Add a bunch of tests exercising the new btrfstune functionality. In
particular check that various restrictions are implemented correctly,
test that btrfs-image works as expected and also test the output of
btrfs inspect-internal dump-super is correct.

Signed-off-by: Nikolay Borisov 
---
 tests/misc-tests/033-metadata-uuid/disk1.raw.xz | Bin 0 -> 78336 bytes
 tests/misc-tests/033-metadata-uuid/disk2.raw.xz | Bin 0 -> 77664 bytes
 tests/misc-tests/033-metadata-uuid/disk3.raw.xz | Bin 0 -> 78328 bytes
 tests/misc-tests/033-metadata-uuid/disk4.raw.xz | Bin 0 -> 77592 bytes
 tests/misc-tests/033-metadata-uuid/disk5.raw.xz | Bin 0 -> 78348 bytes
 tests/misc-tests/033-metadata-uuid/disk6.raw.xz | Bin 0 -> 77552 bytes
 tests/misc-tests/033-metadata-uuid/test.sh  | 225 
 7 files changed, 225 insertions(+)
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk1.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk2.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk3.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk4.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk5.raw.xz
 create mode 100644 tests/misc-tests/033-metadata-uuid/disk6.raw.xz
 create mode 100755 tests/misc-tests/033-metadata-uuid/test.sh

diff --git a/tests/misc-tests/033-metadata-uuid/disk1.raw.xz 
b/tests/misc-tests/033-metadata-uuid/disk1.raw.xz
new file mode 100644
index 
..24f47d2b07553b394f273df8794f503935ab658e
GIT binary patch
literal 78336
zcmeI5i8s{y|Ho8pWSE$2S({;!7V6%}2-%k*5y~F9w#gFNDqEIh
zUovg7Mr0i$l$0eg-|soUd(QoxbHBgejneI&^YRbO@jB1<>-~H@->>)kv#jKSwLl=~
z$2yA+Z9wcqZbKjtoKLZ28cl>==okXwuuP++mD04;bhcbjIbn1^N_;HsGSlMph@k(A
zJq`PRM(W=6X$c9@-;>7H8pqe*5;EkvP!i+I`fGK+Gv+itRWXA9?0(=y+>)pTg0x2^tCX%XKM*pK&%1NTfece$X=>
zk{2gUlX;WyQ~8s-VtjbqWvZ<)wu*W&7dJ9$$*^&;>1E#MC1WY)$AXqSP>Zij4lGEV
zW>)oUm9!fuR()6hK<4;u4mDj{RK#X>8fWx^qZ@94XRrRH1X>hQq#
z%qySCK6%4+suZu*PPJ_-EN@?3ZSakyQYdvp1|Nb=_Z{aW&?00{#m4oA=+V
zoDLx#{v4B(ifnxE@PfmL8CjT#YF0~PG`r)(cpO_+MHmyE$naOleu~^>q+Mfe>$UZ$
zU4+4izn5T(KpV$+=A`vpflUwdvrQ6lDuFza1sH2om9olsutr+i+?|Q;z=Y)H
z>v5yPy9uKydad~0iOlISOz!R~t+qT-`$1(to&<$!8%jU!;b$SpTQ|?Rv^y*gTR9*8
zC}Oc`(eUzAFBk2_hhL0;3T5v#w078xTK1O8K_bsQV845*1J#R;-!A9KwbUguS;*o!
ztzbK+M@Z{h8Yw(imfo+o>E_&!#fuX5u?l8+D|}d@a>E?$M00_d_uHHA8dMCL8ufPO
zJ?ytJrTevx>4Lqam@K`yY529f13p?0L7
z6ytowDLOPBc9FMURVF9EH^z+^m`2uM6S`hIX^_3pp6;4-wBn7WZ1wZ|!6MGs0fy5P
z{KA`lB8f$4?>G>On`zXE-nuDnEZ?gEX*X2m5pGE5Z#1E}zZ@SJAu4u)YA0KkbfjZy
z2W5DmdbzMNh?KptliN#Zw!)qzH*+MS?lJj#PZ`$jm|B!>ObhRO8*hdy4{c@+;dB!$L_ltN58ncP}l_uu=zxr){
z+K5B_UJ8v&$M7Q-j68Wy(cs-v{L_dK-b(Qq<<;2xq)xxxmBMN;h;I+H@KDA0O0{
zh^{C|#-*bB0_ndFU8395?!Rlhtt%SyDa{O<>SoxJ*~suws=i__
zXymi@$?nN%noVK!$*=NnSy?%<3Z=izA$v9=%9{r@P0rrRF_o<9zS_1C@#C1
zM8;|gGZxgE4a{$EXw{hSy)@n>H&;d=vPAlX-;b|;S?YG6+XW})W_%yvY09E!;Fd6|
z>rR|*jqDUlLuVy4`e9pT@Q4jr?!yWS8nt7Unq-X}?kOrom=(oe@VDw$8B
z-3o22_if1L$a32nNuo9f_y_N%K3O?sT1OICnN#w3fqIm4x?dh?7#fKp`xZ@6
z`tvWywwzRGk5J>O3h4fgcnK$c%8d9N*BM!jnVJc!qKOn*h4u;e?V6tpBZ@{KO`AR1
z5@l4FO%Kj?`1N!eFlw7z>tY>d_NhOcc#qY{nuQLNe{Zl37fWR9A;<#GY*
zG2(_?m%`jYbf42$u~Ux#Yt;vNMEVAj?2_`QVvOc>3-_^1@9>T7rpNd4oS|V2zIVs?
zV;LEp-zEgT<4d;x-`m#z^bTmYf9;o`V5>BkP%tRiUs%0A8uywI1{4Mg`;H20L{fi5
zhyA!@FL#ZdtCE10&=cY5*BiNyI15N;U@nZ)diYAsIOMnoPEgGmYP8KiML4t-rYlD-
zklafv)UFV7Ra2i>bKS(~qN8;_le(441-gR6No_#|MFF0U89Z+u*~l-UcQZyko9^AK
ze=oB$s^E>CenYq2#n*b?$IWMAZbr!)xRwd|F*ch-te(|?};)>YiG`|QpZ4;5sVA3|m;{ewo
zj}2DVEclxCsGF62>{pNhqw^9)I#%c=FAF|wvEETiP>(=nr9R2q#^tg3u_1*^<;DJ%~^Aw|5J!`^Sw{2YUe=Ju%paSsW<`^E%u2dyg#ugucXaA
z$#o8bb1Y0AzflqQuBPzXmOJSCh2k|zCM<}%Bo5J^QQ<0;ti}?@?Zb5AjwtYi&?~J}
zwy{
zOPbjY5A_rF{_xMi2pRJqM#^i)wzPeg)ux^(Z6>$04)4Vb`)b;
z+<&!KG3M@Q37%DC*KAM^*ikGVCLq6sKP*odEqd{O*`A%hS)W!GN$ew;V=D7|6GPK#
zFK`;T-XY4VbFH}VEY6Y-KY7X9!suj_GX`I8e=mxX+WdrxembQ;1JxR=jz@$)Gbq?4
z9^<(EK)A@@*Tu0Gzhtbyo(Fp#?0LVRkNv(=ujvS56<+hd9nFDB_zIH%Cr4j!_zMaH
zg?-251kM-WeBql9us{!k9tJ(^!0CVK=pZNz6!u*@FD%J_+LHW?R7bo?
z#i(50?p408>~@i%)RUY9F75~Y@|a}w>kBt+i)aWm|K@;@q^W39hkaE0wU@So1``Kv
zV0ffr7qvebP>hG77|r>UwUUS0Mh?xHz6*BQE!$gr#57VUj*;$o3qh5;daqJ^#ujA8
zcp9_h)F)!Drvb>^uJF=|BdYpRg_}@DJ-P0kitR=
zyS8f`e`E9d{ik81bd{Sn6buTsPM5r3Nv@65U`fJ~geCdEZi|AgSX(231Y?zr1PTTP
zTi?zL3j60NZ1;~a_G?_yY29*o^aIknFX`*Axqq7(u|J9V>MN8dB#4k8LV^ehA|!}k
z_TtDowc;;%?Bv`0G{hCFv@}pKDA@W^1}F>^1`2}?6ja+|-xZ;bi9Q-Lm
z_zuSPgH+SPIlFQ$`-!d3A+CV90^$mYDIg#%ipn0!u-zKJRs0F&^ORG
z&^OTcm%qCXyW}b#s)vF>!PfUv2%s=f7%1%jy`>3GADlioeQ^5V^ug(`+b5~Ohk*|R
z9|k_`n|yc?GJVMOzsu`MfJT5ufJT5ufJT5ufJT4%05Kw6*w11K02Y<=4=3(^3WEtmm_URHM3_K?3B+$S
zX$7_dYz5c~uoYk{z*c~*09ygJ;_o>U1s?`J415^)Fz{jE!@!4u4+9?tK8)js{_G4)
zQNt89Oi{xWHB3>%6m`BAOi{xWHB3>%6!j`oJSZ3xY<*)-P#7o-6b2U?;9`TM-9WMG
zyH$1rIG2Qit#A7Tg@M9AVQ}*N@L4JY4l@(m~7aPkc&-`FbZ#a!ITs3pV3#inl=
ztV15QO1ljOgMzIupMb(ZVW2RGs3D?;h`RjAT`@lV`gQ|ITvxf@K*6A3up}X(hKL#>
zYKW*IqK1eXB5H`JA)EBfJX
z6
z{=LBx!fJ+<1OkDlcK_q~c{avn)?EHMn?Ri5jBYqE!1G|c%WV=#ASSN=%#ZJ|)
zf3RsYgGt_r>d;Wr^(n-c0cmf;C3(%?SO`Aw4(;%rD2xu^bgF3cGCoL2q+T+~a9c;W
zGH4oDdyy^IY9Nc^wC}ZsNLDnT-B?}M9qSOs>v2bVs;3

[PATCH 3/6] btrfs: Add handling for disk split-brain scenario during fsid change

2018-10-11 Thread Nikolay Borisov
Even though FSID change without rewrite is a very quick operations it's
still possible to experience a split brain scenario if power loss
occurs at the right time. This patch handle the case where power
failure occurs while the first transaction (the one setting
FSID_CHANGING_V2) flag is being persisted on disk. This can cause
the btrfs_fs_device of this filesystem to be created by a device which:

 a) has the FSID_CHANGING_V2 flag set but its fsid value is intact

 b) or a device which doesn't have FSID_CHANGING_V2 flag set and its
 fsid value is intact

This situatian is trivially handled by the current find_fsid code since
in both cases the devices are going to be tread like ordinary devices.
Since btrfs is mounted always using the superblock of the latest
device (the one with higher generation number), meaning it will have
the FSID_CHANGING_V2 flag set, ensure it's being cleared. On the first
transaction commit following the mount all disks will have it cleared.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index be2caf513e2f..9c2f46f8421a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2831,10 +2831,10 @@ int open_ctree(struct super_block *sb,
 * the whole block of INFO_SIZE
 */
memcpy(fs_info->super_copy, bh->b_data, sizeof(*fs_info->super_copy));
-   memcpy(fs_info->super_for_commit, fs_info->super_copy,
-  sizeof(*fs_info->super_for_commit));
brelse(bh);
 
+   disk_super = fs_info->super_copy;
+
ASSERT(!memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
   BTRFS_FSID_SIZE));
 
@@ -2844,6 +2844,15 @@ int open_ctree(struct super_block *sb,
BTRFS_FSID_SIZE));
}
 
+   features = btrfs_super_flags(disk_super);
+   if (features & BTRFS_SUPER_FLAG_CHANGING_FSID_v2) {
+   features &= ~BTRFS_SUPER_FLAG_CHANGING_FSID_v2;
+   btrfs_set_super_flags(disk_super, features);
+   btrfs_info(fs_info, "Found metadata uuid in progress flag. 
Clearing\n");
+   }
+
+   memcpy(fs_info->super_for_commit, fs_info->super_copy,
+  sizeof(*fs_info->super_for_commit));
 
ret = btrfs_validate_mount_super(fs_info);
if (ret) {
@@ -2852,7 +2861,6 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
 
-   disk_super = fs_info->super_copy;
if (!btrfs_super_root(disk_super))
goto fail_alloc;
 
-- 
2.7.4



[PATCH 2/6] btrfs: Remove fsid/metadata_fsid fields from btrfs_info

2018-10-11 Thread Nikolay Borisov
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/check-integrity.c   |  2 +-
 fs/btrfs/ctree.c |  5 +++--
 fs/btrfs/ctree.h |  2 --
 fs/btrfs/disk-io.c   | 21 -
 fs/btrfs/extent-tree.c   |  2 +-
 fs/btrfs/ioctl.c |  2 +-
 fs/btrfs/super.c |  2 +-
 fs/btrfs/volumes.c   | 10 --
 include/trace/events/btrfs.h |  2 +-
 9 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 2e43fba44035..781cae168d2a 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1720,7 +1720,7 @@ static int btrfsic_test_for_metadata(struct btrfsic_state 
*state,
num_pages = state->metablock_size >> PAGE_SHIFT;
h = (struct btrfs_header *)datav[0];
 
-   if (memcmp(h->fsid, fs_info->fsid, BTRFS_FSID_SIZE))
+   if (memcmp(h->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE))
return 1;
 
for (i = 0; i < num_pages; i++) {
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 11b5c2abeddc..9de05de8887d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -12,6 +12,7 @@
 #include "transaction.h"
 #include "print-tree.h"
 #include "locking.h"
+#include "volumes.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_path *path, int level);
@@ -224,7 +225,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
+   write_extent_buffer_fsid(cow, fs_info->fs_devices->metadata_uuid);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
@@ -1033,7 +1034,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
+   write_extent_buffer_fsid(cow, fs_info->fs_devices->metadata_uuid);
 
ret = update_ref_for_cow(trans, root, buf, cow, _ref);
if (ret) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index afa55f524a49..ad797ee42894 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -749,8 +749,6 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BALANCE_RUNNING   18
 
 struct btrfs_fs_info {
-   u8 fsid[BTRFS_FSID_SIZE]; /* User-visible fs UUID */
-   u8 metadata_fsid[BTRFS_FSID_SIZE]; /* UUID written to btree blocks */
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
unsigned long flags;
struct btrfs_root *extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b61e4d47e316..be2caf513e2f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -551,7 +551,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->metadata_fsid,
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
 
return csum_tree_block(fs_info, eb, 0);
@@ -2490,11 +2490,12 @@ static int validate_super(struct btrfs_fs_info *fs_info,
ret = -EINVAL;
}
 
-   if (memcmp(fs_info->metadata_fsid, sb->dev_item.fsid,
+   if (memcmp(fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid,
   BTRFS_FSID_SIZE) != 0) {
btrfs_err(fs_info,
   "dev_item UUID does not match metadata fsid: %pU != 
%pU",
-  fs_info->metadata_fsid, sb->dev_item.fsid);
+  fs_info->fs_devices->metadata_uuid,
+  sb->dev_item.fsid);
ret = -EINVAL;
}
 
@@ -2834,14 +2835,16 @@ int open_ctree(struct super_block *sb,
   sizeof(*fs_info->super_for_commit));
brelse(bh);
 
-   memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
+  BTRFS_FSID_SIZE));
+
if (btrfs_fs_incompat(fs_info, METADATA_UUID)) {
-   memcpy(fs_info->metadata_fsid,
-  fs_info->super_copy->metadata_uuid, BTRFS_FSID_SIZE);
-   } else {
-   memcpy(fs_info->metadata_fsid, fs_info->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(fs_info->fs_devices->metadata_uuid,
+ 

[PATCH 6/6] btrfs: Handle final split-brain possibility during fsid change

2018-10-11 Thread Nikolay Borisov
This patch lands the last case which needs to be handled by the fsid
change code. Namely, this is the case where a multidisk filesystem has
already undergone at least one successful fsid change i.e all disks
have the METADATA_UUID incompat bit and power failure occurs as another
fsid chind is in progress. When such an event occurs disks should be
split in 2 groups. One of the groups will have both METADATA_UUID and
FSID_CHANGING_V2 flags set coupled with old fsid/metadata_uuid pairs.
The other group of disks will have only METADATA_UUID bit set and their
fsid will be different than the one in disks in the first group. Here
we look at cases:
  a) A disk from the first group is scanned first, so fs_devices is
  created with stale fsid/metdata_uuid. Then when a disk from the
  second group is scanned it needs to first check whether there exists
  such an fs_devices that has fsid_change set to true (because it was
  created with a disk having the FSID_CHANGING_V2 flag), the
  metadata_uuid and fsid of the fsdevices will be different (since it was
  created by a disk which already had at least 1 successful fsid change)
  and finally the metadata_uuid of the fs_devices will equal that of the
  currently scanned disk (because metadata_uuid never really changes).
  When the correct fs_devices is found the information from the scanned
  disk will replace the current one in fs_devices since the scanned disk
  will have higher generation number.

  b) A disk from the second group is scanned so fs_devices is created
  as usual with differing fsid/metdata_uid. Then when a disk from the
  first group is scanned the code detects that it has both
  FSID_CHANGING and METADATA_UUID flags set and will scan for fs_devices
  that has differing metadata_uuid/fsid and whose metadata_uuid  is the
  same as that of the scanned device.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 65 --
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2c9879a81884..d08667cea189 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -383,7 +383,6 @@ find_fsid(const u8 *fsid, const u8 *metadata_fsid)
ASSERT(fsid);
 
if (metadata_fsid) {
-
/*
 * Handle scanned device having completed its fsid change but
 * belonging to a fs_devices that was created by first scanning
@@ -399,6 +398,21 @@ find_fsid(const u8 *fsid, const u8 *metadata_fsid)
return fs_devices;
}
}
+   /*
+* Handle scanned device having completed its fsid change but
+* belonging to a fs_devices that was created by a device that
+* has an outdated pair of fsid/metadata_uuid and CHANGING_FSID
+* flag set. 6/b
+*/
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (fs_devices->fsid_change &&
+   memcmp(fs_devices->metadata_uuid,
+  fs_devices->fsid, BTRFS_FSID_SIZE) != 0 &&
+   memcmp(metadata_fsid, fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE) == 0) {
+   return fs_devices;
+   }
+   }
}
 
/* Handle non-split brain cases */
@@ -808,6 +822,30 @@ static struct btrfs_fs_devices *find_fsid_inprogress(
return NULL;
 }
 
+
+static struct btrfs_fs_devices *find_fsid_changed(
+   struct btrfs_super_block *disk_super)
+{
+   struct btrfs_fs_devices *fs_devices;
+
+   /*
+* Handles the case where scanned device is part of an fs that had
+* multiple successful changes of FSID but curently device didn't
+* observe it. Meaning our fsid will be different than theirs. 6/b
+*/
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (memcmp(fs_devices->metadata_uuid, fs_devices->fsid,
+  BTRFS_FSID_SIZE) != 0 &&
+   memcmp(fs_devices->metadata_uuid, disk_super->metadata_uuid,
+  BTRFS_FSID_SIZE) == 0 &&
+   memcmp(fs_devices->fsid, disk_super->fsid,
+  BTRFS_FSID_SIZE) != 0) {
+   return fs_devices;
+   }
+   }
+
+   return NULL;
+}
 /*
  * Add new device to list of registered devices
  *
@@ -829,17 +867,20 @@ static noinline struct btrfs_device 
*device_list_add(const char *path,
bool fsid_change_in_progress = (btrfs_super_flags(disk_super) &
BTRFS_SUPER_FLAG_CHANGING_FSID_v2);
 
-   if (fsid_change_in_progress && !has_metadata_uuid) {
-   /*
-* When we have an image which has FSID_CHANGE set
-  

[PATCH 1/6] btrfs: Introduce support for FSID change without metadata rewrite

2018-10-11 Thread Nikolay Borisov
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current uuid (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
uuids:

1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.

When the new incompat flag is present device scaning checks whether
both fsid/metadata_uuid of the scanned device match to any of the
registed filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.

Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.

This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.c|  4 +--
 fs/btrfs/ctree.h| 12 ---
 fs/btrfs/disk-io.c  | 32 ++
 fs/btrfs/extent-tree.c  |  2 +-
 fs/btrfs/volumes.c  | 72 -
 fs/btrfs/volumes.h  |  1 +
 include/uapi/linux/btrfs.h  |  1 +
 include/uapi/linux/btrfs_tree.h |  1 +
 8 files changed, 97 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 2ee43b6a4f09..11b5c2abeddc 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -224,7 +224,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->fsid);
+   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
@@ -1033,7 +1033,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->fsid);
+   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
 
ret = update_ref_for_cow(trans, root, buf, cow, _ref);
if (ret) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 15c659f23411..afa55f524a49 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -197,7 +197,7 @@ struct btrfs_root_backup {
 struct btrfs_super_block {
u8 csum[BTRFS_CSUM_SIZE];
/* the first 4 fields must match struct btrfs_header */
-   u8 fsid[BTRFS_FSID_SIZE];/* FS specific uuid */
+   u8 fsid[BTRFS_FSID_SIZE];/* userfacing FS specific uuid */
__le64 bytenr; /* this block number */
__le64 flags;
 
@@ -234,8 +234,10 @@ struct btrfs_super_block {
__le64 cache_generation;
__le64 uuid_tree_generation;
 
+   u8 metadata_uuid[BTRFS_FSID_SIZE]; /* The uuid written into btree 
blocks */
+
/* future expansion */
-   __le64 reserved[30];
+   __le64 reserved[28];
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 } __attribute__ ((__packed__));
@@ -265,7 +267,8 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
-BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+BTRFS_FEATURE_INCOMPAT_NO_HOLES|   \
+BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
@@ -746,7 +749,8 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BALANCE_RUNNING   18
 
 struct btrfs_fs_info {
-   u8 fsid[BTRFS_FSID_SIZE];
+   u8 fsid[BTRFS_FSID_SIZE]; /* User-visible fs UUID */
+   u8 metadata_fsid[BTRFS_FSID_SIZE]; /* UUID written to btree blocks */
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
unsigned long flags;
struct btrfs_root *extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 27f6a3348f94..b61e4d47e316 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -551,7 +551,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+   ASSERT(memcmp_extent_buffer(eb, 

[PATCH 4/6] btrfs: Introduce 2 more members to struct btrfs_fs_devices

2018-10-11 Thread Nikolay Borisov
In order to gracefully handle split-brain scenario which are very
unlikely, yet still possible while performing the FSID change I'm
gonna need two more pieces of information:
 1. The highes generation number among all devices registered to a
 particular btrfs_fs_devices

 2. A boolean flag whether a given btrfs_fs_devices was created by a
 device which had the FSID_CHANGING_V2 flag set.

This is a preparatory patch and just introduces the variables as well
as code which sets, their actual use is going to happen in a later
patch.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 9 -
 fs/btrfs/volumes.h | 5 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bf0aa900f22c..c2b66d15e08d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -785,6 +785,8 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
u64 devid = btrfs_stack_device_id(_super->dev_item);
bool has_metadata_uuid = (btrfs_super_incompat_flags(disk_super) &
BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
+   bool fsid_change_in_progress = (btrfs_super_flags(disk_super) &
+   BTRFS_SUPER_FLAG_CHANGING_FSID_v2);
 
if (has_metadata_uuid)
fs_devices = find_fsid(disk_super->fsid, 
disk_super->metadata_uuid);
@@ -798,6 +800,8 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
else
fs_devices = alloc_fs_devices(disk_super->fsid, NULL);
 
+   fs_devices->fsid_change = fsid_change_in_progress;
+
if (IS_ERR(fs_devices))
return ERR_CAST(fs_devices);
 
@@ -904,8 +908,11 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
 * it back. We need it to pick the disk with largest generation
 * (as above).
 */
-   if (!fs_devices->opened)
+   if (!fs_devices->opened) {
device->generation = found_transid;
+   fs_devices->latest_generation= max(found_transid,
+   fs_devices->latest_generation);
+   }
 
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 04860497b33c..6b2a01c55426 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -211,6 +211,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 struct btrfs_fs_devices {
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
u8 metadata_uuid[BTRFS_FSID_SIZE];
+   bool fsid_change;
struct list_head fs_list;
 
u64 num_devices;
@@ -219,6 +220,10 @@ struct btrfs_fs_devices {
u64 missing_devices;
u64 total_rw_bytes;
u64 total_devices;
+
+   /* Highest generation number of seen devices */
+   u64 latest_generation;
+
struct block_device *latest_bdev;
 
/* all of the devices in the FS, protected by a mutex
-- 
2.7.4



Re: [PATCH 03/25] vfs: check file ranges before cloning files

2018-10-11 Thread Amir Goldstein
On Thu, Oct 11, 2018 at 4:43 PM Christoph Hellwig  wrote:
>
> > -EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
> > +EXPORT_SYMBOL(vfs_clone_file_prep);
>
> Btw, why isn't this EXPORT_SYMBOL_GPL?  It is rather Linux internal
> code, including some that I wrote which you lifted into the core
> in "vfs: refactor clone/dedupe_file_range common functions".

Because Al will shot down any attempt of those in vfs code:
https://lkml.org/lkml/2018/6/10/4

Thanks,
Amir.


Re: [PATCH 04/25] vfs: strengthen checking of file range inputs to generic_remap_checks

2018-10-11 Thread Christoph Hellwig
On Wed, Oct 10, 2018 at 09:12:46PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> File range remapping, if allowed to run past the destination file's EOF,
> is an optimization on a regular file write.  Regular file writes that
> extend the file length are subject to various constraints which are not
> checked by range cloning.
> 
> This is a correctness problem because we're never allowed to touch
> ranges that the page cache can't support (s_maxbytes); we're not
> supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't
> set; and we must obey resource limits (RLIMIT_FSIZE).
> 
> Therefore, add these checks to the new generic_remap_checks function so
> that we curtail unexpected behavior.
> 
> Signed-off-by: Darrick J. Wong 
> Reviewed-by: Amir Goldstein 

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 03/25] vfs: check file ranges before cloning files

2018-10-11 Thread Christoph Hellwig
> -EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
> +EXPORT_SYMBOL(vfs_clone_file_prep);

Btw, why isn't this EXPORT_SYMBOL_GPL?  It is rather Linux internal
code, including some that I wrote which you lifted into the core
in "vfs: refactor clone/dedupe_file_range common functions".


Re: [PATCH 02/25] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF

2018-10-11 Thread Christoph Hellwig
On Wed, Oct 10, 2018 at 09:12:30PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> vfs_clone_file_prep_inodes cannot return 0 if it is asked to remap from
> a zero byte file because that's what btrfs does.
> 
> Signed-off-by: Darrick J. Wong 
> ---

Maybe it would be a good time to switch btrfs to use
vfs_clone_file_prep_inodes so that we don't have any discrepancies?

Otherwise looks good:

Reviewed-by: Christoph Hellwig 


Re: [PATCH 01/25] xfs: add a per-xfs trace_printk macro

2018-10-11 Thread Christoph Hellwig
On Wed, Oct 10, 2018 at 09:12:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Add a "xfs_tprintk" macro so that developers can use trace_printk to
> print out arbitrary debugging information with the XFS device name
> attached to the trace output.

I can't say I'm a fan of this.  trace_printk is a debugging aid,
and opencoding the file system name really isn't much of a burden.


Re: [PATCH 1/9] btrfs: Add support for reading a filesystem with a RAID 5 or RAID 6 profile.

2018-10-11 Thread Daniel Kiper
On Tue, Oct 09, 2018 at 07:51:01PM +0200, Daniel Kiper wrote:
> On Thu, Sep 27, 2018 at 08:34:56PM +0200, Goffredo Baroncelli wrote:
> > From: Goffredo Baroncelli 
> >
> > Signed-off-by: Goffredo Baroncelli 
> 
> Code LGTM. Though comment begs improvement. I will send you updated
> comment for approval shortly.

Below you can find updated patch. Please check I have not messed up something.

Daniel

>From ecefb12a10d39bdd09e1d2b8fbbcbdb1b35274f8 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli 
Date: Thu, 27 Sep 2018 20:34:56 +0200
Subject: [PATCH 1/1] btrfs: Add support for reading a filesystem with a RAID
 5 or RAID 6 profile.

Signed-off-by: Goffredo Baroncelli 
Signed-off-by: Daniel Kiper 
---
 grub-core/fs/btrfs.c |   73 ++
 1 file changed, 73 insertions(+)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index be19544..933a57d 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
 #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
 #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
 #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
+#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
+#define GRUB_BTRFS_CHUNK_TYPE_RAID6 0x100
   grub_uint8_t dummy2[0xc];
   grub_uint16_t nstripes;
   grub_uint16_t nsubstripes;
@@ -766,6 +768,77 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
  csize = chunk_stripe_length - low;
  break;
}
+ case GRUB_BTRFS_CHUNK_TYPE_RAID5:
+ case GRUB_BTRFS_CHUNK_TYPE_RAID6:
+   {
+ grub_uint64_t nparities, stripe_nr, high, low;
+
+ redundancy = 1;   /* no redundancy for now */
+
+ if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
+   {
+ grub_dprintf ("btrfs", "RAID5\n");
+ nparities = 1;
+   }
+ else
+   {
+ grub_dprintf ("btrfs", "RAID6\n");
+ nparities = 2;
+   }
+
+ /*
+  * RAID 6 layout consists of several stripes spread over
+  * the disks, e.g.:
+  *
+  *   Disk_0  Disk_1  Disk_2  Disk_3
+  * A0  B0  P0  Q0
+  * Q1  A1  B1  P1
+  * P2  Q2  A2  B2
+  *
+  * Note: placement of the parities depend on row number.
+  *
+  * Pay attention that the btrfs terminology may differ from
+  * terminology used in other RAID implementations, e.g. LVM,
+  * dm or md. The main difference is that btrfs calls contiguous
+  * block of data on a given disk, e.g. A0, stripe instead of 
chunk.
+  *
+  * The variables listed below have following meaning:
+  *   - stripe_nr is the stripe number excluding the parities
+  * (A0 = 0, B0 = 1, A1 = 2, B1 = 3, etc.),
+  *   - high is the row number (0 for A0...Q0, 1 for Q1...P1, 
etc.),
+  *   - stripen is the disk number in a row (0 for A0, Q1, P2,
+  * 1 for B0, A1, Q2, etc.),
+  *   - off is the logical address to read,
+  *   - chunk_stripe_length is the size of a stripe (typically 64 
KiB),
+  *   - nstripes is the number of disks in a row,
+  *   - low is the offset of the data inside a stripe,
+  *   - stripe_offset is the data offset in an array,
+  *   - csize is the "potential" data to read; it will be reduced
+  * to size if the latter is smaller,
+  *   - nparities is the number of parities (1 for RAID 5, 2 for
+  * RAID 6); used only in RAID 5/6 code.
+  */
+ stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
+
+ /*
+  * stripen is computed without the parities
+  * (0 for A0, A1, A2, 1 for B0, B1, B2, etc.).
+  */
+ high = grub_divmod64 (stripe_nr, nstripes - nparities, );
+
+ /*
+  * The stripes are spread over the disks. Every each row their
+  * positions are shifted by 1 place. So, the real disks number
+  * change. Hence, we have to take current row number modulo
+  * nstripes into account (0 for A0, 1 for A1, 2 for A2, etc.).
+  */
+ grub_divmod64 (high + stripen, nstripes, );
+
+ stripe_offset = low + chunk_stripe_length * high;
+ csize = chunk_stripe_length - low;
+
+ break;
+   }
  default:
grub_dprintf ("btrfs", "unsupported RAID\n");
return grub_error (GRUB_ERR_NOT_IMPLEMENTED_YET,
-- 
1.7.10.4


Re: [PATCH 1/2] btrfs: Populate btrfs_super_block::log_root_transid

2018-10-11 Thread Qu Wenruo



On 2018/10/11 下午9:11, Nikolay Borisov wrote:
> 
> 
> On 11.10.2018 15:45, Qu Wenruo wrote:
>>
>>
>> On 2018/10/11 下午8:31, David Sterba wrote:
>>> On Thu, Oct 04, 2018 at 01:49:49PM +0800, Qu Wenruo wrote:
 We have btrfs_super_block::log_root_transid to record the transid of log
 tree root.

 However it's never populated and it's always 0, just as the following
 dump-super:
  log_root30572544
  log_root_transid0
  log_root_level  0

 This patch will populate it with log tree root correctly, so the result
 will be:
  log_root30572544
  log_root_transid6
  log_root_level  0

 This won't affect current kernel behavior or btrfs check result as we
 already expect log tree root generation always to be super block
 genration + 1.

 But it could be later used to detect log tree corruption early.
>>>
>>> The backward compatibility seems to be ok in general, I found one
>>> scenario where the check will fail:
>>>
>>> * mount with unpatched kernel, log_root_transid = 0
>>> * mount with patched kernel, log_root_transid 100, generation 101
>>> * mount with unpatched kernel, log_root_transid 100 (unchanged), generation 
>>> 201
>>> * mount with patched kernel -> check fails
>>
>> Indeed, this is a problem I missed.
>>
>> It provides the old principle, if we're going to change how kernel
>> use/interprete a on-disk memeber, we have to introduce a new
>> incompatible flag.
>>
>> Please just drop the series of patch.
> 
> Since this member really doesn't have any purpose now I think the
> prudent thing to do is to rename it to "reserved1" or "padding" or
> whatever but basically signal that it's no longer in used. In the future
> when a new incompaat bit will be required for some feature it can be
> re-used.

Makes sense.

I'll craft a patch for that.

Thanks,
Qu

> 
>>
>> Thanks,
>> Qu
>>>
>>> So the problem is when log_root_transid is not 0 and changes out of sync
>>> with the generation. The above sequence of kernels can simply happen
>>> when switching between old stable and current releases.
>>>
>>


Re: [PATCH 1/2] btrfs: Populate btrfs_super_block::log_root_transid

2018-10-11 Thread Nikolay Borisov



On 11.10.2018 15:45, Qu Wenruo wrote:
> 
> 
> On 2018/10/11 下午8:31, David Sterba wrote:
>> On Thu, Oct 04, 2018 at 01:49:49PM +0800, Qu Wenruo wrote:
>>> We have btrfs_super_block::log_root_transid to record the transid of log
>>> tree root.
>>>
>>> However it's never populated and it's always 0, just as the following
>>> dump-super:
>>>  log_root30572544
>>>  log_root_transid0
>>>  log_root_level  0
>>>
>>> This patch will populate it with log tree root correctly, so the result
>>> will be:
>>>  log_root30572544
>>>  log_root_transid6
>>>  log_root_level  0
>>>
>>> This won't affect current kernel behavior or btrfs check result as we
>>> already expect log tree root generation always to be super block
>>> genration + 1.
>>>
>>> But it could be later used to detect log tree corruption early.
>>
>> The backward compatibility seems to be ok in general, I found one
>> scenario where the check will fail:
>>
>> * mount with unpatched kernel, log_root_transid = 0
>> * mount with patched kernel, log_root_transid 100, generation 101
>> * mount with unpatched kernel, log_root_transid 100 (unchanged), generation 
>> 201
>> * mount with patched kernel -> check fails
> 
> Indeed, this is a problem I missed.
> 
> It provides the old principle, if we're going to change how kernel
> use/interprete a on-disk memeber, we have to introduce a new
> incompatible flag.
> 
> Please just drop the series of patch.

Since this member really doesn't have any purpose now I think the
prudent thing to do is to rename it to "reserved1" or "padding" or
whatever but basically signal that it's no longer in used. In the future
when a new incompaat bit will be required for some feature it can be
re-used.

> 
> Thanks,
> Qu
>>
>> So the problem is when log_root_transid is not 0 and changes out of sync
>> with the generation. The above sequence of kernels can simply happen
>> when switching between old stable and current releases.
>>
> 


Re: [PATCH 1/2] btrfs: Populate btrfs_super_block::log_root_transid

2018-10-11 Thread Qu Wenruo


On 2018/10/11 下午8:31, David Sterba wrote:
> On Thu, Oct 04, 2018 at 01:49:49PM +0800, Qu Wenruo wrote:
>> We have btrfs_super_block::log_root_transid to record the transid of log
>> tree root.
>>
>> However it's never populated and it's always 0, just as the following
>> dump-super:
>>  log_root30572544
>>  log_root_transid0
>>  log_root_level  0
>>
>> This patch will populate it with log tree root correctly, so the result
>> will be:
>>  log_root30572544
>>  log_root_transid6
>>  log_root_level  0
>>
>> This won't affect current kernel behavior or btrfs check result as we
>> already expect log tree root generation always to be super block
>> genration + 1.
>>
>> But it could be later used to detect log tree corruption early.
> 
> The backward compatibility seems to be ok in general, I found one
> scenario where the check will fail:
> 
> * mount with unpatched kernel, log_root_transid = 0
> * mount with patched kernel, log_root_transid 100, generation 101
> * mount with unpatched kernel, log_root_transid 100 (unchanged), generation 
> 201
> * mount with patched kernel -> check fails

Indeed, this is a problem I missed.

It provides the old principle, if we're going to change how kernel
use/interprete a on-disk memeber, we have to introduce a new
incompatible flag.

Please just drop the series of patch.

Thanks,
Qu
> 
> So the problem is when log_root_transid is not 0 and changes out of sync
> with the generation. The above sequence of kernels can simply happen
> when switching between old stable and current releases.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: Populate btrfs_super_block::log_root_transid

2018-10-11 Thread David Sterba
On Thu, Oct 04, 2018 at 01:49:49PM +0800, Qu Wenruo wrote:
> We have btrfs_super_block::log_root_transid to record the transid of log
> tree root.
> 
> However it's never populated and it's always 0, just as the following
> dump-super:
>  log_root30572544
>  log_root_transid0
>  log_root_level  0
> 
> This patch will populate it with log tree root correctly, so the result
> will be:
>  log_root30572544
>  log_root_transid6
>  log_root_level  0
> 
> This won't affect current kernel behavior or btrfs check result as we
> already expect log tree root generation always to be super block
> genration + 1.
> 
> But it could be later used to detect log tree corruption early.

The backward compatibility seems to be ok in general, I found one
scenario where the check will fail:

* mount with unpatched kernel, log_root_transid = 0
* mount with patched kernel, log_root_transid 100, generation 101
* mount with unpatched kernel, log_root_transid 100 (unchanged), generation 201
* mount with patched kernel -> check fails

So the problem is when log_root_transid is not 0 and changes out of sync
with the generation. The above sequence of kernels can simply happen
when switching between old stable and current releases.


Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables

2018-10-11 Thread Nikolay Borisov



On 11.10.2018 15:15, Lu Fengqi wrote:
> On Thu, Oct 11, 2018 at 09:40:52AM +0300, Nikolay Borisov wrote:
>>
>>
>> On 11.10.2018 08:40, Lu Fengqi wrote:
>>> If the return value of find_ref_head() is NULL, the only possibility is
>>> that delayed_refs' head ref rbtree is empty. Hence, the second
>>> find_ref_head() is pointless.
 Besides, the local variables loop and start are unnecessary, just remove
>>> them.
>>
>> So the objective of that function is to get a reference to the first
>> delayed head which is not processed. This is done by essentially keeping
>> track of the last range that was processed in
>> delayed_refs->run_delayed_start
>>>
>>> Signed-off-by: Lu Fengqi 
>>> ---
>>>  fs/btrfs/delayed-ref.c | 17 +++--
>>>  1 file changed, 3 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>> index 885581852bea..2726d2fb4bbe 100644
>>> --- a/fs/btrfs/delayed-ref.c
>>> +++ b/fs/btrfs/delayed-ref.c
>>> @@ -354,20 +354,11 @@ struct btrfs_delayed_ref_head *
>>>  btrfs_select_ref_head(struct btrfs_delayed_ref_root *delayed_refs)
>>>  {
>>> struct btrfs_delayed_ref_head *head;
>>> -   u64 start;
>>> -   bool loop = false;
>>>  
>>>  again:
>>> -   start = delayed_refs->run_delayed_start;
>>> -   head = find_ref_head(delayed_refs, start, 1);
>>> -   if (!head && !loop) {
>>> +   head = find_ref_head(delayed_refs, delayed_refs->run_delayed_start, 1);
>>> +   if (!head) {
>>> delayed_refs->run_delayed_start = 0;
>>> -   start = 0;
>>> -   loop = true;
>>> -   head = find_ref_head(delayed_refs, start, 1);
>>> -   if (!head)
>>> -   return NULL;
>>> -   } else if (!head && loop) {
>>
>> I believe this will have a negative impact since it actually will
>> prevent finding a head which was added BEFORE the last processed head.
>> So when a ref head is selected in btrfs_obtain_ref_head then the
>> delayed_refs->lock is dropped and the given head is locked and
>> delayed_refs->run_delayed_start points to the end of the selected range
>> that the head represents. At this point it's possible that another
>> thread modifies a different range which is before the one we have
>> selected so graphically it will be something like:
>>
>>
>> ---[HEAD2]->[HEAD1]--
>> 0N
>>
>> Where HEAD1 is the head returned from first invocation of
>> btrfs_obtain_ref_head. Once  btrfs_obtain_ref_head is called the 2nd
>> time it will not find HEAD2 so will just reset run_delayed_start to 0
>> and return. So it will be up to another run of the delayed refs to
>> actually find head2. Essentially you made btrfs_obtain_ref_head less
> 
> Not exactly. In fact, find_ref_head hides such a logic. When
> return_bigger is set, if there is no larger entry to return, the first
> entry will be returned. Please see the comment I add in the PATCH 6.
> 
> Hence, the 2nd invocation of btrfs_obtain_ref_head still will return
> HEAD2. There is no functional change here.
> 
> However, your question makes me consider whether such hidden logic
> should be extracted from find_ref_head to btrfs_select_ref_head.

Right I agree with your. As it stands I will expect that if
return_bigger is true to specifically return a bigger entry or if
nothing is found to return null. IMO this behavior is higher level and
belongs to btrfs_delayed_ref_head.

> 
>> greedy. Have you characterized what kind of performance impact this have?
> 
> I noticed that there is a macro called SCRAMBLE_DELAYED_REFS in the
> extent-tree.c. I am a bit curious whether it has been forgotten by
> everyone, I have not found any test results about its performance impact.

I guess it was used during testing but nothing currently sets it. I.e it
might make sense to enable it if BTRFS_DEBUG is set.


Re: [PATCH 2/2] btrfs: Validate btrfs_super_block::log_root_transid

2018-10-11 Thread David Sterba
On Thu, Oct 04, 2018 at 09:57:22AM +0300, Nikolay Borisov wrote:
> > +   /*
> > +* Check log root transid against super block generation.
> > +*
> > +* Older kernel doesn't populate log_root_transid, only need to check
> > +* it when it's not zero.
> > +* Since replaying suspicious log tree can cause serious problem, refuse
> > +* to mount if it doesn't match.
> > +*/
> > +   if (btrfs_super_log_root_transid(sb) &&
> > +   btrfs_super_log_root_transid(sb) != generation + 1) {
> 
> nit: in the same vein as you do for btrfs_super_generation just assign
> super_log_root_Transid to a local var and use that. IMO it's more
> consistent with what's happening in this function.

While I see your point I think the way Qu did it is also fine, there's
not much duplication and I find it readable.


Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables

2018-10-11 Thread Lu Fengqi
On Thu, Oct 11, 2018 at 09:40:52AM +0300, Nikolay Borisov wrote:
>
>
>On 11.10.2018 08:40, Lu Fengqi wrote:
>> If the return value of find_ref_head() is NULL, the only possibility is
>> that delayed_refs' head ref rbtree is empty. Hence, the second
>> find_ref_head() is pointless.
>> > Besides, the local variables loop and start are unnecessary, just remove
>> them.
>
>So the objective of that function is to get a reference to the first
>delayed head which is not processed. This is done by essentially keeping
>track of the last range that was processed in
>delayed_refs->run_delayed_start
>> 
>> Signed-off-by: Lu Fengqi 
>> ---
>>  fs/btrfs/delayed-ref.c | 17 +++--
>>  1 file changed, 3 insertions(+), 14 deletions(-)
>> 
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 885581852bea..2726d2fb4bbe 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -354,20 +354,11 @@ struct btrfs_delayed_ref_head *
>>  btrfs_select_ref_head(struct btrfs_delayed_ref_root *delayed_refs)
>>  {
>>  struct btrfs_delayed_ref_head *head;
>> -u64 start;
>> -bool loop = false;
>>  
>>  again:
>> -start = delayed_refs->run_delayed_start;
>> -head = find_ref_head(delayed_refs, start, 1);
>> -if (!head && !loop) {
>> +head = find_ref_head(delayed_refs, delayed_refs->run_delayed_start, 1);
>> +if (!head) {
>>  delayed_refs->run_delayed_start = 0;
>> -start = 0;
>> -loop = true;
>> -head = find_ref_head(delayed_refs, start, 1);
>> -if (!head)
>> -return NULL;
>> -} else if (!head && loop) {
>
>I believe this will have a negative impact since it actually will
>prevent finding a head which was added BEFORE the last processed head.
>So when a ref head is selected in btrfs_obtain_ref_head then the
>delayed_refs->lock is dropped and the given head is locked and
>delayed_refs->run_delayed_start points to the end of the selected range
>that the head represents. At this point it's possible that another
>thread modifies a different range which is before the one we have
>selected so graphically it will be something like:
>
>
>---[HEAD2]->[HEAD1]--
>0N
>
>Where HEAD1 is the head returned from first invocation of
>btrfs_obtain_ref_head. Once  btrfs_obtain_ref_head is called the 2nd
>time it will not find HEAD2 so will just reset run_delayed_start to 0
>and return. So it will be up to another run of the delayed refs to
>actually find head2. Essentially you made btrfs_obtain_ref_head less

Not exactly. In fact, find_ref_head hides such a logic. When
return_bigger is set, if there is no larger entry to return, the first
entry will be returned. Please see the comment I add in the PATCH 6.

Hence, the 2nd invocation of btrfs_obtain_ref_head still will return
HEAD2. There is no functional change here.

However, your question makes me consider whether such hidden logic
should be extracted from find_ref_head to btrfs_select_ref_head.

>greedy. Have you characterized what kind of performance impact this have?

I noticed that there is a macro called SCRAMBLE_DELAYED_REFS in the
extent-tree.c. I am a bit curious whether it has been forgotten by
everyone, I have not found any test results about its performance impact.

-- 
Thanks,
Lu

>
>
>
>
>>  return NULL;
>>  }
>>  
>> @@ -376,11 +367,9 @@ btrfs_select_ref_head(struct btrfs_delayed_ref_root 
>> *delayed_refs)
>>  
>>  node = rb_next(>href_node);
>>  if (!node) {
>> -if (loop)
>> +if (delayed_refs->run_delayed_start == 0)
>>  return NULL;
>>  delayed_refs->run_delayed_start = 0;
>> -start = 0;
>> -loop = true;
>>  goto again;
>>  }
>>  head = rb_entry(node, struct btrfs_delayed_ref_head,
>> 
>
>




Re: [PATCH v3 1/6] btrfs-progs: image: Use correct device size when restoring

2018-10-11 Thread Nikolay Borisov



On  8.10.2018 15:30, Qu Wenruo wrote:
> When restoring btrfs image, the total_bytes of device item is not
> updated correctly. In fact total_bytes can be left 0 for restored image.
> 
> It doesn't trigger any error because btrfs check never checks
> total_bytes of dev item.
> However this is going to change.
> 
> Fix it by populating total_bytes of device item with the end position of
> last dev extent to make later btrfs check happy.

'Setting it to the end position' really means "setting the total device
size to the allocated space on the device". Is it more clear if stated
as the second way ?

In any case:

Reviewed-by: Nikolay Borisov 

> 
> Signed-off-by: Qu Wenruo 
> ---
>  image/main.c | 48 +---
>  1 file changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/image/main.c b/image/main.c
> index 351c5a256938..d5b89bc3149f 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2082,15 +2082,17 @@ static void remap_overlapping_chunks(struct 
> mdrestore_struct *mdres)
>  }
>  
>  static int fixup_devices(struct btrfs_fs_info *fs_info,
> -  struct mdrestore_struct *mdres, off_t dev_size)
> +  struct mdrestore_struct *mdres, int out_fd)
>  {
>   struct btrfs_trans_handle *trans;
>   struct btrfs_dev_item *dev_item;
> + struct btrfs_dev_extent *dev_ext;
>   struct btrfs_path path;
>   struct extent_buffer *leaf;
>   struct btrfs_root *root = fs_info->chunk_root;
>   struct btrfs_key key;
>   u64 devid, cur_devid;
> + u64 dev_size; /* Get from last dev extents */
>   int ret;
>  
>   trans = btrfs_start_transaction(fs_info->tree_root, 1);
> @@ -2101,16 +2103,56 @@ static int fixup_devices(struct btrfs_fs_info 
> *fs_info,
>  
>   dev_item = _info->super_copy->dev_item;
>  
> + btrfs_init_path();
>   devid = btrfs_stack_device_id(dev_item);
>  
> + key.objectid = devid;
> + key.type = BTRFS_DEV_EXTENT_KEY;
> + key.offset = (u64)-1;
> +
> + ret = btrfs_search_slot(NULL, fs_info->dev_root, , , 0, 0);
> + if (ret < 0) {
> + error("failed to locate last dev extent of devid %llu: %s",
> + devid, strerror(-ret));
> + btrfs_release_path();
> + return ret;
> + }
> + if (ret == 0) {

nit: I'd prefer if this is an else if since it emphasizes the fact that
the if/elseif construct works on a single value.

> + error("found invalid dev extent devid %llu offset -1",
> + devid);
> + btrfs_release_path();
> + return -EUCLEAN;
> + }
> + ret = btrfs_previous_item(fs_info->dev_root, , devid,
> +   BTRFS_DEV_EXTENT_KEY);
> + if (ret > 0)
> + ret = -ENOENT;
> + if (ret < 0) {

ditto

> + error("failed to locate last dev extent of devid %llu: %s",
> + devid, strerror(-ret));
> + btrfs_release_path();
> + return ret;
> + }
> +
> + btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
> + dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
> +  struct btrfs_dev_extent);
> + dev_size = key.offset + btrfs_dev_extent_length(path.nodes[0], dev_ext);
> + btrfs_release_path();
> +
>   btrfs_set_stack_device_total_bytes(dev_item, dev_size);
>   btrfs_set_stack_device_bytes_used(dev_item, mdres->alloced_chunks);
> + /* Don't forget to enlarge the real file */
> + ret = ftruncate64(out_fd, dev_size);
> + if (ret < 0) {
> + error("failed to enlarge result image: %s", strerror(errno));
> + return -errno;
> + }
>  
>   key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
>   key.type = BTRFS_DEV_ITEM_KEY;
>   key.offset = 0;
>  
> - btrfs_init_path();
>  
>  again:
>   ret = btrfs_search_slot(trans, root, , , -1, 1);
> @@ -2275,7 +2317,7 @@ static int restore_metadump(const char *input, FILE 
> *out, int old_restore,
>   return 1;
>   }
>  
> - ret = fixup_devices(info, , st.st_size);
> + ret = fixup_devices(info, , fileno(out));
>   close_ctree(info->chunk_root);
>   if (ret)
>   goto out;
> 


Re: [PATCH 0/6] Some trivail cleanup about dealyed-refs

2018-10-11 Thread David Sterba
On Thu, Oct 11, 2018 at 01:40:32PM +0800, Lu Fengqi wrote:
> There is no functional change. Just improve readablity.
> 
> PATCH 1-4 parameter cleanup patches
> PATCH 5 cleanup about btrfs_select_ref_head
> PATCH 6 switch int to bool; add some comment
> 
> Lu Fengqi (6):
>   btrfs: delayed-ref: pass delayed_refs directly to
> btrfs_select_ref_head()
>   btrfs: delayed-ref: pass delayed_refs directly to
> btrfs_delayed_ref_lock()
>   btrfs: remove fs_info from btrfs_check_space_for_delayed_refs
>   btrfs: remove fs_info from btrfs_should_throttle_delayed_refs
>   btrfs: simplify btrfs_select_ref_head and cleanup some local variables
>   btrfs: switch return_bigger to bool in find_ref_head

1-4 and 6 added to misc-next, thanks.


Re: [PATCH v2 3/4] btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered()

2018-10-11 Thread Qu Wenruo



On 2018/10/11 上午10:15, Su Yue wrote:
> 
> 
> On 8/21/18 4:44 PM, Qu Wenruo wrote:
>> This patch will extract unclsutered extent allocation code into
>> find_free_extent_unclustered().
>>
>> And this helper function will use return value to indicate what to do
>> next.
>>
>> This should make find_free_extent() a little easier to read.
>>
>> Signed-off-by: Qu Wenruo 
> 
> Reviewed-by: Su Yue 
>> ---
>>   fs/btrfs/extent-tree.c | 112 -
>>   1 file changed, 66 insertions(+), 46 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a603900e0eb8..5bc8919edac2 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7546,6 +7546,67 @@ static int find_free_extent_clustered(struct
>> btrfs_block_group_cache *bg,
>>   return 1;
>>   }
>>   +/*
>> + * Return >0 to inform caller that we find nothing
>> + * Return -EAGAIN to inform caller that we need to re-search this
>> block group
> 
> How about adding description of return 0?

Although 0 is pretty boring, it still makes sense.
Will be in next update

Thanks,
Qu
> 
> Thanks,
> Su
>> + */
>> +static int find_free_extent_unclustered(struct
>> btrfs_block_group_cache *bg,
>> +    struct btrfs_free_cluster *last_ptr,
>> +    struct find_free_extent_ctrl *ctrl)
>> +{
>> +    u64 offset;
>> +
>> +    /*
>> + * We are doing an unclustered alloc, set the fragmented flag so we
>> + * don't bother trying to setup a cluster again until we get more
>> space.
>> + */
>> +    if (unlikely(last_ptr)) {
>> +    spin_lock(_ptr->lock);
>> +    last_ptr->fragmented = 1;
>> +    spin_unlock(_ptr->lock);
>> +    }
>> +    if (ctrl->cached) {
>> +    struct btrfs_free_space_ctl *free_space_ctl;
>> +
>> +    free_space_ctl = bg->free_space_ctl;
>> +    spin_lock(_space_ctl->tree_lock);
>> +    if (free_space_ctl->free_space <
>> +    ctrl->num_bytes + ctrl->empty_cluster + ctrl->empty_size) {
>> +    ctrl->max_extent_size = max_t(u64,
>> +    ctrl->max_extent_size,
>> +    free_space_ctl->free_space);
>> +    spin_unlock(_space_ctl->tree_lock);
>> +    return 1;
>> +    }
>> +    spin_unlock(_space_ctl->tree_lock);
>> +    }
>> +
>> +    offset = btrfs_find_space_for_alloc(bg, ctrl->search_start,
>> +    ctrl->num_bytes, ctrl->empty_size,
>> +    >max_extent_size);
>> +
>> +    /*
>> + * If we didn't find a chunk, and we haven't failed on this block
>> group
>> + * before, and this block group is in the middle of caching and
>> we are
>> + * ok with waiting, then go ahead and wait for progress to be
>> made, and
>> + * set @retry_unclustered to true.
>> + *
>> + * If @retry_unclustered is true then we've already waited on
>> this block
>> + * group once and should move on to the next block group.
>> + */
>> +    if (!offset && !ctrl->retry_unclustered && !ctrl->cached &&
>> +    ctrl->loop > LOOP_CACHING_NOWAIT) {
>> +    wait_block_group_cache_progress(bg, ctrl->num_bytes +
>> +    ctrl->empty_size);
>> +    ctrl->retry_unclustered = true;
>> +    return -EAGAIN;
>> +    } else if (!offset) {
>> +    return 1;
>> +    }
>> +    ctrl->found_offset = offset;
>> +    return 0;
>> +}
>> +
>>   /*
>>    * walks the btree of allocated extents and find a hole of a given
>> size.
>>    * The key ins is changed to record the hole:
>> @@ -7747,54 +7808,13 @@ static noinline int find_free_extent(struct
>> btrfs_fs_info *fs_info,
>>   /* ret == -ENOENT case falss through */
>>   }
>>   -    /*
>> - * We are doing an unclustered alloc, set the fragmented flag so
>> - * we don't bother trying to setup a cluster again until we get
>> - * more space.
>> - */
>> -    if (unlikely(last_ptr)) {
>> -    spin_lock(_ptr->lock);
>> -    last_ptr->fragmented = 1;
>> -    spin_unlock(_ptr->lock);
>> -    }
>> -    if (ctrl.cached) {
>> -    struct btrfs_free_space_ctl *ctl =
>> -    block_group->free_space_ctl;
>> -
>> -    spin_lock(>tree_lock);
>> -    if (ctl->free_space <
>> -    num_bytes + ctrl.empty_cluster + empty_size) {
>> -    if (ctl->free_space > ctrl.max_extent_size)
>> -    ctrl.max_extent_size = ctl->free_space;
>> -    spin_unlock(>tree_lock);
>> -    goto loop;
>> -    }
>> -    spin_unlock(>tree_lock);
>> -    }
>> -
>> -    ctrl.found_offset = btrfs_find_space_for_alloc(block_group,
>> -    ctrl.search_start, num_bytes, empty_size,
>> -    _extent_size);
>> -    /*
>> - * If we didn't find a chunk, and we haven't failed on this
>> - * block group before, and this block group is in the middle of
>> - * caching and we are ok 

Re: [PATCH v2 2/4] btrfs: Refactor clustered extent allocation into find_free_extent_clustered()

2018-10-11 Thread Qu Wenruo
[snip]
>> +    /* ret == -ENOENT case falss through */
> 
> falls?

Nice catch.

Thanks,
Qu

> 
> Thanks,
> Su
>>   }
>>   -unclustered_alloc:
>>   /*
>>    * We are doing an unclustered alloc, set the fragmented
>> flag so
>>    * we don't bother trying to setup a cluster again until we get
>>
> 
> 


Re: [PATCH] btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch

2018-10-11 Thread David Sterba
On Thu, Oct 11, 2018 at 01:42:56PM +0800, Lu Fengqi wrote:
> There is no reason to put this check in (!qgroup)'s else branch because if
> qgroup is null, it will goto out directly. So move it out to reduce
> indent.
> 
> No Functional Change.
> 
> Signed-off-by: Lu Fengqi 

Added to misc-next, thanks.


Re: btrfs send receive ERROR: chown failed: No such file or directory

2018-10-11 Thread Leonard Lausen
Hey Filipe,

thanks for the feedback. I ran the command again with -vv. Below are
the last commands logged by btrfs receive to stderr:

  mkfile o2138798-5016457-0
  rename 
leonard/mail/lists/emacs-orgmode/new/1530428589.M675528862P21583Q6R28ec1af3.leonard-xps13
 -> o2138802-5207521-0
  rename o2138798-5016457-0 -> 
leonard/mail/lists/emacs-orgmode/new/1530428589.M675528862P21583Q6R28ec1af3.leonard-xps13
  utimes leonard/mail/lists/emacs-orgmode/new
  ERROR: cannot open 
/mnt/_btrbk_backups/leonard-xps13/@home.20180902T0045/o2138798-5016457-0: No 
such file or directory

The file which is mentioned above has in fact not changed between the
parent and current snapshot. I computed further a hash of the file on
source volume for parent and current snapshot and target on target
volume for parent snapshot. I get the same hash value in all 3 cases, so
the parent snapshot was apparently transferred correctly.

Note that the folder in which this file is placed contains 96558 files
whose sizes sum to 774M.

I further observed that the snapshot that previously failed to transfer
via send receive could now be sent / received without error. However,
the error then occured again for the next snapshot (which has the now
successfully transferred one as parent). There may be a race condition
that leads to non-deterministic failures.

Please let me know if I can help with any further information.

Best regards
Leonard

Filipe Manana  writes:

>> > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs
>> >btrfs send ... | ssh user@machine2 "btrfs receive /path1"
>> > 2) Machine 2 backups all subvolumes stored at /path1 to a second
>> >independent btrfs filesystem. Let /path1/rootsnapshot be the first
>> >snapshot stored at /path1 (ie. it has no Parent UUID). Let
>> >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot
>> >as a parent. Then
>> >btrfs send -v /path1/rootsnapshot | btrfs receive /path2
>> >works without issues, but
>> >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs 
>> > receive /path2
>
> -v is useless. Use -vv, which will dump all commands.
>
>> >fails as follows:
>> >ERROR: chown o257-4639416-0 failed: No such file or directory
>> >
>> > No error is shown in dmesg. /path1 and /path2 denote two independent
>> > btrfs filesystems.
>> >
>> > Note that there was no issue with transferring incrementalsnapshot from
>> > machine 1 to machine 2. No error is shown in dmesg.


Re: [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems

2018-10-11 Thread Amir Goldstein
On Thu, Oct 11, 2018 at 7:12 AM Darrick J. Wong  wrote:
>
> Hi all,
>
> Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> reflink implementation, and tracked it down to reflink forgetting to do
> some of the file-extending activities that must happen for regular
> writes.
>
> We then started auditing the clone, dedupe, and copyfile code and
> realized that from a file contents perspective, clonerange isn't any
> different from a regular file write.  Unfortunately, we also noticed
> that *unlike* a regular write, clonerange skips a ton of overflow
> checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> strip security privileges (suid, capabilities) like a regular write
> would.  I also noticed that xfs and ocfs2 need to dump the page cache
> before remapping blocks, not after.
>
> In fixing the range checking problems I also realized that both dedupe
> and copyfile tell userspace how much of the requested operation was
> acted upon.  Since the range validation can shorten a clone request (or
> we can ENOSPC midway through), we might as well plumb the short
> operation reporting back through the VFS indirection code to userspace.
>
> So, here's the whole giant pile of patches[1] that fix all the problems.
> This branch is against 4.19-rc7 with Dave Chinner's XFS for-next branch.
> The patch "generic: test reflink side effects" recently sent to fstests
> exercises the fixes in this series.  Tests are in [2].
>
> --D
>
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

I tested your branch with overlayfs over xfs.
I did not observe any failures with -g clone except for test generic/937
which also failed on xfs in my test.

I though that you forgot to mention I needed to grab xfsprogs from djwong-devel
for commit e84a9e93 ("xfs_io: dedupe command should only complain
if we don't dedupe anything"), but even with this change the test still fails:

generic/937 - output mismatch (see
/old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad)
--- tests/generic/937.out   2018-10-11 08:23:00.630938364 +0300
+++ /old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad
   2018-10-11 10:54:40.448134832 +0300
@@ -4,8 +4,7 @@
 39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
 Files 1-2 do not match (intentional)
 (partial) dedupe the middle blocks together
-deduped / bytes at offset 
-XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
 Compare sections

One thing that *is* different with overlayfs test is that filefrag crashes
on this same test:

QA output created by 937
Create the original files
35ac8d7917305c385c30f3d82c30a8f6  TEST_DIR/test-937/file1
39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
Files 1-2 do not match (intentional)
(partial) dedupe the middle blocks together
XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
./tests/generic/937: line 59: 19242 Floating point exception(core
dumped) ${FILEFRAG_PROG} -v $testdir/file1 >> $seqres.full
./tests/generic/937: line 60: 19244 Floating point exception(core
dumped) ${FILEFRAG_PROG} -v $testdir/file2 >> $seqres.full

It looks like an overlayfs v4.19-rc1 regression - FIGETBSZ returns zero.
I never noticed this regression before, because none of the generic tests
are using filefrag.

Thanks,
Amir.


Re: [PATCH v3 0/2] btrfs: Add zstd support to grub btrfs

2018-10-11 Thread Paul Menzel

Dear Nick,


Am 10.10.2018 um 22:28 schrieb Nick Terrell:


On Oct 10, 2018, at 12:34 AM, Paul Menzel wrote:

Sorry for being ignorant, but you explain, why the library needs to
be imported and it is not enough to use that library as an external
dependency?

Importing the library means, it has to be maintained in the GRUB
repository, which will result in some maintenance burden.


I've imported zstd because thats the way the rest of the
decompressors are imported.

We could potentially use libzstd as an external dependency, since its
only dependency is libc and GRUB provides the definitions of the libc
functions that zstd needs to decompress (memcpy, and memmove). Theres
some other stuff in the library that requires libc functionality that
GRUB doesn't provide, but that isn't used during decompression. We
strip those files out in the import.


Thank you for the explanation.


Let me know if you want me to switch to an external dependency.


I cannot make that decision. The main developers and the distribution
folks should comment on that.


Kind regards,

Paul


[PATCH v4] geneirc/077 fix min size for btrfs

2018-10-11 Thread Anand Jain
If btrfs need to be tested at its default blockgroup which is non-mixed,
then it needs at least 256mb.

Signed-off-by: Anand Jain 
---
v3->v4: fix the check for the /usr or /lib/modules size. Thanks Darrick.
v2->v3:
separated from the patch set of 9.
notrun for the cases where filler is not big enough to fill the
fssize.
v2->v1: ref the cover-letter of the set.

 tests/generic/077 | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/tests/generic/077 b/tests/generic/077
index ef6af18c83e3..284e25253c9e 100755
--- a/tests/generic/077
+++ b/tests/generic/077
@@ -13,11 +13,18 @@ echo "QA output created by $seq"
 here=`pwd`
 tmp=/tmp/$$
 status=1
-# Something w/ enough data to fill 50M of fs...
-filler=/lib/modules/
 
-# fall back in case /lib/modules doesn't exist
-[ -d $filler ] || filler=/usr
+# Something w/ enough data to fill 256M of fs...
+filler=""
+[ -d /lib/modules ] && \
+   [ $(( $(du -h -m /lib/modules | tail -1| cut -f1) * 2 )) -ge 256 ] && \
+   filler=/lib/modules
+
+# fall back in case /lib/modules doesn't exist or smaller
+[[ -z $filler ]] && \
+   [ -d /usr ] && \
+   [ $(( $(du -h -m /usr | tail -1| cut -f1) * 2 )) -ge 256 ] && \
+   filler=/usr
 
 _cleanup()
 {
@@ -36,7 +43,7 @@ trap "_cleanup; rm -f $tmp.*; exit \$status" 0 1 2 3 15
 _supported_fs generic
 _supported_os Linux
 
-[ ! -d $filler ] && _notrun "No directory to source files from"
+[ ! -d $filler ] && _notrun "No directory at least 256MB to source files from"
 
 _require_scratch
 _require_attrs
@@ -49,9 +56,8 @@ rm -f $seqres.full
 _scratch_unmount >/dev/null 2>&1
 echo "*** MKFS ***" >>$seqres.full
 echo "" >>$seqres.full
-SIZE=`expr 50 \* 1024 \* 1024`
-_scratch_mkfs_sized $SIZE   >>$seqres.full 2>&1 \
-   || _fail "mkfs failed"
+fs_size=$((256 * 1024 * 1024))
+_scratch_mkfs_sized $fs_size >> $seqres.full 2>&1 || _fail "mkfs failed"
 _scratch_mount
 mkdir $SCRATCH_MNT/subdir
 
-- 
1.8.3.1



Re: [PATCH v3] geneirc/077 fix min size for btrfs

2018-10-11 Thread Anand Jain




On 10/11/2018 12:25 PM, Darrick J. Wong wrote:

On Thu, Oct 11, 2018 at 11:26:00AM +0800, Anand Jain wrote:

If btrfs need to be tested at its default blockgroup which is non-mixed,
then it needs at least 256mb.

Signed-off-by: Anand Jain 
---
v2->v3:
separated from the patch set of 9.
notrun for the cases where filler is not big enough to fill the
fssize.
v2->v1: ref the cover-letter of the set.

  tests/generic/077 | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tests/generic/077 b/tests/generic/077
index ef6af18c83e3..784afe448940 100755
--- a/tests/generic/077
+++ b/tests/generic/077
@@ -13,7 +13,7 @@ echo "QA output created by $seq"
  here=`pwd`
  tmp=/tmp/$$
  status=1
-# Something w/ enough data to fill 50M of fs...
+# Something w/ enough data to fill 256M of fs...
  filler=/lib/modules/
  
  # fall back in case /lib/modules doesn't exist

@@ -38,6 +38,11 @@ _supported_os Linux
  
  [ ! -d $filler ] && _notrun "No directory to source files from"
  
+# check if two iterations of the assigned filler is big enough to fill fssize

+fs_size=$((256 * 1024 * 1024))
+[ $(( $(du -h -m /usr | tail -1| cut -f1) * 2 )) -lt 256 ] && \


Err... what does measuring /usr have to do with /lib/modules?


Ah. Thanks Darrick. Its a bug, I fixed it in v4.


Also, /lib/modules is 58M on my test VM, which means that a 256M
filesystem isn't going to ENOSPC.


 Yep as above its only checking for /usr its a bug fixed in v4.

Thanks, Anand


(Though weirdly it doesn't fail despite the lack of ENOSPC even at the
50M size, so I'm not sure what this test is actually supposed to do...)

>

--D


+   _notrun "filler $filler isn't big enough to fill fssize $fssize"
+
  _require_scratch
  _require_attrs
  _require_acls
@@ -49,9 +54,7 @@ rm -f $seqres.full
  _scratch_unmount >/dev/null 2>&1
  echo "*** MKFS ***" >>$seqres.full
  echo "" >>$seqres.full
-SIZE=`expr 50 \* 1024 \* 1024`
-_scratch_mkfs_sized $SIZE   >>$seqres.full 2>&1 \
-   || _fail "mkfs failed"
+_scratch_mkfs_sized $fs_size >> $seqres.full 2>&1 || _fail "mkfs failed"
  _scratch_mount
  mkdir $SCRATCH_MNT/subdir
  
--

1.8.3.1



Re: [PATCH] btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch

2018-10-11 Thread Nikolay Borisov



On 11.10.2018 08:42, Lu Fengqi wrote:
> There is no reason to put this check in (!qgroup)'s else branch because if
> qgroup is null, it will goto out directly. So move it out to reduce
> indent.
> 
> No Functional Change.
> 
> Signed-off-by: Lu Fengqi 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/qgroup.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 27f517315388..af65ab1640b0 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1416,13 +1416,14 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle 
> *trans, u64 qgroupid)
>   if (!qgroup) {
>   ret = -ENOENT;
>   goto out;
> - } else {
> - /* check if there are no children of this qgroup */
> - if (!list_empty(>members)) {
> - ret = -EBUSY;
> - goto out;
> - }
>   }
> +
> + /* check if there are no children of this qgroup */
> + if (!list_empty(>members)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
>   ret = del_qgroup_item(trans, qgroupid);
>   if (ret && ret != -ENOENT)
>   goto out;
> 


Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables

2018-10-11 Thread Nikolay Borisov



On 11.10.2018 08:40, Lu Fengqi wrote:
> If the return value of find_ref_head() is NULL, the only possibility is
> that delayed_refs' head ref rbtree is empty. Hence, the second
> find_ref_head() is pointless.
> > Besides, the local variables loop and start are unnecessary, just remove
> them.

So the objective of that function is to get a reference to the first
delayed head which is not processed. This is done by essentially keeping
track of the last range that was processed in
delayed_refs->run_delayed_start
> 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/delayed-ref.c | 17 +++--
>  1 file changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 885581852bea..2726d2fb4bbe 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -354,20 +354,11 @@ struct btrfs_delayed_ref_head *
>  btrfs_select_ref_head(struct btrfs_delayed_ref_root *delayed_refs)
>  {
>   struct btrfs_delayed_ref_head *head;
> - u64 start;
> - bool loop = false;
>  
>  again:
> - start = delayed_refs->run_delayed_start;
> - head = find_ref_head(delayed_refs, start, 1);
> - if (!head && !loop) {
> + head = find_ref_head(delayed_refs, delayed_refs->run_delayed_start, 1);
> + if (!head) {
>   delayed_refs->run_delayed_start = 0;
> - start = 0;
> - loop = true;
> - head = find_ref_head(delayed_refs, start, 1);
> - if (!head)
> - return NULL;
> - } else if (!head && loop) {

I believe this will have a negative impact since it actually will
prevent finding a head which was added BEFORE the last processed head.
So when a ref head is selected in btrfs_obtain_ref_head then the
delayed_refs->lock is dropped and the given head is locked and
delayed_refs->run_delayed_start points to the end of the selected range
that the head represents. At this point it's possible that another
thread modifies a different range which is before the one we have
selected so graphically it will be something like:


---[HEAD2]->[HEAD1]--
0N

Where HEAD1 is the head returned from first invocation of
btrfs_obtain_ref_head. Once  btrfs_obtain_ref_head is called the 2nd
time it will not find HEAD2 so will just reset run_delayed_start to 0
and return. So it will be up to another run of the delayed refs to
actually find head2. Essentially you made btrfs_obtain_ref_head less
greedy. Have you characterized what kind of performance impact this have?




>   return NULL;
>   }
>  
> @@ -376,11 +367,9 @@ btrfs_select_ref_head(struct btrfs_delayed_ref_root 
> *delayed_refs)
>  
>   node = rb_next(>href_node);
>   if (!node) {
> - if (loop)
> + if (delayed_refs->run_delayed_start == 0)
>   return NULL;
>   delayed_refs->run_delayed_start = 0;
> - start = 0;
> - loop = true;
>   goto again;
>   }
>   head = rb_entry(node, struct btrfs_delayed_ref_head,
> 


  1   2   >