[PULL] uuid_mutex fixes and cleanups part2

2018-07-17 Thread Anand Jain
These patches we sent independently before and are in the mailing list.
They have been tested successfully using the xfstests. The cyclical
lockdep warning aren't due to these set of patches and they (2) can be
ignored as they are harmless because the threads involved are
ioctl/commit and mount thread which lockdep thinks is warn-able can't
co-exist in the same time space. So this set is ready to be pulled in.
Thanks.

  g...@github.com:asj/btrfs-devel.git misc-next-jul18

1/7 has been discussed and reviewed at length, which is drop the
unnecessary uuid_mutex in the btrfs_free_extra_devids().

2/7 fixes the race which syzbot reported.

3/7 as when we sprout we hijack the seed fs_devices, we are bringing
back the seed fs_devices using the clone_fs_devices() instead we could
use our btrfs_scan_one_device() which makes the sprout-ing operation
much cleaner.

4-7/7 are a simple cleanup patches.

Anand Jain (7):
1. btrfs: drop uuid_mutex in btrfs_free_extra_devids()
2. btrfs: fix race between free_stale_devices and close_fs_devices
3. btrfs: do device clone using the btrfs_scan_one_device
4. btrfs: use the assigned fs_devices instead of the dereference
5. btrfs: warn for num_devices below 0
6. btrfs: add helper btrfs_num_devices() to deduce num_devices
7. btrfs: add helper function check device delete able

 fs/btrfs/volumes.c | 106 +++--
 1 file changed, 62 insertions(+), 44 deletions(-)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH FIXED v7] Add cli and ioctl to forget scanned device(s)

2018-07-17 Thread Anand Jain
[ Left out changes is now commited ].

v7:
 Availalbe for pull from
  btrfs-progs:
   g...@github.com:asj/btrfs-progs.git forget
  btrfs.ko:
   g...@github.com:asj/btrfs-devel.git misc-next-jul18

 Use struct btrfs_ioctl_vol_args (instead of struct
  btrfs_ioctl_vol_args_v2) as its inline with other ioctl
  btrfs-control
 The CLI usage remains same. However internally the ioctl flag is not
  required to delete all the unmounted devices. Instead leave
  btrfs_ioctl_vol_args::name NULL.

v6:
 Use the changed fn name btrfs_free_stale_devices().

 Change in title:
 Old v5:
 Cover-letter:
  [PATCH v5] Add cli and ioctl to ignore a scanned device
 Kernel:
  [PATCH v5] btrfs: introduce feature to ignore a btrfs device
 Progs:
  [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli

v5:
  Adds feature to delete all stale devices
  Reuses btrfs_free_stale_devices() fn and so depends on the
patch-set [1] in the ML.
  Uses struct btrfs_ioctl_vol_args_v2 instead of
struct btrfs_ioctl_vol_args as arg
  Does the device path matching instead of btrfs_device matching
(we won't delete the mounted device as btrfs_free_stale_devices()
checks for it)
v4:
  No change. But as the ML thread may be confusing, so resend.
v3:
  No change. Send to correct ML.
v2:
  Accepts review from Nikolay, details are in the specific patch.
  Patch 1/2 is renamed from
[PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list 
delete
  to
[PATCH 1/2] btrfs: add function to device list delete

Adds cli and ioctl to forget a scanned device or forget all stale
devices in the kernel.

Anand Jain (1):
  btrfs: introduce feature to forget a btrfs device

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

Anand Jain (1):
  btrfs-progs: add cli to forget one or all scanned devices

 cmds-device.c | 58 ++
 ioctl.h   |  2 ++
 2 files changed, 60 insertions(+)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH FIXED v7] btrfs: introduce feature to forget a btrfs device

2018-07-17 Thread Anand Jain
Support for a new command 'btrfs dev forget [dev]' is proposed here,
to undo the effects of 'btrfs dev scan [dev]'. For this purpose,
this patch proposes to use ioctl #5 as it was empty.
IOW(BTRFS_IOCTL_MAGIC, 5, ..)
This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
the /dev/btrfs-control to forget one or all devices, (devices which are
not mounted) from the btrfs kernel.

The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
set to specify the device path. And all unmounted devices can be removed
from the kernel if no device path is provided.

Again, the devices are removed only if the relevant fsid aren't mounted.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d7a54c648c5f..196b0773eedb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2249,6 +2249,9 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
ret = PTR_ERR_OR_ZERO(device);
mutex_unlock(_mutex);
break;
+   case BTRFS_IOC_FORGET_DEV:
+   ret = btrfs_forget_devices(vol->name);
+   break;
case BTRFS_IOC_DEVICES_READY:
mutex_lock(_mutex);
device = btrfs_scan_one_device(vol->name, FMODE_READ,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 51451c0dd8f5..73ae1c95e6f3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device 
*bdev, u64 bytenr,
return 0;
 }
 
+int btrfs_forget_devices(const char *path)
+{
+   mutex_lock(_mutex);
+   btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL);
+   mutex_unlock(_mutex);
+
+   return 0;
+}
+
 /*
  * Look for a btrfs signature on a device. This may be called out of the mount 
path
  * and we are not allowed to call set_blocksize during the scan. The superblock
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 06d8bb7dd557..82502a7ff6e4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -405,6 +405,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
   fmode_t flags, void *holder);
+int btrfs_forget_devices(const char *path);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
 void btrfs_assign_next_active_device(struct btrfs_fs_info *fs_info,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5ca1d21fc4a7..b1be7f828cb4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -836,6 +836,8 @@ enum btrfs_err_code {
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH FIXED v7] btrfs-progs: add cli to forget one or all scanned devices

2018-07-17 Thread Anand Jain
This patch adds cli
  btrfs device forget [dev]
to remove the given device structure in the kernel if the device
is unmounted. If no argument is given it shall remove all stale
(device which are not mounted) from the kernel.

Signed-off-by: Anand Jain 
---
 cmds-device.c | 58 ++
 ioctl.h   |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/cmds-device.c b/cmds-device.c
index 86459d1b9564..49cfd4b41adb 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -326,6 +326,63 @@ out:
return !!ret;
 }
 
+static const char * const cmd_device_forget_usage[] = {
+   "btrfs device forget []",
+   "Forget a stale device or all stale devices in btrfs.ko",
+   NULL
+};
+
+static int btrfs_forget_devices(char *path)
+{
+   struct btrfs_ioctl_vol_args args;
+   int ret;
+   int fd;
+
+   fd = open("/dev/btrfs-control", O_RDWR);
+   if (fd < 0)
+   return -errno;
+
+   memset(, 0, sizeof(args));
+   if (path)
+   strncpy_null(args.name, path);
+   ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, );
+   if (ret)
+   ret = -errno;
+   close(fd);
+   return ret;
+}
+
+static int cmd_device_forget(int argc, char **argv)
+{
+   char *path;
+   int ret = 0;
+
+   if (check_argc_max(argc - optind, 1))
+   usage(cmd_device_forget_usage);
+
+   if (argc == 1) {
+   ret = btrfs_forget_devices(NULL);
+   if (ret)
+   error("Can't forget: %s", strerror(-ret));
+   return ret;
+   }
+
+   path = canonicalize_path(argv[1]);
+   if (!path) {
+   error("Could not canonicalize path '%s': %s",
+   argv[1], strerror(errno));
+   return -ENOENT;
+   }
+
+   ret  = btrfs_forget_devices(path);
+   if (ret)
+   error("Can't forget '%s': %s", path, strerror(-ret));
+
+   free(path);
+
+   return ret;
+}
+
 static const char * const cmd_device_ready_usage[] = {
"btrfs device ready ",
"Check device to see if it has all of its devices in cache for 
mounting",
@@ -601,6 +658,7 @@ const struct cmd_group device_cmd_group = {
CMD_ALIAS },
{ "remove", cmd_device_remove, cmd_device_remove_usage, NULL, 0 
},
{ "scan", cmd_device_scan, cmd_device_scan_usage, NULL, 0 },
+   { "forget", cmd_device_forget, cmd_device_forget_usage, NULL, 0 
},
{ "ready", cmd_device_ready, cmd_device_ready_usage, NULL, 0 },
{ "stats", cmd_device_stats, cmd_device_stats_usage, NULL, 0 },
{ "usage", cmd_device_usage,
diff --git a/ioctl.h b/ioctl.h
index 709e996f401c..e27d80e09392 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -721,6 +721,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7] Add cli and ioctl to forget scanned device(s)

2018-07-17 Thread Anand Jain



 Please ignore this. There is a line of code which is un-commit.
 I am sending this series again. Sorry for the noise.

Thanks, Anand



On 07/18/2018 11:07 AM, Anand Jain wrote:

v7:
  Availalbe for pull from
   btrfs-progs:
g...@github.com:asj/btrfs-progs.git forget
   btrfs.ko:
g...@github.com:asj/btrfs-devel.git misc-next-jul18

  Use struct btrfs_ioctl_vol_args (instead of struct
   btrfs_ioctl_vol_args_v2) as its inline with other ioctl
   btrfs-control
  The CLI usage remains same. However internally the ioctl flag is not
   required to delete all the unmounted devices. Instead leave
   btrfs_ioctl_vol_args::name NULL.

v6:
  Use the changed fn name btrfs_free_stale_devices().

  Change in title:
  Old v5:
  Cover-letter:
   [PATCH v5] Add cli and ioctl to ignore a scanned device
  Kernel:
   [PATCH v5] btrfs: introduce feature to ignore a btrfs device
  Progs:
   [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli

v5:
   Adds feature to delete all stale devices
   Reuses btrfs_free_stale_devices() fn and so depends on the
 patch-set [1] in the ML.
   Uses struct btrfs_ioctl_vol_args_v2 instead of
 struct btrfs_ioctl_vol_args as arg
   Does the device path matching instead of btrfs_device matching
 (we won't delete the mounted device as btrfs_free_stale_devices()
 checks for it)
v4:
   No change. But as the ML thread may be confusing, so resend.
v3:
   No change. Send to correct ML.
v2:
   Accepts review from Nikolay, details are in the specific patch.
   Patch 1/2 is renamed from
 [PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list 
delete
   to
 [PATCH 1/2] btrfs: add function to device list delete

Adds cli and ioctl to forget a scanned device or forget all stale
devices in the kernel.

Anand Jain (1):
   btrfs: introduce feature to forget a btrfs device

  fs/btrfs/super.c   | 3 +++
  fs/btrfs/volumes.c | 9 +
  fs/btrfs/volumes.h | 1 +
  include/uapi/linux/btrfs.h | 2 ++
  4 files changed, 15 insertions(+)

Anand Jain (1):
   btrfs-progs: add cli to forget one or all scanned devices

  cmds-device.c | 58 ++
  ioctl.h   |  2 ++
  2 files changed, 60 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7] Add cli and ioctl to forget scanned device(s)

2018-07-17 Thread Anand Jain
v7:
 Availalbe for pull from
  btrfs-progs:
   g...@github.com:asj/btrfs-progs.git forget
  btrfs.ko:
   g...@github.com:asj/btrfs-devel.git misc-next-jul18

 Use struct btrfs_ioctl_vol_args (instead of struct
  btrfs_ioctl_vol_args_v2) as its inline with other ioctl
  btrfs-control
 The CLI usage remains same. However internally the ioctl flag is not
  required to delete all the unmounted devices. Instead leave
  btrfs_ioctl_vol_args::name NULL.

v6:
 Use the changed fn name btrfs_free_stale_devices().

 Change in title:
 Old v5:
 Cover-letter:
  [PATCH v5] Add cli and ioctl to ignore a scanned device
 Kernel:
  [PATCH v5] btrfs: introduce feature to ignore a btrfs device
 Progs:
  [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli

v5:
  Adds feature to delete all stale devices
  Reuses btrfs_free_stale_devices() fn and so depends on the
patch-set [1] in the ML.
  Uses struct btrfs_ioctl_vol_args_v2 instead of
struct btrfs_ioctl_vol_args as arg
  Does the device path matching instead of btrfs_device matching
(we won't delete the mounted device as btrfs_free_stale_devices()
checks for it)
v4:
  No change. But as the ML thread may be confusing, so resend.
v3:
  No change. Send to correct ML.
v2:
  Accepts review from Nikolay, details are in the specific patch.
  Patch 1/2 is renamed from
[PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list 
delete
  to
[PATCH 1/2] btrfs: add function to device list delete

Adds cli and ioctl to forget a scanned device or forget all stale
devices in the kernel.

Anand Jain (1):
  btrfs: introduce feature to forget a btrfs device

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

Anand Jain (1):
  btrfs-progs: add cli to forget one or all scanned devices

 cmds-device.c | 58 ++
 ioctl.h   |  2 ++
 2 files changed, 60 insertions(+)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7] btrfs-progs: add cli to forget one or all scanned devices

2018-07-17 Thread Anand Jain
This patch adds cli
  btrfs device forget [dev]
to remove the given device structure in the kernel if the device
is unmounted. If no argument is given it shall remove all stale
(device which are not mounted) from the kernel.

Signed-off-by: Anand Jain 
---
 cmds-device.c | 58 ++
 ioctl.h   |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/cmds-device.c b/cmds-device.c
index 86459d1b9564..49cfd4b41adb 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -326,6 +326,63 @@ out:
return !!ret;
 }
 
+static const char * const cmd_device_forget_usage[] = {
+   "btrfs device forget []",
+   "Forget a stale device or all stale devices in btrfs.ko",
+   NULL
+};
+
+static int btrfs_forget_devices(char *path)
+{
+   struct btrfs_ioctl_vol_args args;
+   int ret;
+   int fd;
+
+   fd = open("/dev/btrfs-control", O_RDWR);
+   if (fd < 0)
+   return -errno;
+
+   memset(, 0, sizeof(args));
+   if (path)
+   strncpy_null(args.name, path);
+   ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, );
+   if (ret)
+   ret = -errno;
+   close(fd);
+   return ret;
+}
+
+static int cmd_device_forget(int argc, char **argv)
+{
+   char *path;
+   int ret = 0;
+
+   if (check_argc_max(argc - optind, 1))
+   usage(cmd_device_forget_usage);
+
+   if (argc == 1) {
+   ret = btrfs_forget_devices(NULL);
+   if (ret)
+   error("Can't forget: %s", strerror(-ret));
+   return ret;
+   }
+
+   path = canonicalize_path(argv[1]);
+   if (!path) {
+   error("Could not canonicalize path '%s': %s",
+   argv[1], strerror(errno));
+   return -ENOENT;
+   }
+
+   ret  = btrfs_forget_devices(path);
+   if (ret)
+   error("Can't forget '%s': %s", path, strerror(-ret));
+
+   free(path);
+
+   return ret;
+}
+
 static const char * const cmd_device_ready_usage[] = {
"btrfs device ready ",
"Check device to see if it has all of its devices in cache for 
mounting",
@@ -601,6 +658,7 @@ const struct cmd_group device_cmd_group = {
CMD_ALIAS },
{ "remove", cmd_device_remove, cmd_device_remove_usage, NULL, 0 
},
{ "scan", cmd_device_scan, cmd_device_scan_usage, NULL, 0 },
+   { "forget", cmd_device_forget, cmd_device_forget_usage, NULL, 0 
},
{ "ready", cmd_device_ready, cmd_device_ready_usage, NULL, 0 },
{ "stats", cmd_device_stats, cmd_device_stats_usage, NULL, 0 },
{ "usage", cmd_device_usage,
diff --git a/ioctl.h b/ioctl.h
index 709e996f401c..e27d80e09392 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -721,6 +721,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7] btrfs: introduce feature to forget a btrfs device

2018-07-17 Thread Anand Jain
Support for a new command 'btrfs dev forget [dev]' is proposed here,
to undo the effects of 'btrfs dev scan [dev]'. For this purpose,
this patch proposes to use ioctl #5 as it was empty.
IOW(BTRFS_IOCTL_MAGIC, 5, ..)
This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
the /dev/btrfs-control to forget one or all devices, (devices which are
not mounted) from the btrfs kernel.

The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
set to specify the device path. And all unmounted devices can be removed
from the kernel if no device path is provided.

Again, the devices are removed only if the relevant fsid aren't mounted.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d7a54c648c5f..196b0773eedb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2249,6 +2249,9 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
ret = PTR_ERR_OR_ZERO(device);
mutex_unlock(_mutex);
break;
+   case BTRFS_IOC_FORGET_DEV:
+   ret = btrfs_forget_devices(vol->name);
+   break;
case BTRFS_IOC_DEVICES_READY:
mutex_lock(_mutex);
device = btrfs_scan_one_device(vol->name, FMODE_READ,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 51451c0dd8f5..bc1e19663e81 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device 
*bdev, u64 bytenr,
return 0;
 }
 
+int btrfs_forget_devices(const char *path)
+{
+   mutex_lock(_mutex);
+   btrfs_free_stale_devices(path, NULL);
+   mutex_unlock(_mutex);
+
+   return 0;
+}
+
 /*
  * Look for a btrfs signature on a device. This may be called out of the mount 
path
  * and we are not allowed to call set_blocksize during the scan. The superblock
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 06d8bb7dd557..82502a7ff6e4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -405,6 +405,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
   fmode_t flags, void *holder);
+int btrfs_forget_devices(const char *path);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
 void btrfs_assign_next_active_device(struct btrfs_fs_info *fs_info,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5ca1d21fc4a7..b1be7f828cb4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -836,6 +836,8 @@ enum btrfs_err_code {
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-17 Thread Marc MERLIN
On Wed, Jul 18, 2018 at 08:05:51AM +0800, Qu Wenruo wrote:
> No OOM triggers? That's a little strange.
> Maybe it's related to how kernel handles memory over-commit?
 
Yes, I think you are correct.

> And for the hang, I think it's related to some memory allocation failure
> and error handler just didn't handle it well, so it's causing deadlock
> for certain page.

That indeed matches what I'm seeing.

> ENOMEM handling is pretty common but hardly verified, so it's not that
> strange, but we must locate the problem.

I seem to be getting deadlocks in the kernel, so I'm hoping that at least
it's checked there, but maybe not?

> In my system, at least I'm not using btrfs as root fs, and for the
> memory eating program I normally ensure it's eating all the memory +
> swap, so OOM killer is always triggered, maybe that's the cause.
> 
> So in your case, maybe it's btrfs not really taking up all memory, thus
> OOM killer not triggered.

Correct, the swap is not used.

> Any kernel dmesg about OOM killer triggered?
 
Nothing at all. It never gets triggered.

> > Here is my system when it virtually died:
> > ER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> > root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49   1:35 ./btrfs 
> > check /dev/mapper/dshelf2

See how btrs was taking 29GB in that ps output (that's before it takes
everything and I can't even type ps anymore)
Note that VSZ is almost equal to RSS. Nothing gets swapped.

Then see free output:

> >  total   used   free sharedbuffers cached
> > Mem:  32643788   32180100 463688  0  44664 119508
> > -/+ buffers/cache:   32015928 627860
> > Swap: 15616764 443676   15173088
> 
> For swap, it looks like only some other program's memory is swapped out,
> not btrfs'.

That's exactly correct. btrfs check never goes to swap, I'm not sure why,
and because there is virtual memory free, maybe that's why OOM does not
trigger?
So I guess I can probably "fix" my problem by removing swap, but ultimately
it would be useful to know why memory taken by btrfs check does not end up
in swap.

> And unfortunately, I'm not so familiar with OOM/MM code outside of
> filesystem.
> Any help from other experienced developers would definitely help to
> solve why memory of 'btrfs check' is not swapped out or why OOM killer
> is not triggered.

Do you have someone from linux-vm you might be able to ask, or should we Cc
this thread there?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-17 Thread Qu Wenruo



On 2018年07月18日 04:59, Marc MERLIN wrote:
> Ok, I did more testing. Qu is right that btrfs check does not crash the 
> kernel.
> It just takes all the memory until linux hangs everywhere, and somehow (no 
> idea why) 
> the OOM killer never triggers.

No OOM triggers? That's a little strange.
Maybe it's related to how kernel handles memory over-commit?

And for the hang, I think it's related to some memory allocation failure
and error handler just didn't handle it well, so it's causing deadlock
for certain page.

ENOMEM handling is pretty common but hardly verified, so it's not that
strange, but we must locate the problem.

> Details below:
> 
> On Tue, Jul 17, 2018 at 01:32:57PM -0700, Marc MERLIN wrote:
>> Here is what I got when the system was not doing well (it took minutes to 
>> run):
>>
>>  total   used   free sharedbuffers cached
>> Mem:  32643788   32070952 572836  0 1021604378772
>> -/+ buffers/cache:   275900205053768
>> Swap: 15616764 973596   14643168
> 
> ok, the reason it was not that close to 0 was due to /dev/shm it seems.
> I cleared that, and now I can get it to go to near 0 again.
> I'm wrong about the system being fully crashed, it's not, it's just very
> close to being hung.
> I can type killall -9 btrfs in the serial console and wait a few minutes.
> The system eventually recovers, but it's impossible to fix anything via ssh 
> apparently because networking does not get to run when I'm in this state.
> 
> I'm not sure why my system reproduces this easy while Qu's system does not, 
> but Qu was right that the kernel is not dead and that it's merely a problem 
> of userspace
> taking all the RAM and somehow not being killed by OOM

In my system, at least I'm not using btrfs as root fs, and for the
memory eating program I normally ensure it's eating all the memory +
swap, so OOM killer is always triggered, maybe that's the cause.

So in your case, maybe it's btrfs not really taking up all memory, thus
OOM killer not triggered.

> 
> I checked the PID and don't see why it's not being killed:
> gargamel:/proc/31006# grep . oom*
> oom_adj:0
> oom_score:221   << this increases a lot, but OOM never kills it
> oom_score_adj:0
> 
> I have these variables:
> /proc/sys/vm/oom_dump_tasks:1
> /proc/sys/vm/oom_kill_allocating_task:0
> /proc/sys/vm/overcommit_kbytes:0
> /proc/sys/vm/overcommit_memory:0
> /proc/sys/vm/overcommit_ratio:50  << is this bad (seems default)

Any kernel dmesg about OOM killer triggered?

> 
> Here is my system when it virtually died:
> ER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49   1:35 ./btrfs 
> check /dev/mapper/dshelf2
> 
>  total   used   free sharedbuffers cached
> Mem:  32643788   32180100 463688  0  44664 119508
> -/+ buffers/cache:   32015928 627860
> Swap: 15616764 443676   15173088

For swap, it looks like only some other program's memory is swapped out,
not btrfs'.

And unfortunately, I'm not so familiar with OOM/MM code outside of
filesystem.
Any help from other experienced developers would definitely help to
solve why memory of 'btrfs check' is not swapped out or why OOM killer
is not triggered.

Thanks,
Qu

> 
> MemTotal:   32643788 kB
> MemFree:  463440 kB
> MemAvailable:  44864 kB
> Buffers:   44664 kB
> Cached:   120360 kB
> SwapCached:87064 kB
> Active: 30381404 kB
> Inactive: 585952 kB
> Active(anon):   30334696 kB
> Inactive(anon):   474624 kB
> Active(file):  46708 kB
> Inactive(file):   111328 kB
> Unevictable:5616 kB
> Mlocked:5616 kB
> SwapTotal:  15616764 kB
> SwapFree:   15173088 kB
> Dirty:  1636 kB
> Writeback: 4 kB
> AnonPages:  30734240 kB
> Mapped:67236 kB
> Shmem:  3036 kB
> Slab: 267884 kB
> SReclaimable:  51528 kB
> SUnreclaim:   216356 kB
> KernelStack:   10144 kB
> PageTables:69284 kB
> NFS_Unstable:  0 kB
> Bounce:0 kB
> WritebackTmp:  0 kB
> CommitLimit:31938656 kB
> Committed_AS:   32865492 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:   0 kB
> VmallocChunk:  0 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 0 kB
> ShmemHugePages:0 kB
> ShmemPmdMapped:0 kB
> CmaTotal:  16384 kB
> CmaFree:   0 kB
> HugePages_Total:   0
> HugePages_Free:0
> HugePages_Rsvd:0
> HugePages_Surp:0
> Hugepagesize:   2048 kB
> Hugetlb:   0 kB
> DirectMap4k:  560404 kB
> DirectMap2M:32692224 kB
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-07-17 Thread Adam Borowski
Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
whenever you try to defrag a program that's currently being run, or
causing intermittent exec failures on a live system being defragged.

As defrag doesn't change the file's contents in any way, there's no reason
to consider it a rw operation.  Thus, let's check only whether the file
could have been opened rw.  Such access control is still needed as
currently defrag can use extra disk space, and might trigger bugs.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 43ecbe620dea..01c150b6ab62 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2941,7 +2941,8 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
ret = btrfs_defrag_root(root);
break;
case S_IFREG:
-   if (!(file->f_mode & FMODE_WRITE)) {
+   if (!capable(CAP_SYS_ADMIN) &&
+   inode_permission(inode, MAY_WRITE)) {
ret = -EINVAL;
goto out;
}
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 2/2] btrfs: defrag: return EPERM not EINVAL when only permissions fail

2018-07-17 Thread Adam Borowski
We give EINVAL when the request is invalid; here it's ok but merely the
user has insufficient privileges.  Thus, this return value reflects the
error better -- as discussed in the identical case for dedupe.

According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 01c150b6ab62..e96e3c3caca1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2943,7 +2943,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
case S_IFREG:
if (!capable(CAP_SYS_ADMIN) &&
inode_permission(inode, MAY_WRITE)) {
-   ret = -EINVAL;
+   ret = -EPERM;
goto out;
}
 
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 0/2] btrfs: fix races between exec and defrag

2018-07-17 Thread Adam Borowski
Hi!
Here's a ping for a patch to fix ETXTBSY races between defrag and exec, just
like the dedupe counterpart.  Unlike that one which is shared to multiple
filesystems and thus lives in Al Viro's land, it is btrfs only.

Attached: a simple tool to fragment a file, by ten O_SYNC rewrites of length
1 at random positions; racey vs concurrent writes or execs but shouldn't
damage the file otherwise.

Also attached: a preliminary patch for -progs; it yet lacks a check for the
kernel version, but to add such a check we'd need to know which kernels
actually permit ro defrag for non-root.

No man page patch -- there's no man page to be patched...


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *txt, ...) __attribute__((format (printf, 1, 2)));
static void die(const char *txt, ...)
{
fprintf(stderr, "fragme: ");

va_list ap;
va_start(ap, txt);
vfprintf(stderr, txt, ap);
va_end(ap);

exit(1);
}

static uint64_t rnd(uint64_t max)
{
__uint128_t r;
if (syscall(SYS_getrandom, , sizeof(r), 0)==-1)
die("getrandom(): %m\n");
return r%max;
}

int main(int argc, char **argv)
{
if (argc!=2)
die("Usage: fragme \n");

int fd = open(argv[1], O_RDWR|O_SYNC);
if (fd == -1)
die("open(\"%s\"): %m\n", argv[1]);
off_t size = lseek(fd, 0, SEEK_END);
if (size == -1)
die("lseek(SEEK_END): %m\n");

for (int i=0; i<10; ++i)
{
off_t off = rnd(size);
char b;
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for read: %m\n");
if (read(fd, , 1) != 1)
die("read(%lu): %m\n", off);
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for write: %m\n");
if (write(fd, , 1) != 1)
die("write: %m\n");
}

return 0;
}
>From d040af09adb03daadbba4336700f40425a860320 Mon Sep 17 00:00:00 2001
From: Adam Borowski 
Date: Tue, 28 Nov 2017 01:00:21 +0100
Subject: [PATCH] defrag: open files RO

NOT FOR MERGING -- requires kernel versioning

Fixes EXTXBSY races.

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 30a50bf5..7eb6b7bb 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -876,7 +876,7 @@ static int defrag_callback(const char *fpath, const struct stat *sb,
 	if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
 		if (defrag_global_verbose)
 			printf("%s\n", fpath);
-		fd = open(fpath, O_RDWR);
+		fd = open(fpath, O_RDONLY);
 		if (fd < 0) {
 			goto error;
 		}
@@ -1012,7 +1012,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
 		int defrag_err = 0;
 
 		dirstream = NULL;
-		fd = open_file_or_dir(argv[i], );
+		fd = open_file_or_dir3(argv[i], , O_RDONLY);
 		if (fd < 0) {
 			error("cannot open %s: %m", argv[i]);
 			ret = -errno;
-- 
2.18.0



Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-17 Thread Duncan
Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

I can't agree.  I don't know whether you meant that in the global sense, 
or purely in the btrfs context (which I suspect), but either way I can't 
agree.

In the pure btrfs context, while striping and mirroring/pairing are 
orthogonal today, Hugo's whole point was that btrfs is theoretically 
flexible enough to allow both together and the feature may at some point 
be added, so it makes sense to have a layout notation format flexible 
enough to allow it as well.

In the global context, just to complete things and mostly for others 
reading as I feel a bit like a simpleton explaining to the expert here, 
just as raid10 is shorthand for raid1+0, aka raid0 layered on top of 
raid1 (normally preferred to raid01 due to rebuild characteristics, and 
as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes 
recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' 
data integrity characteristics and less optimized performance), so 
there's also raid51 and raid15, raid61 and raid16, etc, with or without 
the + symbols, involving mirroring and parity conceptually at two 
different levels altho they can be combined in a single implementation 
just as raid10 and raid01 commonly are.  These additional layered-raid 
levels can be used for higher reliability, with differing rebuild and 
performance characteristics between the two forms depending on which is 
the top layer.

> Question #1: for "parity" profiles, does make sense to limit the maximum
> disks number where the data may be spread ? If the answer is not, we
> could omit the last S. IMHO it should.

As someone else already replied, btrfs doesn't currently have the ability 
to specify spread limit, but the idea if we're going to change the 
notation is to allow for the flexibility in the new notation so the 
feature can be added later without further notation changes.

Why might it make sense to specify spread?  At least two possible reasons:

a) (stealing an already posted example) Consider a multi-device layout 
with two or more device sizes.  Someone may want to limit the spread in 
ordered to keep performance and risk consistent as the smaller devices 
fill up, limiting further usage to a lower number of devices.  If that 
lower number is specified as the spread originally it'll make things more 
consistent between the room on all devices case and the room on only some 
devices case.

b) Limiting spread can change the risk and rebuild performance profiles.  
Stripes of full width mean all stripes have a strip on each device, so 
knock a device out and (assuming parity or mirroring) replace it, and all 
stripes are degraded and must be rebuilt.  With less than maximum spread, 
some stripes won't be stripped to the replaced device, and won't be 
degraded or need rebuilt, tho assuming the same overall fill, a larger 
percentage of stripes that /do/ need rebuilt will be on the replaced 
device.  So the risk profile is more "objects" (stripes/chunks/files) 
affected but less of each object, or less of the total affected, but more 
of each affected object.

> Question #2: historically RAID10 is requires 4 disks. However I am
> guessing if the stripe could be done on a different number of disks:
> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
> that every 64k, the data are stored on a different disk

As someone else pointed out, md/lvm-raid10 already work like this.  What 
btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
works this way except with huge (gig size) chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-17 Thread Marc MERLIN
Ok, I did more testing. Qu is right that btrfs check does not crash the kernel.
It just takes all the memory until linux hangs everywhere, and somehow (no idea 
why) 
the OOM killer never triggers.
Details below:

On Tue, Jul 17, 2018 at 01:32:57PM -0700, Marc MERLIN wrote:
> Here is what I got when the system was not doing well (it took minutes to 
> run):
> 
>  total   used   free sharedbuffers cached
> Mem:  32643788   32070952 572836  0 1021604378772
> -/+ buffers/cache:   275900205053768
> Swap: 15616764 973596   14643168

ok, the reason it was not that close to 0 was due to /dev/shm it seems.
I cleared that, and now I can get it to go to near 0 again.
I'm wrong about the system being fully crashed, it's not, it's just very
close to being hung.
I can type killall -9 btrfs in the serial console and wait a few minutes.
The system eventually recovers, but it's impossible to fix anything via ssh 
apparently because networking does not get to run when I'm in this state.

I'm not sure why my system reproduces this easy while Qu's system does not, 
but Qu was right that the kernel is not dead and that it's merely a problem of 
userspace
taking all the RAM and somehow not being killed by OOM

I checked the PID and don't see why it's not being killed:
gargamel:/proc/31006# grep . oom*
oom_adj:0
oom_score:221   << this increases a lot, but OOM never kills it
oom_score_adj:0

I have these variables:
/proc/sys/vm/oom_dump_tasks:1
/proc/sys/vm/oom_kill_allocating_task:0
/proc/sys/vm/overcommit_kbytes:0
/proc/sys/vm/overcommit_memory:0
/proc/sys/vm/overcommit_ratio:50  << is this bad (seems default)

Here is my system when it virtually died:
ER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49   1:35 ./btrfs check 
/dev/mapper/dshelf2

 total   used   free sharedbuffers cached
Mem:  32643788   32180100 463688  0  44664 119508
-/+ buffers/cache:   32015928 627860
Swap: 15616764 443676   15173088

MemTotal:   32643788 kB
MemFree:  463440 kB
MemAvailable:  44864 kB
Buffers:   44664 kB
Cached:   120360 kB
SwapCached:87064 kB
Active: 30381404 kB
Inactive: 585952 kB
Active(anon):   30334696 kB
Inactive(anon):   474624 kB
Active(file):  46708 kB
Inactive(file):   111328 kB
Unevictable:5616 kB
Mlocked:5616 kB
SwapTotal:  15616764 kB
SwapFree:   15173088 kB
Dirty:  1636 kB
Writeback: 4 kB
AnonPages:  30734240 kB
Mapped:67236 kB
Shmem:  3036 kB
Slab: 267884 kB
SReclaimable:  51528 kB
SUnreclaim:   216356 kB
KernelStack:   10144 kB
PageTables:69284 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:31938656 kB
Committed_AS:   32865492 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages:0 kB
ShmemPmdMapped:0 kB
CmaTotal:  16384 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb:   0 kB
DirectMap4k:  560404 kB
DirectMap2M:32692224 kB


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-17 Thread Marc MERLIN
On Tue, Jul 17, 2018 at 10:50:32AM -0700, Marc MERLIN wrote:
> I got the following on 4.17.6 while running btrfs check --repair on an
> unmounted filesystem (not the lowmem version)
> 
> I understand that btrfs check is userland only, although it seems that
> it caused these FS hangs on a different filesystem (the trace of course
> does not provide info on which FS)
> 
> Any idea what happened here?
> I'm going to wait a few hours without running btrfs check to see if it
> happens again and then if running btrfs check will re-create this issue,
> but other suggestions (if any), are welcome:

Hi Qu, I know we were talking about this last week and then, btrfs check
just worked for me so I wasn't able to reproduce.
Now I'm able to reproduce again.

I tried again, it's definitely triggered by btrfs check --repair

I tried to capture what happens, and memory didn't dip to 0, but the system
got very slow and things started failing.
btrfs was never killed though while ssh was.
Is there a chance that maybe btrfs is in some kernel OOM exclude list?

Here is what I got when the system was not doing well (it took minutes to run):

 total   used   free sharedbuffers cached
Mem:  32643788   32070952 572836  0 1021604378772
-/+ buffers/cache:   275900205053768
Swap: 15616764 973596   14643168

gargamel:~# cat /proc/meminfo
MemTotal:   32643788 kB
MemFree: 2726276 kB
MemAvailable:2502200 kB
Buffers:   12360 kB
Cached:  1676388 kB
SwapCached: 11048580 kB
Active: 16443004 kB
Inactive:   12010456 kB
Active(anon):   16287780 kB
Inactive(anon): 11651692 kB
Active(file): 155224 kB
Inactive(file):   358764 kB
Unevictable:5776 kB
Mlocked:5776 kB
SwapTotal:  15616764 kB
SwapFree: 294592 kB
Dirty:  3032 kB
Writeback: 76064 kB
AnonPages:  15723272 kB
Mapped:   612124 kB
Shmem:   1171032 kB
Slab: 399824 kB
SReclaimable:  84568 kB
SUnreclaim:   315256 kB
KernelStack:   20576 kB
PageTables:94268 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:31938656 kB
Committed_AS:   37909452 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 98304 kB
ShmemHugePages:0 kB
ShmemPmdMapped:0 kB
CmaTotal:  16384 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb:   0 kB
DirectMap4k:  355604 kB
DirectMap2M:32897024 kB

and console:
[ 9184.345329] INFO: task zmtrigger.pl:9981 blocked for more than 120 seconds.
[ 9184.366258]   Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4
[ 9184.385323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 9184.408803] zmtrigger.plD0  9981   9804 0x20020080
[ 9184.425249] Call Trace:
[ 9184.432580]  ? __schedule+0x53e/0x59b
[ 9184.443551]  schedule+0x7f/0x98
[ 9184.452960]  io_schedule+0x16/0x38
[ 9184.463154]  wait_on_page_bit_common+0x10c/0x199
[ 9184.476996]  ? file_check_and_advance_wb_err+0xd7/0xd7
[ 9184.493339]  shmem_getpage_gfp+0x2dd/0x975
[ 9184.506558]  shmem_fault+0x188/0x1c3
[ 9184.518199]  ? filemap_map_pages+0x6f/0x295
[ 9184.531680]  __do_fault+0x1d/0x6e
[ 9184.542505]  __handle_mm_fault+0x675/0xa61
[ 9184.555653]  ? list_move+0x21/0x3a
[ 9184.566737]  handle_mm_fault+0x11c/0x16b
[ 9184.579355]  __do_page_fault+0x324/0x41c
[ 9184.591996]  ? page_fault+0x8/0x30
[ 9184.603059]  page_fault+0x1e/0x30
[ 9184.613846] RIP: 0023:0xf7d2d022
[ 9184.624366] RSP: 002b:ffeb9fe8 EFLAGS: 00010202
[ 9184.640868] RAX: f7eed000 RBX: 567e6000 RCX: 0004
[ 9184.663095] RDX: 587fecb0 RSI: 5876538c RDI: 0004
[ 9184.685308] RBP: 58185160 R08:  R09: 
[ 9184.707524] R10:  R11: 0286 R12: 
[ 9184.729757] R13:  R14:  R15: 
[ 9184.751988] INFO: task /usr/sbin/apach:11868 blocked for more than 120 
seconds.
[ 9184.775106]   Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4
[ 9184.795072] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 9184.819423] /usr/sbin/apach D0 11868  11311 0x20020080
[ 9184.836748] Call Trace:
[ 9184.844926]  ? __schedule+0x53e/0x59b
[ 9184.856811]  schedule+0x7f/0x98
[ 9184.867075]  io_schedule+0x16/0x38
[ 9184.878114]  wait_on_page_bit_common+0x10c/0x199
[ 9184.892807]  ? file_check_and_advance_wb_err+0xd7/0xd7
[ 9184.909036]  shmem_getpage_gfp+0x2dd/0x975
[ 9184.922157]  shmem_fault+0x188/0x1c3
[ 9184.933667]  ? filemap_map_pages+0x6f/0x295
[ 9184.947504]  __do_fault+0x1d/0x6e
[ 9184.958234]  __handle_mm_fault+0x675/0xa61
[ 

Re: [RESEND][PATCH v4 0/2] vfs: better dedupe permission check

2018-07-17 Thread Mark Fasheh
CCing Michael Kerrisk and linux-api 

The patch at the end of this e-mail updates our man page for
ioctl_fideduperange to reflect the changes proposed in this patch series:

https://marc.info/?l=linux-fsdevel=153185457324037=2

Any feedback would be appreciated. Thanks in advance.
--Mark

On Tue, Jul 17, 2018 at 12:48:18PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 17, 2018 at 12:09:04PM -0700, Mark Fasheh wrote:
> > Hi Al,
> > 
> > The following patches fix a couple of issues with the permission check
> > we do in vfs_dedupe_file_range(). I sent them out for a few times now,
> > a changelog is attached. If they look ok to you, I'd appreciate them
> > being pushed upstream.
> > 
> > You can get them from git if you like:
> > 
> > git pull https://github.com/markfasheh/linux dedupe-perms
> > 
> > I also have a set of patches against 4.17 if you prefer. The code and
> > testing are identical:
> > 
> > git pull https://github.com/markfasheh/linux dedupe-perms-v4.17
> > 
> > 
> > The first patch expands our check to allow dedupe of a file if the
> > user owns it or otherwise would be allowed to write to it.
> > 
> > Current behavior is that we'll allow dedupe only if:
> > 
> > - the user is an admin (root)
> > - the user has the file open for write
> > 
> > This makes it impossible for a user to dedupe their own file set
> > unless they do it as root, or ensure that all files have write
> > permission. There's a couple of duperemove bugs open for this:
> > 
> > https://github.com/markfasheh/duperemove/issues/129
> > https://github.com/markfasheh/duperemove/issues/86
> > 
> > The other problem we have is also related to forcing the user to open
> > target files for write - A process trying to exec a file currently
> > being deduped gets ETXTBUSY. The answer (as above) is to allow them to
> > open the targets ro - root can already do this. There was a patch from
> > Adam Borowski to fix this back in 2016:
> > 
> > https://lkml.org/lkml/2016/7/17/130
> > 
> > which I have incorporated into my changes.
> > 
> > 
> > The 2nd patch fixes our return code for permission denied to be
> > EPERM. For some reason we're returning EINVAL - I think that's
> > probably my fault. At any rate, we need to be returning something
> > descriptive of the actual problem, otherwise callers see EINVAL and
> > can't really make a valid determination of what's gone wrong.
> > 
> > This has also popped up in duperemove, mostly in the form of cryptic
> > error messages. Because this is a code returned to userspace, I did
> > check the other users of extent-same that I could find. Both 'bees'
> > and 'rust-btrfs' do the same as duperemove and simply report the error
> > (as they should).
> > 
> > Please apply.
> > 
> > Thanks,
> >   --Mark
> > 
> > Changes from V3 to V4:
> > - Add a patch (below) to ioctl_fideduperange.2 explaining our
> >   changes. I will send this patch once the kernel update is
> >   accepted. Thanks to Darrick Wong for this suggestion.
> > - V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html
> > 
> > Changes from V2 to V3:
> > - Return bool from allow_file_dedupe
> > - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html
> > 
> > Changes from V1 to V2:
> > - Add inode_permission check as suggested by Adam Borowski
> > - V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2
> > 
> > 
> > From: Mark Fasheh 
> > 
> > [PATCH] ioctl_fideduperange.2: clarify permission requirements
> > 
> > dedupe permission checks were recently relaxed - update our man page to
> > reflect those changes.
> > 
> > Signed-off-by: Mark Fasheh 
> > ---
> >  man2/ioctl_fideduperange.2 | 8 +---
> 
> Mmm, man page update, thank you for editing the documentation too!
> 
> Please cc linux-api and Michael Kerrisk so this can go upstream.


From: Mark Fasheh 

[PATCH] ioctl_fideduperange.2: clarify permission requirements

dedupe permission checks were recently relaxed - update our man page to
reflect those changes.

Signed-off-by: Mark Fasheh 
---
 man2/ioctl_fideduperange.2 | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2
index 84d20a276..4040ee064 100644
--- a/man2/ioctl_fideduperange.2
+++ b/man2/ioctl_fideduperange.2
@@ -105,9 +105,12 @@ The field
 must be zero.
 During the call,
 .IR src_fd
-must be open for reading and
+must be open for reading.
 .IR dest_fd
-must be open for writing.
+can be open for writing, or reading.
+If
+.IR dest_fd
+is open for reading, the user must have write access to the file.
 The combined size of the struct
 .IR file_dedupe_range
 and the struct
@@ -185,8 +188,8 @@ This can appear if the filesystem does not support 
deduplicating either file
 descriptor, or if either file descriptor refers to special inodes.
 .TP
 .B EPERM
-.IR dest_fd
-is immutable.
+This will be returned if the user lacks permission to dedupe the file 
referenced by
+.IR dest_fd .
 .TP
 .B ETXTBSY
 

Re: [RESEND][PATCH v4 0/2] vfs: better dedupe permission check

2018-07-17 Thread Mark Fasheh
On Tue, Jul 17, 2018 at 12:48:18PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 17, 2018 at 12:09:04PM -0700, Mark Fasheh wrote:
> > From: Mark Fasheh 
> > 
> > [PATCH] ioctl_fideduperange.2: clarify permission requirements
> > 
> > dedupe permission checks were recently relaxed - update our man page to
> > reflect those changes.
> > 
> > Signed-off-by: Mark Fasheh 
> > ---
> >  man2/ioctl_fideduperange.2 | 8 +---
> 
> Mmm, man page update, thank you for editing the documentation too!

No problem, thanks for suggesting I do it.


> Please cc linux-api and Michael Kerrisk so this can go upstream.

Will do. The changes you point out below all look good to me so I'll
incorporate them and send an updated version to the appropriate folks.

Thanks
--Mark

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH v4 0/2] vfs: better dedupe permission check

2018-07-17 Thread Darrick J. Wong
On Tue, Jul 17, 2018 at 12:09:04PM -0700, Mark Fasheh wrote:
> Hi Al,
> 
> The following patches fix a couple of issues with the permission check
> we do in vfs_dedupe_file_range(). I sent them out for a few times now,
> a changelog is attached. If they look ok to you, I'd appreciate them
> being pushed upstream.
> 
> You can get them from git if you like:
> 
> git pull https://github.com/markfasheh/linux dedupe-perms
> 
> I also have a set of patches against 4.17 if you prefer. The code and
> testing are identical:
> 
> git pull https://github.com/markfasheh/linux dedupe-perms-v4.17
> 
> 
> The first patch expands our check to allow dedupe of a file if the
> user owns it or otherwise would be allowed to write to it.
> 
> Current behavior is that we'll allow dedupe only if:
> 
> - the user is an admin (root)
> - the user has the file open for write
> 
> This makes it impossible for a user to dedupe their own file set
> unless they do it as root, or ensure that all files have write
> permission. There's a couple of duperemove bugs open for this:
> 
> https://github.com/markfasheh/duperemove/issues/129
> https://github.com/markfasheh/duperemove/issues/86
> 
> The other problem we have is also related to forcing the user to open
> target files for write - A process trying to exec a file currently
> being deduped gets ETXTBUSY. The answer (as above) is to allow them to
> open the targets ro - root can already do this. There was a patch from
> Adam Borowski to fix this back in 2016:
> 
> https://lkml.org/lkml/2016/7/17/130
> 
> which I have incorporated into my changes.
> 
> 
> The 2nd patch fixes our return code for permission denied to be
> EPERM. For some reason we're returning EINVAL - I think that's
> probably my fault. At any rate, we need to be returning something
> descriptive of the actual problem, otherwise callers see EINVAL and
> can't really make a valid determination of what's gone wrong.
> 
> This has also popped up in duperemove, mostly in the form of cryptic
> error messages. Because this is a code returned to userspace, I did
> check the other users of extent-same that I could find. Both 'bees'
> and 'rust-btrfs' do the same as duperemove and simply report the error
> (as they should).
> 
> Please apply.
> 
> Thanks,
>   --Mark
> 
> Changes from V3 to V4:
> - Add a patch (below) to ioctl_fideduperange.2 explaining our
>   changes. I will send this patch once the kernel update is
>   accepted. Thanks to Darrick Wong for this suggestion.
> - V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html
> 
> Changes from V2 to V3:
> - Return bool from allow_file_dedupe
> - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html
> 
> Changes from V1 to V2:
> - Add inode_permission check as suggested by Adam Borowski
> - V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2
> 
> 
> From: Mark Fasheh 
> 
> [PATCH] ioctl_fideduperange.2: clarify permission requirements
> 
> dedupe permission checks were recently relaxed - update our man page to
> reflect those changes.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  man2/ioctl_fideduperange.2 | 8 +---

Mmm, man page update, thank you for editing the documentation too!

Please cc linux-api and Michael Kerrisk so this can go upstream.

>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2
> index 84d20a276..7dea0323d 100644
> --- a/man2/ioctl_fideduperange.2
> +++ b/man2/ioctl_fideduperange.2
> @@ -105,9 +105,11 @@ The field
>  must be zero.
>  During the call,
>  .IR src_fd
> -must be open for reading and
> +must be open for reading.
>  .IR dest_fd
> -must be open for writing.
> +can be open for writing, or reading. If

Manpages usually start each new sentence on its own line (though I defer
to mkerrisk on that).

> +.IR dest_fd
> +is open for reading, the user should be have write access to the file.

"...the user must have write access..."

>  The combined size of the struct
>  .IR file_dedupe_range
>  and the struct
> @@ -185,8 +187,8 @@ This can appear if the filesystem does not support 
> deduplicating either file
>  descriptor, or if either file descriptor refers to special inodes.
>  .TP
>  .B EPERM
> +This will be returned if the user lacks permission to dedupe the file 
> referenced by
>  .IR dest_fd
> -is immutable.

(Did the period fall off the end of the sentence here?  I am bad at
reading manpage markup...)

--D

>  .TP
>  .B ETXTBSY
>  One of the files is a swap file.
> -- 
> 2.15.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/2] vfs: allow dedupe of user owned read-only files

2018-07-17 Thread Mark Fasheh
The permission check in vfs_dedupe_file_range() is too coarse - We
only allow dedupe of the destination file if the user is root, or
they have the file open for write.

This effectively limits a non-root user from deduping their own read-only
files. In addition, the write file descriptor that the user is forced to
hold open can prevent execution of files. As file data during a dedupe
does not change, the behavior is unexpected and this has caused a number of
issue reports. For an example, see:

https://github.com/markfasheh/duperemove/issues/129

So change the check so we allow dedupe on the target if:

- the root or admin is asking for it
- the process has write access
- the owner of the file is asking for the dedupe
- the process could get write access

That way users can open read-only and still get dedupe.

Signed-off-by: Mark Fasheh 
Acked-by: Darrick J. Wong 
---
 fs/read_write.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e83bd9744b5d..71e9077f8bc1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, 
loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
+/* Check whether we are allowed to dedupe the destination file */
+static bool allow_file_dedupe(struct file *file)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+   if (file->f_mode & FMODE_WRITE)
+   return true;
+   if (uid_eq(current_fsuid(), file_inode(file)->i_uid))
+   return true;
+   if (!inode_permission(file_inode(file), MAY_WRITE))
+   return true;
+   return false;
+}
+
 int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 {
struct file_dedupe_range_info *info;
@@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
u64 len;
int i;
int ret;
-   bool is_admin = capable(CAP_SYS_ADMIN);
u16 count = same->dest_count;
struct file *dst_file;
loff_t dst_off;
@@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
 
if (info->reserved) {
info->status = -EINVAL;
-   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+   } else if (!allow_file_dedupe(dst_file)) {
info->status = -EINVAL;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-07-17 Thread Mark Fasheh
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.

Signed-off-by: Mark Fasheh 
Reviewed-by: Darrick J. Wong 
Acked-by: David Sterba 
---
 fs/read_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 71e9077f8bc1..7188982e2733 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
if (info->reserved) {
info->status = -EINVAL;
} else if (!allow_file_dedupe(dst_file)) {
-   info->status = -EINVAL;
+   info->status = -EPERM;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
} else if (S_ISDIR(dst->i_mode)) {
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND][PATCH v4 0/2] vfs: better dedupe permission check

2018-07-17 Thread Mark Fasheh
Hi Al,

The following patches fix a couple of issues with the permission check
we do in vfs_dedupe_file_range(). I sent them out for a few times now,
a changelog is attached. If they look ok to you, I'd appreciate them
being pushed upstream.

You can get them from git if you like:

git pull https://github.com/markfasheh/linux dedupe-perms

I also have a set of patches against 4.17 if you prefer. The code and
testing are identical:

git pull https://github.com/markfasheh/linux dedupe-perms-v4.17


The first patch expands our check to allow dedupe of a file if the
user owns it or otherwise would be allowed to write to it.

Current behavior is that we'll allow dedupe only if:

- the user is an admin (root)
- the user has the file open for write

This makes it impossible for a user to dedupe their own file set
unless they do it as root, or ensure that all files have write
permission. There's a couple of duperemove bugs open for this:

https://github.com/markfasheh/duperemove/issues/129
https://github.com/markfasheh/duperemove/issues/86

The other problem we have is also related to forcing the user to open
target files for write - A process trying to exec a file currently
being deduped gets ETXTBUSY. The answer (as above) is to allow them to
open the targets ro - root can already do this. There was a patch from
Adam Borowski to fix this back in 2016:

https://lkml.org/lkml/2016/7/17/130

which I have incorporated into my changes.


The 2nd patch fixes our return code for permission denied to be
EPERM. For some reason we're returning EINVAL - I think that's
probably my fault. At any rate, we need to be returning something
descriptive of the actual problem, otherwise callers see EINVAL and
can't really make a valid determination of what's gone wrong.

This has also popped up in duperemove, mostly in the form of cryptic
error messages. Because this is a code returned to userspace, I did
check the other users of extent-same that I could find. Both 'bees'
and 'rust-btrfs' do the same as duperemove and simply report the error
(as they should).

Please apply.

Thanks,
  --Mark

Changes from V3 to V4:
- Add a patch (below) to ioctl_fideduperange.2 explaining our
  changes. I will send this patch once the kernel update is
  accepted. Thanks to Darrick Wong for this suggestion.
- V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html

Changes from V2 to V3:
- Return bool from allow_file_dedupe
- V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html

Changes from V1 to V2:
- Add inode_permission check as suggested by Adam Borowski
- V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2


From: Mark Fasheh 

[PATCH] ioctl_fideduperange.2: clarify permission requirements

dedupe permission checks were recently relaxed - update our man page to
reflect those changes.

Signed-off-by: Mark Fasheh 
---
 man2/ioctl_fideduperange.2 | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2
index 84d20a276..7dea0323d 100644
--- a/man2/ioctl_fideduperange.2
+++ b/man2/ioctl_fideduperange.2
@@ -105,9 +105,11 @@ The field
 must be zero.
 During the call,
 .IR src_fd
-must be open for reading and
+must be open for reading.
 .IR dest_fd
-must be open for writing.
+can be open for writing, or reading. If
+.IR dest_fd
+is open for reading, the user should be have write access to the file.
 The combined size of the struct
 .IR file_dedupe_range
 and the struct
@@ -185,8 +187,8 @@ This can appear if the filesystem does not support 
deduplicating either file
 descriptor, or if either file descriptor refers to special inodes.
 .TP
 .B EPERM
+This will be returned if the user lacks permission to dedupe the file 
referenced by
 .IR dest_fd
-is immutable.
 .TP
 .B ETXTBSY
 One of the files is a swap file.
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-17 Thread Martin Steigerwald
Nikolay Borisov - 17.07.18, 10:16:
> On 17.07.2018 11:02, Martin Steigerwald wrote:
> > Nikolay Borisov - 17.07.18, 09:20:
> >> On 16.07.2018 23:58, Wolf wrote:
> >>> Greetings,
> >>> I would like to ask what what is healthy amount of free space to
> >>> keep on each device for btrfs to be happy?
> >>> 
> >>> This is how my disk array currently looks like
> >>> 
> >>> [root@dennas ~]# btrfs fi usage /raid
> >>> 
> >>> Overall:
> >>> Device size:  29.11TiB
> >>> Device allocated: 21.26TiB
> >>> Device unallocated:7.85TiB
> >>> Device missing:  0.00B
> >>> Used: 21.18TiB
> >>> Free (estimated):  3.96TiB  (min: 3.96TiB)
> >>> Data ratio:   2.00
> >>> Metadata ratio:   2.00
> >>> Global reserve:  512.00MiB  (used: 0.00B)
> > 
> > […]
> > 
> >>> Btrfs does quite good job of evenly using space on all devices.
> >>> No,
> >>> how low can I let that go? In other words, with how much space
> >>> free/unallocated remaining space should I consider adding new
> >>> disk?
> >> 
> >> Btrfs will start running into problems when you run out of
> >> unallocated space. So the best advice will be monitor your device
> >> unallocated, once it gets really low - like 2-3 gb I will suggest
> >> you run balance which will try to free up unallocated space by
> >> rewriting data more compactly into sparsely populated block
> >> groups. If after running balance you haven't really freed any
> >> space then you should consider adding a new drive and running
> >> balance to even out the spread of data/metadata.
> > 
> > What are these issues exactly?
> 
> For example if you have plenty of data space but your metadata is full
> then you will be getting ENOSPC.

Of that one I am aware.

This just did not happen so far.

I did not yet add it explicitly to the training slides, but I just make 
myself a note to do that.

Anything else?

> > I have
> > 
> > % btrfs fi us -T /home
> > 
> > Overall:
> > Device size: 340.00GiB
> > Device allocated:340.00GiB
> > Device unallocated:2.00MiB
> > Device missing:  0.00B
> > Used:308.37GiB
> > Free (estimated): 14.65GiB  (min: 14.65GiB)
> > Data ratio:   2.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 0.00B)
> > 
> >   Data  Metadata System
> > 
> > Id Path   RAID1 RAID1RAID1Unallocated
> > -- -- -   ---
> > 
> >  1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
> >  2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
> > 
> > -- -- -   ---
> > 
> >Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
> >Used   151.24GiB  2.95GiB 48.00KiB
>
> You already have only 33% of your metadata full so if your workload
> turned out to actually be making more metadata-heavy changed i.e
> snapshots you could exhaust this and get ENOSPC, despite having around
> 14gb of free data space. Furthermore this data space is spread around
> multiple data chunks, depending on how populated they are a balance
> could be able to free up unallocated space which later could be
> re-purposed for metadata (again, depending on what you are doing).

The filesystem above IMO is not fit for snapshots. It would fill up 
rather quickly, I think even when I balance metadata. Actually I tried 
this and as I remember it took at most a day until it was full.

If I read above figures currently at maximum I could gain one additional 
GiB by balancing metadata. That would not make a huge difference.

I bet I am already running this filesystem beyond recommendation, as I 
bet many would argue it is to full already for regular usage… I do not 
see the benefit of squeezing the last free space out of it just to fit 
in another GiB.

So I still do not get the point why it would make sense to balance it at 
this point in time. Especially as this 1 GiB I could regain is not even 
needed. And I do not see the point of balancing it weekly. I would 
regain about 1 GiB of metadata space every now and then, but the cost 
would be a lot of additional I/O to the SSD. They still take it very 
nicely so far, but I think, right now, there is simply no point in 
balancing, at least not regularly, unless…

there would be an performance gain. Whenever I balanced a complete 
filesystem with data and metadata I however saw a cross drop in 
performance, like doubling the boot time for example (no scientific 
measurement, just my personal observation). I admit I did not do this 
for a long time and the balancing 

task btrfs-transacti:921 blocked for more than 120 seconds during check repair

2018-07-17 Thread Marc MERLIN
I got the following on 4.17.6 while running btrfs check --repair on an
unmounted filesystem (not the lowmem version)

I understand that btrfs check is userland only, although it seems that
it caused these FS hangs on a different filesystem (the trace of course
does not provide info on which FS)

Any idea what happened here?
I'm going to wait a few hours without running btrfs check to see if it
happens again and then if running btrfs check will re-create this issue,
but other suggestions (if any), are welcome:

[ 2538.566952] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[ 2538.616484] Call Trace:
[ 2538.623828]  ? __schedule+0x53e/0x59b
[ 2538.634802]  schedule+0x7f/0x98
[ 2538.644214]  wait_current_trans+0x9b/0xd8
[ 2538.656229]  ? add_wait_queue+0x3a/0x3a
[ 2538.668239]  start_transaction+0x1ce/0x325
[ 2538.680556]  btrfs_finish_ordered_io+0x240/0x5d3
[ 2538.694414]  normal_work_helper+0x118/0x277
[ 2538.706984]  process_one_work+0x19c/0x281
[ 2538.719036]  ? rescuer_thread+0x279/0x279
[ 2538.731064]  worker_thread+0x197/0x246
[ 2538.742322]  kthread+0xeb/0xf0
[ 2538.751492]  ? kthread_create_worker_on_cpu+0x66/0x66
[ 2538.76]  ret_from_fork+0x35/0x40
[ 2538.777403] INFO: task kworker/u16:11:369 blocked for more than 120 seconds.
[ 2538.799025]   Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4
[ 2538.818109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2538.841640] kworker/u16:11  D0   369  2 0x8000
[ 2538.858112] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[ 2538.876401] Call Trace:
[ 2538.883770]  ? __schedule+0x53e/0x59b
[ 2538.894760]  schedule+0x7f/0x98
[ 2538.904192]  wait_current_trans+0x9b/0xd8
[ 2538.916242]  ? add_wait_queue+0x3a/0x3a
[ 2538.927772]  start_transaction+0x1ce/0x325
[ 2538.940081]  btrfs_finish_ordered_io+0x240/0x5d3
[ 2538.953973]  normal_work_helper+0x118/0x277
[ 2538.966523]  process_one_work+0x19c/0x281
[ 2538.978546]  ? rescuer_thread+0x279/0x279
[ 2538.990560]  worker_thread+0x197/0x246
[ 2539.001797]  kthread+0xeb/0xf0
[ 2539.010986]  ? kthread_create_worker_on_cpu+0x66/0x66
[ 2539.026137]  ret_from_fork+0x35/0x40
[ 2539.037666] INFO: task btrfs-transacti:921 blocked for more than 120 seconds.
[ 2539.059851]   Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4
[ 2539.079733] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2539.104007] btrfs-transacti D0   921  2 0x8000
[ 2539.121257] Call Trace:
[ 2539.129377]  ? __schedule+0x53e/0x59b
[ 2539.141171]  schedule+0x7f/0x98
[ 2539.151370]  btrfs_tree_lock+0xa6/0x19d
[ 2539.163621]  ? add_wait_queue+0x3a/0x3a
[ 2539.175876]  btrfs_search_slot+0x5aa/0x756
[ 2539.188899]  lookup_inline_extent_backref+0x11a/0x485
[ 2539.204781]  ? fixup_slab_list.isra.43+0x1b/0x72
[ 2539.219360]  __btrfs_free_extent+0xf1/0xa72
[ 2539.232597]  ? btrfs_merge_delayed_refs+0x18b/0x1a7
[ 2539.247922]  ? __mutex_trylock_or_owner+0x43/0x54
[ 2539.262708]  __btrfs_run_delayed_refs+0xad8/0xc40
[ 2539.277504]  btrfs_run_delayed_refs+0x6e/0x16a
[ 2539.291519]  btrfs_commit_transaction+0x42/0x710
[ 2539.306043]  ? start_transaction+0x295/0x325
[ 2539.319516]  transaction_kthread+0xc9/0x135
[ 2539.332757]  ? btrfs_cleanup_transaction+0x3ee/0x3ee
[ 2539.348327]  kthread+0xeb/0xf0
[ 2539.358155]  ? kthread_create_worker_on_cpu+0x66/0x66
[ 2539.373977]  ret_from_fork+0x35/0x40
[ 2539.385394] INFO: task vnstatd:6338 blocked for more than 120 seconds.
[ 2539.405667]   Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3)

2018-07-17 Thread David Sterba
On Fri, Jul 13, 2018 at 11:02:03PM +0200, Goffredo Baroncelli wrote:
> As general comment, good to hear that something is moving around raid5/6 + 
> write hole and multiple mirroring.
> However I am guessing if this is time to simplify the RAID code. There are a 
> lot of "if" which could be avoided using 
> the values stored in the array "btrfs_raid_array[]".

I absolutely agree and had the same impression during implementing the
feature. For this patchset I did only a minimal prep work, the
suggestions you give below make sense to me.

Enhancing the table would make a lot of code go away and just use one
formula to calculate the results that are now opencoded. I'll be going
through the raid code so I'll get to the cleanups eventually.

> Below some comments:

> > @@ -5075,6 +5093,8 @@ static inline int btrfs_chunk_max_errors(struct 
> > map_lookup *map)
> >  BTRFS_BLOCK_GROUP_RAID5 |
> >  BTRFS_BLOCK_GROUP_DUP)) {
> > max_errors = 1;
> > +   } else if (map->type & BTRFS_BLOCK_GROUP_RAID1C3) {
> > +   max_errors = 2;
> > } else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
> > max_errors = 2;
> > } else {
> 
> Even in this case the ifs above could be replaced with something like:
> 
>   index = btrfs_bg_flags_to_raid_index(map->type)
>   max_errors = btrfs_raid_array[index].ncopies-1;

There's .tolerated_failures that should equal ncopies - 1 in general,
but does not for DUP so the semantics of the function and caller needs
to be verified.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: rename btrfs_parse_early_options

2018-07-17 Thread David Sterba
On Mon, Jul 16, 2018 at 11:01:37PM +0800, Anand Jain wrote:
> On 07/16/2018 10:18 PM, Anand Jain wrote:
> > Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As
> > btrfs_parse_early_options() parses the -o device options and scan the
> > device provided. So this rename specifies its action. Also the function
> > name is inline with btrfs_parse_subvol_options().

Yeah, much more fitting name.

> > No functional changes.
> > 
> > Signed-off-by: Anand Jain 
> > ---
> 
> Forgot to mention: This should be applied on top of
>Gu Jinxiang (2):
>btrfs: make fs_devices to be a local variable
>btrfs: get device pointer from btrfs_scan_one_device

Both are in misc-next now so your patch applied too, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit

2018-07-17 Thread David Sterba
On Tue, Jul 17, 2018 at 04:58:22PM +0800, Lu Fengqi wrote:
> Since commit 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info
> convenience variables"), the srcroot is no longer used to get fs_fino.
> In fact, it can be dropped after commit 707e8a071528 ("btrfs: use nodesize
> everywhere, kill leafsize").
> 
> Signed-off-by: Lu Fengqi 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] btrfs: get device pointer from btrfs_scan_one_device

2018-07-17 Thread David Sterba
On Tue, Jul 17, 2018 at 12:54:00PM +0800, Anand Jain wrote:
> 
> ::
> 
> >>> @@ -1561,12 +1564,15 @@ static struct dentry *btrfs_mount_root(struct 
> >>> file_system_type *fs_type,
> >>>   goto error_fs_info;
> >>>   }
> >>>
> >>> - error = btrfs_scan_one_device(device_name, mode, fs_type, _devices);
> >>> - if (error) {
> >>> + device = btrfs_scan_one_device(device_name, mode, fs_type);
> >>> + if (IS_ERR(device)) {
> >>> + error = PTR_ERR(device);
> >>>   mutex_unlock(_mutex);
> >>>   goto error_fs_info;
> >>>   }
> >>>
> >>> + fs_devices = device->fs_devices;
> >>> +
> 
>   The version at misc-next is missing this assignment and it panics.

Thanks, that was my mismerge. Now added back and pushed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] btrfs: get device pointer from btrfs_scan_one_device

2018-07-17 Thread David Sterba
On Tue, Jul 17, 2018 at 01:08:34AM +, Gu, Jinxiang wrote:
> > > - device = device_list_add(path, disk_super, _device_added);
> > > - if (IS_ERR(device)) {
> > > - ret = PTR_ERR(device);
> > > - } else {
> > > - *fs_devices_ret = device->fs_devices;
> > > - if (new_device_added)
> > > - btrfs_free_stale_devices(path, device);
> > > - }
> > > + ret = device_list_add(path, disk_super, _device_added);
> > > + if (new_device_added)
> > > + btrfs_free_stale_devices(path, ret);
> > 
> > Why is this skipping the error handling? I've resolved the conflict in a
> > different way in misc-next, have you had a look there?
> 
> Since when new_device_added set to be true, the return value of 
> device_list_add
> must not a error value.

Yeah, but this is not obvious from the context and looks like "
device_list_add fails but we'll do something anyway", so I'm for keeping
the check.

> So, I think no necessary to add the judge.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] btrfs: Verify every chunk has corresponding block group at mount time

2018-07-17 Thread Qu Wenruo


On 2018年07月17日 20:33, David Sterba wrote:
> On Mon, Jul 16, 2018 at 09:57:43PM +0800, Qu Wenruo wrote:
> -EUCLEAN ?

 Either works for me.
>>>
>>> That's not just a cosmetic change, there's a semantic difference between
>>> the error codes, I maybe make that more explicit and not expect that this
>>> is obvious.
>>>
>>> ENOENT does not make much sense in this context, the caller (mount in
>>> this case) cannot do anything about a code that says 'some internal
>>> structure not found'.
>>
>> The point here is, if every self-checker should only return -EUCLEAN, it
>> won't really indicate what's going wrong, except points to some
>> self-checker (and such self-checkers are growing larger than our
>> expectation already).
>>
>> My practice here is, put some human readable and meaningful error
>> message. No matter what we choose to return, the error message should
>> tell us what's going wrong.
>>
>> In this case, I don't really care the return value. If it's explicitly
>> needed to return -EUCLEAN, I could make all existing checker (from
>> tree-checker to chunk/bg/dev-extent checker) to return -EUCLEAN if
>> anything is wrong (and save several "ret = -EUCLEAN" lines).
>> The return value doesn't really have much meaning nowadays, it's the
>> error message important now.
> 
> Ok, I see what you mean. The message is important as it's otherwise
> almost impossible to find where exactly the mount fails.
> 
> The error messages perhaps fall into several categories:
> 
> 1) transient errors, some failure that happens before the filesystem state
>is fully examined
> 
> this is namely ENOMEM, or EINTR eg. returned by kthread_run

This standard is a little misleading, or did I misunderstand your category?

From the example error number, I could only find ENOMEM so
straightforward for end user/developer that we don't need any error
message to explain them.
Or this category is just for error no need of error message? (or can be
handled by btrfs-progs without any need of user interruption/decision?)

> 
> maybe also a failure on a multi-device filesystem when the devices
> haven't been scanned yet
> 
> 2) clearly some corruption/consistency condtion, with enough information
>available to decide
> 
> like a missing tree, most of the tree-checker would fall into this
> category

This is pretty clear.

> 
> 3) same as the previous one, but there's some external condition preventing
>a full check
> 
> that's eg. a real EIO after reading a tree block

That csum mismatch EIO with error message or really some error from
underlying layer like some ATA command failure?

> 
> 
> The error code are IMO important to see how severe the problems are and
> what's the expected solution. 2 is for 'check', 3 may need degraded
> mount, 1 needs maybe more time to mount again.

Category 2 for check is sure.
For other 2 cases it's a little hard to say.
Normally if we really hit some error we don't expect, under most case
the filesystem is already corrupted (e.g. a lot of errors of resuming
balance/mount failure finally turns out to be fs corruption).

If category is determined by the expected solution, most will just fall
into category 2), including most of errors we have in btrfs module
currently.

> 
> With the error messages in place, 2 can be completely covered by
> EUCLEAN. I briefly skimmed a few call paths and think that the 3
> categories should be enough, but I'm also expecting some exceptions that
> can be decided case by case.

For category 2), I think it's pretty clear and practically to use EUCLEAN.

For other categories I'm not really sure.

E.G what happens if we can't find certain backref when running delayed
refs? It's either a kernel bug or a corrupted fs.
Which category should it fit? Category 2? But we don't really know
what's going wrong.
For category 1/3? It won't really be fixed until we fix the bug or the fs.

More details examples would definitely help me understand the category.

> 
> The error codes are now not consistent, lots of EUCLEAN are historically
> EIO, but before we start cleaning that up we should have at least some
> guidelines. Please let me know what you think.
> 
At least for self-verification code it's pretty clear that we should
have error message for what's going wrong and what we expect, with
explicit EUCLEAN error number.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/5] btrfs: Verify every chunk has corresponding block group at mount time

2018-07-17 Thread David Sterba
On Mon, Jul 16, 2018 at 09:57:43PM +0800, Qu Wenruo wrote:
> >>> -EUCLEAN ?
> >>
> >> Either works for me.
> > 
> > That's not just a cosmetic change, there's a semantic difference between
> > the error codes, I maybe make that more explicit and not expect that this
> > is obvious.
> > 
> > ENOENT does not make much sense in this context, the caller (mount in
> > this case) cannot do anything about a code that says 'some internal
> > structure not found'.
> 
> The point here is, if every self-checker should only return -EUCLEAN, it
> won't really indicate what's going wrong, except points to some
> self-checker (and such self-checkers are growing larger than our
> expectation already).
> 
> My practice here is, put some human readable and meaningful error
> message. No matter what we choose to return, the error message should
> tell us what's going wrong.
> 
> In this case, I don't really care the return value. If it's explicitly
> needed to return -EUCLEAN, I could make all existing checker (from
> tree-checker to chunk/bg/dev-extent checker) to return -EUCLEAN if
> anything is wrong (and save several "ret = -EUCLEAN" lines).
> The return value doesn't really have much meaning nowadays, it's the
> error message important now.

Ok, I see what you mean. The message is important as it's otherwise
almost impossible to find where exactly the mount fails.

The error messages perhaps fall into several categories:

1) transient errors, some failure that happens before the filesystem state
   is fully examined

this is namely ENOMEM, or EINTR eg. returned by kthread_run

maybe also a failure on a multi-device filesystem when the devices
haven't been scanned yet

2) clearly some corruption/consistency condtion, with enough information
   available to decide

like a missing tree, most of the tree-checker would fall into this
category

3) same as the previous one, but thre's some extenal condition preventing
   a full check

that's eg. a real EIO after reading a tree block


The error code are IMO important to see how severe the problems are and
what's the expected solution. 2 is for 'check', 3 may need degraded
mount, 1 needs maybe more time to mount again.

With the error messages in place, 2 can be completely covered by
EUCLEAN. I briefly skimmed a few call paths and think that the 3
categories should be enough, but I'm also expecting some exceptions that
can be decided case by case.

The error codes are now not consistent, lots of EUCLEAN are historically
EIO, but before we start cleaning that up we should have at least some
guidelines. Please let me know what you think.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: check: enhanced progress indicator

2018-07-17 Thread David Sterba
On Tue, Jul 17, 2018 at 10:23:47AM +0900, Misono Tomohiro wrote:
> On 2018/07/05 4:20, Stéphane Lesimple wrote:
> > @@ -9806,16 +9839,26 @@ int cmd_check(int argc, char **argv)
> > error("errors found in csum tree");
> > err |= !!ret;
> >  
> > -   fprintf(stderr, "checking root refs\n");
> > /* For low memory mode, check_fs_roots_v2 handles root refs */
> > -   if (check_mode != CHECK_MODE_LOWMEM) {
> > +if (check_mode != CHECK_MODE_LOWMEM) {
> > +   if (!ctx.progress_enabled)
> > +   fprintf(stderr, "[6/7] checking root refs\n");
> > +   else {
> > +   ctx.tp = TASK_ROOT_REFS;
> > +   task_start(ctx.info, _time, _count);
> > +   }
> > +
> > ret = check_root_refs(root, _cache);
> > +   task_stop(ctx.info);
> > err |= !!ret;
> > if (ret) {
> > error("errors found in root refs");
> > goto out;
> > }
> > }
> > +   else {
> > +   fprintf(stderr, "[6/7] checking root refs done with fs roots in 
> > lowmem mode, skipping\n");
> > +   }
> >  
> > while (repair && !list_empty(>fs_info->recow_ebs)) {
> > struct extent_buffer *eb;
> > @@ -9844,8 +9887,15 @@ int cmd_check(int argc, char **argv)
> > }
> >  
> > if (info->quota_enabled) {
> > -   fprintf(stderr, "checking quota groups\n");
> > +   if (!ctx.progress_enabled)
> > +   fprintf(stderr, "[7/7] checking quota groups\n");
> 
> 
> qgroup_set_item_count_ptr(_count) is needed here too. Otherwise
> quota check causes segfault without -p option.

Thanks, fixed as

--- a/check/main.c
+++ b/check/main.c
@@ -9969,12 +9969,12 @@ int cmd_check(int argc, char **argv)
}
 
if (info->quota_enabled) {
+   qgroup_set_item_count_ptr(_count);
if (!ctx.progress_enabled) {
fprintf(stderr, "[7/7] checking quota groups\n");
} else {
ctx.tp = TASK_QGROUPS;
task_start(ctx.info, _time, _count);
-   qgroup_set_item_count_ptr(_count);
}
ret = qgroup_verify_all(info);
task_stop(ctx.info);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()

2018-07-17 Thread David Sterba
On Wed, Jul 11, 2018 at 01:41:21PM +0800, Qu Wenruo wrote:
> In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
> replace") we removed the branch of copy_nocow_pages() to avoid
> corruption for compressed nodatasum extents.
> 
> However above commit only solves the problem in scrub_extent(), if
> during scrub_pages() we failed to read some pages,
> sctx->no_io_error_seen will be non-zero and we go to fixup function
> scrub_handle_errored_block().
> 
> In scrub_handle_errored_block(), for sctx without csum (no matter if
> we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
> which does the similar thing with copy_nocow_pages(), but does it
> without the extra check in copy_nocow_pages() routine.
> 
> So for test cases like btrfs/100, where we emulate read errors during
> replace/scrub, we could corrupt compressed extent data again.
> 
> This patch will fix it just by avoiding any "optimization" for
> nodatasum, just falls back to the normal fixup routine by try read from
> any good copy.
> 
> This also solves WARN_ON() or dead lock caused by lame backref iteration
> in scrub_fixup_nodatasum() routine.
> 
> The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
> ("btrfs: scrub: Don't use inode pages for device replace") since
> copy_nocow_pages() have better locking and extra check for data extent,
> and it's already doing the fixup work by try to read data from any good
> copy, so it won't go scrub_fixup_nodatasum() anyway.
> 
> Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace")
> Signed-off-by: Qu Wenruo 

Thanks, I'll forward this to 4.18, there are a few more regression fixes
queued.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-17 Thread Austin S. Hemmelgarn

On 2018-07-16 16:58, Wolf wrote:

Greetings,
I would like to ask what what is healthy amount of free space to keep on
each device for btrfs to be happy?

This is how my disk array currently looks like

 [root@dennas ~]# btrfs fi usage /raid
 Overall:
 Device size:  29.11TiB
 Device allocated: 21.26TiB
 Device unallocated:7.85TiB
 Device missing:  0.00B
 Used: 21.18TiB
 Free (estimated):  3.96TiB  (min: 3.96TiB)
 Data ratio:   2.00
 Metadata ratio:   2.00
 Global reserve:  512.00MiB  (used: 0.00B)

 Data,RAID1: Size:10.61TiB, Used:10.58TiB
/dev/mapper/data1   1.75TiB
/dev/mapper/data2   1.75TiB
/dev/mapper/data3 856.00GiB
/dev/mapper/data4 856.00GiB
/dev/mapper/data5   1.75TiB
/dev/mapper/data6   1.75TiB
/dev/mapper/data7   6.29TiB
/dev/mapper/data8   6.29TiB

 Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
/dev/mapper/data1   2.00GiB
/dev/mapper/data2   3.00GiB
/dev/mapper/data3   1.00GiB
/dev/mapper/data4   1.00GiB
/dev/mapper/data5   3.00GiB
/dev/mapper/data6   1.00GiB
/dev/mapper/data7   9.00GiB
/dev/mapper/data8  10.00GiB
Slightly OT, but the distribution of metadata chunks across devices 
looks a bit sub-optimal here.  If you can tolerate the volume being 
somewhat slower for a while, I'd suggest balancing these (it should get 
you better performance long-term).


 System,RAID1: Size:64.00MiB, Used:1.50MiB
/dev/mapper/data2  32.00MiB
/dev/mapper/data6  32.00MiB
/dev/mapper/data7  32.00MiB
/dev/mapper/data8  32.00MiB

 Unallocated:
/dev/mapper/data11004.52GiB
/dev/mapper/data21004.49GiB
/dev/mapper/data31006.01GiB
/dev/mapper/data41006.01GiB
/dev/mapper/data51004.52GiB
/dev/mapper/data61004.49GiB
/dev/mapper/data71005.00GiB
/dev/mapper/data81005.00GiB

Btrfs does quite good job of evenly using space on all devices. No, how
low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new disk?

Disclaimer: What I'm about to say is based on personal experience.  YMMV.

It depends on how you use the filesystem.

Realistically, there are a couple of things I consider when trying to 
decide on this myself:


* How quickly does the total usage increase on average, and how much can 
it be expected to increase in one day in the worst case scenario?  This 
isn't really BTRFS specific, but it's worth mentioning.  I usually don't 
let an array get close enough to full that it wouldn't be able to safely 
handle at least one day of the worst case increase and another 2 of 
average increases.  In BTRFS terms, the 'safely handle' part means you 
should be adding about 5GB for a multi-TB array like you have, or about 
1GB for a sub-TB array.


* What are the typical write patterns?  Do files get rewritten in-place, 
or are they only ever rewritten with a replace-by-rename? Are writes 
mostly random, or mostly sequential?  Are writes mostly small or mostly 
large?  The more towards the first possibility listed in each of those 
question (in-place rewrites, random access, and small writes), the more 
free space you should keep on the volume.


* Does this volume see heavy usage of fallocate() either to preallocate 
space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch 
holes or remove ranges from files.  If whatever software you're using 
does this a lot on this volume, you want even more free space.


* Do old files tend to get removed in large batches?  That is, possibly 
hundreds or thousands of files at a time.  If so, and you're running a 
reasonably recent (4.x series) kernel or regularly balance the volume to 
clean up empty chunks, you don't need quite as much free space.


* How quickly can you get a new device added, and is it critical that 
this volume always be writable?  Sounds stupid, but a lot of people 
don't consider this.  If you can trivially get a new device added 
immediately, you can generally let things go a bit further than you 
would normally, same for if the volume being read-only can be tolerated 
for a while without significant issues.


It's worth noting that I explicitly do not care about snapshot usage. 
It rarely has much impact on this other than changing how the total 
usage increases in a day.


Evaluating all of this is of course something I can't really do for you. 
 If I had to guess, with no other information that the allocations 
shown, I'd say that you're probably generically fine until you get down 
to about 5GB more than twice the average 

Re: [PATCH RESEND 2/2] btrfs-progs: make all programs and libraries optional

2018-07-17 Thread David Sterba
On Mon, Jul 16, 2018 at 10:44:52AM -0700, Omar Sandoval wrote:
> On Mon, Jul 16, 2018 at 04:56:57PM +0200, David Sterba wrote:
> > On Thu, Jul 12, 2018 at 04:11:19PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval 
> > > 
> > > We have a build system internally which only needs to build the
> > > libraries out of a repository, not any binaries. I looked at how this
> > > works with other projects, and the best example was util-linux, which
> > > makes it possible to enable or disable everything individually. This is
> > > nice and really flexible, so let's do the same. This way, if you only
> > > want to build and install libbtrfsutil, you can simply do
> > > 
> > >   ./configure --disable-documentation --disable-all-programs 
> > > --enable-libbtrfsutil
> > >   make
> > >   make install
> > 
> > I think this is an overkill and abusing the --enable-XXX options.  You
> > want to avoid building the tools by default, so adding an option for
> > that is fine. Selectively building only certain tools can utilize that
> > option too and just follow with 'make btrfs-image' etc.
> 
> Yeah, it's easy to build stuff selectively, but `make install` will
> still try to build everything, that's the part I'm more concerned with.

Oh right, installation. What if it installs just the binaries that are
built? The default actions done by configure & make & make install would
not change, but there could be configure --disable-all, then selectively
make and final make install would be 

for p in $(progs_install); if test -f $p && $(INSTALL) ...; fi

> > The number of --enable-* will stay minimal and we don't even have to
> > discuss how to find a good naming scheme (that works for util-linux but
> > looks a bit confusing for btrfs-progs).
> 
> Ok, I can collapse these into just --disable-programs/--enable-programs,
> and --disable-libraries/--enable-libraries? That would be enough for me.

Sounds ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit

2018-07-17 Thread Lu Fengqi
Since commit 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info
convenience variables"), the srcroot is no longer used to get fs_fino.
In fact, it can be dropped after commit 707e8a071528 ("btrfs: use nodesize
everywhere, kill leafsize").

Signed-off-by: Lu Fengqi 
---
 fs/btrfs/qgroup.c | 17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 52e05f031ee3..06e0e9b39340 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2286,22 +2286,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
if (ret)
goto out;
 
-   if (srcid) {
-   struct btrfs_root *srcroot;
-   struct btrfs_key srckey;
-
-   srckey.objectid = srcid;
-   srckey.type = BTRFS_ROOT_ITEM_KEY;
-   srckey.offset = (u64)-1;
-   srcroot = btrfs_read_fs_root_no_name(fs_info, );
-   if (IS_ERR(srcroot)) {
-   ret = PTR_ERR(srcroot);
-   goto out;
-   }
-
-   level_size = fs_info->nodesize;
-   }
-
/*
 * add qgroup to all inherited groups
 */
@@ -2358,6 +2342,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 * our counts don't go crazy, so at this point the only
 * difference between the two roots should be the root node.
 */
+   level_size = fs_info->nodesize;
dstgroup->rfer = srcgroup->rfer;
dstgroup->rfer_cmpr = srcgroup->rfer_cmpr;
dstgroup->excl = level_size;
-- 
2.18.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use

2018-07-17 Thread Qu Wenruo



On 2018年07月17日 16:28, Nikolay Borisov wrote:
> 
> 
> On 17.07.2018 11:24, Qu Wenruo wrote:
>> And it's causing problem for certain test cases.
>> Please ignore this (at least for now).
>>
>> But on the other hand, we indeed have a lot of reports on corrupted
>> extent tree, it's possible to hit some corrupted extent tree (Su is
>> already exhausted by the corrupted tree reported by Marc)
>>
>> So I'm not completely fine with current extent tree error handling.
>> I'll try to find some balance in next version.
> 
> 
> I agree we need a better OVERALL error handling/detection. Your
> tree-checker work IMO is a step in the right direction. What I want is
> to prevent ad-hoc checks being sprinkled in the code.

Yep, I also don't like what I'm doing.
But the problem and the limit of tree-checker is, it's static check.

For things doing tons of cross-check like extent tree, it's not really
as useful.

> Sorry, but that's
> not fine. The thing with working on a lot of corruption reports is the
> fact each one of them is looked at in isolation so it produces isolated
> fixes. Whereas if a step back is taken and the overall error
> handling/detection is considered it might turn out a whole class of
> corruption could be detected by a single change, otherwise checks upon
> checks will be added which just add technical debt.
> 
> Considering this, I'm more in favor of extending the tree-checker to be
> the central place where errors are detected (of course this is easier
> said than done).

For this report itself, tree checker can detect it indirectly by reject
the leaf for the unknown key type.

But one can easily create a valid image by just removing the valid
METADATA_ITEM, and tree-checker can't do anything to detect the problem.

So unfortunately, we will eventually need some runtime check anyway.

Thanks,
Qu

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use

2018-07-17 Thread Qu Wenruo



On 2018年07月17日 16:01, Nikolay Borisov wrote:
> 
> 
> On 17.07.2018 10:46, Qu Wenruo wrote:
>> [BUG]
>> For certain fuzzed btrfs image, if we create any csum data, it would
>> cause the following kernel warning and deadlock when trying to update
>> csum tree:
>> --
>> [  278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 
>> btrfs_tree_lock+0x3e2/0x400
>> [  278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
>> [  278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> [  278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper
>> [  278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400
>> [  278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 
>> 28 00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 
>> 0b e9 d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e
>> [  278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246
>> [  278.113865] Call Trace:
>> [  278.113936]  btrfs_alloc_tree_block+0x39f/0x770
>> [  278.113988]  __btrfs_cow_block+0x285/0x9e0
>> [  278.114029]  btrfs_cow_block+0x191/0x2e0
>> [  278.114035]  btrfs_search_slot+0x492/0x1160
>> [  278.114146]  btrfs_lookup_csum+0xec/0x280
>> [  278.114182]  btrfs_csum_file_blocks+0x2be/0xa60
>> [  278.114232]  add_pending_csums+0xaf/0xf0
>> [  278.114238]  btrfs_finish_ordered_io+0x74b/0xc90
>> [  278.114281]  finish_ordered_fn+0x15/0x20
>> [  278.114285]  normal_work_helper+0xf6/0x500
>> [  278.114305]  btrfs_endio_write_helper+0x12/0x20
>> [  278.114310]  process_one_work+0x302/0x770
>> [  278.114315]  worker_thread+0x81/0x6d0
>> [  278.114321]  kthread+0x180/0x1d0
>> [  278.114334]  ret_from_fork+0x35/0x40
>> [  278.114339] ---[ end trace 2e85051acb5f6dc1 ]---
>> --
>>
>> [CAUSE]
>> The fuzzed image has corrupted EXTENT_ITEM for csum tree root:
>> --
>> extent tree key (EXTENT_TREE ROOT_ITEM 0)
>>  item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33
>>  refs 1 gen 6 flags TREE_BLOCK
>>  tree block skinny level 0
>>  tree block backref root UUID_TREE
>>  item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33
>>   Corrupted METADATA_ITEM
>>  item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33
>>  refs 1 gen 4 flags TREE_BLOCK
>>  tree block skinny level 0
>>  tree block backref root DATA_RELOC_TREE
>>
>> checksum tree key (CSUM_TREE ROOT_ITEM 0)
>> leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE
>>   bytenr matches above item.
>> --
>>
>> So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since
>> there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks
>> it's free space, and reserve it.
>>
>> However in fact it's already been used by csum tree, and later
>> btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose
>> WARN_ON() detects lock nest on the same extent buffer.
>>
>> Finally the wait_event() on the eb->read/write_lock_wq will never exit
>> since we're holding the lock by ourselves and deadlock.
>>
>> [FIX]
>> The fix here is to ensure at least the reserved extent buffer is not
>> cached.
>> Any used extent buffer should be cached in the global radix tree
>> (fs_info->buffer_radix).
>>
>> So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(),
>> we call find_extent_buffer() explicitly to verify it's not used by
>> ourselves.
>>
>> Please note this is just a basic check, it is not and will never be as
>> good as btrfs check on detecting extent tree corruption, but at least we
>> won't dead lock so easily.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
>> Reported-by: Xu Wen 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/extent-tree.c | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 3578fa5b30ef..782dd96b7c5e 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
>> btrfs_trans_handle *trans,
>>  if (ret)
>>  goto out_unuse;
>>  
>> +/*
>> + * Newly allocated tree block should never be cached in radix tree,
>> + * Or we have a corrupted extent tree.
>> + */
>> +buf = find_extent_buffer(fs_info, ins.objectid);
>> +if (buf) {
>> +btrfs_err_rl(fs_info,
>> +"tree block %llu is already in use, extent tree may be corrupted",
>> + ins.objectid);
>> +ret = -EUCLEAN;
>> +free_extent_buffer(buf);
>> +goto out_unuse;
>> +}
> 
> The code makes sense but I have the feeling it needs to have some sort
> of assert guard because this check will likely trigger only on severly
> corrupted filesystemd and yet we introduce it for everyone.

And it's causing problem 

Re: Healthy amount of free space?

2018-07-17 Thread Nikolay Borisov



On 17.07.2018 11:02, Martin Steigerwald wrote:
> Hi Nikolay.
> 
> Nikolay Borisov - 17.07.18, 09:20:
>> On 16.07.2018 23:58, Wolf wrote:
>>> Greetings,
>>> I would like to ask what what is healthy amount of free space to
>>> keep on each device for btrfs to be happy?
>>>
>>> This is how my disk array currently looks like
>>>
>>> [root@dennas ~]# btrfs fi usage /raid
>>> 
>>> Overall:
>>> Device size:  29.11TiB
>>> Device allocated: 21.26TiB
>>> Device unallocated:7.85TiB
>>> Device missing:  0.00B
>>> Used: 21.18TiB
>>> Free (estimated):  3.96TiB  (min: 3.96TiB)
>>> Data ratio:   2.00
>>> Metadata ratio:   2.00
>>> Global reserve:  512.00MiB  (used: 0.00B)
> […]
>>> Btrfs does quite good job of evenly using space on all devices. No,
>>> how low can I let that go? In other words, with how much space
>>> free/unallocated remaining space should I consider adding new disk?
>>
>> Btrfs will start running into problems when you run out of unallocated
>> space. So the best advice will be monitor your device unallocated,
>> once it gets really low - like 2-3 gb I will suggest you run balance
>> which will try to free up unallocated space by rewriting data more
>> compactly into sparsely populated block groups. If after running
>> balance you haven't really freed any space then you should consider
>> adding a new drive and running balance to even out the spread of
>> data/metadata.
> 
> What are these issues exactly?

For example if you have plenty of data space but your metadata is full
then you will be getting ENOSPC.

> 
> I have
> 
> % btrfs fi us -T /home
> Overall:
> Device size: 340.00GiB
> Device allocated:340.00GiB
> Device unallocated:2.00MiB
> Device missing:  0.00B
> Used:308.37GiB
> Free (estimated): 14.65GiB  (min: 14.65GiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
>   Data  Metadata System  
> Id Path   RAID1 RAID1RAID1Unallocated
> -- -- -   ---
>  1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
>  2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
> -- -- -   ---
>Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
>Used   151.24GiB  2.95GiB 48.00KiB

You already have only 33% of your metadata full so if your workload
turned out to actually be making more metadata-heavy changed i.e
snapshots you could exhaust this and get ENOSPC, despite having around
14gb of free data space. Furthermore this data space is spread around
multiple data chunks, depending on how populated they are a balance
could be able to free up unallocated space which later could be
re-purposed for metadata (again, depending on what you are doing).

> 
> on a RAID-1 filesystem one, part of the time two Plasma desktops + 
> KDEPIM and Akonadi + Baloo desktop search + you name it write to like 
> mad.
> 



> 
> Thanks,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use

2018-07-17 Thread Su Yue




On 07/17/2018 04:01 PM, Nikolay Borisov wrote:



On 17.07.2018 10:46, Qu Wenruo wrote:

[BUG]
For certain fuzzed btrfs image, if we create any csum data, it would
cause the following kernel warning and deadlock when trying to update
csum tree:
--
[  278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 
btrfs_tree_lock+0x3e2/0x400
[  278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
[  278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[  278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400
[  278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 28 00 00 
00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e9 d4 fc 
ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e
[  278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246
[  278.113865] Call Trace:
[  278.113936]  btrfs_alloc_tree_block+0x39f/0x770
[  278.113988]  __btrfs_cow_block+0x285/0x9e0
[  278.114029]  btrfs_cow_block+0x191/0x2e0
[  278.114035]  btrfs_search_slot+0x492/0x1160
[  278.114146]  btrfs_lookup_csum+0xec/0x280
[  278.114182]  btrfs_csum_file_blocks+0x2be/0xa60
[  278.114232]  add_pending_csums+0xaf/0xf0
[  278.114238]  btrfs_finish_ordered_io+0x74b/0xc90
[  278.114281]  finish_ordered_fn+0x15/0x20
[  278.114285]  normal_work_helper+0xf6/0x500
[  278.114305]  btrfs_endio_write_helper+0x12/0x20
[  278.114310]  process_one_work+0x302/0x770
[  278.114315]  worker_thread+0x81/0x6d0
[  278.114321]  kthread+0x180/0x1d0
[  278.114334]  ret_from_fork+0x35/0x40
[  278.114339] ---[ end trace 2e85051acb5f6dc1 ]---
--

[CAUSE]
The fuzzed image has corrupted EXTENT_ITEM for csum tree root:
--
extent tree key (EXTENT_TREE ROOT_ITEM 0)
item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33
refs 1 gen 6 flags TREE_BLOCK
tree block skinny level 0
tree block backref root UUID_TREE
item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33
 Corrupted METADATA_ITEM
item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33
refs 1 gen 4 flags TREE_BLOCK
tree block skinny level 0
tree block backref root DATA_RELOC_TREE

checksum tree key (CSUM_TREE ROOT_ITEM 0)
leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE
   bytenr matches above item.
--

So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since
there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks
it's free space, and reserve it.

However in fact it's already been used by csum tree, and later
btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose
WARN_ON() detects lock nest on the same extent buffer.

Finally the wait_event() on the eb->read/write_lock_wq will never exit
since we're holding the lock by ourselves and deadlock.

[FIX]
The fix here is to ensure at least the reserved extent buffer is not
cached.
Any used extent buffer should be cached in the global radix tree
(fs_info->buffer_radix).

So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(),
we call find_extent_buffer() explicitly to verify it's not used by
ourselves.

Please note this is just a basic check, it is not and will never be as
good as btrfs check on detecting extent tree corruption, but at least we
won't dead lock so easily.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen 
Signed-off-by: Qu Wenruo 
---
  fs/btrfs/extent-tree.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3578fa5b30ef..782dd96b7c5e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
if (ret)
goto out_unuse;
  
+	/*

+* Newly allocated tree block should never be cached in radix tree,
+* Or we have a corrupted extent tree.
+*/
+   buf = find_extent_buffer(fs_info, ins.objectid);
+   if (buf) {
+   btrfs_err_rl(fs_info,
+   "tree block %llu is already in use, extent tree may be corrupted",
+ins.objectid);
+   ret = -EUCLEAN;
+   free_extent_buffer(buf);
+   goto out_unuse;
+   }


The code makes sense but I have the feeling it needs to have some sort
of assert guard because this check will likely trigger only on severly
corrupted filesystemd and yet we introduce it for everyone.


Agreed, the function is called so frequently.

+
buf = btrfs_init_new_buffer(trans, root, ins.objectid, level);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a 

Re: Healthy amount of free space?

2018-07-17 Thread Martin Steigerwald
Hi Nikolay.

Nikolay Borisov - 17.07.18, 09:20:
> On 16.07.2018 23:58, Wolf wrote:
> > Greetings,
> > I would like to ask what what is healthy amount of free space to
> > keep on each device for btrfs to be happy?
> > 
> > This is how my disk array currently looks like
> > 
> > [root@dennas ~]# btrfs fi usage /raid
> > 
> > Overall:
> > Device size:  29.11TiB
> > Device allocated: 21.26TiB
> > Device unallocated:7.85TiB
> > Device missing:  0.00B
> > Used: 21.18TiB
> > Free (estimated):  3.96TiB  (min: 3.96TiB)
> > Data ratio:   2.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 0.00B)
[…]
> > Btrfs does quite good job of evenly using space on all devices. No,
> > how low can I let that go? In other words, with how much space
> > free/unallocated remaining space should I consider adding new disk?
> 
> Btrfs will start running into problems when you run out of unallocated
> space. So the best advice will be monitor your device unallocated,
> once it gets really low - like 2-3 gb I will suggest you run balance
> which will try to free up unallocated space by rewriting data more
> compactly into sparsely populated block groups. If after running
> balance you haven't really freed any space then you should consider
> adding a new drive and running balance to even out the spread of
> data/metadata.

What are these issues exactly?

I have

% btrfs fi us -T /home
Overall:
Device size: 340.00GiB
Device allocated:340.00GiB
Device unallocated:2.00MiB
Device missing:  0.00B
Used:308.37GiB
Free (estimated): 14.65GiB  (min: 14.65GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

  Data  Metadata System  
Id Path   RAID1 RAID1RAID1Unallocated
-- -- -   ---
 1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
 2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
-- -- -   ---
   Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
   Used   151.24GiB  2.95GiB 48.00KiB

on a RAID-1 filesystem one, part of the time two Plasma desktops + 
KDEPIM and Akonadi + Baloo desktop search + you name it write to like 
mad.

Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS 
crawled to an halt on searching for free blocks, and I had to switch off 
the laptop uncleanly. If that happened, a balance helped for a while. 
But since 4.5 or 4.6 this did not happen anymore.

I found with SLES 12 SP 3 or so there is btrfsmaintenance running a 
balance weekly. Which created an issue on our Proxmox + Ceph on Intel 
NUC based opensource demo lab. This is for sure no recommended 
configuration for Ceph and Ceph is quite slow on these 2,5 inch 
harddisks and 1 GBit network link, despite albeit somewhat minimal, 
limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled 
to a halt and the kernel gave task hung for more than 120 seconds 
messages. The VM was basically unusable during the balance. Sure that 
should not happen with a "proper" setup, also it also did not happen 
without the automatic balance.

Also what would happen on a hypervisor setup with several thousands of 
VMs with BTRFS, when several 100 of them decide to start the balance at 
a similar time? It could probably bring the I/O system below to an halt, 
as many enterprise storage systems are designed to sustain burst I/O 
loads, but not maximum utilization during an extended period of time.

I am really wondering what to recommend in my Linux performance tuning 
and analysis courses. On my own laptop I do not do regular balances so 
far. Due to my thinking: If it is not broken, do not fix it.

My personal opinion here also is: If the filesystem degrades that much 
that it becomes unusable without regular maintenance from user space, 
the filesystem needs to be fixed. Ideally I would not have to worry on 
whether to regularly balance an BTRFS or not. In other words: I should 
not have to visit a performance analysis and tuning course in order to 
use a computer with BTRFS filesystem.

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use

2018-07-17 Thread Nikolay Borisov



On 17.07.2018 10:46, Qu Wenruo wrote:
> [BUG]
> For certain fuzzed btrfs image, if we create any csum data, it would
> cause the following kernel warning and deadlock when trying to update
> csum tree:
> --
> [  278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 
> btrfs_tree_lock+0x3e2/0x400
> [  278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
> [  278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [  278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper
> [  278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400
> [  278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 28 
> 00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 
> e9 d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e
> [  278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246
> [  278.113865] Call Trace:
> [  278.113936]  btrfs_alloc_tree_block+0x39f/0x770
> [  278.113988]  __btrfs_cow_block+0x285/0x9e0
> [  278.114029]  btrfs_cow_block+0x191/0x2e0
> [  278.114035]  btrfs_search_slot+0x492/0x1160
> [  278.114146]  btrfs_lookup_csum+0xec/0x280
> [  278.114182]  btrfs_csum_file_blocks+0x2be/0xa60
> [  278.114232]  add_pending_csums+0xaf/0xf0
> [  278.114238]  btrfs_finish_ordered_io+0x74b/0xc90
> [  278.114281]  finish_ordered_fn+0x15/0x20
> [  278.114285]  normal_work_helper+0xf6/0x500
> [  278.114305]  btrfs_endio_write_helper+0x12/0x20
> [  278.114310]  process_one_work+0x302/0x770
> [  278.114315]  worker_thread+0x81/0x6d0
> [  278.114321]  kthread+0x180/0x1d0
> [  278.114334]  ret_from_fork+0x35/0x40
> [  278.114339] ---[ end trace 2e85051acb5f6dc1 ]---
> --
> 
> [CAUSE]
> The fuzzed image has corrupted EXTENT_ITEM for csum tree root:
> --
> extent tree key (EXTENT_TREE ROOT_ITEM 0)
>   item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33
>   refs 1 gen 6 flags TREE_BLOCK
>   tree block skinny level 0
>   tree block backref root UUID_TREE
>   item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33
>    Corrupted METADATA_ITEM
>   item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33
>   refs 1 gen 4 flags TREE_BLOCK
>   tree block skinny level 0
>   tree block backref root DATA_RELOC_TREE
> 
> checksum tree key (CSUM_TREE ROOT_ITEM 0)
> leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE
>   bytenr matches above item.
> --
> 
> So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since
> there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks
> it's free space, and reserve it.
> 
> However in fact it's already been used by csum tree, and later
> btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose
> WARN_ON() detects lock nest on the same extent buffer.
> 
> Finally the wait_event() on the eb->read/write_lock_wq will never exit
> since we're holding the lock by ourselves and deadlock.
> 
> [FIX]
> The fix here is to ensure at least the reserved extent buffer is not
> cached.
> Any used extent buffer should be cached in the global radix tree
> (fs_info->buffer_radix).
> 
> So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(),
> we call find_extent_buffer() explicitly to verify it's not used by
> ourselves.
> 
> Please note this is just a basic check, it is not and will never be as
> good as btrfs check on detecting extent tree corruption, but at least we
> won't dead lock so easily.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/extent-tree.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3578fa5b30ef..782dd96b7c5e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
> btrfs_trans_handle *trans,
>   if (ret)
>   goto out_unuse;
>  
> + /*
> +  * Newly allocated tree block should never be cached in radix tree,
> +  * Or we have a corrupted extent tree.
> +  */
> + buf = find_extent_buffer(fs_info, ins.objectid);
> + if (buf) {
> + btrfs_err_rl(fs_info,
> + "tree block %llu is already in use, extent tree may be corrupted",
> +  ins.objectid);
> + ret = -EUCLEAN;
> + free_extent_buffer(buf);
> + goto out_unuse;
> + }

The code makes sense but I have the feeling it needs to have some sort
of assert guard because this check will likely trigger only on severly
corrupted filesystemd and yet we introduce it for everyone.

> +
>   buf = btrfs_init_new_buffer(trans, root, ins.objectid, level);
>   if (IS_ERR(buf)) {
>   ret = PTR_ERR(buf);
> 
--
To unsubscribe 

Re: Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"

2018-07-17 Thread Udo Waechter
Thanks for the answer.

On 16/07/18 10:32, Qu Wenruo wrote:
> 
> 
> On 2018年07月16日 16:15, Udo Waechter wrote:
>>> The weird thing is that I can't really find information about the
>>> "failed to recover balance: -5" error. - There was no rebalancing
>>> running when during the crash.
> 
> Can only be determined by tree dump.
> 
> # btrfs ins dump-tree -t root 
This gives me:

btrfs-progs v4.13.3
parent transid verify failed on 321265147904 wanted 3276017 found 3273915
parent transid verify failed on 321265147904 wanted 3276017 found 3273915
parent transid verify failed on 321265147904 wanted 3276017 found 3263707
parent transid verify failed on 321265147904 wanted 3276017 found 3273915
Ignoring transid failure
leaf parent key incorrect 321265147904
ERROR: unable to open /dev/vg00/var_

>>> * Unfortunatly I did a "btrfs rescue zero-log" at some point :( - As it
>>> turns out that might have been a bad idea
>>>
>>>
>>> * Also, a "btrfs  check --init-extent-tree" - https://pastebin.com/jATDCFZy
> 
> Then it is making things worse, fortunately it should terminate before
> it causes more damage.
> 
> I'm just curious why people doesn't try the safest "btrfs check" without
> any options, but goes the most dangerous option.
> 
> And "btrfs check" output please.
> If possible, "btrfs check --mode=lowmem" is also good for debug.
> 
Same thing here:

parent transid verify failed on 321265147904 wanted 3276017 found 3273915
parent transid verify failed on 321265147904 wanted 3276017 found 3273915
parent transid verify failed on 321265147904 wanted 3276017 found 3263707
parent transid verify failed on 321265147904 wanted 3276017 found 3273915
Ignoring transid failure
leaf parent key incorrect 321265147904
ERROR: cannot open file system


I did an image with dd pretty early in this process. Unfortunatly this
gives me the same error.


Thanks,
udo.

> Thanks,
> Qu
> 
>>>
>>> The volume contained qcow2 images for VMs. I need only one of those,
>>> since one piece of important software decided to not do backups :(
>>>
>>> Any help is highly appreciated.
>>>
>>> Many thanks,
>>> udo.
>>>
>>
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use

2018-07-17 Thread Qu Wenruo
[BUG]
For certain fuzzed btrfs image, if we create any csum data, it would
cause the following kernel warning and deadlock when trying to update
csum tree:
--
[  278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 
btrfs_tree_lock+0x3e2/0x400
[  278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
[  278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[  278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400
[  278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 28 
00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e9 
d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e
[  278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246
[  278.113865] Call Trace:
[  278.113936]  btrfs_alloc_tree_block+0x39f/0x770
[  278.113988]  __btrfs_cow_block+0x285/0x9e0
[  278.114029]  btrfs_cow_block+0x191/0x2e0
[  278.114035]  btrfs_search_slot+0x492/0x1160
[  278.114146]  btrfs_lookup_csum+0xec/0x280
[  278.114182]  btrfs_csum_file_blocks+0x2be/0xa60
[  278.114232]  add_pending_csums+0xaf/0xf0
[  278.114238]  btrfs_finish_ordered_io+0x74b/0xc90
[  278.114281]  finish_ordered_fn+0x15/0x20
[  278.114285]  normal_work_helper+0xf6/0x500
[  278.114305]  btrfs_endio_write_helper+0x12/0x20
[  278.114310]  process_one_work+0x302/0x770
[  278.114315]  worker_thread+0x81/0x6d0
[  278.114321]  kthread+0x180/0x1d0
[  278.114334]  ret_from_fork+0x35/0x40
[  278.114339] ---[ end trace 2e85051acb5f6dc1 ]---
--

[CAUSE]
The fuzzed image has corrupted EXTENT_ITEM for csum tree root:
--
extent tree key (EXTENT_TREE ROOT_ITEM 0)
item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33
refs 1 gen 6 flags TREE_BLOCK
tree block skinny level 0
tree block backref root UUID_TREE
item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33
 Corrupted METADATA_ITEM
item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33
refs 1 gen 4 flags TREE_BLOCK
tree block skinny level 0
tree block backref root DATA_RELOC_TREE

checksum tree key (CSUM_TREE ROOT_ITEM 0)
leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE
  bytenr matches above item.
--

So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since
there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks
it's free space, and reserve it.

However in fact it's already been used by csum tree, and later
btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose
WARN_ON() detects lock nest on the same extent buffer.

Finally the wait_event() on the eb->read/write_lock_wq will never exit
since we're holding the lock by ourselves and deadlock.

[FIX]
The fix here is to ensure at least the reserved extent buffer is not
cached.
Any used extent buffer should be cached in the global radix tree
(fs_info->buffer_radix).

So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(),
we call find_extent_buffer() explicitly to verify it's not used by
ourselves.

Please note this is just a basic check, it is not and will never be as
good as btrfs check on detecting extent tree corruption, but at least we
won't dead lock so easily.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3578fa5b30ef..782dd96b7c5e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
if (ret)
goto out_unuse;
 
+   /*
+* Newly allocated tree block should never be cached in radix tree,
+* Or we have a corrupted extent tree.
+*/
+   buf = find_extent_buffer(fs_info, ins.objectid);
+   if (buf) {
+   btrfs_err_rl(fs_info,
+   "tree block %llu is already in use, extent tree may be corrupted",
+ins.objectid);
+   ret = -EUCLEAN;
+   free_extent_buffer(buf);
+   goto out_unuse;
+   }
+
buf = btrfs_init_new_buffer(trans, root, ins.objectid, level);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-17 Thread Nikolay Borisov



On 16.07.2018 23:58, Wolf wrote:
> Greetings,
> I would like to ask what what is healthy amount of free space to keep on
> each device for btrfs to be happy?
> 
> This is how my disk array currently looks like
> 
> [root@dennas ~]# btrfs fi usage /raid
> Overall:
> Device size:  29.11TiB
> Device allocated: 21.26TiB
> Device unallocated:7.85TiB
> Device missing:  0.00B
> Used: 21.18TiB
> Free (estimated):  3.96TiB  (min: 3.96TiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,RAID1: Size:10.61TiB, Used:10.58TiB
>/dev/mapper/data1   1.75TiB
>/dev/mapper/data2   1.75TiB
>/dev/mapper/data3 856.00GiB
>/dev/mapper/data4 856.00GiB
>/dev/mapper/data5   1.75TiB
>/dev/mapper/data6   1.75TiB
>/dev/mapper/data7   6.29TiB
>/dev/mapper/data8   6.29TiB
> 
> Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
>/dev/mapper/data1   2.00GiB
>/dev/mapper/data2   3.00GiB
>/dev/mapper/data3   1.00GiB
>/dev/mapper/data4   1.00GiB
>/dev/mapper/data5   3.00GiB
>/dev/mapper/data6   1.00GiB
>/dev/mapper/data7   9.00GiB
>/dev/mapper/data8  10.00GiB
> 
> System,RAID1: Size:64.00MiB, Used:1.50MiB
>/dev/mapper/data2  32.00MiB
>/dev/mapper/data6  32.00MiB
>/dev/mapper/data7  32.00MiB
>/dev/mapper/data8  32.00MiB
> 
> Unallocated:
>/dev/mapper/data11004.52GiB
>/dev/mapper/data21004.49GiB
>/dev/mapper/data31006.01GiB
>/dev/mapper/data41006.01GiB
>/dev/mapper/data51004.52GiB
>/dev/mapper/data61004.49GiB
>/dev/mapper/data71005.00GiB
>/dev/mapper/data81005.00GiB
> 
> Btrfs does quite good job of evenly using space on all devices. No, how
> low can I let that go? In other words, with how much space
> free/unallocated remaining space should I consider adding new disk?

Btrfs will start running into problems when you run out of unallocated
space. So the best advice will be monitor your device unallocated, once
it gets really low - like 2-3 gb I will suggest you run balance which
will try to free up unallocated space by rewriting data more compactly
into sparsely populated block groups. If after running balance you
haven't really freed any space then you should consider adding a new
drive and running balance to even out the spread of data/metadata.

> 
> Thanks for advice :)
> 
> W.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: completion: Let dump-tree/dump-super/inode-resolve to accept any file

2018-07-17 Thread Qu Wenruo
For dump-tree/dump-super the completion uses default filedir -d, which
is far from convenient.
Use filedir for dump-tree/dump-super/inode-resolve just like rootid.

Signed-off-by: Qu Wenruo 
---
 btrfs-completion | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/btrfs-completion b/btrfs-completion
index 33830e827977..b22a951b3a65 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -133,7 +133,7 @@ _btrfs()
_btrfs_mnts
return 0
;;
-   rootid)
+   
dump-tree|dump-super|rootid|inode-resolve)
_filedir
return 0
;;
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: completion: Use _filedir() to replace _btrfs_devs()

2018-07-17 Thread Qu Wenruo
For developers it's pretty common to call "btrfs check" on a raw image
dump other than real block device.

So current _btrfs_devs() is really making things worse. Use _filedir()
to replace _btrfs_devs() so it can complete any filenames, no matter if
it's just a file or a real block device.

Signed-off-by: Qu Wenruo 
---
 btrfs-completion | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/btrfs-completion b/btrfs-completion
index ae683f4ecf61..33830e827977 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -6,9 +6,7 @@
 
 _btrfs_devs()
 {
-   local DEVS
-   DEVS=''; while read dev; do DEVS+="$dev "; done < <(lsblk -pnro name)
-   COMPREPLY+=( $( compgen -W "$DEVS" -- "$cur" ) )
+   _filedir
 }
 
 _btrfs_mnts()
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] btrfs-progs: completion: Small fixes to make debug simpler

2018-07-17 Thread Qu Wenruo
For developer, it's pretty common to use "btrfs check" or "btrfs ins
dump-tree" on raw dumps.

However "btrfs check" can only complete real block devices, and
"btrfs inspect dump-tree" can only complete dir.

Make them to use _filedir() so any filename can be completed and save us
developer a little time and nerve hitting that holy tab.

Qu Wenruo (2):
  btrfs-progs: completion: Use _filedir() to replace _btrfs_devs()
  btrfs-progs: completion: Let dump-tree/dump-super/inode-resolve to
accept any file

 btrfs-completion | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html