[PULL] uuid_mutex fixes and cleanups part2
These patches we sent independently before and are in the mailing list. They have been tested successfully using the xfstests. The cyclical lockdep warning aren't due to these set of patches and they (2) can be ignored as they are harmless because the threads involved are ioctl/commit and mount thread which lockdep thinks is warn-able can't co-exist in the same time space. So this set is ready to be pulled in. Thanks. g...@github.com:asj/btrfs-devel.git misc-next-jul18 1/7 has been discussed and reviewed at length, which is drop the unnecessary uuid_mutex in the btrfs_free_extra_devids(). 2/7 fixes the race which syzbot reported. 3/7 as when we sprout we hijack the seed fs_devices, we are bringing back the seed fs_devices using the clone_fs_devices() instead we could use our btrfs_scan_one_device() which makes the sprout-ing operation much cleaner. 4-7/7 are a simple cleanup patches. Anand Jain (7): 1. btrfs: drop uuid_mutex in btrfs_free_extra_devids() 2. btrfs: fix race between free_stale_devices and close_fs_devices 3. btrfs: do device clone using the btrfs_scan_one_device 4. btrfs: use the assigned fs_devices instead of the dereference 5. btrfs: warn for num_devices below 0 6. btrfs: add helper btrfs_num_devices() to deduce num_devices 7. btrfs: add helper function check device delete able fs/btrfs/volumes.c | 106 +++-- 1 file changed, 62 insertions(+), 44 deletions(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH FIXED v7] Add cli and ioctl to forget scanned device(s)
[ Left out changes is now commited ]. v7: Availalbe for pull from btrfs-progs: g...@github.com:asj/btrfs-progs.git forget btrfs.ko: g...@github.com:asj/btrfs-devel.git misc-next-jul18 Use struct btrfs_ioctl_vol_args (instead of struct btrfs_ioctl_vol_args_v2) as its inline with other ioctl btrfs-control The CLI usage remains same. However internally the ioctl flag is not required to delete all the unmounted devices. Instead leave btrfs_ioctl_vol_args::name NULL. v6: Use the changed fn name btrfs_free_stale_devices(). Change in title: Old v5: Cover-letter: [PATCH v5] Add cli and ioctl to ignore a scanned device Kernel: [PATCH v5] btrfs: introduce feature to ignore a btrfs device Progs: [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli v5: Adds feature to delete all stale devices Reuses btrfs_free_stale_devices() fn and so depends on the patch-set [1] in the ML. Uses struct btrfs_ioctl_vol_args_v2 instead of struct btrfs_ioctl_vol_args as arg Does the device path matching instead of btrfs_device matching (we won't delete the mounted device as btrfs_free_stale_devices() checks for it) v4: No change. But as the ML thread may be confusing, so resend. v3: No change. Send to correct ML. v2: Accepts review from Nikolay, details are in the specific patch. Patch 1/2 is renamed from [PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list delete to [PATCH 1/2] btrfs: add function to device list delete Adds cli and ioctl to forget a scanned device or forget all stale devices in the kernel. Anand Jain (1): btrfs: introduce feature to forget a btrfs device fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) Anand Jain (1): btrfs-progs: add cli to forget one or all scanned devices cmds-device.c | 58 ++ ioctl.h | 2 ++ 2 files changed, 60 insertions(+) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH FIXED v7] btrfs: introduce feature to forget a btrfs device
Support for a new command 'btrfs dev forget [dev]' is proposed here, to undo the effects of 'btrfs dev scan [dev]'. For this purpose, this patch proposes to use ioctl #5 as it was empty. IOW(BTRFS_IOCTL_MAGIC, 5, ..) This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from the /dev/btrfs-control to forget one or all devices, (devices which are not mounted) from the btrfs kernel. The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be set to specify the device path. And all unmounted devices can be removed from the kernel if no device path is provided. Again, the devices are removed only if the relevant fsid aren't mounted. Signed-off-by: Anand Jain --- fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d7a54c648c5f..196b0773eedb 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2249,6 +2249,9 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, ret = PTR_ERR_OR_ZERO(device); mutex_unlock(_mutex); break; + case BTRFS_IOC_FORGET_DEV: + ret = btrfs_forget_devices(vol->name); + break; case BTRFS_IOC_DEVICES_READY: mutex_lock(_mutex); device = btrfs_scan_one_device(vol->name, FMODE_READ, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 51451c0dd8f5..73ae1c95e6f3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, return 0; } +int btrfs_forget_devices(const char *path) +{ + mutex_lock(_mutex); + btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL); + mutex_unlock(_mutex); + + return 0; +} + /* * Look for a btrfs signature on a device. This may be called out of the mount path * and we are not allowed to call set_blocksize during the scan. The superblock diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 06d8bb7dd557..82502a7ff6e4 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -405,6 +405,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fmode_t flags, void *holder); struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, void *holder); +int btrfs_forget_devices(const char *path); int btrfs_close_devices(struct btrfs_fs_devices *fs_devices); void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step); void btrfs_assign_next_active_device(struct btrfs_fs_info *fs_info, diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 5ca1d21fc4a7..b1be7f828cb4 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -836,6 +836,8 @@ enum btrfs_err_code { struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ struct btrfs_ioctl_vol_args) +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ + struct btrfs_ioctl_vol_args) /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH FIXED v7] btrfs-progs: add cli to forget one or all scanned devices
This patch adds cli btrfs device forget [dev] to remove the given device structure in the kernel if the device is unmounted. If no argument is given it shall remove all stale (device which are not mounted) from the kernel. Signed-off-by: Anand Jain --- cmds-device.c | 58 ++ ioctl.h | 2 ++ 2 files changed, 60 insertions(+) diff --git a/cmds-device.c b/cmds-device.c index 86459d1b9564..49cfd4b41adb 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -326,6 +326,63 @@ out: return !!ret; } +static const char * const cmd_device_forget_usage[] = { + "btrfs device forget []", + "Forget a stale device or all stale devices in btrfs.ko", + NULL +}; + +static int btrfs_forget_devices(char *path) +{ + struct btrfs_ioctl_vol_args args; + int ret; + int fd; + + fd = open("/dev/btrfs-control", O_RDWR); + if (fd < 0) + return -errno; + + memset(, 0, sizeof(args)); + if (path) + strncpy_null(args.name, path); + ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, ); + if (ret) + ret = -errno; + close(fd); + return ret; +} + +static int cmd_device_forget(int argc, char **argv) +{ + char *path; + int ret = 0; + + if (check_argc_max(argc - optind, 1)) + usage(cmd_device_forget_usage); + + if (argc == 1) { + ret = btrfs_forget_devices(NULL); + if (ret) + error("Can't forget: %s", strerror(-ret)); + return ret; + } + + path = canonicalize_path(argv[1]); + if (!path) { + error("Could not canonicalize path '%s': %s", + argv[1], strerror(errno)); + return -ENOENT; + } + + ret = btrfs_forget_devices(path); + if (ret) + error("Can't forget '%s': %s", path, strerror(-ret)); + + free(path); + + return ret; +} + static const char * const cmd_device_ready_usage[] = { "btrfs device ready ", "Check device to see if it has all of its devices in cache for mounting", @@ -601,6 +658,7 @@ const struct cmd_group device_cmd_group = { CMD_ALIAS }, { "remove", cmd_device_remove, cmd_device_remove_usage, NULL, 0 }, { "scan", cmd_device_scan, cmd_device_scan_usage, NULL, 0 }, + { "forget", cmd_device_forget, cmd_device_forget_usage, NULL, 0 }, { "ready", cmd_device_ready, cmd_device_ready_usage, NULL, 0 }, { "stats", cmd_device_stats, cmd_device_stats_usage, NULL, 0 }, { "usage", cmd_device_usage, diff --git a/ioctl.h b/ioctl.h index 709e996f401c..e27d80e09392 100644 --- a/ioctl.h +++ b/ioctl.h @@ -721,6 +721,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code) struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ struct btrfs_ioctl_vol_args) +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ + struct btrfs_ioctl_vol_args) /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7] Add cli and ioctl to forget scanned device(s)
Please ignore this. There is a line of code which is un-commit. I am sending this series again. Sorry for the noise. Thanks, Anand On 07/18/2018 11:07 AM, Anand Jain wrote: v7: Availalbe for pull from btrfs-progs: g...@github.com:asj/btrfs-progs.git forget btrfs.ko: g...@github.com:asj/btrfs-devel.git misc-next-jul18 Use struct btrfs_ioctl_vol_args (instead of struct btrfs_ioctl_vol_args_v2) as its inline with other ioctl btrfs-control The CLI usage remains same. However internally the ioctl flag is not required to delete all the unmounted devices. Instead leave btrfs_ioctl_vol_args::name NULL. v6: Use the changed fn name btrfs_free_stale_devices(). Change in title: Old v5: Cover-letter: [PATCH v5] Add cli and ioctl to ignore a scanned device Kernel: [PATCH v5] btrfs: introduce feature to ignore a btrfs device Progs: [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli v5: Adds feature to delete all stale devices Reuses btrfs_free_stale_devices() fn and so depends on the patch-set [1] in the ML. Uses struct btrfs_ioctl_vol_args_v2 instead of struct btrfs_ioctl_vol_args as arg Does the device path matching instead of btrfs_device matching (we won't delete the mounted device as btrfs_free_stale_devices() checks for it) v4: No change. But as the ML thread may be confusing, so resend. v3: No change. Send to correct ML. v2: Accepts review from Nikolay, details are in the specific patch. Patch 1/2 is renamed from [PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list delete to [PATCH 1/2] btrfs: add function to device list delete Adds cli and ioctl to forget a scanned device or forget all stale devices in the kernel. Anand Jain (1): btrfs: introduce feature to forget a btrfs device fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) Anand Jain (1): btrfs-progs: add cli to forget one or all scanned devices cmds-device.c | 58 ++ ioctl.h | 2 ++ 2 files changed, 60 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7] Add cli and ioctl to forget scanned device(s)
v7: Availalbe for pull from btrfs-progs: g...@github.com:asj/btrfs-progs.git forget btrfs.ko: g...@github.com:asj/btrfs-devel.git misc-next-jul18 Use struct btrfs_ioctl_vol_args (instead of struct btrfs_ioctl_vol_args_v2) as its inline with other ioctl btrfs-control The CLI usage remains same. However internally the ioctl flag is not required to delete all the unmounted devices. Instead leave btrfs_ioctl_vol_args::name NULL. v6: Use the changed fn name btrfs_free_stale_devices(). Change in title: Old v5: Cover-letter: [PATCH v5] Add cli and ioctl to ignore a scanned device Kernel: [PATCH v5] btrfs: introduce feature to ignore a btrfs device Progs: [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli v5: Adds feature to delete all stale devices Reuses btrfs_free_stale_devices() fn and so depends on the patch-set [1] in the ML. Uses struct btrfs_ioctl_vol_args_v2 instead of struct btrfs_ioctl_vol_args as arg Does the device path matching instead of btrfs_device matching (we won't delete the mounted device as btrfs_free_stale_devices() checks for it) v4: No change. But as the ML thread may be confusing, so resend. v3: No change. Send to correct ML. v2: Accepts review from Nikolay, details are in the specific patch. Patch 1/2 is renamed from [PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list delete to [PATCH 1/2] btrfs: add function to device list delete Adds cli and ioctl to forget a scanned device or forget all stale devices in the kernel. Anand Jain (1): btrfs: introduce feature to forget a btrfs device fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) Anand Jain (1): btrfs-progs: add cli to forget one or all scanned devices cmds-device.c | 58 ++ ioctl.h | 2 ++ 2 files changed, 60 insertions(+) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7] btrfs-progs: add cli to forget one or all scanned devices
This patch adds cli btrfs device forget [dev] to remove the given device structure in the kernel if the device is unmounted. If no argument is given it shall remove all stale (device which are not mounted) from the kernel. Signed-off-by: Anand Jain --- cmds-device.c | 58 ++ ioctl.h | 2 ++ 2 files changed, 60 insertions(+) diff --git a/cmds-device.c b/cmds-device.c index 86459d1b9564..49cfd4b41adb 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -326,6 +326,63 @@ out: return !!ret; } +static const char * const cmd_device_forget_usage[] = { + "btrfs device forget []", + "Forget a stale device or all stale devices in btrfs.ko", + NULL +}; + +static int btrfs_forget_devices(char *path) +{ + struct btrfs_ioctl_vol_args args; + int ret; + int fd; + + fd = open("/dev/btrfs-control", O_RDWR); + if (fd < 0) + return -errno; + + memset(, 0, sizeof(args)); + if (path) + strncpy_null(args.name, path); + ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, ); + if (ret) + ret = -errno; + close(fd); + return ret; +} + +static int cmd_device_forget(int argc, char **argv) +{ + char *path; + int ret = 0; + + if (check_argc_max(argc - optind, 1)) + usage(cmd_device_forget_usage); + + if (argc == 1) { + ret = btrfs_forget_devices(NULL); + if (ret) + error("Can't forget: %s", strerror(-ret)); + return ret; + } + + path = canonicalize_path(argv[1]); + if (!path) { + error("Could not canonicalize path '%s': %s", + argv[1], strerror(errno)); + return -ENOENT; + } + + ret = btrfs_forget_devices(path); + if (ret) + error("Can't forget '%s': %s", path, strerror(-ret)); + + free(path); + + return ret; +} + static const char * const cmd_device_ready_usage[] = { "btrfs device ready ", "Check device to see if it has all of its devices in cache for mounting", @@ -601,6 +658,7 @@ const struct cmd_group device_cmd_group = { CMD_ALIAS }, { "remove", cmd_device_remove, cmd_device_remove_usage, NULL, 0 }, { "scan", cmd_device_scan, cmd_device_scan_usage, NULL, 0 }, + { "forget", cmd_device_forget, cmd_device_forget_usage, NULL, 0 }, { "ready", cmd_device_ready, cmd_device_ready_usage, NULL, 0 }, { "stats", cmd_device_stats, cmd_device_stats_usage, NULL, 0 }, { "usage", cmd_device_usage, diff --git a/ioctl.h b/ioctl.h index 709e996f401c..e27d80e09392 100644 --- a/ioctl.h +++ b/ioctl.h @@ -721,6 +721,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code) struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ struct btrfs_ioctl_vol_args) +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ + struct btrfs_ioctl_vol_args) /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7] btrfs: introduce feature to forget a btrfs device
Support for a new command 'btrfs dev forget [dev]' is proposed here, to undo the effects of 'btrfs dev scan [dev]'. For this purpose, this patch proposes to use ioctl #5 as it was empty. IOW(BTRFS_IOCTL_MAGIC, 5, ..) This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from the /dev/btrfs-control to forget one or all devices, (devices which are not mounted) from the btrfs kernel. The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be set to specify the device path. And all unmounted devices can be removed from the kernel if no device path is provided. Again, the devices are removed only if the relevant fsid aren't mounted. Signed-off-by: Anand Jain --- fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d7a54c648c5f..196b0773eedb 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2249,6 +2249,9 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, ret = PTR_ERR_OR_ZERO(device); mutex_unlock(_mutex); break; + case BTRFS_IOC_FORGET_DEV: + ret = btrfs_forget_devices(vol->name); + break; case BTRFS_IOC_DEVICES_READY: mutex_lock(_mutex); device = btrfs_scan_one_device(vol->name, FMODE_READ, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 51451c0dd8f5..bc1e19663e81 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, return 0; } +int btrfs_forget_devices(const char *path) +{ + mutex_lock(_mutex); + btrfs_free_stale_devices(path, NULL); + mutex_unlock(_mutex); + + return 0; +} + /* * Look for a btrfs signature on a device. This may be called out of the mount path * and we are not allowed to call set_blocksize during the scan. The superblock diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 06d8bb7dd557..82502a7ff6e4 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -405,6 +405,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fmode_t flags, void *holder); struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, void *holder); +int btrfs_forget_devices(const char *path); int btrfs_close_devices(struct btrfs_fs_devices *fs_devices); void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step); void btrfs_assign_next_active_device(struct btrfs_fs_info *fs_info, diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 5ca1d21fc4a7..b1be7f828cb4 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -836,6 +836,8 @@ enum btrfs_err_code { struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ struct btrfs_ioctl_vol_args) +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ + struct btrfs_ioctl_vol_args) /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)
On Wed, Jul 18, 2018 at 08:05:51AM +0800, Qu Wenruo wrote: > No OOM triggers? That's a little strange. > Maybe it's related to how kernel handles memory over-commit? Yes, I think you are correct. > And for the hang, I think it's related to some memory allocation failure > and error handler just didn't handle it well, so it's causing deadlock > for certain page. That indeed matches what I'm seeing. > ENOMEM handling is pretty common but hardly verified, so it's not that > strange, but we must locate the problem. I seem to be getting deadlocks in the kernel, so I'm hoping that at least it's checked there, but maybe not? > In my system, at least I'm not using btrfs as root fs, and for the > memory eating program I normally ensure it's eating all the memory + > swap, so OOM killer is always triggered, maybe that's the cause. > > So in your case, maybe it's btrfs not really taking up all memory, thus > OOM killer not triggered. Correct, the swap is not used. > Any kernel dmesg about OOM killer triggered? Nothing at all. It never gets triggered. > > Here is my system when it virtually died: > > ER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > > root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49 1:35 ./btrfs > > check /dev/mapper/dshelf2 See how btrs was taking 29GB in that ps output (that's before it takes everything and I can't even type ps anymore) Note that VSZ is almost equal to RSS. Nothing gets swapped. Then see free output: > > total used free sharedbuffers cached > > Mem: 32643788 32180100 463688 0 44664 119508 > > -/+ buffers/cache: 32015928 627860 > > Swap: 15616764 443676 15173088 > > For swap, it looks like only some other program's memory is swapped out, > not btrfs'. That's exactly correct. btrfs check never goes to swap, I'm not sure why, and because there is virtual memory free, maybe that's why OOM does not trigger? So I guess I can probably "fix" my problem by removing swap, but ultimately it would be useful to know why memory taken by btrfs check does not end up in swap. > And unfortunately, I'm not so familiar with OOM/MM code outside of > filesystem. > Any help from other experienced developers would definitely help to > solve why memory of 'btrfs check' is not swapped out or why OOM killer > is not triggered. Do you have someone from linux-vm you might be able to ask, or should we Cc this thread there? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)
On 2018年07月18日 04:59, Marc MERLIN wrote: > Ok, I did more testing. Qu is right that btrfs check does not crash the > kernel. > It just takes all the memory until linux hangs everywhere, and somehow (no > idea why) > the OOM killer never triggers. No OOM triggers? That's a little strange. Maybe it's related to how kernel handles memory over-commit? And for the hang, I think it's related to some memory allocation failure and error handler just didn't handle it well, so it's causing deadlock for certain page. ENOMEM handling is pretty common but hardly verified, so it's not that strange, but we must locate the problem. > Details below: > > On Tue, Jul 17, 2018 at 01:32:57PM -0700, Marc MERLIN wrote: >> Here is what I got when the system was not doing well (it took minutes to >> run): >> >> total used free sharedbuffers cached >> Mem: 32643788 32070952 572836 0 1021604378772 >> -/+ buffers/cache: 275900205053768 >> Swap: 15616764 973596 14643168 > > ok, the reason it was not that close to 0 was due to /dev/shm it seems. > I cleared that, and now I can get it to go to near 0 again. > I'm wrong about the system being fully crashed, it's not, it's just very > close to being hung. > I can type killall -9 btrfs in the serial console and wait a few minutes. > The system eventually recovers, but it's impossible to fix anything via ssh > apparently because networking does not get to run when I'm in this state. > > I'm not sure why my system reproduces this easy while Qu's system does not, > but Qu was right that the kernel is not dead and that it's merely a problem > of userspace > taking all the RAM and somehow not being killed by OOM In my system, at least I'm not using btrfs as root fs, and for the memory eating program I normally ensure it's eating all the memory + swap, so OOM killer is always triggered, maybe that's the cause. So in your case, maybe it's btrfs not really taking up all memory, thus OOM killer not triggered. > > I checked the PID and don't see why it's not being killed: > gargamel:/proc/31006# grep . oom* > oom_adj:0 > oom_score:221 << this increases a lot, but OOM never kills it > oom_score_adj:0 > > I have these variables: > /proc/sys/vm/oom_dump_tasks:1 > /proc/sys/vm/oom_kill_allocating_task:0 > /proc/sys/vm/overcommit_kbytes:0 > /proc/sys/vm/overcommit_memory:0 > /proc/sys/vm/overcommit_ratio:50 << is this bad (seems default) Any kernel dmesg about OOM killer triggered? > > Here is my system when it virtually died: > ER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49 1:35 ./btrfs > check /dev/mapper/dshelf2 > > total used free sharedbuffers cached > Mem: 32643788 32180100 463688 0 44664 119508 > -/+ buffers/cache: 32015928 627860 > Swap: 15616764 443676 15173088 For swap, it looks like only some other program's memory is swapped out, not btrfs'. And unfortunately, I'm not so familiar with OOM/MM code outside of filesystem. Any help from other experienced developers would definitely help to solve why memory of 'btrfs check' is not swapped out or why OOM killer is not triggered. Thanks, Qu > > MemTotal: 32643788 kB > MemFree: 463440 kB > MemAvailable: 44864 kB > Buffers: 44664 kB > Cached: 120360 kB > SwapCached:87064 kB > Active: 30381404 kB > Inactive: 585952 kB > Active(anon): 30334696 kB > Inactive(anon): 474624 kB > Active(file): 46708 kB > Inactive(file): 111328 kB > Unevictable:5616 kB > Mlocked:5616 kB > SwapTotal: 15616764 kB > SwapFree: 15173088 kB > Dirty: 1636 kB > Writeback: 4 kB > AnonPages: 30734240 kB > Mapped:67236 kB > Shmem: 3036 kB > Slab: 267884 kB > SReclaimable: 51528 kB > SUnreclaim: 216356 kB > KernelStack: 10144 kB > PageTables:69284 kB > NFS_Unstable: 0 kB > Bounce:0 kB > WritebackTmp: 0 kB > CommitLimit:31938656 kB > Committed_AS: 32865492 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 0 kB > VmallocChunk: 0 kB > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > ShmemHugePages:0 kB > ShmemPmdMapped:0 kB > CmaTotal: 16384 kB > CmaFree: 0 kB > HugePages_Total: 0 > HugePages_Free:0 > HugePages_Rsvd:0 > HugePages_Surp:0 > Hugepagesize: 2048 kB > Hugetlb: 0 kB > DirectMap4k: 560404 kB > DirectMap2M:32692224 kB > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions
Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY whenever you try to defrag a program that's currently being run, or causing intermittent exec failures on a live system being defragged. As defrag doesn't change the file's contents in any way, there's no reason to consider it a rw operation. Thus, let's check only whether the file could have been opened rw. Such access control is still needed as currently defrag can use extra disk space, and might trigger bugs. Signed-off-by: Adam Borowski --- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 43ecbe620dea..01c150b6ab62 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2941,7 +2941,8 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) ret = btrfs_defrag_root(root); break; case S_IFREG: - if (!(file->f_mode & FMODE_WRITE)) { + if (!capable(CAP_SYS_ADMIN) && + inode_permission(inode, MAY_WRITE)) { ret = -EINVAL; goto out; } -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend 2/2] btrfs: defrag: return EPERM not EINVAL when only permissions fail
We give EINVAL when the request is invalid; here it's ok but merely the user has insufficient privileges. Thus, this return value reflects the error better -- as discussed in the identical case for dedupe. According to codesearch.debian.net, no userspace program distinguishes these values beyond strerror(). Signed-off-by: Adam Borowski --- fs/btrfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 01c150b6ab62..e96e3c3caca1 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2943,7 +2943,7 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) case S_IFREG: if (!capable(CAP_SYS_ADMIN) && inode_permission(inode, MAY_WRITE)) { - ret = -EINVAL; + ret = -EPERM; goto out; } -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend 0/2] btrfs: fix races between exec and defrag
Hi! Here's a ping for a patch to fix ETXTBSY races between defrag and exec, just like the dedupe counterpart. Unlike that one which is shared to multiple filesystems and thus lives in Al Viro's land, it is btrfs only. Attached: a simple tool to fragment a file, by ten O_SYNC rewrites of length 1 at random positions; racey vs concurrent writes or execs but shouldn't damage the file otherwise. Also attached: a preliminary patch for -progs; it yet lacks a check for the kernel version, but to add such a check we'd need to know which kernels actually permit ro defrag for non-root. No man page patch -- there's no man page to be patched... Meow! -- // If you believe in so-called "intellectual property", please immediately // cease using counterfeit alphabets. Instead, contact the nearest temple // of Amon, whose priests will provide you with scribal services for all // your writing needs, for Reasonable And Non-Discriminatory prices. #include #include #include #include #include #include #include #include #include #include #include static void die(const char *txt, ...) __attribute__((format (printf, 1, 2))); static void die(const char *txt, ...) { fprintf(stderr, "fragme: "); va_list ap; va_start(ap, txt); vfprintf(stderr, txt, ap); va_end(ap); exit(1); } static uint64_t rnd(uint64_t max) { __uint128_t r; if (syscall(SYS_getrandom, , sizeof(r), 0)==-1) die("getrandom(): %m\n"); return r%max; } int main(int argc, char **argv) { if (argc!=2) die("Usage: fragme \n"); int fd = open(argv[1], O_RDWR|O_SYNC); if (fd == -1) die("open(\"%s\"): %m\n", argv[1]); off_t size = lseek(fd, 0, SEEK_END); if (size == -1) die("lseek(SEEK_END): %m\n"); for (int i=0; i<10; ++i) { off_t off = rnd(size); char b; if (lseek(fd, off, SEEK_SET) != off) die("lseek for read: %m\n"); if (read(fd, , 1) != 1) die("read(%lu): %m\n", off); if (lseek(fd, off, SEEK_SET) != off) die("lseek for write: %m\n"); if (write(fd, , 1) != 1) die("write: %m\n"); } return 0; } >From d040af09adb03daadbba4336700f40425a860320 Mon Sep 17 00:00:00 2001 From: Adam Borowski Date: Tue, 28 Nov 2017 01:00:21 +0100 Subject: [PATCH] defrag: open files RO NOT FOR MERGING -- requires kernel versioning Fixes EXTXBSY races. Signed-off-by: Adam Borowski --- cmds-filesystem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 30a50bf5..7eb6b7bb 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -876,7 +876,7 @@ static int defrag_callback(const char *fpath, const struct stat *sb, if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) { if (defrag_global_verbose) printf("%s\n", fpath); - fd = open(fpath, O_RDWR); + fd = open(fpath, O_RDONLY); if (fd < 0) { goto error; } @@ -1012,7 +1012,7 @@ static int cmd_filesystem_defrag(int argc, char **argv) int defrag_err = 0; dirstream = NULL; - fd = open_file_or_dir(argv[i], ); + fd = open_file_or_dir3(argv[i], , O_RDONLY); if (fd < 0) { error("cannot open %s: %m", argv[i]); ret = -errno; -- 2.18.0
Re: [PATCH 0/4] 3- and 4- copy RAID1
Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as excerpted: > On 07/15/2018 04:37 PM, waxhead wrote: > Striping and mirroring/pairing are orthogonal properties; mirror and > parity are mutually exclusive. I can't agree. I don't know whether you meant that in the global sense, or purely in the btrfs context (which I suspect), but either way I can't agree. In the pure btrfs context, while striping and mirroring/pairing are orthogonal today, Hugo's whole point was that btrfs is theoretically flexible enough to allow both together and the feature may at some point be added, so it makes sense to have a layout notation format flexible enough to allow it as well. In the global context, just to complete things and mostly for others reading as I feel a bit like a simpleton explaining to the expert here, just as raid10 is shorthand for raid1+0, aka raid0 layered on top of raid1 (normally preferred to raid01 due to rebuild characteristics, and as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' data integrity characteristics and less optimized performance), so there's also raid51 and raid15, raid61 and raid16, etc, with or without the + symbols, involving mirroring and parity conceptually at two different levels altho they can be combined in a single implementation just as raid10 and raid01 commonly are. These additional layered-raid levels can be used for higher reliability, with differing rebuild and performance characteristics between the two forms depending on which is the top layer. > Question #1: for "parity" profiles, does make sense to limit the maximum > disks number where the data may be spread ? If the answer is not, we > could omit the last S. IMHO it should. As someone else already replied, btrfs doesn't currently have the ability to specify spread limit, but the idea if we're going to change the notation is to allow for the flexibility in the new notation so the feature can be added later without further notation changes. Why might it make sense to specify spread? At least two possible reasons: a) (stealing an already posted example) Consider a multi-device layout with two or more device sizes. Someone may want to limit the spread in ordered to keep performance and risk consistent as the smaller devices fill up, limiting further usage to a lower number of devices. If that lower number is specified as the spread originally it'll make things more consistent between the room on all devices case and the room on only some devices case. b) Limiting spread can change the risk and rebuild performance profiles. Stripes of full width mean all stripes have a strip on each device, so knock a device out and (assuming parity or mirroring) replace it, and all stripes are degraded and must be rebuilt. With less than maximum spread, some stripes won't be stripped to the replaced device, and won't be degraded or need rebuilt, tho assuming the same overall fill, a larger percentage of stripes that /do/ need rebuilt will be on the replaced device. So the risk profile is more "objects" (stripes/chunks/files) affected but less of each object, or less of the total affected, but more of each affected object. > Question #2: historically RAID10 is requires 4 disks. However I am > guessing if the stripe could be done on a different number of disks: > What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is > that every 64k, the data are stored on a different disk As someone else pointed out, md/lvm-raid10 already work like this. What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much works this way except with huge (gig size) chunks. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)
Ok, I did more testing. Qu is right that btrfs check does not crash the kernel. It just takes all the memory until linux hangs everywhere, and somehow (no idea why) the OOM killer never triggers. Details below: On Tue, Jul 17, 2018 at 01:32:57PM -0700, Marc MERLIN wrote: > Here is what I got when the system was not doing well (it took minutes to > run): > > total used free sharedbuffers cached > Mem: 32643788 32070952 572836 0 1021604378772 > -/+ buffers/cache: 275900205053768 > Swap: 15616764 973596 14643168 ok, the reason it was not that close to 0 was due to /dev/shm it seems. I cleared that, and now I can get it to go to near 0 again. I'm wrong about the system being fully crashed, it's not, it's just very close to being hung. I can type killall -9 btrfs in the serial console and wait a few minutes. The system eventually recovers, but it's impossible to fix anything via ssh apparently because networking does not get to run when I'm in this state. I'm not sure why my system reproduces this easy while Qu's system does not, but Qu was right that the kernel is not dead and that it's merely a problem of userspace taking all the RAM and somehow not being killed by OOM I checked the PID and don't see why it's not being killed: gargamel:/proc/31006# grep . oom* oom_adj:0 oom_score:221 << this increases a lot, but OOM never kills it oom_score_adj:0 I have these variables: /proc/sys/vm/oom_dump_tasks:1 /proc/sys/vm/oom_kill_allocating_task:0 /proc/sys/vm/overcommit_kbytes:0 /proc/sys/vm/overcommit_memory:0 /proc/sys/vm/overcommit_ratio:50 << is this bad (seems default) Here is my system when it virtually died: ER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49 1:35 ./btrfs check /dev/mapper/dshelf2 total used free sharedbuffers cached Mem: 32643788 32180100 463688 0 44664 119508 -/+ buffers/cache: 32015928 627860 Swap: 15616764 443676 15173088 MemTotal: 32643788 kB MemFree: 463440 kB MemAvailable: 44864 kB Buffers: 44664 kB Cached: 120360 kB SwapCached:87064 kB Active: 30381404 kB Inactive: 585952 kB Active(anon): 30334696 kB Inactive(anon): 474624 kB Active(file): 46708 kB Inactive(file): 111328 kB Unevictable:5616 kB Mlocked:5616 kB SwapTotal: 15616764 kB SwapFree: 15173088 kB Dirty: 1636 kB Writeback: 4 kB AnonPages: 30734240 kB Mapped:67236 kB Shmem: 3036 kB Slab: 267884 kB SReclaimable: 51528 kB SUnreclaim: 216356 kB KernelStack: 10144 kB PageTables:69284 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:31938656 kB Committed_AS: 32865492 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages:0 kB ShmemPmdMapped:0 kB CmaTotal: 16384 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 560404 kB DirectMap2M:32692224 kB -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs check (not lowmem) and OOM-like hangs (4.17.6)
On Tue, Jul 17, 2018 at 10:50:32AM -0700, Marc MERLIN wrote: > I got the following on 4.17.6 while running btrfs check --repair on an > unmounted filesystem (not the lowmem version) > > I understand that btrfs check is userland only, although it seems that > it caused these FS hangs on a different filesystem (the trace of course > does not provide info on which FS) > > Any idea what happened here? > I'm going to wait a few hours without running btrfs check to see if it > happens again and then if running btrfs check will re-create this issue, > but other suggestions (if any), are welcome: Hi Qu, I know we were talking about this last week and then, btrfs check just worked for me so I wasn't able to reproduce. Now I'm able to reproduce again. I tried again, it's definitely triggered by btrfs check --repair I tried to capture what happens, and memory didn't dip to 0, but the system got very slow and things started failing. btrfs was never killed though while ssh was. Is there a chance that maybe btrfs is in some kernel OOM exclude list? Here is what I got when the system was not doing well (it took minutes to run): total used free sharedbuffers cached Mem: 32643788 32070952 572836 0 1021604378772 -/+ buffers/cache: 275900205053768 Swap: 15616764 973596 14643168 gargamel:~# cat /proc/meminfo MemTotal: 32643788 kB MemFree: 2726276 kB MemAvailable:2502200 kB Buffers: 12360 kB Cached: 1676388 kB SwapCached: 11048580 kB Active: 16443004 kB Inactive: 12010456 kB Active(anon): 16287780 kB Inactive(anon): 11651692 kB Active(file): 155224 kB Inactive(file): 358764 kB Unevictable:5776 kB Mlocked:5776 kB SwapTotal: 15616764 kB SwapFree: 294592 kB Dirty: 3032 kB Writeback: 76064 kB AnonPages: 15723272 kB Mapped: 612124 kB Shmem: 1171032 kB Slab: 399824 kB SReclaimable: 84568 kB SUnreclaim: 315256 kB KernelStack: 20576 kB PageTables:94268 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:31938656 kB Committed_AS: 37909452 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 98304 kB ShmemHugePages:0 kB ShmemPmdMapped:0 kB CmaTotal: 16384 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 355604 kB DirectMap2M:32897024 kB and console: [ 9184.345329] INFO: task zmtrigger.pl:9981 blocked for more than 120 seconds. [ 9184.366258] Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4 [ 9184.385323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9184.408803] zmtrigger.plD0 9981 9804 0x20020080 [ 9184.425249] Call Trace: [ 9184.432580] ? __schedule+0x53e/0x59b [ 9184.443551] schedule+0x7f/0x98 [ 9184.452960] io_schedule+0x16/0x38 [ 9184.463154] wait_on_page_bit_common+0x10c/0x199 [ 9184.476996] ? file_check_and_advance_wb_err+0xd7/0xd7 [ 9184.493339] shmem_getpage_gfp+0x2dd/0x975 [ 9184.506558] shmem_fault+0x188/0x1c3 [ 9184.518199] ? filemap_map_pages+0x6f/0x295 [ 9184.531680] __do_fault+0x1d/0x6e [ 9184.542505] __handle_mm_fault+0x675/0xa61 [ 9184.555653] ? list_move+0x21/0x3a [ 9184.566737] handle_mm_fault+0x11c/0x16b [ 9184.579355] __do_page_fault+0x324/0x41c [ 9184.591996] ? page_fault+0x8/0x30 [ 9184.603059] page_fault+0x1e/0x30 [ 9184.613846] RIP: 0023:0xf7d2d022 [ 9184.624366] RSP: 002b:ffeb9fe8 EFLAGS: 00010202 [ 9184.640868] RAX: f7eed000 RBX: 567e6000 RCX: 0004 [ 9184.663095] RDX: 587fecb0 RSI: 5876538c RDI: 0004 [ 9184.685308] RBP: 58185160 R08: R09: [ 9184.707524] R10: R11: 0286 R12: [ 9184.729757] R13: R14: R15: [ 9184.751988] INFO: task /usr/sbin/apach:11868 blocked for more than 120 seconds. [ 9184.775106] Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4 [ 9184.795072] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9184.819423] /usr/sbin/apach D0 11868 11311 0x20020080 [ 9184.836748] Call Trace: [ 9184.844926] ? __schedule+0x53e/0x59b [ 9184.856811] schedule+0x7f/0x98 [ 9184.867075] io_schedule+0x16/0x38 [ 9184.878114] wait_on_page_bit_common+0x10c/0x199 [ 9184.892807] ? file_check_and_advance_wb_err+0xd7/0xd7 [ 9184.909036] shmem_getpage_gfp+0x2dd/0x975 [ 9184.922157] shmem_fault+0x188/0x1c3 [ 9184.933667] ? filemap_map_pages+0x6f/0x295 [ 9184.947504] __do_fault+0x1d/0x6e [ 9184.958234] __handle_mm_fault+0x675/0xa61 [
Re: [RESEND][PATCH v4 0/2] vfs: better dedupe permission check
CCing Michael Kerrisk and linux-api The patch at the end of this e-mail updates our man page for ioctl_fideduperange to reflect the changes proposed in this patch series: https://marc.info/?l=linux-fsdevel=153185457324037=2 Any feedback would be appreciated. Thanks in advance. --Mark On Tue, Jul 17, 2018 at 12:48:18PM -0700, Darrick J. Wong wrote: > On Tue, Jul 17, 2018 at 12:09:04PM -0700, Mark Fasheh wrote: > > Hi Al, > > > > The following patches fix a couple of issues with the permission check > > we do in vfs_dedupe_file_range(). I sent them out for a few times now, > > a changelog is attached. If they look ok to you, I'd appreciate them > > being pushed upstream. > > > > You can get them from git if you like: > > > > git pull https://github.com/markfasheh/linux dedupe-perms > > > > I also have a set of patches against 4.17 if you prefer. The code and > > testing are identical: > > > > git pull https://github.com/markfasheh/linux dedupe-perms-v4.17 > > > > > > The first patch expands our check to allow dedupe of a file if the > > user owns it or otherwise would be allowed to write to it. > > > > Current behavior is that we'll allow dedupe only if: > > > > - the user is an admin (root) > > - the user has the file open for write > > > > This makes it impossible for a user to dedupe their own file set > > unless they do it as root, or ensure that all files have write > > permission. There's a couple of duperemove bugs open for this: > > > > https://github.com/markfasheh/duperemove/issues/129 > > https://github.com/markfasheh/duperemove/issues/86 > > > > The other problem we have is also related to forcing the user to open > > target files for write - A process trying to exec a file currently > > being deduped gets ETXTBUSY. The answer (as above) is to allow them to > > open the targets ro - root can already do this. There was a patch from > > Adam Borowski to fix this back in 2016: > > > > https://lkml.org/lkml/2016/7/17/130 > > > > which I have incorporated into my changes. > > > > > > The 2nd patch fixes our return code for permission denied to be > > EPERM. For some reason we're returning EINVAL - I think that's > > probably my fault. At any rate, we need to be returning something > > descriptive of the actual problem, otherwise callers see EINVAL and > > can't really make a valid determination of what's gone wrong. > > > > This has also popped up in duperemove, mostly in the form of cryptic > > error messages. Because this is a code returned to userspace, I did > > check the other users of extent-same that I could find. Both 'bees' > > and 'rust-btrfs' do the same as duperemove and simply report the error > > (as they should). > > > > Please apply. > > > > Thanks, > > --Mark > > > > Changes from V3 to V4: > > - Add a patch (below) to ioctl_fideduperange.2 explaining our > > changes. I will send this patch once the kernel update is > > accepted. Thanks to Darrick Wong for this suggestion. > > - V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html > > > > Changes from V2 to V3: > > - Return bool from allow_file_dedupe > > - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html > > > > Changes from V1 to V2: > > - Add inode_permission check as suggested by Adam Borowski > > - V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2 > > > > > > From: Mark Fasheh > > > > [PATCH] ioctl_fideduperange.2: clarify permission requirements > > > > dedupe permission checks were recently relaxed - update our man page to > > reflect those changes. > > > > Signed-off-by: Mark Fasheh > > --- > > man2/ioctl_fideduperange.2 | 8 +--- > > Mmm, man page update, thank you for editing the documentation too! > > Please cc linux-api and Michael Kerrisk so this can go upstream. From: Mark Fasheh [PATCH] ioctl_fideduperange.2: clarify permission requirements dedupe permission checks were recently relaxed - update our man page to reflect those changes. Signed-off-by: Mark Fasheh --- man2/ioctl_fideduperange.2 | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2 index 84d20a276..4040ee064 100644 --- a/man2/ioctl_fideduperange.2 +++ b/man2/ioctl_fideduperange.2 @@ -105,9 +105,12 @@ The field must be zero. During the call, .IR src_fd -must be open for reading and +must be open for reading. .IR dest_fd -must be open for writing. +can be open for writing, or reading. +If +.IR dest_fd +is open for reading, the user must have write access to the file. The combined size of the struct .IR file_dedupe_range and the struct @@ -185,8 +188,8 @@ This can appear if the filesystem does not support deduplicating either file descriptor, or if either file descriptor refers to special inodes. .TP .B EPERM -.IR dest_fd -is immutable. +This will be returned if the user lacks permission to dedupe the file referenced by +.IR dest_fd . .TP .B ETXTBSY
Re: [RESEND][PATCH v4 0/2] vfs: better dedupe permission check
On Tue, Jul 17, 2018 at 12:48:18PM -0700, Darrick J. Wong wrote: > On Tue, Jul 17, 2018 at 12:09:04PM -0700, Mark Fasheh wrote: > > From: Mark Fasheh > > > > [PATCH] ioctl_fideduperange.2: clarify permission requirements > > > > dedupe permission checks were recently relaxed - update our man page to > > reflect those changes. > > > > Signed-off-by: Mark Fasheh > > --- > > man2/ioctl_fideduperange.2 | 8 +--- > > Mmm, man page update, thank you for editing the documentation too! No problem, thanks for suggesting I do it. > Please cc linux-api and Michael Kerrisk so this can go upstream. Will do. The changes you point out below all look good to me so I'll incorporate them and send an updated version to the appropriate folks. Thanks --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND][PATCH v4 0/2] vfs: better dedupe permission check
On Tue, Jul 17, 2018 at 12:09:04PM -0700, Mark Fasheh wrote: > Hi Al, > > The following patches fix a couple of issues with the permission check > we do in vfs_dedupe_file_range(). I sent them out for a few times now, > a changelog is attached. If they look ok to you, I'd appreciate them > being pushed upstream. > > You can get them from git if you like: > > git pull https://github.com/markfasheh/linux dedupe-perms > > I also have a set of patches against 4.17 if you prefer. The code and > testing are identical: > > git pull https://github.com/markfasheh/linux dedupe-perms-v4.17 > > > The first patch expands our check to allow dedupe of a file if the > user owns it or otherwise would be allowed to write to it. > > Current behavior is that we'll allow dedupe only if: > > - the user is an admin (root) > - the user has the file open for write > > This makes it impossible for a user to dedupe their own file set > unless they do it as root, or ensure that all files have write > permission. There's a couple of duperemove bugs open for this: > > https://github.com/markfasheh/duperemove/issues/129 > https://github.com/markfasheh/duperemove/issues/86 > > The other problem we have is also related to forcing the user to open > target files for write - A process trying to exec a file currently > being deduped gets ETXTBUSY. The answer (as above) is to allow them to > open the targets ro - root can already do this. There was a patch from > Adam Borowski to fix this back in 2016: > > https://lkml.org/lkml/2016/7/17/130 > > which I have incorporated into my changes. > > > The 2nd patch fixes our return code for permission denied to be > EPERM. For some reason we're returning EINVAL - I think that's > probably my fault. At any rate, we need to be returning something > descriptive of the actual problem, otherwise callers see EINVAL and > can't really make a valid determination of what's gone wrong. > > This has also popped up in duperemove, mostly in the form of cryptic > error messages. Because this is a code returned to userspace, I did > check the other users of extent-same that I could find. Both 'bees' > and 'rust-btrfs' do the same as duperemove and simply report the error > (as they should). > > Please apply. > > Thanks, > --Mark > > Changes from V3 to V4: > - Add a patch (below) to ioctl_fideduperange.2 explaining our > changes. I will send this patch once the kernel update is > accepted. Thanks to Darrick Wong for this suggestion. > - V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html > > Changes from V2 to V3: > - Return bool from allow_file_dedupe > - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html > > Changes from V1 to V2: > - Add inode_permission check as suggested by Adam Borowski > - V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2 > > > From: Mark Fasheh > > [PATCH] ioctl_fideduperange.2: clarify permission requirements > > dedupe permission checks were recently relaxed - update our man page to > reflect those changes. > > Signed-off-by: Mark Fasheh > --- > man2/ioctl_fideduperange.2 | 8 +--- Mmm, man page update, thank you for editing the documentation too! Please cc linux-api and Michael Kerrisk so this can go upstream. > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2 > index 84d20a276..7dea0323d 100644 > --- a/man2/ioctl_fideduperange.2 > +++ b/man2/ioctl_fideduperange.2 > @@ -105,9 +105,11 @@ The field > must be zero. > During the call, > .IR src_fd > -must be open for reading and > +must be open for reading. > .IR dest_fd > -must be open for writing. > +can be open for writing, or reading. If Manpages usually start each new sentence on its own line (though I defer to mkerrisk on that). > +.IR dest_fd > +is open for reading, the user should be have write access to the file. "...the user must have write access..." > The combined size of the struct > .IR file_dedupe_range > and the struct > @@ -185,8 +187,8 @@ This can appear if the filesystem does not support > deduplicating either file > descriptor, or if either file descriptor refers to special inodes. > .TP > .B EPERM > +This will be returned if the user lacks permission to dedupe the file > referenced by > .IR dest_fd > -is immutable. (Did the period fall off the end of the sentence here? I am bad at reading manpage markup...) --D > .TP > .B ETXTBSY > One of the files is a swap file. > -- > 2.15.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/2] vfs: allow dedupe of user owned read-only files
The permission check in vfs_dedupe_file_range() is too coarse - We only allow dedupe of the destination file if the user is root, or they have the file open for write. This effectively limits a non-root user from deduping their own read-only files. In addition, the write file descriptor that the user is forced to hold open can prevent execution of files. As file data during a dedupe does not change, the behavior is unexpected and this has caused a number of issue reports. For an example, see: https://github.com/markfasheh/duperemove/issues/129 So change the check so we allow dedupe on the target if: - the root or admin is asking for it - the process has write access - the owner of the file is asking for the dedupe - the process could get write access That way users can open read-only and still get dedupe. Signed-off-by: Mark Fasheh Acked-by: Darrick J. Wong --- fs/read_write.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index e83bd9744b5d..71e9077f8bc1 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, } EXPORT_SYMBOL(vfs_dedupe_file_range_compare); +/* Check whether we are allowed to dedupe the destination file */ +static bool allow_file_dedupe(struct file *file) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + if (file->f_mode & FMODE_WRITE) + return true; + if (uid_eq(current_fsuid(), file_inode(file)->i_uid)) + return true; + if (!inode_permission(file_inode(file), MAY_WRITE)) + return true; + return false; +} + int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) { struct file_dedupe_range_info *info; @@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) u64 len; int i; int ret; - bool is_admin = capable(CAP_SYS_ADMIN); u16 count = same->dest_count; struct file *dst_file; loff_t dst_off; @@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (info->reserved) { info->status = -EINVAL; - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) { + } else if (!allow_file_dedupe(dst_file)) { info->status = -EINVAL; } else if (file->f_path.mnt != dst_file->f_path.mnt) { info->status = -EXDEV; -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/2] vfs: dedupe should return EPERM if permission is not granted
Right now we return EINVAL if a process does not have permission to dedupe a file. This was an oversight on my part. EPERM gives a true description of the nature of our error, and EINVAL is already used for the case that the filesystem does not support dedupe. Signed-off-by: Mark Fasheh Reviewed-by: Darrick J. Wong Acked-by: David Sterba --- fs/read_write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 71e9077f8bc1..7188982e2733 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (info->reserved) { info->status = -EINVAL; } else if (!allow_file_dedupe(dst_file)) { - info->status = -EINVAL; + info->status = -EPERM; } else if (file->f_path.mnt != dst_file->f_path.mnt) { info->status = -EXDEV; } else if (S_ISDIR(dst->i_mode)) { -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND][PATCH v4 0/2] vfs: better dedupe permission check
Hi Al, The following patches fix a couple of issues with the permission check we do in vfs_dedupe_file_range(). I sent them out for a few times now, a changelog is attached. If they look ok to you, I'd appreciate them being pushed upstream. You can get them from git if you like: git pull https://github.com/markfasheh/linux dedupe-perms I also have a set of patches against 4.17 if you prefer. The code and testing are identical: git pull https://github.com/markfasheh/linux dedupe-perms-v4.17 The first patch expands our check to allow dedupe of a file if the user owns it or otherwise would be allowed to write to it. Current behavior is that we'll allow dedupe only if: - the user is an admin (root) - the user has the file open for write This makes it impossible for a user to dedupe their own file set unless they do it as root, or ensure that all files have write permission. There's a couple of duperemove bugs open for this: https://github.com/markfasheh/duperemove/issues/129 https://github.com/markfasheh/duperemove/issues/86 The other problem we have is also related to forcing the user to open target files for write - A process trying to exec a file currently being deduped gets ETXTBUSY. The answer (as above) is to allow them to open the targets ro - root can already do this. There was a patch from Adam Borowski to fix this back in 2016: https://lkml.org/lkml/2016/7/17/130 which I have incorporated into my changes. The 2nd patch fixes our return code for permission denied to be EPERM. For some reason we're returning EINVAL - I think that's probably my fault. At any rate, we need to be returning something descriptive of the actual problem, otherwise callers see EINVAL and can't really make a valid determination of what's gone wrong. This has also popped up in duperemove, mostly in the form of cryptic error messages. Because this is a code returned to userspace, I did check the other users of extent-same that I could find. Both 'bees' and 'rust-btrfs' do the same as duperemove and simply report the error (as they should). Please apply. Thanks, --Mark Changes from V3 to V4: - Add a patch (below) to ioctl_fideduperange.2 explaining our changes. I will send this patch once the kernel update is accepted. Thanks to Darrick Wong for this suggestion. - V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html Changes from V2 to V3: - Return bool from allow_file_dedupe - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html Changes from V1 to V2: - Add inode_permission check as suggested by Adam Borowski - V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2 From: Mark Fasheh [PATCH] ioctl_fideduperange.2: clarify permission requirements dedupe permission checks were recently relaxed - update our man page to reflect those changes. Signed-off-by: Mark Fasheh --- man2/ioctl_fideduperange.2 | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2 index 84d20a276..7dea0323d 100644 --- a/man2/ioctl_fideduperange.2 +++ b/man2/ioctl_fideduperange.2 @@ -105,9 +105,11 @@ The field must be zero. During the call, .IR src_fd -must be open for reading and +must be open for reading. .IR dest_fd -must be open for writing. +can be open for writing, or reading. If +.IR dest_fd +is open for reading, the user should be have write access to the file. The combined size of the struct .IR file_dedupe_range and the struct @@ -185,8 +187,8 @@ This can appear if the filesystem does not support deduplicating either file descriptor, or if either file descriptor refers to special inodes. .TP .B EPERM +This will be returned if the user lacks permission to dedupe the file referenced by .IR dest_fd -is immutable. .TP .B ETXTBSY One of the files is a swap file. -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
Nikolay Borisov - 17.07.18, 10:16: > On 17.07.2018 11:02, Martin Steigerwald wrote: > > Nikolay Borisov - 17.07.18, 09:20: > >> On 16.07.2018 23:58, Wolf wrote: > >>> Greetings, > >>> I would like to ask what what is healthy amount of free space to > >>> keep on each device for btrfs to be happy? > >>> > >>> This is how my disk array currently looks like > >>> > >>> [root@dennas ~]# btrfs fi usage /raid > >>> > >>> Overall: > >>> Device size: 29.11TiB > >>> Device allocated: 21.26TiB > >>> Device unallocated:7.85TiB > >>> Device missing: 0.00B > >>> Used: 21.18TiB > >>> Free (estimated): 3.96TiB (min: 3.96TiB) > >>> Data ratio: 2.00 > >>> Metadata ratio: 2.00 > >>> Global reserve: 512.00MiB (used: 0.00B) > > > > […] > > > >>> Btrfs does quite good job of evenly using space on all devices. > >>> No, > >>> how low can I let that go? In other words, with how much space > >>> free/unallocated remaining space should I consider adding new > >>> disk? > >> > >> Btrfs will start running into problems when you run out of > >> unallocated space. So the best advice will be monitor your device > >> unallocated, once it gets really low - like 2-3 gb I will suggest > >> you run balance which will try to free up unallocated space by > >> rewriting data more compactly into sparsely populated block > >> groups. If after running balance you haven't really freed any > >> space then you should consider adding a new drive and running > >> balance to even out the spread of data/metadata. > > > > What are these issues exactly? > > For example if you have plenty of data space but your metadata is full > then you will be getting ENOSPC. Of that one I am aware. This just did not happen so far. I did not yet add it explicitly to the training slides, but I just make myself a note to do that. Anything else? > > I have > > > > % btrfs fi us -T /home > > > > Overall: > > Device size: 340.00GiB > > Device allocated:340.00GiB > > Device unallocated:2.00MiB > > Device missing: 0.00B > > Used:308.37GiB > > Free (estimated): 14.65GiB (min: 14.65GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) > > > > Data Metadata System > > > > Id Path RAID1 RAID1RAID1Unallocated > > -- -- - --- > > > > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > > > -- -- - --- > > > >Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB > >Used 151.24GiB 2.95GiB 48.00KiB > > You already have only 33% of your metadata full so if your workload > turned out to actually be making more metadata-heavy changed i.e > snapshots you could exhaust this and get ENOSPC, despite having around > 14gb of free data space. Furthermore this data space is spread around > multiple data chunks, depending on how populated they are a balance > could be able to free up unallocated space which later could be > re-purposed for metadata (again, depending on what you are doing). The filesystem above IMO is not fit for snapshots. It would fill up rather quickly, I think even when I balance metadata. Actually I tried this and as I remember it took at most a day until it was full. If I read above figures currently at maximum I could gain one additional GiB by balancing metadata. That would not make a huge difference. I bet I am already running this filesystem beyond recommendation, as I bet many would argue it is to full already for regular usage… I do not see the benefit of squeezing the last free space out of it just to fit in another GiB. So I still do not get the point why it would make sense to balance it at this point in time. Especially as this 1 GiB I could regain is not even needed. And I do not see the point of balancing it weekly. I would regain about 1 GiB of metadata space every now and then, but the cost would be a lot of additional I/O to the SSD. They still take it very nicely so far, but I think, right now, there is simply no point in balancing, at least not regularly, unless… there would be an performance gain. Whenever I balanced a complete filesystem with data and metadata I however saw a cross drop in performance, like doubling the boot time for example (no scientific measurement, just my personal observation). I admit I did not do this for a long time and the balancing
task btrfs-transacti:921 blocked for more than 120 seconds during check repair
I got the following on 4.17.6 while running btrfs check --repair on an unmounted filesystem (not the lowmem version) I understand that btrfs check is userland only, although it seems that it caused these FS hangs on a different filesystem (the trace of course does not provide info on which FS) Any idea what happened here? I'm going to wait a few hours without running btrfs check to see if it happens again and then if running btrfs check will re-create this issue, but other suggestions (if any), are welcome: [ 2538.566952] Workqueue: btrfs-endio-write btrfs_endio_write_helper [ 2538.616484] Call Trace: [ 2538.623828] ? __schedule+0x53e/0x59b [ 2538.634802] schedule+0x7f/0x98 [ 2538.644214] wait_current_trans+0x9b/0xd8 [ 2538.656229] ? add_wait_queue+0x3a/0x3a [ 2538.668239] start_transaction+0x1ce/0x325 [ 2538.680556] btrfs_finish_ordered_io+0x240/0x5d3 [ 2538.694414] normal_work_helper+0x118/0x277 [ 2538.706984] process_one_work+0x19c/0x281 [ 2538.719036] ? rescuer_thread+0x279/0x279 [ 2538.731064] worker_thread+0x197/0x246 [ 2538.742322] kthread+0xeb/0xf0 [ 2538.751492] ? kthread_create_worker_on_cpu+0x66/0x66 [ 2538.76] ret_from_fork+0x35/0x40 [ 2538.777403] INFO: task kworker/u16:11:369 blocked for more than 120 seconds. [ 2538.799025] Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4 [ 2538.818109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2538.841640] kworker/u16:11 D0 369 2 0x8000 [ 2538.858112] Workqueue: btrfs-endio-write btrfs_endio_write_helper [ 2538.876401] Call Trace: [ 2538.883770] ? __schedule+0x53e/0x59b [ 2538.894760] schedule+0x7f/0x98 [ 2538.904192] wait_current_trans+0x9b/0xd8 [ 2538.916242] ? add_wait_queue+0x3a/0x3a [ 2538.927772] start_transaction+0x1ce/0x325 [ 2538.940081] btrfs_finish_ordered_io+0x240/0x5d3 [ 2538.953973] normal_work_helper+0x118/0x277 [ 2538.966523] process_one_work+0x19c/0x281 [ 2538.978546] ? rescuer_thread+0x279/0x279 [ 2538.990560] worker_thread+0x197/0x246 [ 2539.001797] kthread+0xeb/0xf0 [ 2539.010986] ? kthread_create_worker_on_cpu+0x66/0x66 [ 2539.026137] ret_from_fork+0x35/0x40 [ 2539.037666] INFO: task btrfs-transacti:921 blocked for more than 120 seconds. [ 2539.059851] Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4 [ 2539.079733] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2539.104007] btrfs-transacti D0 921 2 0x8000 [ 2539.121257] Call Trace: [ 2539.129377] ? __schedule+0x53e/0x59b [ 2539.141171] schedule+0x7f/0x98 [ 2539.151370] btrfs_tree_lock+0xa6/0x19d [ 2539.163621] ? add_wait_queue+0x3a/0x3a [ 2539.175876] btrfs_search_slot+0x5aa/0x756 [ 2539.188899] lookup_inline_extent_backref+0x11a/0x485 [ 2539.204781] ? fixup_slab_list.isra.43+0x1b/0x72 [ 2539.219360] __btrfs_free_extent+0xf1/0xa72 [ 2539.232597] ? btrfs_merge_delayed_refs+0x18b/0x1a7 [ 2539.247922] ? __mutex_trylock_or_owner+0x43/0x54 [ 2539.262708] __btrfs_run_delayed_refs+0xad8/0xc40 [ 2539.277504] btrfs_run_delayed_refs+0x6e/0x16a [ 2539.291519] btrfs_commit_transaction+0x42/0x710 [ 2539.306043] ? start_transaction+0x295/0x325 [ 2539.319516] transaction_kthread+0xc9/0x135 [ 2539.332757] ? btrfs_cleanup_transaction+0x3ee/0x3ee [ 2539.348327] kthread+0xeb/0xf0 [ 2539.358155] ? kthread_create_worker_on_cpu+0x66/0x66 [ 2539.373977] ret_from_fork+0x35/0x40 [ 2539.385394] INFO: task vnstatd:6338 blocked for more than 120 seconds. [ 2539.405667] Not tainted 4.17.6-amd64-preempt-sysrq-20180818 #4 Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3)
On Fri, Jul 13, 2018 at 11:02:03PM +0200, Goffredo Baroncelli wrote: > As general comment, good to hear that something is moving around raid5/6 + > write hole and multiple mirroring. > However I am guessing if this is time to simplify the RAID code. There are a > lot of "if" which could be avoided using > the values stored in the array "btrfs_raid_array[]". I absolutely agree and had the same impression during implementing the feature. For this patchset I did only a minimal prep work, the suggestions you give below make sense to me. Enhancing the table would make a lot of code go away and just use one formula to calculate the results that are now opencoded. I'll be going through the raid code so I'll get to the cleanups eventually. > Below some comments: > > @@ -5075,6 +5093,8 @@ static inline int btrfs_chunk_max_errors(struct > > map_lookup *map) > > BTRFS_BLOCK_GROUP_RAID5 | > > BTRFS_BLOCK_GROUP_DUP)) { > > max_errors = 1; > > + } else if (map->type & BTRFS_BLOCK_GROUP_RAID1C3) { > > + max_errors = 2; > > } else if (map->type & BTRFS_BLOCK_GROUP_RAID6) { > > max_errors = 2; > > } else { > > Even in this case the ifs above could be replaced with something like: > > index = btrfs_bg_flags_to_raid_index(map->type) > max_errors = btrfs_raid_array[index].ncopies-1; There's .tolerated_failures that should equal ncopies - 1 in general, but does not for DUP so the semantics of the function and caller needs to be verified. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: rename btrfs_parse_early_options
On Mon, Jul 16, 2018 at 11:01:37PM +0800, Anand Jain wrote: > On 07/16/2018 10:18 PM, Anand Jain wrote: > > Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As > > btrfs_parse_early_options() parses the -o device options and scan the > > device provided. So this rename specifies its action. Also the function > > name is inline with btrfs_parse_subvol_options(). Yeah, much more fitting name. > > No functional changes. > > > > Signed-off-by: Anand Jain > > --- > > Forgot to mention: This should be applied on top of >Gu Jinxiang (2): >btrfs: make fs_devices to be a local variable >btrfs: get device pointer from btrfs_scan_one_device Both are in misc-next now so your patch applied too, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit
On Tue, Jul 17, 2018 at 04:58:22PM +0800, Lu Fengqi wrote: > Since commit 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info > convenience variables"), the srcroot is no longer used to get fs_fino. > In fact, it can be dropped after commit 707e8a071528 ("btrfs: use nodesize > everywhere, kill leafsize"). > > Signed-off-by: Lu Fengqi Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/2] btrfs: get device pointer from btrfs_scan_one_device
On Tue, Jul 17, 2018 at 12:54:00PM +0800, Anand Jain wrote: > > :: > > >>> @@ -1561,12 +1564,15 @@ static struct dentry *btrfs_mount_root(struct > >>> file_system_type *fs_type, > >>> goto error_fs_info; > >>> } > >>> > >>> - error = btrfs_scan_one_device(device_name, mode, fs_type, _devices); > >>> - if (error) { > >>> + device = btrfs_scan_one_device(device_name, mode, fs_type); > >>> + if (IS_ERR(device)) { > >>> + error = PTR_ERR(device); > >>> mutex_unlock(_mutex); > >>> goto error_fs_info; > >>> } > >>> > >>> + fs_devices = device->fs_devices; > >>> + > > The version at misc-next is missing this assignment and it panics. Thanks, that was my mismerge. Now added back and pushed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/2] btrfs: get device pointer from btrfs_scan_one_device
On Tue, Jul 17, 2018 at 01:08:34AM +, Gu, Jinxiang wrote: > > > - device = device_list_add(path, disk_super, _device_added); > > > - if (IS_ERR(device)) { > > > - ret = PTR_ERR(device); > > > - } else { > > > - *fs_devices_ret = device->fs_devices; > > > - if (new_device_added) > > > - btrfs_free_stale_devices(path, device); > > > - } > > > + ret = device_list_add(path, disk_super, _device_added); > > > + if (new_device_added) > > > + btrfs_free_stale_devices(path, ret); > > > > Why is this skipping the error handling? I've resolved the conflict in a > > different way in misc-next, have you had a look there? > > Since when new_device_added set to be true, the return value of > device_list_add > must not a error value. Yeah, but this is not obvious from the context and looks like " device_list_add fails but we'll do something anyway", so I'm for keeping the check. > So, I think no necessary to add the judge. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] btrfs: Verify every chunk has corresponding block group at mount time
On 2018年07月17日 20:33, David Sterba wrote: > On Mon, Jul 16, 2018 at 09:57:43PM +0800, Qu Wenruo wrote: > -EUCLEAN ? Either works for me. >>> >>> That's not just a cosmetic change, there's a semantic difference between >>> the error codes, I maybe make that more explicit and not expect that this >>> is obvious. >>> >>> ENOENT does not make much sense in this context, the caller (mount in >>> this case) cannot do anything about a code that says 'some internal >>> structure not found'. >> >> The point here is, if every self-checker should only return -EUCLEAN, it >> won't really indicate what's going wrong, except points to some >> self-checker (and such self-checkers are growing larger than our >> expectation already). >> >> My practice here is, put some human readable and meaningful error >> message. No matter what we choose to return, the error message should >> tell us what's going wrong. >> >> In this case, I don't really care the return value. If it's explicitly >> needed to return -EUCLEAN, I could make all existing checker (from >> tree-checker to chunk/bg/dev-extent checker) to return -EUCLEAN if >> anything is wrong (and save several "ret = -EUCLEAN" lines). >> The return value doesn't really have much meaning nowadays, it's the >> error message important now. > > Ok, I see what you mean. The message is important as it's otherwise > almost impossible to find where exactly the mount fails. > > The error messages perhaps fall into several categories: > > 1) transient errors, some failure that happens before the filesystem state >is fully examined > > this is namely ENOMEM, or EINTR eg. returned by kthread_run This standard is a little misleading, or did I misunderstand your category? From the example error number, I could only find ENOMEM so straightforward for end user/developer that we don't need any error message to explain them. Or this category is just for error no need of error message? (or can be handled by btrfs-progs without any need of user interruption/decision?) > > maybe also a failure on a multi-device filesystem when the devices > haven't been scanned yet > > 2) clearly some corruption/consistency condtion, with enough information >available to decide > > like a missing tree, most of the tree-checker would fall into this > category This is pretty clear. > > 3) same as the previous one, but there's some external condition preventing >a full check > > that's eg. a real EIO after reading a tree block That csum mismatch EIO with error message or really some error from underlying layer like some ATA command failure? > > > The error code are IMO important to see how severe the problems are and > what's the expected solution. 2 is for 'check', 3 may need degraded > mount, 1 needs maybe more time to mount again. Category 2 for check is sure. For other 2 cases it's a little hard to say. Normally if we really hit some error we don't expect, under most case the filesystem is already corrupted (e.g. a lot of errors of resuming balance/mount failure finally turns out to be fs corruption). If category is determined by the expected solution, most will just fall into category 2), including most of errors we have in btrfs module currently. > > With the error messages in place, 2 can be completely covered by > EUCLEAN. I briefly skimmed a few call paths and think that the 3 > categories should be enough, but I'm also expecting some exceptions that > can be decided case by case. For category 2), I think it's pretty clear and practically to use EUCLEAN. For other categories I'm not really sure. E.G what happens if we can't find certain backref when running delayed refs? It's either a kernel bug or a corrupted fs. Which category should it fit? Category 2? But we don't really know what's going wrong. For category 1/3? It won't really be fixed until we fix the bug or the fs. More details examples would definitely help me understand the category. > > The error codes are now not consistent, lots of EUCLEAN are historically > EIO, but before we start cleaning that up we should have at least some > guidelines. Please let me know what you think. > At least for self-verification code it's pretty clear that we should have error message for what's going wrong and what we expect, with explicit EUCLEAN error number. Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/5] btrfs: Verify every chunk has corresponding block group at mount time
On Mon, Jul 16, 2018 at 09:57:43PM +0800, Qu Wenruo wrote: > >>> -EUCLEAN ? > >> > >> Either works for me. > > > > That's not just a cosmetic change, there's a semantic difference between > > the error codes, I maybe make that more explicit and not expect that this > > is obvious. > > > > ENOENT does not make much sense in this context, the caller (mount in > > this case) cannot do anything about a code that says 'some internal > > structure not found'. > > The point here is, if every self-checker should only return -EUCLEAN, it > won't really indicate what's going wrong, except points to some > self-checker (and such self-checkers are growing larger than our > expectation already). > > My practice here is, put some human readable and meaningful error > message. No matter what we choose to return, the error message should > tell us what's going wrong. > > In this case, I don't really care the return value. If it's explicitly > needed to return -EUCLEAN, I could make all existing checker (from > tree-checker to chunk/bg/dev-extent checker) to return -EUCLEAN if > anything is wrong (and save several "ret = -EUCLEAN" lines). > The return value doesn't really have much meaning nowadays, it's the > error message important now. Ok, I see what you mean. The message is important as it's otherwise almost impossible to find where exactly the mount fails. The error messages perhaps fall into several categories: 1) transient errors, some failure that happens before the filesystem state is fully examined this is namely ENOMEM, or EINTR eg. returned by kthread_run maybe also a failure on a multi-device filesystem when the devices haven't been scanned yet 2) clearly some corruption/consistency condtion, with enough information available to decide like a missing tree, most of the tree-checker would fall into this category 3) same as the previous one, but thre's some extenal condition preventing a full check that's eg. a real EIO after reading a tree block The error code are IMO important to see how severe the problems are and what's the expected solution. 2 is for 'check', 3 may need degraded mount, 1 needs maybe more time to mount again. With the error messages in place, 2 can be completely covered by EUCLEAN. I briefly skimmed a few call paths and think that the 3 categories should be enough, but I'm also expecting some exceptions that can be decided case by case. The error codes are now not consistent, lots of EUCLEAN are historically EIO, but before we start cleaning that up we should have at least some guidelines. Please let me know what you think. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs-progs: check: enhanced progress indicator
On Tue, Jul 17, 2018 at 10:23:47AM +0900, Misono Tomohiro wrote: > On 2018/07/05 4:20, Stéphane Lesimple wrote: > > @@ -9806,16 +9839,26 @@ int cmd_check(int argc, char **argv) > > error("errors found in csum tree"); > > err |= !!ret; > > > > - fprintf(stderr, "checking root refs\n"); > > /* For low memory mode, check_fs_roots_v2 handles root refs */ > > - if (check_mode != CHECK_MODE_LOWMEM) { > > +if (check_mode != CHECK_MODE_LOWMEM) { > > + if (!ctx.progress_enabled) > > + fprintf(stderr, "[6/7] checking root refs\n"); > > + else { > > + ctx.tp = TASK_ROOT_REFS; > > + task_start(ctx.info, _time, _count); > > + } > > + > > ret = check_root_refs(root, _cache); > > + task_stop(ctx.info); > > err |= !!ret; > > if (ret) { > > error("errors found in root refs"); > > goto out; > > } > > } > > + else { > > + fprintf(stderr, "[6/7] checking root refs done with fs roots in > > lowmem mode, skipping\n"); > > + } > > > > while (repair && !list_empty(>fs_info->recow_ebs)) { > > struct extent_buffer *eb; > > @@ -9844,8 +9887,15 @@ int cmd_check(int argc, char **argv) > > } > > > > if (info->quota_enabled) { > > - fprintf(stderr, "checking quota groups\n"); > > + if (!ctx.progress_enabled) > > + fprintf(stderr, "[7/7] checking quota groups\n"); > > > qgroup_set_item_count_ptr(_count) is needed here too. Otherwise > quota check causes segfault without -p option. Thanks, fixed as --- a/check/main.c +++ b/check/main.c @@ -9969,12 +9969,12 @@ int cmd_check(int argc, char **argv) } if (info->quota_enabled) { + qgroup_set_item_count_ptr(_count); if (!ctx.progress_enabled) { fprintf(stderr, "[7/7] checking quota groups\n"); } else { ctx.tp = TASK_QGROUPS; task_start(ctx.info, _time, _count); - qgroup_set_item_count_ptr(_count); } ret = qgroup_verify_all(info); task_stop(ctx.info); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
On Wed, Jul 11, 2018 at 01:41:21PM +0800, Qu Wenruo wrote: > In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device > replace") we removed the branch of copy_nocow_pages() to avoid > corruption for compressed nodatasum extents. > > However above commit only solves the problem in scrub_extent(), if > during scrub_pages() we failed to read some pages, > sctx->no_io_error_seen will be non-zero and we go to fixup function > scrub_handle_errored_block(). > > In scrub_handle_errored_block(), for sctx without csum (no matter if > we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine, > which does the similar thing with copy_nocow_pages(), but does it > without the extra check in copy_nocow_pages() routine. > > So for test cases like btrfs/100, where we emulate read errors during > replace/scrub, we could corrupt compressed extent data again. > > This patch will fix it just by avoiding any "optimization" for > nodatasum, just falls back to the normal fixup routine by try read from > any good copy. > > This also solves WARN_ON() or dead lock caused by lame backref iteration > in scrub_fixup_nodatasum() routine. > > The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662 > ("btrfs: scrub: Don't use inode pages for device replace") since > copy_nocow_pages() have better locking and extra check for data extent, > and it's already doing the fixup work by try to read data from any good > copy, so it won't go scrub_fixup_nodatasum() anyway. > > Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace") > Signed-off-by: Qu Wenruo Thanks, I'll forward this to 4.18, there are a few more regression fixes queued. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-16 16:58, Wolf wrote: Greetings, I would like to ask what what is healthy amount of free space to keep on each device for btrfs to be happy? This is how my disk array currently looks like [root@dennas ~]# btrfs fi usage /raid Overall: Device size: 29.11TiB Device allocated: 21.26TiB Device unallocated:7.85TiB Device missing: 0.00B Used: 21.18TiB Free (estimated): 3.96TiB (min: 3.96TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:10.61TiB, Used:10.58TiB /dev/mapper/data1 1.75TiB /dev/mapper/data2 1.75TiB /dev/mapper/data3 856.00GiB /dev/mapper/data4 856.00GiB /dev/mapper/data5 1.75TiB /dev/mapper/data6 1.75TiB /dev/mapper/data7 6.29TiB /dev/mapper/data8 6.29TiB Metadata,RAID1: Size:15.00GiB, Used:13.00GiB /dev/mapper/data1 2.00GiB /dev/mapper/data2 3.00GiB /dev/mapper/data3 1.00GiB /dev/mapper/data4 1.00GiB /dev/mapper/data5 3.00GiB /dev/mapper/data6 1.00GiB /dev/mapper/data7 9.00GiB /dev/mapper/data8 10.00GiB Slightly OT, but the distribution of metadata chunks across devices looks a bit sub-optimal here. If you can tolerate the volume being somewhat slower for a while, I'd suggest balancing these (it should get you better performance long-term). System,RAID1: Size:64.00MiB, Used:1.50MiB /dev/mapper/data2 32.00MiB /dev/mapper/data6 32.00MiB /dev/mapper/data7 32.00MiB /dev/mapper/data8 32.00MiB Unallocated: /dev/mapper/data11004.52GiB /dev/mapper/data21004.49GiB /dev/mapper/data31006.01GiB /dev/mapper/data41006.01GiB /dev/mapper/data51004.52GiB /dev/mapper/data61004.49GiB /dev/mapper/data71005.00GiB /dev/mapper/data81005.00GiB Btrfs does quite good job of evenly using space on all devices. No, how low can I let that go? In other words, with how much space free/unallocated remaining space should I consider adding new disk? Disclaimer: What I'm about to say is based on personal experience. YMMV. It depends on how you use the filesystem. Realistically, there are a couple of things I consider when trying to decide on this myself: * How quickly does the total usage increase on average, and how much can it be expected to increase in one day in the worst case scenario? This isn't really BTRFS specific, but it's worth mentioning. I usually don't let an array get close enough to full that it wouldn't be able to safely handle at least one day of the worst case increase and another 2 of average increases. In BTRFS terms, the 'safely handle' part means you should be adding about 5GB for a multi-TB array like you have, or about 1GB for a sub-TB array. * What are the typical write patterns? Do files get rewritten in-place, or are they only ever rewritten with a replace-by-rename? Are writes mostly random, or mostly sequential? Are writes mostly small or mostly large? The more towards the first possibility listed in each of those question (in-place rewrites, random access, and small writes), the more free space you should keep on the volume. * Does this volume see heavy usage of fallocate() either to preallocate space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch holes or remove ranges from files. If whatever software you're using does this a lot on this volume, you want even more free space. * Do old files tend to get removed in large batches? That is, possibly hundreds or thousands of files at a time. If so, and you're running a reasonably recent (4.x series) kernel or regularly balance the volume to clean up empty chunks, you don't need quite as much free space. * How quickly can you get a new device added, and is it critical that this volume always be writable? Sounds stupid, but a lot of people don't consider this. If you can trivially get a new device added immediately, you can generally let things go a bit further than you would normally, same for if the volume being read-only can be tolerated for a while without significant issues. It's worth noting that I explicitly do not care about snapshot usage. It rarely has much impact on this other than changing how the total usage increases in a day. Evaluating all of this is of course something I can't really do for you. If I had to guess, with no other information that the allocations shown, I'd say that you're probably generically fine until you get down to about 5GB more than twice the average
Re: [PATCH RESEND 2/2] btrfs-progs: make all programs and libraries optional
On Mon, Jul 16, 2018 at 10:44:52AM -0700, Omar Sandoval wrote: > On Mon, Jul 16, 2018 at 04:56:57PM +0200, David Sterba wrote: > > On Thu, Jul 12, 2018 at 04:11:19PM -0700, Omar Sandoval wrote: > > > From: Omar Sandoval > > > > > > We have a build system internally which only needs to build the > > > libraries out of a repository, not any binaries. I looked at how this > > > works with other projects, and the best example was util-linux, which > > > makes it possible to enable or disable everything individually. This is > > > nice and really flexible, so let's do the same. This way, if you only > > > want to build and install libbtrfsutil, you can simply do > > > > > > ./configure --disable-documentation --disable-all-programs > > > --enable-libbtrfsutil > > > make > > > make install > > > > I think this is an overkill and abusing the --enable-XXX options. You > > want to avoid building the tools by default, so adding an option for > > that is fine. Selectively building only certain tools can utilize that > > option too and just follow with 'make btrfs-image' etc. > > Yeah, it's easy to build stuff selectively, but `make install` will > still try to build everything, that's the part I'm more concerned with. Oh right, installation. What if it installs just the binaries that are built? The default actions done by configure & make & make install would not change, but there could be configure --disable-all, then selectively make and final make install would be for p in $(progs_install); if test -f $p && $(INSTALL) ...; fi > > The number of --enable-* will stay minimal and we don't even have to > > discuss how to find a good naming scheme (that works for util-linux but > > looks a bit confusing for btrfs-progs). > > Ok, I can collapse these into just --disable-programs/--enable-programs, > and --disable-libraries/--enable-libraries? That would be enough for me. Sounds ok. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit
Since commit 0b246afa62b0 ("btrfs: root->fs_info cleanup, add fs_info convenience variables"), the srcroot is no longer used to get fs_fino. In fact, it can be dropped after commit 707e8a071528 ("btrfs: use nodesize everywhere, kill leafsize"). Signed-off-by: Lu Fengqi --- fs/btrfs/qgroup.c | 17 + 1 file changed, 1 insertion(+), 16 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 52e05f031ee3..06e0e9b39340 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2286,22 +2286,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, if (ret) goto out; - if (srcid) { - struct btrfs_root *srcroot; - struct btrfs_key srckey; - - srckey.objectid = srcid; - srckey.type = BTRFS_ROOT_ITEM_KEY; - srckey.offset = (u64)-1; - srcroot = btrfs_read_fs_root_no_name(fs_info, ); - if (IS_ERR(srcroot)) { - ret = PTR_ERR(srcroot); - goto out; - } - - level_size = fs_info->nodesize; - } - /* * add qgroup to all inherited groups */ @@ -2358,6 +2342,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, * our counts don't go crazy, so at this point the only * difference between the two roots should be the root node. */ + level_size = fs_info->nodesize; dstgroup->rfer = srcgroup->rfer; dstgroup->rfer_cmpr = srcgroup->rfer_cmpr; dstgroup->excl = level_size; -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use
On 2018年07月17日 16:28, Nikolay Borisov wrote: > > > On 17.07.2018 11:24, Qu Wenruo wrote: >> And it's causing problem for certain test cases. >> Please ignore this (at least for now). >> >> But on the other hand, we indeed have a lot of reports on corrupted >> extent tree, it's possible to hit some corrupted extent tree (Su is >> already exhausted by the corrupted tree reported by Marc) >> >> So I'm not completely fine with current extent tree error handling. >> I'll try to find some balance in next version. > > > I agree we need a better OVERALL error handling/detection. Your > tree-checker work IMO is a step in the right direction. What I want is > to prevent ad-hoc checks being sprinkled in the code. Yep, I also don't like what I'm doing. But the problem and the limit of tree-checker is, it's static check. For things doing tons of cross-check like extent tree, it's not really as useful. > Sorry, but that's > not fine. The thing with working on a lot of corruption reports is the > fact each one of them is looked at in isolation so it produces isolated > fixes. Whereas if a step back is taken and the overall error > handling/detection is considered it might turn out a whole class of > corruption could be detected by a single change, otherwise checks upon > checks will be added which just add technical debt. > > Considering this, I'm more in favor of extending the tree-checker to be > the central place where errors are detected (of course this is easier > said than done). For this report itself, tree checker can detect it indirectly by reject the leaf for the unknown key type. But one can easily create a valid image by just removing the valid METADATA_ITEM, and tree-checker can't do anything to detect the problem. So unfortunately, we will eventually need some runtime check anyway. Thanks, Qu > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use
On 2018年07月17日 16:01, Nikolay Borisov wrote: > > > On 17.07.2018 10:46, Qu Wenruo wrote: >> [BUG] >> For certain fuzzed btrfs image, if we create any csum data, it would >> cause the following kernel warning and deadlock when trying to update >> csum tree: >> -- >> [ 278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 >> btrfs_tree_lock+0x3e2/0x400 >> [ 278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 >> [ 278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> Ubuntu-1.8.2-1ubuntu1 04/01/2014 >> [ 278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper >> [ 278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400 >> [ 278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 >> 28 00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> >> 0b e9 d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e >> [ 278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246 >> [ 278.113865] Call Trace: >> [ 278.113936] btrfs_alloc_tree_block+0x39f/0x770 >> [ 278.113988] __btrfs_cow_block+0x285/0x9e0 >> [ 278.114029] btrfs_cow_block+0x191/0x2e0 >> [ 278.114035] btrfs_search_slot+0x492/0x1160 >> [ 278.114146] btrfs_lookup_csum+0xec/0x280 >> [ 278.114182] btrfs_csum_file_blocks+0x2be/0xa60 >> [ 278.114232] add_pending_csums+0xaf/0xf0 >> [ 278.114238] btrfs_finish_ordered_io+0x74b/0xc90 >> [ 278.114281] finish_ordered_fn+0x15/0x20 >> [ 278.114285] normal_work_helper+0xf6/0x500 >> [ 278.114305] btrfs_endio_write_helper+0x12/0x20 >> [ 278.114310] process_one_work+0x302/0x770 >> [ 278.114315] worker_thread+0x81/0x6d0 >> [ 278.114321] kthread+0x180/0x1d0 >> [ 278.114334] ret_from_fork+0x35/0x40 >> [ 278.114339] ---[ end trace 2e85051acb5f6dc1 ]--- >> -- >> >> [CAUSE] >> The fuzzed image has corrupted EXTENT_ITEM for csum tree root: >> -- >> extent tree key (EXTENT_TREE ROOT_ITEM 0) >> item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33 >> refs 1 gen 6 flags TREE_BLOCK >> tree block skinny level 0 >> tree block backref root UUID_TREE >> item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33 >> Corrupted METADATA_ITEM >> item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33 >> refs 1 gen 4 flags TREE_BLOCK >> tree block skinny level 0 >> tree block backref root DATA_RELOC_TREE >> >> checksum tree key (CSUM_TREE ROOT_ITEM 0) >> leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE >> bytenr matches above item. >> -- >> >> So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since >> there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks >> it's free space, and reserve it. >> >> However in fact it's already been used by csum tree, and later >> btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose >> WARN_ON() detects lock nest on the same extent buffer. >> >> Finally the wait_event() on the eb->read/write_lock_wq will never exit >> since we're holding the lock by ourselves and deadlock. >> >> [FIX] >> The fix here is to ensure at least the reserved extent buffer is not >> cached. >> Any used extent buffer should be cached in the global radix tree >> (fs_info->buffer_radix). >> >> So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(), >> we call find_extent_buffer() explicitly to verify it's not used by >> ourselves. >> >> Please note this is just a basic check, it is not and will never be as >> good as btrfs check on detecting extent tree corruption, but at least we >> won't dead lock so easily. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405 >> Reported-by: Xu Wen >> Signed-off-by: Qu Wenruo >> --- >> fs/btrfs/extent-tree.c | 14 ++ >> 1 file changed, 14 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 3578fa5b30ef..782dd96b7c5e 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct >> btrfs_trans_handle *trans, >> if (ret) >> goto out_unuse; >> >> +/* >> + * Newly allocated tree block should never be cached in radix tree, >> + * Or we have a corrupted extent tree. >> + */ >> +buf = find_extent_buffer(fs_info, ins.objectid); >> +if (buf) { >> +btrfs_err_rl(fs_info, >> +"tree block %llu is already in use, extent tree may be corrupted", >> + ins.objectid); >> +ret = -EUCLEAN; >> +free_extent_buffer(buf); >> +goto out_unuse; >> +} > > The code makes sense but I have the feeling it needs to have some sort > of assert guard because this check will likely trigger only on severly > corrupted filesystemd and yet we introduce it for everyone. And it's causing problem
Re: Healthy amount of free space?
On 17.07.2018 11:02, Martin Steigerwald wrote: > Hi Nikolay. > > Nikolay Borisov - 17.07.18, 09:20: >> On 16.07.2018 23:58, Wolf wrote: >>> Greetings, >>> I would like to ask what what is healthy amount of free space to >>> keep on each device for btrfs to be happy? >>> >>> This is how my disk array currently looks like >>> >>> [root@dennas ~]# btrfs fi usage /raid >>> >>> Overall: >>> Device size: 29.11TiB >>> Device allocated: 21.26TiB >>> Device unallocated:7.85TiB >>> Device missing: 0.00B >>> Used: 21.18TiB >>> Free (estimated): 3.96TiB (min: 3.96TiB) >>> Data ratio: 2.00 >>> Metadata ratio: 2.00 >>> Global reserve: 512.00MiB (used: 0.00B) > […] >>> Btrfs does quite good job of evenly using space on all devices. No, >>> how low can I let that go? In other words, with how much space >>> free/unallocated remaining space should I consider adding new disk? >> >> Btrfs will start running into problems when you run out of unallocated >> space. So the best advice will be monitor your device unallocated, >> once it gets really low - like 2-3 gb I will suggest you run balance >> which will try to free up unallocated space by rewriting data more >> compactly into sparsely populated block groups. If after running >> balance you haven't really freed any space then you should consider >> adding a new drive and running balance to even out the spread of >> data/metadata. > > What are these issues exactly? For example if you have plenty of data space but your metadata is full then you will be getting ENOSPC. > > I have > > % btrfs fi us -T /home > Overall: > Device size: 340.00GiB > Device allocated:340.00GiB > Device unallocated:2.00MiB > Device missing: 0.00B > Used:308.37GiB > Free (estimated): 14.65GiB (min: 14.65GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data Metadata System > Id Path RAID1 RAID1RAID1Unallocated > -- -- - --- > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > -- -- - --- >Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB >Used 151.24GiB 2.95GiB 48.00KiB You already have only 33% of your metadata full so if your workload turned out to actually be making more metadata-heavy changed i.e snapshots you could exhaust this and get ENOSPC, despite having around 14gb of free data space. Furthermore this data space is spread around multiple data chunks, depending on how populated they are a balance could be able to free up unallocated space which later could be re-purposed for metadata (again, depending on what you are doing). > > on a RAID-1 filesystem one, part of the time two Plasma desktops + > KDEPIM and Akonadi + Baloo desktop search + you name it write to like > mad. > > > Thanks, > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use
On 07/17/2018 04:01 PM, Nikolay Borisov wrote: On 17.07.2018 10:46, Qu Wenruo wrote: [BUG] For certain fuzzed btrfs image, if we create any csum data, it would cause the following kernel warning and deadlock when trying to update csum tree: -- [ 278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400 [ 278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 [ 278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper [ 278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400 [ 278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 28 00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e9 d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e [ 278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246 [ 278.113865] Call Trace: [ 278.113936] btrfs_alloc_tree_block+0x39f/0x770 [ 278.113988] __btrfs_cow_block+0x285/0x9e0 [ 278.114029] btrfs_cow_block+0x191/0x2e0 [ 278.114035] btrfs_search_slot+0x492/0x1160 [ 278.114146] btrfs_lookup_csum+0xec/0x280 [ 278.114182] btrfs_csum_file_blocks+0x2be/0xa60 [ 278.114232] add_pending_csums+0xaf/0xf0 [ 278.114238] btrfs_finish_ordered_io+0x74b/0xc90 [ 278.114281] finish_ordered_fn+0x15/0x20 [ 278.114285] normal_work_helper+0xf6/0x500 [ 278.114305] btrfs_endio_write_helper+0x12/0x20 [ 278.114310] process_one_work+0x302/0x770 [ 278.114315] worker_thread+0x81/0x6d0 [ 278.114321] kthread+0x180/0x1d0 [ 278.114334] ret_from_fork+0x35/0x40 [ 278.114339] ---[ end trace 2e85051acb5f6dc1 ]--- -- [CAUSE] The fuzzed image has corrupted EXTENT_ITEM for csum tree root: -- extent tree key (EXTENT_TREE ROOT_ITEM 0) item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33 refs 1 gen 6 flags TREE_BLOCK tree block skinny level 0 tree block backref root UUID_TREE item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33 Corrupted METADATA_ITEM item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33 refs 1 gen 4 flags TREE_BLOCK tree block skinny level 0 tree block backref root DATA_RELOC_TREE checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE bytenr matches above item. -- So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks it's free space, and reserve it. However in fact it's already been used by csum tree, and later btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose WARN_ON() detects lock nest on the same extent buffer. Finally the wait_event() on the eb->read/write_lock_wq will never exit since we're holding the lock by ourselves and deadlock. [FIX] The fix here is to ensure at least the reserved extent buffer is not cached. Any used extent buffer should be cached in the global radix tree (fs_info->buffer_radix). So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(), we call find_extent_buffer() explicitly to verify it's not used by ourselves. Please note this is just a basic check, it is not and will never be as good as btrfs check on detecting extent tree corruption, but at least we won't dead lock so easily. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405 Reported-by: Xu Wen Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3578fa5b30ef..782dd96b7c5e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans, if (ret) goto out_unuse; + /* +* Newly allocated tree block should never be cached in radix tree, +* Or we have a corrupted extent tree. +*/ + buf = find_extent_buffer(fs_info, ins.objectid); + if (buf) { + btrfs_err_rl(fs_info, + "tree block %llu is already in use, extent tree may be corrupted", +ins.objectid); + ret = -EUCLEAN; + free_extent_buffer(buf); + goto out_unuse; + } The code makes sense but I have the feeling it needs to have some sort of assert guard because this check will likely trigger only on severly corrupted filesystemd and yet we introduce it for everyone. Agreed, the function is called so frequently. + buf = btrfs_init_new_buffer(trans, root, ins.objectid, level); if (IS_ERR(buf)) { ret = PTR_ERR(buf); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a
Re: Healthy amount of free space?
Hi Nikolay. Nikolay Borisov - 17.07.18, 09:20: > On 16.07.2018 23:58, Wolf wrote: > > Greetings, > > I would like to ask what what is healthy amount of free space to > > keep on each device for btrfs to be happy? > > > > This is how my disk array currently looks like > > > > [root@dennas ~]# btrfs fi usage /raid > > > > Overall: > > Device size: 29.11TiB > > Device allocated: 21.26TiB > > Device unallocated:7.85TiB > > Device missing: 0.00B > > Used: 21.18TiB > > Free (estimated): 3.96TiB (min: 3.96TiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) […] > > Btrfs does quite good job of evenly using space on all devices. No, > > how low can I let that go? In other words, with how much space > > free/unallocated remaining space should I consider adding new disk? > > Btrfs will start running into problems when you run out of unallocated > space. So the best advice will be monitor your device unallocated, > once it gets really low - like 2-3 gb I will suggest you run balance > which will try to free up unallocated space by rewriting data more > compactly into sparsely populated block groups. If after running > balance you haven't really freed any space then you should consider > adding a new drive and running balance to even out the spread of > data/metadata. What are these issues exactly? I have % btrfs fi us -T /home Overall: Device size: 340.00GiB Device allocated:340.00GiB Device unallocated:2.00MiB Device missing: 0.00B Used:308.37GiB Free (estimated): 14.65GiB (min: 14.65GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Metadata System Id Path RAID1 RAID1RAID1Unallocated -- -- - --- 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB -- -- - --- Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB Used 151.24GiB 2.95GiB 48.00KiB on a RAID-1 filesystem one, part of the time two Plasma desktops + KDEPIM and Akonadi + Baloo desktop search + you name it write to like mad. Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS crawled to an halt on searching for free blocks, and I had to switch off the laptop uncleanly. If that happened, a balance helped for a while. But since 4.5 or 4.6 this did not happen anymore. I found with SLES 12 SP 3 or so there is btrfsmaintenance running a balance weekly. Which created an issue on our Proxmox + Ceph on Intel NUC based opensource demo lab. This is for sure no recommended configuration for Ceph and Ceph is quite slow on these 2,5 inch harddisks and 1 GBit network link, despite albeit somewhat minimal, limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled to a halt and the kernel gave task hung for more than 120 seconds messages. The VM was basically unusable during the balance. Sure that should not happen with a "proper" setup, also it also did not happen without the automatic balance. Also what would happen on a hypervisor setup with several thousands of VMs with BTRFS, when several 100 of them decide to start the balance at a similar time? It could probably bring the I/O system below to an halt, as many enterprise storage systems are designed to sustain burst I/O loads, but not maximum utilization during an extended period of time. I am really wondering what to recommend in my Linux performance tuning and analysis courses. On my own laptop I do not do regular balances so far. Due to my thinking: If it is not broken, do not fix it. My personal opinion here also is: If the filesystem degrades that much that it becomes unusable without regular maintenance from user space, the filesystem needs to be fixed. Ideally I would not have to worry on whether to regularly balance an BTRFS or not. In other words: I should not have to visit a performance analysis and tuning course in order to use a computer with BTRFS filesystem. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use
On 17.07.2018 10:46, Qu Wenruo wrote: > [BUG] > For certain fuzzed btrfs image, if we create any csum data, it would > cause the following kernel warning and deadlock when trying to update > csum tree: > -- > [ 278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 > btrfs_tree_lock+0x3e2/0x400 > [ 278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 > [ 278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [ 278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper > [ 278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400 > [ 278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 28 > 00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b > e9 d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e > [ 278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246 > [ 278.113865] Call Trace: > [ 278.113936] btrfs_alloc_tree_block+0x39f/0x770 > [ 278.113988] __btrfs_cow_block+0x285/0x9e0 > [ 278.114029] btrfs_cow_block+0x191/0x2e0 > [ 278.114035] btrfs_search_slot+0x492/0x1160 > [ 278.114146] btrfs_lookup_csum+0xec/0x280 > [ 278.114182] btrfs_csum_file_blocks+0x2be/0xa60 > [ 278.114232] add_pending_csums+0xaf/0xf0 > [ 278.114238] btrfs_finish_ordered_io+0x74b/0xc90 > [ 278.114281] finish_ordered_fn+0x15/0x20 > [ 278.114285] normal_work_helper+0xf6/0x500 > [ 278.114305] btrfs_endio_write_helper+0x12/0x20 > [ 278.114310] process_one_work+0x302/0x770 > [ 278.114315] worker_thread+0x81/0x6d0 > [ 278.114321] kthread+0x180/0x1d0 > [ 278.114334] ret_from_fork+0x35/0x40 > [ 278.114339] ---[ end trace 2e85051acb5f6dc1 ]--- > -- > > [CAUSE] > The fuzzed image has corrupted EXTENT_ITEM for csum tree root: > -- > extent tree key (EXTENT_TREE ROOT_ITEM 0) > item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33 > refs 1 gen 6 flags TREE_BLOCK > tree block skinny level 0 > tree block backref root UUID_TREE > item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33 > Corrupted METADATA_ITEM > item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33 > refs 1 gen 4 flags TREE_BLOCK > tree block skinny level 0 > tree block backref root DATA_RELOC_TREE > > checksum tree key (CSUM_TREE ROOT_ITEM 0) > leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE > bytenr matches above item. > -- > > So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since > there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks > it's free space, and reserve it. > > However in fact it's already been used by csum tree, and later > btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose > WARN_ON() detects lock nest on the same extent buffer. > > Finally the wait_event() on the eb->read/write_lock_wq will never exit > since we're holding the lock by ourselves and deadlock. > > [FIX] > The fix here is to ensure at least the reserved extent buffer is not > cached. > Any used extent buffer should be cached in the global radix tree > (fs_info->buffer_radix). > > So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(), > we call find_extent_buffer() explicitly to verify it's not used by > ourselves. > > Please note this is just a basic check, it is not and will never be as > good as btrfs check on detecting extent tree corruption, but at least we > won't dead lock so easily. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405 > Reported-by: Xu Wen > Signed-off-by: Qu Wenruo > --- > fs/btrfs/extent-tree.c | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3578fa5b30ef..782dd96b7c5e 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct > btrfs_trans_handle *trans, > if (ret) > goto out_unuse; > > + /* > + * Newly allocated tree block should never be cached in radix tree, > + * Or we have a corrupted extent tree. > + */ > + buf = find_extent_buffer(fs_info, ins.objectid); > + if (buf) { > + btrfs_err_rl(fs_info, > + "tree block %llu is already in use, extent tree may be corrupted", > + ins.objectid); > + ret = -EUCLEAN; > + free_extent_buffer(buf); > + goto out_unuse; > + } The code makes sense but I have the feeling it needs to have some sort of assert guard because this check will likely trigger only on severly corrupted filesystemd and yet we introduce it for everyone. > + > buf = btrfs_init_new_buffer(trans, root, ins.objectid, level); > if (IS_ERR(buf)) { > ret = PTR_ERR(buf); > -- To unsubscribe
Re: Corrupted FS with "open_ctree failed" and "failed to recover balance: -5"
Thanks for the answer. On 16/07/18 10:32, Qu Wenruo wrote: > > > On 2018年07月16日 16:15, Udo Waechter wrote: >>> The weird thing is that I can't really find information about the >>> "failed to recover balance: -5" error. - There was no rebalancing >>> running when during the crash. > > Can only be determined by tree dump. > > # btrfs ins dump-tree -t root This gives me: btrfs-progs v4.13.3 parent transid verify failed on 321265147904 wanted 3276017 found 3273915 parent transid verify failed on 321265147904 wanted 3276017 found 3273915 parent transid verify failed on 321265147904 wanted 3276017 found 3263707 parent transid verify failed on 321265147904 wanted 3276017 found 3273915 Ignoring transid failure leaf parent key incorrect 321265147904 ERROR: unable to open /dev/vg00/var_ >>> * Unfortunatly I did a "btrfs rescue zero-log" at some point :( - As it >>> turns out that might have been a bad idea >>> >>> >>> * Also, a "btrfs check --init-extent-tree" - https://pastebin.com/jATDCFZy > > Then it is making things worse, fortunately it should terminate before > it causes more damage. > > I'm just curious why people doesn't try the safest "btrfs check" without > any options, but goes the most dangerous option. > > And "btrfs check" output please. > If possible, "btrfs check --mode=lowmem" is also good for debug. > Same thing here: parent transid verify failed on 321265147904 wanted 3276017 found 3273915 parent transid verify failed on 321265147904 wanted 3276017 found 3273915 parent transid verify failed on 321265147904 wanted 3276017 found 3263707 parent transid verify failed on 321265147904 wanted 3276017 found 3273915 Ignoring transid failure leaf parent key incorrect 321265147904 ERROR: cannot open file system I did an image with dd pretty early in this process. Unfortunatly this gives me the same error. Thanks, udo. > Thanks, > Qu > >>> >>> The volume contained qcow2 images for VMs. I need only one of those, >>> since one piece of important software decided to not do backups :( >>> >>> Any help is highly appreciated. >>> >>> Many thanks, >>> udo. >>> >> > signature.asc Description: OpenPGP digital signature
[PATCH] btrfs: extent-tree: Check if the newly reserved tree block is already in use
[BUG] For certain fuzzed btrfs image, if we create any csum data, it would cause the following kernel warning and deadlock when trying to update csum tree: -- [ 278.113360] WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400 [ 278.113737] CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 [ 278.113745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 278.113753] Workqueue: btrfs-endio-write btrfs_endio_write_helper [ 278.113761] RIP: 0010:btrfs_tree_lock+0x3e2/0x400 [ 278.113762] Code: 00 48 c7 40 08 00 00 00 00 48 8b 45 d0 65 48 33 04 25 28 00 00 00 75 20 48 81 c4 a0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e9 d4 fc ff ff 0f 0b e9 61 ff ff ff e8 ab f4 87 ff 90 66 2e [ 278.113818] RSP: 0018:8801f407f488 EFLAGS: 00010246 [ 278.113865] Call Trace: [ 278.113936] btrfs_alloc_tree_block+0x39f/0x770 [ 278.113988] __btrfs_cow_block+0x285/0x9e0 [ 278.114029] btrfs_cow_block+0x191/0x2e0 [ 278.114035] btrfs_search_slot+0x492/0x1160 [ 278.114146] btrfs_lookup_csum+0xec/0x280 [ 278.114182] btrfs_csum_file_blocks+0x2be/0xa60 [ 278.114232] add_pending_csums+0xaf/0xf0 [ 278.114238] btrfs_finish_ordered_io+0x74b/0xc90 [ 278.114281] finish_ordered_fn+0x15/0x20 [ 278.114285] normal_work_helper+0xf6/0x500 [ 278.114305] btrfs_endio_write_helper+0x12/0x20 [ 278.114310] process_one_work+0x302/0x770 [ 278.114315] worker_thread+0x81/0x6d0 [ 278.114321] kthread+0x180/0x1d0 [ 278.114334] ret_from_fork+0x35/0x40 [ 278.114339] ---[ end trace 2e85051acb5f6dc1 ]--- -- [CAUSE] The fuzzed image has corrupted EXTENT_ITEM for csum tree root: -- extent tree key (EXTENT_TREE ROOT_ITEM 0) item 4 key (29364224 METADATA_ITEM 0) itemoff 3857 itemsize 33 refs 1 gen 6 flags TREE_BLOCK tree block skinny level 0 tree block backref root UUID_TREE item 5 key (29376512 UNKNOWN.0 0) itemoff 3824 itemsize 33 Corrupted METADATA_ITEM item 6 key (29380608 METADATA_ITEM 0) itemoff 3791 itemsize 33 refs 1 gen 4 flags TREE_BLOCK tree block skinny level 0 tree block backref root DATA_RELOC_TREE checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29376512 items 0 free space 3995 generation 4 owner CSUM_TREE bytenr matches above item. -- So when btrfs_alloc_tree_blocks() calls btrfs_reserve_extent(), since there is not METADATA_ITEM/EXTENT_ITEM for bytenr 29376512, btrfs thinks it's free space, and reserve it. However in fact it's already been used by csum tree, and later btrfs_init_new_buffer() will try to call btrfs_tree_lock(), whose WARN_ON() detects lock nest on the same extent buffer. Finally the wait_event() on the eb->read/write_lock_wq will never exit since we're holding the lock by ourselves and deadlock. [FIX] The fix here is to ensure at least the reserved extent buffer is not cached. Any used extent buffer should be cached in the global radix tree (fs_info->buffer_radix). So before calling btrfs_init_new_buffer() in btrfs_alloc_tree_block(), we call find_extent_buffer() explicitly to verify it's not used by ourselves. Please note this is just a basic check, it is not and will never be as good as btrfs check on detecting extent tree corruption, but at least we won't dead lock so easily. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405 Reported-by: Xu Wen Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3578fa5b30ef..782dd96b7c5e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8435,6 +8435,20 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans, if (ret) goto out_unuse; + /* +* Newly allocated tree block should never be cached in radix tree, +* Or we have a corrupted extent tree. +*/ + buf = find_extent_buffer(fs_info, ins.objectid); + if (buf) { + btrfs_err_rl(fs_info, + "tree block %llu is already in use, extent tree may be corrupted", +ins.objectid); + ret = -EUCLEAN; + free_extent_buffer(buf); + goto out_unuse; + } + buf = btrfs_init_new_buffer(trans, root, ins.objectid, level); if (IS_ERR(buf)) { ret = PTR_ERR(buf); -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 16.07.2018 23:58, Wolf wrote: > Greetings, > I would like to ask what what is healthy amount of free space to keep on > each device for btrfs to be happy? > > This is how my disk array currently looks like > > [root@dennas ~]# btrfs fi usage /raid > Overall: > Device size: 29.11TiB > Device allocated: 21.26TiB > Device unallocated:7.85TiB > Device missing: 0.00B > Used: 21.18TiB > Free (estimated): 3.96TiB (min: 3.96TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:10.61TiB, Used:10.58TiB >/dev/mapper/data1 1.75TiB >/dev/mapper/data2 1.75TiB >/dev/mapper/data3 856.00GiB >/dev/mapper/data4 856.00GiB >/dev/mapper/data5 1.75TiB >/dev/mapper/data6 1.75TiB >/dev/mapper/data7 6.29TiB >/dev/mapper/data8 6.29TiB > > Metadata,RAID1: Size:15.00GiB, Used:13.00GiB >/dev/mapper/data1 2.00GiB >/dev/mapper/data2 3.00GiB >/dev/mapper/data3 1.00GiB >/dev/mapper/data4 1.00GiB >/dev/mapper/data5 3.00GiB >/dev/mapper/data6 1.00GiB >/dev/mapper/data7 9.00GiB >/dev/mapper/data8 10.00GiB > > System,RAID1: Size:64.00MiB, Used:1.50MiB >/dev/mapper/data2 32.00MiB >/dev/mapper/data6 32.00MiB >/dev/mapper/data7 32.00MiB >/dev/mapper/data8 32.00MiB > > Unallocated: >/dev/mapper/data11004.52GiB >/dev/mapper/data21004.49GiB >/dev/mapper/data31006.01GiB >/dev/mapper/data41006.01GiB >/dev/mapper/data51004.52GiB >/dev/mapper/data61004.49GiB >/dev/mapper/data71005.00GiB >/dev/mapper/data81005.00GiB > > Btrfs does quite good job of evenly using space on all devices. No, how > low can I let that go? In other words, with how much space > free/unallocated remaining space should I consider adding new disk? Btrfs will start running into problems when you run out of unallocated space. So the best advice will be monitor your device unallocated, once it gets really low - like 2-3 gb I will suggest you run balance which will try to free up unallocated space by rewriting data more compactly into sparsely populated block groups. If after running balance you haven't really freed any space then you should consider adding a new drive and running balance to even out the spread of data/metadata. > > Thanks for advice :) > > W. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs-progs: completion: Let dump-tree/dump-super/inode-resolve to accept any file
For dump-tree/dump-super the completion uses default filedir -d, which is far from convenient. Use filedir for dump-tree/dump-super/inode-resolve just like rootid. Signed-off-by: Qu Wenruo --- btrfs-completion | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/btrfs-completion b/btrfs-completion index 33830e827977..b22a951b3a65 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -133,7 +133,7 @@ _btrfs() _btrfs_mnts return 0 ;; - rootid) + dump-tree|dump-super|rootid|inode-resolve) _filedir return 0 ;; -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: completion: Use _filedir() to replace _btrfs_devs()
For developers it's pretty common to call "btrfs check" on a raw image dump other than real block device. So current _btrfs_devs() is really making things worse. Use _filedir() to replace _btrfs_devs() so it can complete any filenames, no matter if it's just a file or a real block device. Signed-off-by: Qu Wenruo --- btrfs-completion | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/btrfs-completion b/btrfs-completion index ae683f4ecf61..33830e827977 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -6,9 +6,7 @@ _btrfs_devs() { - local DEVS - DEVS=''; while read dev; do DEVS+="$dev "; done < <(lsblk -pnro name) - COMPREPLY+=( $( compgen -W "$DEVS" -- "$cur" ) ) + _filedir } _btrfs_mnts() -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] btrfs-progs: completion: Small fixes to make debug simpler
For developer, it's pretty common to use "btrfs check" or "btrfs ins dump-tree" on raw dumps. However "btrfs check" can only complete real block devices, and "btrfs inspect dump-tree" can only complete dir. Make them to use _filedir() so any filename can be completed and save us developer a little time and nerve hitting that holy tab. Qu Wenruo (2): btrfs-progs: completion: Use _filedir() to replace _btrfs_devs() btrfs-progs: completion: Let dump-tree/dump-super/inode-resolve to accept any file btrfs-completion | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html