Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > > - is documenting rejection on request alignment grounds > > > > (i.e. EINVAL) in the man page sufficient for app > > > > developers to understand what is going on here? > > > > > > I think so. The manpage says: "The filesystem does not support > > > reflinking the ranges of the given files", which (to my mind) covers > > > this case of not supporting dedupe of EOF blocks. > > > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > > this; however, on btrfs, not supporting dedupe of EOF blocks means small > > files (one extent) cannot be deduped at all, because the EOF block holds > > a reference to the entire dst extent. If a dedupe app doesn't go all the > > way to EOF on btrfs, then it should not attempt to dedupe any part of the > > last extent of the file as the benefit would be zero or slightly negative. > > That's a filesystem implementation issue, not an API or application > issue. The API and application issue remains even if btrfs is not considered. btrfs is just the worst case outcome. Other filesystems still have fragmentation issues, and applications have efficiency-vs-capability tradeoffs to make if they can't rely on dedupe-to-EOF being available. Tools like 'cp --reflink=auto' work by trying the best case, then falling back to a second choice if the first choice returns an error. If the second choice fails too, the surprising behavior can make inattentive users lose data. > > The app developer would need to be aware that such a restriction could > > exist on some filesystems, and be able to distinguish this from other > > cases that could lead to EINVAL. Portable code would have to try a dedupe > > up to EOF, then if that failed, round down and retry, and if that failed > > too, the app would have to figure out which filesystem it's running on > > to know what to do next. Performance demands the app know what the FS > > will do in advance, and avoid a whole class of behavior. > > Nobody writes "portable" applications like that. As an app developer, and having studied other applications' revision histories, and having followed IRC and mailing list conversations involving other developers writing these applications, I can assure you that is _exactly_ how portable applications get written around the dedupe function. Usually people start from experience with tools that use hardlinks to implement dedupe, so the developer's mental model starts with deduping entire files. Their first attempt does this: stat(fd, ); dedupe( ..., src_offset = 0, dst_offset = 0, length = st.st_size); then subsequent revisions of their code cope with limits on length, and then deal with EINVAL on odd lengths, because those are the problems that are encountered as the code runs for the first time on an expanding set of filesystems. After that, they deal with implementation-specific performance issues. Other app developers start by ignoring incomplete blocks, then compare their free-space-vs-time graphs with other dedupe apps on the same filesystem, then either adapt to handle EOF properly, or just accept being uncompetitive. > They read the man > page first, and work out what the common subset of functionality is > and then code from that. > Man page says: > > "Disk filesystems generally require the offset and length arguments > to be aligned to the fundamental block size." > IOWs, code compatible with starts with supporting the general case. > i.e. a range rounded to filesystem block boundaries (it's already > run fstat() on the files it wants to dedupe to find their size, > yes?), hence ignoring the partial EOF block. Will just work on > everything. Will cause a significant time/space performance hit too. EOFs are everywhere, and they have a higher-than-average duplication rate for their size. If an application assumes EOF can't be deduped on every filesystem, then it leaves a non-trivial amount of free space unrecovered on filesystems that can dedupe EOF. It also necessarily increases fragmentation unless the filesystem implements file tails (where it keeps fragmentation constant as the tail won't be stored contiguously in any case). > Code that then wants to optimise for btrfs/xfs/ocfs quirks runs > fstatvfs to determine what fs it's operating on and applies the > necessary quirks. For btrfs it can extend the range to include the > partial EOF block, and hence will handle the implementation quirks > btrfs has with single extent dedupe. > > Simple, reliable, and doesn't require any sort of flailing > about with offsets and lengths to avoid unexpected EINVAL errors. It would be better if the filesystem was given the EOF block by the application if it is the
Re: How to erase a RAID1 (+++)?
Chris Murphy posted on Thu, 30 Aug 2018 11:08:28 -0600 as excerpted: >> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks >> in prder to start in degraded mode > > Good luck with this. The Btrfs archives are full of various limitations > of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if > you persistently ask for degraded mount, you run the risk of other > problems if there's merely a delayed discovery of one of the devices. > Once a Btrfs volume is degraded, it does not automatically resume normal > operation just because the formerly missing device becomes available. > > So... this is flat out not suitable for use cases where you need > unattended raid1 degraded boot. Agreeing in general and adding some detail... 1) Are you intending to use an initr*? I'm not sure the current status (I actually need to test again for myself), but at least in the past, booting a btrfs raid1 rootfs required an initr*, and I have and use one here, for that purpose alone (until switching to btrfs raid1 root, I went initr*-less, and would prefer that again, due to the complications of maintaining an initr*). The base problem is that with raid1 (or other forms of multi-device btrfs, but it happens to be raid1 that's in question for both you and I) the filesystem needs multiple devices to complete the filesystem and the kernel's root= parameter takes only one. When mounting after userspace is up, a btrfs device scan is normally run (often automatically by udev) before the mount, that lets btrfs in the kernel track what devices belong to what filesystems, so pointing to just one of the devices is enough because the kernel knows from that what filesystem is intended and can match up the others that go with it from the earlier scan. Now there's a btrfs mount option, device=/dev/*, that can be provided more than once for additional devices, that can /normally/ be used to tell the kernel what specific devices to use, bypassing the need for btrfs device scan, and in /theory/, passing that like other mount options in the kernel commandline via rootflags= /should/ "just work". But for reasons I as a btrfs user (not dev, and definitely not kernel or btrfs dev) don't fully understand, passing device= via rootflags= is, or at least was, broken, so properly mounting a multi-device btrfs required (and may still require) userspace, thus for a multi-device btrfs rootfs, an initr*. So direct-booting to a multi-device btrfs rootfs didn't normally work. It would if you passed rootflags=degraded (at least with a two-device raid1 so the one device passed in root= contained one copy of everything), but then it was unclear if the additional device was successfully added to the raid1 later, or not. And with no automatic sync and bringing back to undegraded status, it was a risk I didn't want to take. So unfortunately, initr* it was! But I originally tested that when I setup my own btrfs raid1 rootfs very long ago in kernel and btrfs terms, kernel 3.6 or so IIRC, and while I've not /seen/ anything definitive on-list to suggest rootflags=device= is unbroken now (I asked recently and got an affirmative reply, but I asked for clarification and I've not seen it, tho perhaps it's there and I've not read it yet), perhaps I missed it. And I've not retested lately, tho I really should as while I asked I guess the only real way to know is to try it for myself, and it'd definitely be nice to be direct-booting without having to bother with an initr*, again. 2) As both Chris and I alluded to, unlike say mdraid, btrfs doesn't (yet) have an automatic mechanism to re-sync and "undegrade" after having been mounted degraded,rw. A btrfs scrub can be run to re-sync raid1 chunks, but single chunks may have been added while in the degraded state as well, and those need a balance convert to raid1 mode, before the filesystem and data on it can be be considered reliably able to withstand device loss once again. In fact, while the problem has been fixed now, for quite awhile if the filesystem was mounted degraded,rw, you often had exactly that one mount to fix the problem, as new chunks would be written in single mode and after that the filesystem would refuse to mount writable,degraded, and would only let you mount degraded,ro, which would let you get data off it but not let you fix the problem. Word to the wise if you're planning on running stable-debian (which tend to be older) kernels, or even just trying to use them for recovery if you need to! (The fix was to have a mount check if at least one copy of all chunks were available and allow rw mounting if so, instead of simply assuming that any single-mode chunks at all meant some wouldn't be available on a multi-device filesystem with a device missing, thus forcing read-only mode only mounting, as it used to do.) 3) If a btrfs raid1 is mounted degraded,rw with one device missing, then mounted again
[PATCH 3/3] btrfs: qgroup: Remove deprecated feature support in btrfs_qgorup_inhert()
Since btrfs_validate_inherit() will not allow features like copy rfer/excl and limit set, remove these dead code. Signed-off-by: Qu Wenruo --- fs/btrfs/qgroup.c | 57 +-- 1 file changed, 1 insertion(+), 56 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 53daf73b0de9..b57577f45ff3 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2301,8 +2301,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid, if (inherit) { i_qgroups = (u64 *)(inherit + 1); - nums = inherit->num_qgroups + 2 * inherit->num_ref_copies + - 2 * inherit->num_excl_copies; + nums = inherit->num_qgroups; for (i = 0; i < nums; ++i) { srcgroup = find_qgroup_rb(fs_info, *i_qgroups); @@ -2354,23 +2353,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid, goto unlock; } - if (inherit && inherit->flags & BTRFS_QGROUP_INHERIT_SET_LIMITS) { - dstgroup->lim_flags = inherit->lim.flags; - dstgroup->max_rfer = inherit->lim.max_rfer; - dstgroup->max_excl = inherit->lim.max_excl; - dstgroup->rsv_rfer = inherit->lim.rsv_rfer; - dstgroup->rsv_excl = inherit->lim.rsv_excl; - - ret = update_qgroup_limit_item(trans, dstgroup); - if (ret) { - fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT; - btrfs_info(fs_info, - "unable to update quota limit for %llu", - dstgroup->qgroupid); - goto unlock; - } - } - if (srcid) { srcgroup = find_qgroup_rb(fs_info, srcid); if (!srcgroup) @@ -2413,43 +2395,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid, ++i_qgroups; } - for (i = 0; i < inherit->num_ref_copies; ++i, i_qgroups += 2) { - struct btrfs_qgroup *src; - struct btrfs_qgroup *dst; - - if (!i_qgroups[0] || !i_qgroups[1]) - continue; - - src = find_qgroup_rb(fs_info, i_qgroups[0]); - dst = find_qgroup_rb(fs_info, i_qgroups[1]); - - if (!src || !dst) { - ret = -EINVAL; - goto unlock; - } - - dst->rfer = src->rfer - level_size; - dst->rfer_cmpr = src->rfer_cmpr - level_size; - } - for (i = 0; i < inherit->num_excl_copies; ++i, i_qgroups += 2) { - struct btrfs_qgroup *src; - struct btrfs_qgroup *dst; - - if (!i_qgroups[0] || !i_qgroups[1]) - continue; - - src = find_qgroup_rb(fs_info, i_qgroups[0]); - dst = find_qgroup_rb(fs_info, i_qgroups[1]); - - if (!src || !dst) { - ret = -EINVAL; - goto unlock; - } - - dst->excl = src->excl + level_size; - dst->excl_cmpr = src->excl_cmpr + level_size; - } - unlock: spin_unlock(_info->qgroup_lock); out: -- 2.18.0
[PATCH 2/3] btrfs: qgroup: Validate btrfs_qgroup_inherit structure before passing it to qgroup
btrfs_qgroup_inherit structure doesn't goes through much validation check. Now do a comprehensive check for it, including: 1) inherit size Should not exceeding SZ_4K and its num_qgroups should not exceed its size passed in btrfs_ioctl_vol_args_v2. 2) flags Should not include any unknown flags (In fact, no flag is supported at all now) Btrfs-progs never has such ability to set flags for btrfs_qgroup_inherit. 3) limit Should not contain anything. Btrfs-progs never has such ability to set limit for btrfs_qgroup_inherit. 4) rfer/excl copy Deprecated feature. Btrfs-progs has such interface but never documented and we're already going to remove such ability. It's the easiest way to screw up qgroup numbers. 3) Qgroupid Comprehensive check is already in btrfs_qgroup_inherit(), here we only check if there is any obviously invalid qgroupid (0). Coverity-id: 1021055 Reported-by: Nikolay Borisov Signed-off-by: Qu Wenruo --- fs/btrfs/ioctl.c | 3 +++ fs/btrfs/qgroup.c | 39 ++ fs/btrfs/qgroup.h | 2 ++ include/uapi/linux/btrfs.h | 17 - 4 files changed, 52 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 5db8680b40a9..4f5f453d5d07 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1820,6 +1820,9 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file, ret = PTR_ERR(inherit); goto free_args; } + ret = btrfs_validate_inherit(inherit, vol_args->size); + if (ret < 0) + goto free_args; } ret = btrfs_ioctl_snap_create_transid(file, vol_args->name, diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 4353bb69bb86..53daf73b0de9 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2232,6 +2232,45 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans) return ret; } +/* + * To make sure the inherit passed in is valid + * + * Here we only check flags and rule out some no-longer supported features. + * And we only do very basis qgroupid check to ensure there is no obviously + * invalid qgroupid (0). Detailed qgroupid check will be done in + * btrfs_qgroup_inherit(). + */ +int btrfs_validate_inherit(struct btrfs_qgroup_inherit *inherit, + u64 inherit_size) +{ + u64 i; + + if (inherit->flags & ~BTRFS_QGROUP_INHERIT_FLAGS_SUPP) + return -ENOTTY; + /* Qgroup rfer/excl copy is deprecated */ + if (inherit->num_excl_copies || inherit->num_ref_copies) + return -ENOTTY; + + /* Since SET_LIMITS is never used, @lim should all be zeroed */ + if (inherit->lim.max_excl || inherit->lim.max_rfer || + inherit->lim.rsv_excl || inherit->lim.rsv_rfer || + inherit->lim.flags) + return -ENOTTY; + + /* Size check */ + if (sizeof(u64) * inherit->num_qgroups + sizeof(*inherit) > + min_t(u64, BTRFS_QGROUP_INHERIT_MAX_SIZE, inherit_size)) + return -EINVAL; + + + /* Qgroup 0/0 is not allowed */ + for (i = 0; i < inherit->num_qgroups; i++) { + if (inherit->qgroups[i] == 0) + return -EINVAL; + } + return 0; +} + /* * Copy the accounting information between qgroups. This is necessary * when a snapshot or a subvolume is created. Throwing an error will diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 54b8bb282c0e..1bf9c584be70 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -241,6 +241,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr, struct ulist *new_roots); int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans); int btrfs_run_qgroups(struct btrfs_trans_handle *trans); +int btrfs_validate_inherit(struct btrfs_qgroup_inherit *inherit, + u64 inherit_size); int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid, u64 objectid, struct btrfs_qgroup_inherit *inherit); void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info, diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 311edb65567c..5a5532a20019 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -74,21 +74,20 @@ struct btrfs_qgroup_limit { __u64 rsv_excl; }; -/* - * flags definition for qgroup inheritance - * - * Used by: - * struct btrfs_qgroup_inherit.flags - */ +/* flags definition for qgroup inheritance (DEPRECATED) */ #define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0) +/* No supported flags */ +#define BTRFS_QGROUP_INHERIT_FLAGS_SUPP (0) + #define BTRFS_QGROUP_INHERIT_MAX_SIZE (SZ_4K) + struct btrfs_qgroup_inherit { __u64 flags; __u64 num_qgroups; - __u64 num_ref_copies; - __u64
[PATCH 1/3] btrfs: Set qgroup inherit size limit to SZ_4K instead of page size
Change btrfs_qgroup_inherit maximum size from PAGE_SIZE to SZ_4K to make it consistent across different architectures. Although in theory this could lead to incompatibility, but considering how rare btrfs_qgroup_inherit is used, it's still not too late to change it without impacting a large user base. Signed-off-by: Qu Wenruo --- fs/btrfs/ioctl.c | 2 +- include/uapi/linux/btrfs.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 63600dc2ac4c..5db8680b40a9 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1811,7 +1811,7 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file, if (vol_args->flags & BTRFS_SUBVOL_RDONLY) readonly = true; if (vol_args->flags & BTRFS_SUBVOL_QGROUP_INHERIT) { - if (vol_args->size > PAGE_SIZE) { + if (vol_args->size > BTRFS_QGROUP_INHERIT_MAX_SIZE) { ret = -EINVAL; goto free_args; } diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 5ca1d21fc4a7..311edb65567c 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -82,6 +82,7 @@ struct btrfs_qgroup_limit { */ #define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0) +#define BTRFS_QGROUP_INHERIT_MAX_SIZE (SZ_4K) struct btrfs_qgroup_inherit { __u64 flags; __u64 num_qgroups; -- 2.18.0
[PATCH 0/3] btrfs: qgroup: Deprecate unused features for btrfs_qgroup_inherit()
This patchset can be fetched from github: https://github.com/adam900710/linux/tree/qgroup_inherit_check Which is based on v4.19-rc1 tag. This patchset will first set btrfs_qgroup_inherit structure size limit from PAGE_SIZE to fixed SZ_4K. I understand this normally will cause compatibility problem, but considering how minor this feature is and no sane guy should use it for over 100 qgroups, it should be fine in real world. The 2nd patch introduce check function for btrfs_qgroup_inherit structure and deprecates the following features: 1) limit set Never utilized by btrfs-progs from the beginning. 2) copy rfer/excl Although btrfs-progs provides support for it as a hidden, undocumented feature, it's the easiest way to screw up qgroup numbers. And we already have patches wondering around the ML to remove such support. The last one will just cleanup the code for unsupported features. Qu Wenruo (3): btrfs: Set qgroup inherit size limit to SZ_4K instead of page size btrfs: qgroup: Validate btrfs_qgroup_inherit structure before passing it to qgroup btrfs: qgroup: Remove deprecated feature support in btrfs_qgorup_inhert() fs/btrfs/ioctl.c | 5 +- fs/btrfs/qgroup.c | 96 -- fs/btrfs/qgroup.h | 2 + include/uapi/linux/btrfs.h | 18 +++ 4 files changed, 55 insertions(+), 66 deletions(-) -- 2.18.0
Re: fsck lowmem mode only: ERROR: errors found in fs roots
Thank for the report. On 08/31/2018 12:47 AM, Christoph Anton Mitterer wrote: Hey. I've the following on a btrfs that's basically the system fs for my notebook: When booting from a USB stick with: # uname -a Linux heisenberg 4.17.0-3-amd64 #1 SMP Debian 4.17.17-1 (2018-08-18) x86_64 GNU/Linux # btrfs --version btrfs-progs v4.17 ... a lowmem mode fsck gives no error: # btrfs check --mode=lowmem /dev/mapper/system ; echo $? Checking filesystem on /dev/mapper/system UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c checking extents checking free space cache checking fs roots ERROR: errors found in fs roots found 495910952960 bytes used, error(s) found total csum bytes: 481840472 total tree bytes: 2388819968 total fs tree bytes: 1651097600 total extent tree bytes: 161841152 btree space waste bytes: 446707102 file data blocks allocated: 6651878428672 referenced 542320984064 1 ... while a normal mode fsck doesn't give one: # btrfs check /dev/mapper/system ; echo $? Checking filesystem on /dev/mapper/system UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c checking extents checking free space cache checking fs roots checking only csum items (without verifying data) checking root refs found 495910952960 bytes used, no error found total csum bytes: 481840472 total tree bytes: 2388819968 total fs tree bytes: 1651097600 total extent tree bytes: 161841152 btree space waste bytes: 446707102 file data blocks allocated: 6651878428672 referenced 542320984064 0 There were no unusual kernel log messages. Humm, I think it's a bug of lowmem mode. After looking through releated codes, I can't tell the cause. Can you please fetch btrfs-progs from my repo and run lowmem check in readonly? Repo: https://github.com/Damenly/btrfs-progs/tree/lowmem_debug It's based on v4.17.1 plus additonal output for debug only. Back in the normal system (no longer USB)... I spottet this: Aug 30 18:31:29 heisenberg kernel: BTRFS info (device dm-0): the free space cache file (22570598400) is invalid, skip it but not sure whether it's related (probably not)... and I haven't seen such a free space cache file issue (or any other btrfs errors) in a long while (I usually watch my kernel log once after booting has finished). BTW, what's the mount option of USB? Any ideas? Perhaps it's just yet another lowmem false positive... anything I can help in debugging this? Apart from this I haven't noticed any corruptions recently... just about to make a full backup of the fs (or better said a rw snapshot of the most of the data) with tar, so most data will be read soon at least once... an I would probably notice any further errors that are otherwise silent. Don't worry, since normal check didn't report any error, it may be just a false alert. Su Cheers, Chris.
Re: How to erase a RAID1 (+++)?
And also, I'll argue this might have been a btrfs-progs bug as well, depending on what version was used and the command. Both mkfs and dev add should not be able to add type code 0x05. At least libblkid correctly shows that it's 1KiB in size, so really Btrfs should not succeed at adding this device, it can't put any of the supers in the correct location. [chris@f28h ~]$ sudo fdisk -l /dev/loop0 Disk /dev/loop0: 1 GiB, 1073741824 bytes, 2097152 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x7e255cce Device Boot StartEnd Sectors Size Id Type /dev/loop0p12048 206847 204800 100M 83 Linux /dev/loop0p2 206848 411647 204800 100M 83 Linux /dev/loop0p3 411648 616447 204800 100M 83 Linux /dev/loop0p4 616448 821247 204800 100M 5 Extended /dev/loop0p5 618496 821247 202752 99M 83 Linux [chris@f28h ~]$ sudo kpartx -a /dev/loop0 [chris@f28h ~]$ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:00 1G 0 loop ├─loop0p1 253:10 100M 0 part ├─loop0p2 253:20 100M 0 part ├─loop0p3 253:30 100M 0 part ├─loop0p4 253:40 1K 0 part └─loop0p5 253:5099M 0 part [chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4 btrfs-progs v4.17.1 See http://btrfs.wiki.kernel.org for more information. probe of /dev/loop0p4 failed, cannot detect existing filesystem. ERROR: use the -f option to force overwrite of /dev/loop0p4 [chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4 -f btrfs-progs v4.17.1 See http://btrfs.wiki.kernel.org for more information. ERROR: mount check: cannot open /dev/loop0p4: No such file or directory ERROR: cannot check mount status of /dev/loop0p4: No such file or directory [chris@f28h ~]$ I guess that's a good sign in this case? Chris Murphy
[PATCH 34/35] btrfs: wait on ordered extents on abort cleanup
If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 54fbdc944a3f..51b2a5bf25e5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) spin_lock(_info->ordered_root_lock); } spin_unlock(_info->ordered_root_lock); + + /* +* We need this here because if we've been flipped read-only we won't +* get sync() from the umount, so we need to make sure any ordered +* extents that haven't had their dirty pages IO start writeout yet +* actually get run and error out properly. +*/ + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, -- 2.14.3
[PATCH 11/35] btrfs: don't use global rsv for chunk allocation
We've done this forever because of the voodoo around knowing how much space we have. However we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index df826f713034..783341e3653e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4366,21 +4366,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 17/35] btrfs: move the dio_sem higher up the callchain
We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. Signed-off-by: Josef Bacik --- fs/btrfs/file.c | 12 fs/btrfs/tree-log.c | 2 -- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 095f0bb86bb7..c07110edb9de 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2079,6 +2079,14 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) goto out; inode_lock(inode); + + /* +* We take the dio_sem here because the tree log stuff can race with +* lockless dio writes and get an extent map logged for an extent we +* never waited on. We need it this high up for lockdep reasons. +*/ + down_write(_I(inode)->dio_sem); + atomic_inc(>log_batch); /* @@ -2087,6 +2095,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) */ ret = btrfs_wait_ordered_range(inode, start, len); if (ret) { + up_write(_I(inode)->dio_sem); inode_unlock(inode); goto out; } @@ -2110,6 +2119,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * checked called fsync. */ ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err); + up_write(_I(inode)->dio_sem); inode_unlock(inode); goto out; } @@ -2128,6 +2138,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); + up_write(_I(inode)->dio_sem); inode_unlock(inode); goto out; } @@ -2149,6 +2160,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * file again, but that will end up using the synchronization * inside btrfs_sync_log to keep things safe. */ + up_write(_I(inode)->dio_sem); inode_unlock(inode); /* diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1650dc44a5e3..66b7e059b765 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4374,7 +4374,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, INIT_LIST_HEAD(); - down_write(>dio_sem); write_lock(>lock); test_gen = root->fs_info->last_trans_committed; logged_start = start; @@ -4440,7 +4439,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, } WARN_ON(!list_empty()); write_unlock(>lock); - up_write(>dio_sem); btrfs_release_path(path); if (!ret) -- 2.14.3
[PATCH 31/35] btrfs: clear delayed_refs_rsv for dirty bg cleanup
We keep track of dirty bg's as a reservation in the delayed_refs_rsv, so when we abort and we cleanup those dirty bgs we need to drop their reservation so we don't have accounting issues and lots of scary messages on umount. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index caaca8154a1a..54fbdc944a3f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4412,6 +4412,7 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans, spin_unlock(_trans->dirty_bgs_lock); btrfs_put_block_group(cache); + btrfs_delayed_refs_rsv_release(fs_info, 1); spin_lock(_trans->dirty_bgs_lock); } spin_unlock(_trans->dirty_bgs_lock); -- 2.14.3
[PATCH 13/35] btrfs: reset max_extent_size properly
If we use up our block group before allocating a new one we'll easily get a max_extent_size that's set really really low, which will result in a lot of fragmentation. We need to make sure we're resetting the max_extent_size when we add a new chunk or add new space. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 22e1f9f55f4f..f4e7caf37d6c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4565,6 +4565,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, goto out; } else { ret = 1; + space_info->max_extent_size = 0; } space_info->force_alloc = CHUNK_ALLOC_NO_FORCE; @@ -8064,11 +8065,17 @@ static int __btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info, if (pin) pin_down_extent(fs_info, cache, start, len, 1); else { + struct btrfs_space_info *space_info = cache->space_info; + if (btrfs_test_opt(fs_info, DISCARD)) ret = btrfs_discard_extent(fs_info, start, len, NULL, BTRFS_CLEAR_OP_DISCARD); btrfs_add_free_space(cache, start, len); btrfs_free_reserved_bytes(cache, len, delalloc); + + spin_lock(_info->lock); + space_info->max_extent_size = 0; + spin_unlock(_info->lock); trace_btrfs_reserved_extent_free(fs_info, start, len); } -- 2.14.3
[PATCH 18/35] btrfs: set max_extent_size properly
From: Josef Bacik We can't use entry->bytes if our entry is a bitmap entry, we need to use entry->max_extent_size in that case. Fix up all the logic to make this consistent. Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 29 +++-- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index db93a5f035a0..53521027dd78 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1766,6 +1766,18 @@ static int search_bitmap(struct btrfs_free_space_ctl *ctl, return -1; } +static void set_max_extent_size(struct btrfs_free_space *entry, + u64 *max_extent_size) +{ + if (entry->bitmap) { + if (entry->max_extent_size > *max_extent_size) + *max_extent_size = entry->max_extent_size; + } else { + if (entry->bytes > *max_extent_size) + *max_extent_size = entry->bytes; + } +} + /* Cache the size of the max extent in bytes */ static struct btrfs_free_space * find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, @@ -1787,8 +1799,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, for (node = >offset_index; node; node = rb_next(node)) { entry = rb_entry(node, struct btrfs_free_space, offset_index); if (entry->bytes < *bytes) { - if (entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + set_max_extent_size(entry, max_extent_size); continue; } @@ -1806,8 +1817,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, } if (entry->bytes < *bytes + align_off) { - if (entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + set_max_extent_size(entry, max_extent_size); continue; } @@ -1819,8 +1829,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, *offset = tmp; *bytes = size; return entry; - } else if (size > *max_extent_size) { - *max_extent_size = size; + } else { + set_max_extent_size(entry, max_extent_size); } continue; } @@ -2680,8 +2690,7 @@ static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group, err = search_bitmap(ctl, entry, _start, _bytes, true); if (err) { - if (search_bytes > *max_extent_size) - *max_extent_size = search_bytes; + set_max_extent_size(entry, max_extent_size); return 0; } @@ -2718,8 +2727,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group, entry = rb_entry(node, struct btrfs_free_space, offset_index); while (1) { - if (entry->bytes < bytes && entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + if (entry->bytes < bytes) + set_max_extent_size(entry, max_extent_size); if (entry->bytes < bytes || (!entry->bitmap && entry->offset < min_start)) { -- 2.14.3
[PATCH 29/35] btrfs: just delete pending bgs if we are aborted
We still need to do all of the accounting cleanup for pending block groups if we abort. So set the ret to trans->aborted so if we aborted the cleanup happens and everybody is happy. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 90f267f4dd0f..132a1157982c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; - int ret = 0; + int ret = trans->aborted; bool can_flush_pending_bgs = trans->can_flush_pending_bgs; trans->can_flush_pending_bgs = false; -- 2.14.3
[PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
We may abort the transaction during a commit and not have a chance to run the pending bgs stuff, which will leave block groups on our list and cause us accounting issues and leaked memory. Fix this by running the pending bgs when we cleanup a transaction. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 89d14f135837..0f39a0d302d3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_scrub_continue(fs_info); cleanup_transaction: btrfs_trans_release_metadata(trans); + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); trans->block_rsv = NULL; btrfs_warn(fs_info, "Skipping commit of aborted transaction."); -- 2.14.3
[PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 7 ++- include/trace/events/btrfs.h | 1 + 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0a4e55703d48..791e287c2292 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2731,7 +2731,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 783341e3653e..22e1f9f55f4f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4907,6 +4907,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4914,7 +4915,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5060,6 +5063,8 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) } } spin_unlock(_info->lock); + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; } while (flush_state <= COMMIT_TRANS); } diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 7d205e50b09c..fdb23181b5b7 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort
We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 5 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 13 ++--- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 791e287c2292..67923b2030b8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -35,6 +35,7 @@ struct btrfs_trans_handle; struct btrfs_transaction; struct btrfs_pending_snapshot; +struct btrfs_delayed_ref_root; extern struct kmem_cache *btrfs_trans_handle_cachep; extern struct kmem_cache *btrfs_bit_radix_cachep; extern struct kmem_cache *btrfs_path_cachep; @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long count); int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, unsigned long count, u64 transid, int wait); +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1d3f5731d616..caaca8154a1a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (pin_bytes) btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); btrfs_put_delayed_ref_head(head); cond_resched(); spin_lock(_refs->lock); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 32579221d900..031d2b11ddee 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, - struct btrfs_delayed_ref_head *head) +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_delayed_ref_root *delayed_refs = - >transaction->delayed_refs; int nr_items = 1; if (head->total_ref_mod < 0) { @@ -2549,7 +2548,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); @@ -7191,7 +7190,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 32/35] btrfs: only free reserved extent if we didn't insert it
When we insert the file extent once the ordered extent completes we free the reserved extent reservation as it'll have been migrated to the bytes_used counter. However if we error out after this step we'll still clear the reserved extent reservation, resulting in a negative accounting of the reserved bytes for the block group and space info. Fix this by only doing the free if we didn't successfully insert a file extent for this extent. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 10455d0aa71c..3391f6a9fc77 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool truncated = false; bool range_locked = false; bool clear_new_delalloc_bytes = false; + bool clear_reserved_extent = true; if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) && @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) logical_len, logical_len, compress_type, 0, 0, BTRFS_FILE_EXTENT_REG); - if (!ret) + if (!ret) { + clear_reserved_extent = false; btrfs_release_delalloc_bytes(fs_info, ordered_extent->start, ordered_extent->disk_len); + } } unpin_extent_cache(_I(inode)->extent_tree, ordered_extent->file_offset, ordered_extent->len, @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) * wrong we need to return the space for this ordered extent * back to the allocator. We only free the extent in the * truncated case if we didn't write out the extent at all. +* +* If we made it past insert_reserved_file_extent before we +* errored out then we don't need to do this as the accounting +* has already been done. */ if ((ret || !logical_len) && + clear_reserved_extent && !test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags)) btrfs_free_reserved_extent(fs_info, -- 2.14.3
[PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 3 +-- fs/btrfs/transaction.c | 6 ++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 031d2b11ddee..90f267f4dd0f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2970,8 +2970,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } if (run_all) { - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); spin_lock(_refs->lock); node = rb_first(_refs->href_root); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 2bb19e2ded5e..89d14f135837 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -839,8 +839,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); @@ -1927,8 +1926,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) cur_trans->delayed_refs.flushing = 1; smp_wmb(); - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { int run_it = 0; -- 2.14.3
[PATCH 35/35] MAINTAINERS: update my email address for btrfs
My work email is completely useless, switch it to my personal address so I get emails on a account I actually pay attention to. Signed-off-by: Josef Bacik --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 32fbc6f732d4..7723dc958e99 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3095,7 +3095,7 @@ F:drivers/gpio/gpio-bt8xx.c BTRFS FILE SYSTEM M: Chris Mason -M: Josef Bacik +M: Josef Bacik M: David Sterba L: linux-btrfs@vger.kernel.org W: http://btrfs.wiki.kernel.org/ -- 2.14.3
[PATCH 10/35] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 78 1 file changed, 16 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 212fa71317d6..10455d0aa71c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4493,31 +4493,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, _info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4562,7 +4537,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4775,15 +4749,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans, fs_info)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans, fs_info)) should_throttle = true; @@ -4795,7 +4761,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4807,23 +4773,23 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we can +* only FLUSH_LIMIT, if this fails we just return EAGAIN +* and let the normal space allocation stuff do it's +* work. */ - if (should_end) { -
[PATCH 08/35] btrfs: release metadata before running delayed refs
We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 99741254e27e..ebb0c0405598 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) return ret; } + btrfs_trans_release_metadata(trans); + trans->block_rsv = NULL; + /* make a pass through all the delayed refs we have so far * any runnings procs may add more while we are here */ @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) return ret; } - btrfs_trans_release_metadata(trans); - trans->block_rsv = NULL; - cur_trans = trans->transaction; /* -- 2.14.3
[PATCH 33/35] btrfs: fix insert_reserved error handling
We were not handling the reserved byte accounting properly for data references. Metadata was fine, if it errored out the error paths would free the bytes_reserved count and pin the extent, but it even missed one of the error cases. So instead move this handling up into run_one_delayed_ref so we are sure that both cases are properly cleaned up in case of a transaction abort. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 132a1157982c..fd9169f80de0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, insert_reserved); else BUG(); + if (ret && insert_reserved) + btrfs_pin_extent(trans->fs_info, node->bytenr, +node->num_bytes, 1); return ret; } @@ -8227,21 +8230,14 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans, } path = btrfs_alloc_path(); - if (!path) { - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); + if (!path) return -ENOMEM; - } path->leave_spinning = 1; ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path, _key, size); if (ret) { btrfs_free_path(path); - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); return ret; } -- 2.14.3
[PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
Instead of open coding this stuff use the helper instead. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c72ab2ca7627..1d3f5731d616 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (head->must_insert_reserved) pin_bytes = true; btrfs_free_delayed_extent_op(head->extent_op); - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; - atomic_dec(_refs->num_entries); - rb_erase(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); mutex_unlock(>mutex); -- 2.14.3
[PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock
We don't need the trans except to get the delayed_refs_root, so just pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a day. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 5 + fs/btrfs/delayed-ref.h | 2 +- fs/btrfs/extent-tree.c | 2 +- 3 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 96ce087747b2..87778645bf4a 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr, return NULL; } -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_delayed_ref_head *head) { - struct btrfs_delayed_ref_root *delayed_refs; - - delayed_refs = >transaction->delayed_refs; lockdep_assert_held(_refs->lock); if (mutex_trylock(>mutex)) return 0; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 7769177b489e..ee636d7a710a 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head * btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 bytenr); -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_delayed_ref_head *head); static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index fc30ff96f0d6..32579221d900 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2591,7 +2591,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, /* grab the lock that says we are going to process * all the refs for this head */ - ret = btrfs_delayed_ref_lock(trans, locked_ref); + ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref); spin_unlock(_refs->lock); /* * we may have dropped the spin lock to get the head -- 2.14.3
[PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
We have this open coded in btrfs_destroy_delayed_refs, use the helper instead. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 11ea2ea7439e..c72ab2ca7627 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); - if (!mutex_trylock(>mutex)) { - refcount_inc(>refs); - spin_unlock(_refs->lock); - - mutex_lock(>mutex); - mutex_unlock(>mutex); - btrfs_put_delayed_ref_head(head); - spin_lock(_refs->lock); + if (btrfs_delayed_ref_lock(delayed_refs, head)) continue; - } + spin_lock(>lock); while ((n = rb_first(>ref_tree)) != NULL) { ref = rb_entry(n, struct btrfs_delayed_ref_node, -- 2.14.3
[PATCH 21/35] btrfs: only run delayed refs if we're committing
I noticed in a giant dbench run that we spent a lot of time on lock contention while running transaction commit. This is because dbench results in a lot of fsync()'s that do a btrfs_transaction_commit(), and they all run the delayed refs first thing, so they all contend with each other. This leads to seconds of 0 throughput. Change this to only run the delayed refs if we're the ones committing the transaction. This makes the latency go away and we get no more lock contention. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index ebb0c0405598..2bb19e2ded5e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - /* make a pass through all the delayed refs we have so far -* any runnings procs may add more while we are here -*/ - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - cur_trans = trans->transaction; /* @@ -1939,12 +1930,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (!list_empty(>new_bgs)) btrfs_create_pending_block_groups(trans); - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { int run_it = 0; @@ -2015,6 +2000,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_unlock(_info->trans_lock); } + /* +* We are now the only one in the commit area, we can run delayed refs +* without hitting a bunch of lock contention from a lot of people +* trying to commit the transaction at once. +*/ + ret = btrfs_run_delayed_refs(trans, 0); + if (ret) + goto cleanup_transaction; + extwriter_counter_dec(cur_trans, trans->type); ret = btrfs_start_delalloc_flush(fs_info); -- 2.14.3
[PATCH 23/35] btrfs: assert on non-empty delayed iputs
I ran into an issue where there was some reference being held on an inode that I couldn't track. This assert wasn't triggered, but it at least rules out we're doing something stupid. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0e42401756b8..11ea2ea7439e 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) kthread_stop(fs_info->transaction_kthread); kthread_stop(fs_info->cleaner_kthread); + ASSERT(list_empty(_info->delayed_iputs)); set_bit(BTRFS_FS_CLOSING_DONE, _info->flags); btrfs_free_qgroup_config(fs_info); -- 2.14.3
[PATCH 19/35] btrfs: don't use ctl->free_space for max_extent_size
From: Josef Bacik max_extent_size is supposed to be the largest contiguous range for the space info, and ctl->free_space is the total free space in the block group. We need to keep track of these separately and _only_ use the max_free_space if we don't have a max_extent_size, as that means our original request was too large to search any of the block groups for and therefore wouldn't have a max_extent_size set. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 664b867ae499..ca98c39308f6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7460,6 +7460,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_block_group_cache *block_group = NULL; u64 search_start = 0; u64 max_extent_size = 0; + u64 max_free_space = 0; u64 empty_cluster = 0; struct btrfs_space_info *space_info; int loop = 0; @@ -7755,8 +7756,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_lock(>tree_lock); if (ctl->free_space < num_bytes + empty_cluster + empty_size) { - if (ctl->free_space > max_extent_size) - max_extent_size = ctl->free_space; + max_free_space = max(max_free_space, +ctl->free_space); spin_unlock(>tree_lock); goto loop; } @@ -7923,6 +7924,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, } out: if (ret == -ENOSPC) { + if (!max_extent_size) + max_extent_size = max_free_space; spin_lock(_info->lock); space_info->max_extent_size = max_extent_size; spin_unlock(_info->lock); -- 2.14.3
[PATCH 22/35] btrfs: make sure we create all new bgs
We can actually allocate new chunks while we're creating our bg's, so instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure to catch any new bg's that get added to the list. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ca98c39308f6..fc30ff96f0d6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) { struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_block_group_cache *block_group, *tmp; + struct btrfs_block_group_cache *block_group; struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) bool can_flush_pending_bgs = trans->can_flush_pending_bgs; trans->can_flush_pending_bgs = false; - list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) { + while (!list_empty(>new_bgs)) { + block_group = list_first_entry(>new_bgs, + struct btrfs_block_group_cache, + bg_list); if (ret) goto next; -- 2.14.3
[PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap
From: Josef Bacik We need to clear the max_extent_size when we clear bits from a bitmap since it could have been from the range that contains the max_extent_size. Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 53521027dd78..7faca05e61ea 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, bitmap_clear(info->bitmap, start, count); info->bytes -= bytes; + if (info->max_extent_size > ctl->unit) + info->max_extent_size = 0; } static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, -- 2.14.3
[PATCH 14/35] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f4e7caf37d6c..7c0e99e1f56c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4771,6 +4771,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -4993,7 +4994,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5002,7 +5003,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(>list); ticket->error = -ENOSPC; wake_up(>wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5057,8 +5061,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5112,10 +5120,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(_info->lock); @@ -5136,14 +5145,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(>list)) list_del_init(>list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - space_info->bytes_may_use -= num_bytes; - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5169,6 +5176,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5204,6 +5212,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(); @@ -5244,25 +5253,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, space_info, , - orig_bytes); + return wait_reserve_ticket(fs_info,
[PATCH 15/35] btrfs: run delayed iputs before committing
We want to have a complete picture of any delayed inode updates before we make the decision to commit or not, so make sure we run delayed iputs before making the decision to commit or not. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7c0e99e1f56c..064db7ebaf67 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4831,6 +4831,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, goto commit; } + mutex_lock(_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(_info->cleaner_delayed_iput_mutex); + spin_lock(_rsv->lock); reclaim_bytes += delayed_rsv->reserved; spin_unlock(_rsv->lock); -- 2.14.3
[PATCH 16/35] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 064db7ebaf67..664b867ae499 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5769,10 +5769,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = >block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; +again: spin_lock(_rsv->lock); if (block_rsv->reserved < block_rsv->size) num_bytes = block_rsv->size - block_rsv->reserved; @@ -5797,8 +5798,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(_rsv->lock); - } else + } else { btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + + /* +* If we are fragmented we can end up with a lot of outstanding +* extents which will make our size be much larger than our +* reserved amount. If we happen to try to do a reservation +* here that may result in us trying to do a pretty hefty +* reservation, which we may not need once delalloc flushing +* happens. If this is the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) { + last = num_bytes; + goto again; + } + } return ret; } -- 2.14.3
[PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 6799950fa057..4c9fd35bca07 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); spin_unlock(>lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 09/35] btrfs: protect space cache inode alloc with nofs
If we're allocating a new space cache inode it's likely going to be under a transaction handle, so we need to use memalloc_nofs_save() in order to avoid deadlocks, and more importantly lockdep messages that make xfstests fail. Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index c3888c113d81..db93a5f035a0 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "ctree.h" #include "free-space-cache.h" #include "transaction.h" @@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, struct btrfs_free_space_header *header; struct extent_buffer *leaf; struct inode *inode = NULL; + unsigned nofs_flag; int ret; key.objectid = BTRFS_FREE_SPACE_OBJECTID; @@ -68,7 +70,9 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, btrfs_disk_key_to_cpu(, _key); btrfs_release_path(path); + nofs_flag = memalloc_nofs_save(); inode = btrfs_iget(fs_info->sb, , root, NULL); + memalloc_nofs_restore(nofs_flag); if (IS_ERR(inode)) return inode; -- 2.14.3
[PATCH 05/35] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 24 +++- fs/btrfs/delayed-ref.c | 28 - fs/btrfs/disk-io.c | 3 + fs/btrfs/extent-tree.c | 268 +++ fs/btrfs/transaction.c | 68 +-- include/trace/events/btrfs.h | 2 + 6 files changed, 294 insertions(+), 99 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 66f1d3895bca..0a4e55703d48 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -452,8 +452,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -794,6 +795,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 27f7dd4e3d52..96ce087747b2 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, * existing and update must have the same bytenr */ static noinline void -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, +update_existing_head_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *existing, struct btrfs_delayed_ref_head *update, int *old_ref_mod_ret) { + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + struct
[PATCH 01/35] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 24 3 files changed, 20 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 62ff545ba1f7..3a9e4ac21794 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans) return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(_refs->lock); + lockdep_assert_held(>lock); + + rb_erase(>href_node, _refs->href_root); + RB_CLEAR_NODE(>href_node); + atomic_dec(_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index d9f2a4ebd5db..7769177b489e 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(>mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head * btrfs_select_ref_head(struct btrfs_trans_handle *trans); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f77226d8020a..6799950fa057 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); - spin_unlock(>lock); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(_refs->lock); - atomic_dec(_refs->num_entries); + spin_unlock(>lock); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(>mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); - atomic_dec(_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(>lock); spin_unlock(_refs->lock); -- 2.14.3
[PATCH 00/35] My current patch queue
This is the current queue of things that I've been working on. The main thing these patches are doing is separating out the delayed refs reservations from the global reserve into their own block rsv. We have been consistently hitting issues in production where we abort a transaction because we run out of the global reserve either while running delayed refs or while updating dirty block groups. This is because the math around global reserves is made up bullshit magic that has been tweaked more and more throughout the years. The result is something that is inconsistent across the board and sometimes wrong. So instead we need a way to know exactly how much space we need to keep around in order to satisfy our outstanding delayed refs and our dirty block groups. Since we don't know how many delayed refs we need at the start of any modification we simply use the nr_items passed into btrfs_start_transaction() as a guess for what we may need. This has the side effect of putting more pressure on the ENOSPC system, but it's pressure we can deal with more intelligently because we always know how much space we have outstanding, instead of guessing with weird global reserve math. This works similar to every other reservation we have, we reserve the worst case up front, and then at transaction end time we free up any space we didn't actually use for delayed refs. My performance tests show that we are bit faster now since we can do more intelligent flushing and don't have to fall back on simply committing the transaction in hopes that we have enough space for everything we need to do. That leads me to the 2nd part of this pull, there's a bunch of fixes around ENOSPC. Because we are a bit faster now there were a bunch of things uncovered in testing, but they seem to be all resolved now. The final chunk of fixes are around transaction aborts. There were a lot of accounting bugs I was running into while running generic/435, so I fixed a bunch of those up so now it runs cleanly. I have been running these patches through xfstests on multiple machines for a while, they are pretty solid and ready for wider testing and review. Thanks, Josef
[PATCH 06/35] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 6e7f350754d2..80615a579b18 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, struct btrfs_trans_handle *trans; u64 bytes; u64 reclaim_bytes = 0; + bool do_commit = true; trans = (struct btrfs_trans_handle *)current->journal_info; if (trans) @@ -4832,8 +4833,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * See if there is some space in the delayed insertion reservation for * this reservation. */ - if (space_info != delayed_rsv->space_info) - return -ENOSPC; + if (space_info != delayed_rsv->space_info) { + do_commit = false; + goto commit; + } spin_lock(_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4848,15 +4851,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (__percpu_counter_compare(_info->total_bytes_pinned, bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - + BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + do_commit = false; commit: trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) return -ENOSPC; + if (!do_commit && + !test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags)) { + btrfs_end_transaction(trans); + return -ENOSPC; + } return btrfs_commit_transaction(trans); } -- 2.14.3
[PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. So instead track only the ref heads added by this trans handle and adjust the counting appropriately in __btrfs_run_delayed_refs. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- fs/btrfs/extent-tree.c | 5 + 2 files changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 3a9e4ac21794..27f7dd4e3d52 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(>add_list, >ref_add_list); atomic_inc(>num_entries); - trans->delayed_ref_updates++; spin_unlock(>lock); return ret; } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 87c42a2c45b1..20531389a20a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2583,6 +2583,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); break; } + count++; /* grab the lock that says we are going to process * all the refs for this head */ @@ -2596,7 +2597,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, */ if (ret == -EAGAIN) { locked_ref = NULL; - count++; continue; } } @@ -2624,7 +2624,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unselect_delayed_ref_head(delayed_refs, locked_ref); locked_ref = NULL; cond_resched(); - count++; continue; } @@ -2642,7 +2641,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, return ret; } locked_ref = NULL; - count++; continue; } @@ -2693,7 +2691,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } btrfs_put_delayed_ref(ref); - count++; cond_resched(); } -- 2.14.3
[PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup
From: Josef Bacik Unify the extent_op handling as well, just add a flag so we don't actually run the extent op from check_ref_cleanup and instead return a value so that we can skip cleaning up the ref head. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4c9fd35bca07..87c42a2c45b1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref } static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +struct btrfs_delayed_ref_head *head, +bool run_extent_op) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; int ret; if (!extent_op) return 0; + head->extent_op = NULL; if (head->must_insert_reserved) { btrfs_free_delayed_extent_op(extent_op); return 0; + } else if (!run_extent_op) { + return 1; } + spin_unlock(>lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2506,7 +2511,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = >transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = cleanup_extent_op(trans, head, true); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6982,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(>ref_tree)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(trans, head, false)) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 07/35] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 80615a579b18..df826f713034 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +static void dump_block_rsv(struct btrfs_block_rsv *rsv) +{ + spin_lock(>lock); + printk(KERN_ERR "%d: size %llu reserved %llu\n", + rsv->type, (unsigned long long)rsv->size, + (unsigned long long)rsv->reserved); + spin_unlock(>lock); +} + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -7929,6 +7938,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(>lock); + dump_block_rsv(_info->global_block_rsv); + dump_block_rsv(_info->trans_block_rsv); + dump_block_rsv(_info->chunk_block_rsv); + dump_block_rsv(_info->delayed_block_rsv); + dump_block_rsv(_info->delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
Re: How to erase a RAID1 (+++)?
On Thu, Aug 30, 2018 at 9:21 AM, Alberto Bursi wrote: > > On 8/30/2018 11:13 AM, Pierre Couderc wrote: >> Trying to install a RAID1 on a debian stretch, I made some mistake and >> got this, after installing on disk1 and trying to add second disk : >> >> >> root@server:~# fdisk -l >> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors >> Units: sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disklabel type: dos >> Disk identifier: 0x2a799300 >> >> Device Boot StartEndSectors Size Id Type >> /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux >> >> >> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors >> Units: sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disklabel type: dos >> Disk identifier: 0x9770f6fa >> >> Device Boot StartEndSectors Size Id Type >> /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended >> >> >> And : >> >> root@server:~# btrfs fi show >> Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed >> Total devices 2 FS bytes used 1.10GiB >> devid1 size 1.82TiB used 4.02GiB path /dev/sda1 >> devid2 size 1.00KiB used 0.00B path /dev/sdb1 >> >> ... >> >> My purpose is a simple RAID1 main fs, with bootable flag on the 2 >> disks in prder to start in degraded mode >> How to get out ofr that...? >> >> Thnaks >> PC > > > sdb1 is an extended partition, you cannot format an extended partition. > > change sdb1 into primary partition or add a logical partition into it. Ahh you're correct. There is special treatment of 0x05, it's a logical container with the start address actually pointing to the address where the EBR is. And that EBR's first record contains the actual real extended partition information. So this represents two bugs in the installer: 1. If there's only one partition on a drive, it should be primary by default, not extended. 2. But if extended, it must point to an EBR, and the EBR must be created at that location. Obviously since there is no /dev/sdb2, this EBR is not present. -- Chris Murphy
Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands
On 2018-08-30 13:13, Axel Burri wrote: On 29/08/2018 21.02, Austin S. Hemmelgarn wrote: On 2018-08-29 13:24, Axel Burri wrote: This patch allows to build distinct binaries for specific btrfs subcommands, e.g. "btrfs-subvolume-show" which would be identical to "btrfs subvolume show". Motivation: While btrfs-progs offer the all-inclusive "btrfs" command, it gets pretty cumbersome to restrict privileges to the subcommands [1]. Common approaches are to either setuid root for "/sbin/btrfs" (which is not recommended at all), or to write sudo rules for each subcommand. Separating the subcommands into distinct binaries makes it easy to set elevated privileges using capabilities(7) or setuid. A typical use case where this is needed is when it comes to automated scripts, e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh. Let me start by saying I think this is a great idea to have as an option, and that the motivation is a particularly good one. I've posted my opinions on your two open questions below, but there's two other comments I'd like to make: * Is there some particular reason that this only includes the commands it does, and _hard codes_ which ones it works with? if we just do everything instead of only the stuff we think needs certain capabilities, then we can auto-generate the list of commands to be processed based on function names in the C files, and it will automatically pick up any newly added commands. At the very least, it could still parse through the C files and look for tags in the comments for the functions to indicate which ones need to be processed this way. Either case will make it significantly easier to add new commands, and would also better justify the overhead of shipping all the files pre-generated (because there would be much more involved in pre-generating them). It includes the commands that are required by btrbk. It was quite painful to figure out the required capabilities (reading kernel code and some trial and error involved), and I did not get around to include other commands yet. Yeah, I can imagine that it was not an easy task. I've actually been thinking of writing a script to scan the kernel sources and assemble a summary of the permissions checks performed by each system call and ioctl so that stuff like this is a bit easier, but that's unfortunately way beyond my abilities right now (parsing C and building call graphs is not easy no matter what language you're doing it with). I like your idea of adding some tags in the C files, I'll try to implement this, and we'll see what it gets to. Something embedded in the comments is likely to be the easiest option in terms of making sure it doesn't break the regular build. Just the tagging in general would be useful as documentation though. It would be kind of neat to have the list of capabilities needed for each one auto-generated from what it calls, but that's getting into some particularly complex territory that would likely require call graphs to properly implement. * While not essential, it would be really neat to have the `btrfs` command detect if an associated binary exists for whatever command was just invoked, and automatically exec that (possibly with some verification) instead of calling the command directly so that desired permissions are enforced. This would mitigate the need for users to remember different command names depending on execution context. Hmm this sounds a bit too magic for me, and would probably be more confusing than useful. It would mean than running "btrfs" as user would work when splitted commands are available, and would not work if not. It would also mean scripts would not have to add special handling for the case of running as a non-root user and seeing if the split commands actually exist or not (and, for that matter, would not have to directly depend on having the split commands at all), and that users would not need to worry about how to call BTRFS based on who they were running as. Realistically, I'd expect the same error to show if the binary isn't available as if it's not executable, so that it just becomes a case of 'if you see this error, re-run the same thing as root and it should work'. Description: Patch 1 adds a template as well as a generator shell script for the splitted subcommands. Patch 2 adds the generated subcommand source files. Patch 3-5 adds a "install-splitcmd-setcap" make target, with different approaches (either hardcoded in Makefile, or more generically by including "Makefile.install_setcap" generated by "splitcmd-gen.sh"). Open Questions: 1. "make install-splitcmd-setcap" installs the binaries with hardcoded group "btrfs". This needs to be configurable (how?). Another approach would be to not set the group at all, and leave this to the user or distro packaging script. Leave it to the user or distro. It's likely to end up standardized on the name 'btrfs', but it should be agnostic of that. 2.
Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands
On 29/08/2018 21.02, Austin S. Hemmelgarn wrote: > On 2018-08-29 13:24, Axel Burri wrote: >> This patch allows to build distinct binaries for specific btrfs >> subcommands, e.g. "btrfs-subvolume-show" which would be identical to >> "btrfs subvolume show". >> >> >> Motivation: >> >> While btrfs-progs offer the all-inclusive "btrfs" command, it gets >> pretty cumbersome to restrict privileges to the subcommands [1]. >> Common approaches are to either setuid root for "/sbin/btrfs" (which >> is not recommended at all), or to write sudo rules for each >> subcommand. >> >> Separating the subcommands into distinct binaries makes it easy to set >> elevated privileges using capabilities(7) or setuid. A typical use >> case where this is needed is when it comes to automated scripts, >> e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh. > Let me start by saying I think this is a great idea to have as an > option, and that the motivation is a particularly good one. > > I've posted my opinions on your two open questions below, but there's > two other comments I'd like to make: > > * Is there some particular reason that this only includes the commands > it does, and _hard codes_ which ones it works with? if we just do > everything instead of only the stuff we think needs certain > capabilities, then we can auto-generate the list of commands to be > processed based on function names in the C files, and it will > automatically pick up any newly added commands. At the very least, it > could still parse through the C files and look for tags in the comments > for the functions to indicate which ones need to be processed this way. > Either case will make it significantly easier to add new commands, and > would also better justify the overhead of shipping all the files > pre-generated (because there would be much more involved in > pre-generating them). It includes the commands that are required by btrbk. It was quite painful to figure out the required capabilities (reading kernel code and some trial and error involved), and I did not get around to include other commands yet. I like your idea of adding some tags in the C files, I'll try to implement this, and we'll see what it gets to. > * While not essential, it would be really neat to have the `btrfs` > command detect if an associated binary exists for whatever command was > just invoked, and automatically exec that (possibly with some > verification) instead of calling the command directly so that desired > permissions are enforced. This would mitigate the need for users to > remember different command names depending on execution context. Hmm this sounds a bit too magic for me, and would probably be more confusing than useful. It would mean than running "btrfs" as user would work when splitted commands are available, and would not work if not. >> >> >> Description: >> >> Patch 1 adds a template as well as a generator shell script for the >> splitted subcommands. >> >> Patch 2 adds the generated subcommand source files. >> >> Patch 3-5 adds a "install-splitcmd-setcap" make target, with different >> approaches (either hardcoded in Makefile, or more generically by >> including "Makefile.install_setcap" generated by "splitcmd-gen.sh"). >> >> >> Open Questions: >> >> 1. "make install-splitcmd-setcap" installs the binaries with hardcoded >> group "btrfs". This needs to be configurable (how?). Another approach >> would be to not set the group at all, and leave this to the user or >> distro packaging script. > Leave it to the user or distro. It's likely to end up standardized on > the name 'btrfs', but it should be agnostic of that. >> >> 2. Instead of the "install-splitcmd-setcap" make target, we could >> introduce a "configure --enable-splitted-subcommands" option, which >> would simply add all splitcmd binaries to the "all" and "install" >> targets without special treatment, and leave the setcap stuff to the >> user or distro packaging script (at least in gentoo, this needs to be >> specified using the "fcaps" eclass anyways [5]). > A bit of a nitpick, but 'split' is the proper past tense of the word > 'split', it's one of those exceptions that English has all over the > place. Even aside from that though, I think `separate` sounds more > natural for the configure option, or better yet, just make it > `--enable-fscaps` like most other packages do. > > That aside, I think having a configure option is the best way to do > this, it makes it very easy for distro build systems to handle it > because this is what they're used to doing anyway. It also makes it a > bit easier on the user, because it just becomes `make` to build > whichever version you want installed.
Re: How to erase a RAID1 (+++)?
On Thu, Aug 30, 2018 at 3:13 AM, Pierre Couderc wrote: > Trying to install a RAID1 on a debian stretch, I made some mistake and got > this, after installing on disk1 and trying to add second disk : > > > root@server:~# fdisk -l > Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x2a799300 > > Device Boot StartEndSectors Size Id Type > /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux > > > Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x9770f6fa > > Device Boot StartEndSectors Size Id Type > /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended Extended partition type is not a problem if you're using GRUB as the bootloader; other bootloaders may not like this. Strictly speaking the type code 0x05 is incorrect, GRUB ignores type code, as does the kernel. GRUB also ignores the active bit (boot flag). > > > And : > > root@server:~# btrfs fi show > Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > Total devices 2 FS bytes used 1.10GiB > devid1 size 1.82TiB used 4.02GiB path /dev/sda1 > devid2 size 1.00KiB used 0.00B path /dev/sdb1 That's odd; and I know you've moved on from this problem but I would have liked to see the super for /dev/sdb1 and also the installer log for what commands were used for partitioning, including mkfs and device add commands. For what it's worth, 'btrfs dev add' formats the device being added, it does not need to be formatted in advance, and also it resizes the file system properly. > My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in > prder to start in degraded mode Good luck with this. The Btrfs archives are full of various limitations of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if you persistently ask for degraded mount, you run the risk of other problems if there's merely a delayed discovery of one of the devices. Once a Btrfs volume is degraded, it does not automatically resume normal operation just because the formerly missing device becomes available. So... this is flat out not suitable for use cases where you need unattended raid1 degraded boot. -- Chris Murphy
fsck lowmem mode only: ERROR: errors found in fs roots
Hey. I've the following on a btrfs that's basically the system fs for my notebook: When booting from a USB stick with: # uname -a Linux heisenberg 4.17.0-3-amd64 #1 SMP Debian 4.17.17-1 (2018-08-18) x86_64 GNU/Linux # btrfs --version btrfs-progs v4.17 ... a lowmem mode fsck gives no error: # btrfs check --mode=lowmem /dev/mapper/system ; echo $? Checking filesystem on /dev/mapper/system UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c checking extents checking free space cache checking fs roots ERROR: errors found in fs roots found 495910952960 bytes used, error(s) found total csum bytes: 481840472 total tree bytes: 2388819968 total fs tree bytes: 1651097600 total extent tree bytes: 161841152 btree space waste bytes: 446707102 file data blocks allocated: 6651878428672 referenced 542320984064 1 ... while a normal mode fsck doesn't give one: # btrfs check /dev/mapper/system ; echo $? Checking filesystem on /dev/mapper/system UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c checking extents checking free space cache checking fs roots checking only csum items (without verifying data) checking root refs found 495910952960 bytes used, no error found total csum bytes: 481840472 total tree bytes: 2388819968 total fs tree bytes: 1651097600 total extent tree bytes: 161841152 btree space waste bytes: 446707102 file data blocks allocated: 6651878428672 referenced 542320984064 0 There were no unusual kernel log messages. Back in the normal system (no longer USB)... I spottet this: Aug 30 18:31:29 heisenberg kernel: BTRFS info (device dm-0): the free space cache file (22570598400) is invalid, skip it but not sure whether it's related (probably not)... and I haven't seen such a free space cache file issue (or any other btrfs errors) in a long while (I usually watch my kernel log once after booting has finished). Any ideas? Perhaps it's just yet another lowmem false positive... anything I can help in debugging this? Apart from this I haven't noticed any corruptions recently... just about to make a full backup of the fs (or better said a rw snapshot of the most of the data) with tar, so most data will be read soon at least once... an I would probably notice any further errors that are otherwise silent. Cheers, Chris.
Re: How to erase a RAID1 (+++)?
On 8/30/2018 11:13 AM, Pierre Couderc wrote: > Trying to install a RAID1 on a debian stretch, I made some mistake and > got this, after installing on disk1 and trying to add second disk : > > > root@server:~# fdisk -l > Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x2a799300 > > Device Boot Start End Sectors Size Id Type > /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux > > > Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x9770f6fa > > Device Boot Start End Sectors Size Id Type > /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended > > > And : > > root@server:~# btrfs fi show > Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > Total devices 2 FS bytes used 1.10GiB > devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 > devid 2 size 1.00KiB used 0.00B path /dev/sdb1 > > ... > > My purpose is a simple RAID1 main fs, with bootable flag on the 2 > disks in prder to start in degraded mode > How to get out ofr that...? > > Thnaks > PC sdb1 is an extended partition, you cannot format an extended partition. change sdb1 into primary partition or add a logical partition into it. -Alberto
Re: How to erase a RAID1 (+++)?
On Thursday, 30 August 2018 12:01:55 CEST Pierre Couderc wrote: > > On 08/30/2018 11:35 AM, Qu Wenruo wrote: > > > > On 2018/8/30 下午5:13, Pierre Couderc wrote: > >> Trying to install a RAID1 on a debian stretch, I made some mistake and > >> got this, after installing on disk1 and trying to add second disk : > >> > >> > >> root@server:~# fdisk -l > >> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > >> Units: sectors of 1 * 512 = 512 bytes > >> Sector size (logical/physical): 512 bytes / 512 bytes > >> I/O size (minimum/optimal): 512 bytes / 512 bytes > >> Disklabel type: dos > >> Disk identifier: 0x2a799300 > >> > >> Device Boot StartEndSectors Size Id Type > >> /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux > >> > >> > >> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > >> Units: sectors of 1 * 512 = 512 bytes > >> Sector size (logical/physical): 512 bytes / 512 bytes > >> I/O size (minimum/optimal): 512 bytes / 512 bytes > >> Disklabel type: dos > >> Disk identifier: 0x9770f6fa > >> > >> Device Boot StartEndSectors Size Id Type > >> /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended > >> > >> > >> And : > >> > >> root@server:~# btrfs fi show > >> Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > >> Total devices 2 FS bytes used 1.10GiB > >> devid1 size 1.82TiB used 4.02GiB path /dev/sda1 > >> devid2 size 1.00KiB used 0.00B path /dev/sdb1 I think your problem is that sdb1 is a extended partition and not a primary one or you'll need to make a logical partition inside the extended partition and use that. -- Kai Stian Olstad
Re: How to erase a RAID1 (+++)?
On 2018/8/30 下午6:01, Pierre Couderc wrote: > > > On 08/30/2018 11:35 AM, Qu Wenruo wrote: >> >> On 2018/8/30 下午5:13, Pierre Couderc wrote: >>> Trying to install a RAID1 on a debian stretch, I made some mistake and >>> got this, after installing on disk1 and trying to add second disk : >>> >>> >>> root@server:~# fdisk -l >>> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors >>> Units: sectors of 1 * 512 =12 bytes >>> Sector size (logical/physical): 512 bytes / 512 bytes >>> I/O size (minimum/optimal): 512 bytes / 512 bytes >>> Disklabel type: dos >>> Disk identifier: 0x2a799300 >>> >>> Device Boot Start End Sectors Size Id Type >>> /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux >>> >>> >>> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors >>> Units: sectors of 1 * 512 =12 bytes >>> Sector size (logical/physical): 512 bytes / 512 bytes >>> I/O size (minimum/optimal): 512 bytes / 512 bytes >>> Disklabel type: dos >>> Disk identifier: 0x9770f6fa >>> >>> Device Boot Start End Sectors Size Id Type >>> /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended >>> >>> >>> And : >>> >>> root@server:~# btrfs fi show >>> Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed >>> Total devices 2 FS bytes used 1.10GiB >>> devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 >>> devid 2 size 1.00KiB used 0.00B path /dev/sdb1 >>> >>> ... >>> >>> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks >>> in prder to start in degraded mode >>> How to get out ofr that...? >> The 2nd device is indeed strange. >> >> Considering how old packages Debian tends to deliver, it should be some >> old btrfs-progs. >> >> You could just boot into the system and execute the following commands: >> >> # btrfs device remove 2 > This works fine. >> >> Then add a new real device to the fs >> >> # btrfs device add >> >> > Thnk you, this : > > btrfs device add /dev/sdb1 / > > seems to work but gives me the same : > > root@server:~# btrfs fi show > Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > Total devices 2 FS bytes used 1.10GiB > devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 > devid 2 size 1.00KiB used 0.00B path /dev/sdb1 > > So I need to "prepare" or format /dev/sdb before adding /dev/sdb1 > I have tried to format /dev/sdb and it works : What's the output of lsblk command? If it's 2T (1.8T) show in lsblk, then in above case you could fix it by resize that device by: # btrfs resize 2:max Thanks, Qu > root@server:~# btrfs fi show > Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > Total devices 2 FS bytes used 1.10GiB > devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 > devid 2 size 1.82TiB used 0.00B path /dev/sdb > > but I have no partition table and no booot flag for degraded mode > > > signature.asc Description: OpenPGP digital signature
Re: How to erase a RAID1 (+++)?
On 08/30/2018 11:35 AM, Qu Wenruo wrote: On 2018/8/30 下午5:13, Pierre Couderc wrote: Trying to install a RAID1 on a debian stretch, I made some mistake and got this, after installing on disk1 and trying to add second disk : root@server:~# fdisk -l Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x2a799300 Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x9770f6fa Device Boot Start End Sectors Size Id Type /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended And : root@server:~# btrfs fi show Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed Total devices 2 FS bytes used 1.10GiB devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 devid 2 size 1.00KiB used 0.00B path /dev/sdb1 ... My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in prder to start in degraded mode How to get out ofr that...? The 2nd device is indeed strange. Considering how old packages Debian tends to deliver, it should be some old btrfs-progs. You could just boot into the system and execute the following commands: # btrfs device remove 2 This works fine. Then add a new real device to the fs # btrfs device add Thnk you, this : btrfs device add /dev/sdb1 / seems to work but gives me the same : root@server:~# btrfs fi show Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed Total devices 2 FS bytes used 1.10GiB devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 devid 2 size 1.00KiB used 0.00B path /dev/sdb1 So I need to "prepare" or format /dev/sdb before adding /dev/sdb1 I have tried to format /dev/sdb and it works : root@server:~# btrfs fi show Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed Total devices 2 FS bytes used 1.10GiB devid1 size 1.82TiB used 4.02GiB path /dev/sda1 devid2 size 1.82TiB used 0.00B path /dev/sdb but I have no partition table and no booot flag for degraded mode
Re: How to erase a RAID1 (+++)?
On 2018/8/30 下午5:13, Pierre Couderc wrote: > Trying to install a RAID1 on a debian stretch, I made some mistake and > got this, after installing on disk1 and trying to add second disk : > > > root@server:~# fdisk -l > Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x2a799300 > > Device Boot Start End Sectors Size Id Type > /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux > > > Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x9770f6fa > > Device Boot Start End Sectors Size Id Type > /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended > > > And : > > root@server:~# btrfs fi show > Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > Total devices 2 FS bytes used 1.10GiB > devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 > devid 2 size 1.00KiB used 0.00B path /dev/sdb1 > > ... > > My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks > in prder to start in degraded mode > How to get out ofr that...? The 2nd device is indeed strange. Considering how old packages Debian tends to deliver, it should be some old btrfs-progs. You could just boot into the system and execute the following commands: # btrfs device remove 2 Then add a new real device to the fs # btrfs device add Then convert the fs to RAID1 # btrfs balance start -dconvert=RAID1 -mconvert=RAID1 Thanks, Qu > > Thnaks > PC signature.asc Description: OpenPGP digital signature
Re: How to erase a RAID1 (+++)?
Trying to install a RAID1 on a debian stretch, I made some mistake and got this, after installing on disk1 and trying to add second disk : root@server:~# fdisk -l Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x2a799300 Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x9770f6fa Device Boot Start End Sectors Size Id Type /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended And : root@server:~# btrfs fi show Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed Total devices 2 FS bytes used 1.10GiB devid 1 size 1.82TiB used 4.02GiB path /dev/sda1 devid 2 size 1.00KiB used 0.00B path /dev/sdb1 ... My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in prder to start in degraded mode How to get out ofr that...? Thnaks PC
[PATCH] btrfs-progs: dump-tree: print invalid argument and strerror
Before this patch: $ ls nothingness ls: cannot access 'nothingness': No such file or directory $ btrfs inspect-internal dump-tree nothingness ERROR: not a block device or regular file: nothingness The confusing error message makes users thinks that nonexistent file is existed but in wrong type. This patch let check_arg_type return -errno if realpath failed. And print strerror if check_arg_type failed and the returned code is negative. Like: $ btrfs inspect-internal dump-tree nothingness ERROR: invalid argument: nothingness: No such file or directory Signed-off-by: Su Yue --- cmds-inspect-dump-tree.c | 7 ++- utils.c | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index c8acd55a0c3a..f70a1c47d145 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -313,7 +313,12 @@ int cmd_inspect_dump_tree(int argc, char **argv) ret = check_arg_type(argv[optind]); if (ret != BTRFS_ARG_BLKDEV && ret != BTRFS_ARG_REG) { - error("not a block device or regular file: %s", argv[optind]); + if (ret < 0) + error("invalid argument: %s: %s", argv[optind], + strerror(-ret)); + else + error("not a block device or regular file: %s", + argv[optind]); goto out; } diff --git a/utils.c b/utils.c index 1e275c668cd5..80eca77b1b20 100644 --- a/utils.c +++ b/utils.c @@ -502,6 +502,8 @@ int check_arg_type(const char *input) return BTRFS_ARG_REG; return BTRFS_ARG_UNKNOWN; + } else { + return -errno; } if (strlen(input) == (BTRFS_UUID_UNPARSED_SIZE - 1) && -- 2.17.1
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > - is documenting rejection on request alignment grounds > > > (i.e. EINVAL) in the man page sufficient for app > > > developers to understand what is going on here? > > > > I think so. The manpage says: "The filesystem does not support > > reflinking the ranges of the given files", which (to my mind) covers > > this case of not supporting dedupe of EOF blocks. > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > this; however, on btrfs, not supporting dedupe of EOF blocks means small > files (one extent) cannot be deduped at all, because the EOF block holds > a reference to the entire dst extent. If a dedupe app doesn't go all the > way to EOF on btrfs, then it should not attempt to dedupe any part of the > last extent of the file as the benefit would be zero or slightly negative. That's a filesystem implementation issue, not an API or application issue. > The app developer would need to be aware that such a restriction could > exist on some filesystems, and be able to distinguish this from other > cases that could lead to EINVAL. Portable code would have to try a dedupe > up to EOF, then if that failed, round down and retry, and if that failed > too, the app would have to figure out which filesystem it's running on > to know what to do next. Performance demands the app know what the FS > will do in advance, and avoid a whole class of behavior. Nobody writes "portable" applications like that. They read the man page first, and work out what the common subset of functionality is and then code from that. Man page says: "Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size." IOWs, code compatible with starts with supporting the general case. i.e. a range rounded to filesystem block boundaries (it's already run fstat() on the files it wants to dedupe to find their size, yes?), hence ignoring the partial EOF block. Will just work on everything. Code that then wants to optimise for btrfs/xfs/ocfs quirks runs fstatvfs to determine what fs it's operating on and applies the necessary quirks. For btrfs it can extend the range to include the partial EOF block, and hence will handle the implementation quirks btrfs has with single extent dedupe. Simple, reliable, and doesn't require any sort of flailing about with offsets and lengths to avoid unexpected EINVAL errors. > btrfs dedupe reports success if the src extent is inline and the same > size as the dst extent (i.e. file is smaller than one page). No dedupe > can occur in such cases--a clone results in a simple copy, so the best > a dedupe could do would be a no-op. Returning EINVAL there would break > a few popular tools like "cp --reflink". Returning OK but doing nothing > seems to be the best option in that case. Again, those are a filesystem implementation issues, not problems with the API itself. > > > - should we just round down the EOF dedupe request to the > > > block before EOF so dedupe still succeeds? > > > > I've often wondered if the interface should (have) be(en) that we start > > at src_off/dst_off and share as many common blocks as possible until we > > find a mismatch, then tell userspace where we stopped... instead of like > > now where we compare the entire extent and fail if any part of it > > doesn't match. > > The usefulness or harmfulness of that approach depends a lot on what > the application expects the filesystem to do. > > In btrfs, the dedupe operation acts on references to data, not the > underlying data blocks. If there are 1000 slightly overlapping references > to a single contiguous range of data blocks in dst on disk, each dedupe > operation acts on only one of those, leaving the other 999 untouched. > If the app then submits 999 other dedupe requests, no references to the > dst blocks remain and the underlying data blocks can be deleted. Assuming your strawman is valid, if you have a thousand separate references across the same set of data blocks on disk, then that data is already heavily deduplicated. Trying to optimise that further seems misguided, way down the curve of diminishing returns. > In a parallel universe (or a better filesystem, or a userspace emulation ^^^ > built out of dedupe and other ioctls), dedupe could work at the extent > data (physical) level. The app points at src and dst extent references > (inode/offset/length tuples), and the filesystem figures out which > physical blocks these point to, then adjusts all the references to the > dst blocks at once, That's XFS dedupe in a nutshell. And we aren't in a parallel universe here. :P > dealing with partial overlaps and snapshots and nodatacow > and whatever other exotic features might be