[PATCH v2] fstests: test regression of -EEXIST on creating new file after log replay
The regression is introduced to btrfs in linux v4.4 and it refuses to create new files after log replay by returning -EEXIST. Although the problem is on btrfs only, there is no btrfs stuff in terms of test, so this makes it generic. The kernel fix is Btrfs: fix unexpected -EEXIST when creating new inode Reviewed-by: Filipe MananaSigned-off-by: Liu Bo --- v2: - Remove failed message from 481.out - Drop the unnecessary write in creating a file tests/generic/481 | 75 +++ tests/generic/481.out | 2 ++ tests/generic/group | 1 + 3 files changed, 78 insertions(+) create mode 100755 tests/generic/481 create mode 100644 tests/generic/481.out diff --git a/tests/generic/481 b/tests/generic/481 new file mode 100755 index 000..6a7e9dd --- /dev/null +++ b/tests/generic/481 @@ -0,0 +1,75 @@ +#! /bin/bash +# FSQA Test No. 481 +# +# Reproduce a regression of btrfs that leads to -EEXIST on creating new files +# after log replay. +# +# The kernel fix is +# Btrfs: fix unexpected -EEXIST when creating new inode +# +#--- +# +# Copyright (C) 2018 Oracle. All Rights Reserved. +# Author: Bo Liu +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + _cleanup_flakey + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_require_scratch +_require_dm_target flakey + +rm -f $seqres.full + +_scratch_mkfs >>$seqres.full 2>&1 +_init_flakey +_mount_flakey + +# create a file and keep it in write ahead log +$XFS_IO_PROG -f -c "fsync" $SCRATCH_MNT/foo + +# fail this filesystem so that remount can replay the write ahead log +_flakey_drop_and_remount + +# see if we can create a new file successfully +touch $SCRATCH_MNT/bar + +_unmount_flakey + +echo "Silence is golden" + +status=0 +exit diff --git a/tests/generic/481.out b/tests/generic/481.out new file mode 100644 index 000..206e116 --- /dev/null +++ b/tests/generic/481.out @@ -0,0 +1,2 @@ +QA output created by 481 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index ea2056b..05f60f2 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -483,3 +483,4 @@ 478 auto quick 479 auto quick metadata 480 auto quick metadata +481 auto quick -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to replace a failed drive in btrfs RAID 1 filesystem
Andrei Borzenkov posted on Sat, 10 Mar 2018 13:27:03 +0300 as excerpted: > And "missing" is not the answer because I obviously may have more than > one missing device. "missing" is indeed the answer when using btrfs device remove. See the btrfs-device manpage, which explains that if there's more than one device missing, either just the first one described by the metadata will be removed (if missing is only specified once), or missing can be specified multiple times. raid6 with two devices missing is the only normal candidate for that presently, tho on-list we've seen aborted-add cases where it still worked as well, because while the metadata listed the new device it didn't actually have any data when it became apparent it was bad and thus needed to be removed again. Note that because btrfs raid1 and raid10 only does two-way-mirroring regardless of the number of devices, and because of the per-chunk (as opposed to per-device) nature of btrfs raid10, those modes can only expect successful recovery with a single missing device, altho as mentioned above we've seen on-list at least one case where an aborted device-add of device found to be bad after the add didn't actually have anything on it, so it could still be removed along with the device it was originally intended to replace. Of course the N-way-mirroring mode, whenever it eventually gets implemented, will allow missing devices upto N-1, and N-way-parity mode, if it's ever implemented, similar, but N-way-mirroring was scheduled for after raid56 mode so it could make use of some of the same code, and that has of course taken years on years to get merged and stabilize, and there's no sign yet of N-way-mirroring patches, which based on the raid56 case could take years to stabilize and debug after original merge, so the still somewhat iffy raid6 mode is likely to remain the only normal usage of multiple missing for years, yet. For btrfs replace, the manpage says ID's the only way to handle missing, but getting that ID, as you've indicated, could be difficult. For filesystems with only a few devices that haven't had any or many device config changes, it should be pretty easy to guess (a two device filesystem with no changes should have IDs 1 and 2, so if only one is listed, the other is obvious, and a 3-4 device fs with only one or two previous device changes, likely well remembered by the admin, should still be reasonably easy to guess), but as the number of devices and the number of device adds/removes/replaces increases, finding/guessing the missing one becomes far more difficult. Of course the sysadmin's first rule of backups states in simple form that not having one == defining the value of the data as trivial, not worth the trouble of a backup, which in turn means that at some point before there's /too/ many device change events, it's likely going to be less trouble (particularly after factoring in reliability) to restore from backups to a fresh filesystem than it is to do yet another device change, and together with the current practical limits btrfs imposes on the number of missing devices, that tends to impose /some/ limit on the possibilities for missing device IDs, so the situation, while not ideal, isn't yet /entirely/ out of hand, either, because a successful guess based on available information should be possible without /too/ many attempts. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 6:51 PM, Linus Torvaldswrote: > > So in *historical* context - when a compiler didn't do variable length > arrays at all - the original semantics of C "constant expressions" > actually make a ton of sense. > > You can basically think of a constant expression as something that can > be (historically) handled by the front-end without any real > complexity, and no optimization phases - just evaluating a simple > parse tree with purely compile-time constants. > > So there's a good and perfectly valid reason for why C limits certain > expressions to just a very particular subset. It's not just array > sizes, it's case statements etc too. And those are very much part of > the C standard. > > So an error message like > >warning: ISO C90 requires array sizes to be constant-expressions > > would be technically correct and useful from a portability angle. It > tells you when you're doing something non-portable, and should be > automatically enabled with "-ansi -pedantic", for example. > > So what's misleading is actually the name of the warning and the > message, not that it happens. The warning isn't about "variable > length", it's literally about the rules for what a > "constant-expression" is. > > And in C, the expression (2,3) has a constant _value_ (namely 3), but > it isn't a constant-expression as specified by the language. > > Now, the thing is that once you actually do variable length arrays, > those old front-end rules make no sense any more (outside of the "give > portability warnings" thing). > > Because once you do variable length arrays, you obviously _parse_ > everything just fine, and you're doing to evaluate much more complex > expressions than some limited constant-expression rule. > > And at that point it would make a whole lot more sense to add a *new* > warning that basically says "I have to generate code for a > variable-sized array", if you actually talk about VLA's. > > But that's clearly not what gcc actually did. > > So the problem really is that -Wvla doesn't actually warn about VLA's, > but about something technically completely different. > I *think* I followed your reasoning. For gcc, -Wvla is the "I have to generate code for a variable-sized array" one; but in this case, the array size is the actual issue that you would have liked to be warned about; since people writing: int a[(2,3)]; did not really mean to declare a VLA. Therefore, you say warning them about the "warning: ISO C90 requires array sizes to be constant-expressions" (let's call it -Wpedantic-array-sizes) would be more helpful here instead of saying stuff about VLAs. In my case, I was just expecting gcc to give us both warnings and that's it, instead of trying to be smart and give only the -Wpedantic-array-sizes one (which is the one I was wondering in my previous email about why it was missing). I think it would be clear enough if both warnings are shown are the same time. And it makes sense, since if you write that line in ISO C90 it means there really are 2 things going wrong in the end (fishy syntax while in ISO C90 mode and, due to that, VLA code generated as well), no? Thanks for taking the time to write about the historical context, by the way! Miguel > And that's why those stupid syntactic issues with min/max matter. It's > not whether the end result is a compile-time constant or not, it's > about completely different issues, like whether there is a > comma-expression in it. > > Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 2018-03-10 at 23:31 +0500, Roman Mamedov wrote: > QCOW2 would add a second layer of COW > on top of > Btrfs, which sounds like a nightmare. I've just seen there is even a nocow option "specifically" for btrfs... it seems however that it doesn't disable the CoW of qcow, but rather that of btrfs... (thus silently also the checksumming). Does plain qcow2 really CoW on every write? I've always assumed it would only CoW when one makes snapshots or so... Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 10 Mar 2018 16:50:22 +0100 Adam Borowskiwrote: > Since we're on a btrfs mailing list, if you use qemu, you really want > sparse format:raw instead of qcow2 or preallocated raw. This also works > great with TRIM. Agreed, that's why I use RAW. QCOW2 would add a second layer of COW on top of Btrfs, which sounds like a nightmare. Even if you would run those files as NOCOW in Btrfs, somehow I feel FS-native COW is more efficient than emulating it in userspace with special format files. > > It works, just not with some of the QEMU virtualized disk device drivers. > > You don't need to use qemu-img to manually dig holes either, it's all > > automatic. > > It works only with scsi and virtio-scsi drivers. Most qemu setups use > either ide (ouch!) or virtio-blk. It works with IDE as well. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to change/fix 'Received UUID'
Thanks all for the help again. I just wrote a blog post to explain the process to others should anyone need this later. http://marc.merlins.org/perso/btrfs/post_2018-03-09_Btrfs-Tips_-Rescuing-A-Btrfs-Send-Receive-Relationship.html Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 9:34 AM, Miguel Ojedawrote: > > So the warning is probably implemented to just trigger whenever VLAs > are used but the given standard does not allow them, for all > languages. The problem is why the ISO C90 frontend is not giving an > error for using invalid syntax for array sizes to begin with? So in *historical* context - when a compiler didn't do variable length arrays at all - the original semantics of C "constant expressions" actually make a ton of sense. You can basically think of a constant expression as something that can be (historically) handled by the front-end without any real complexity, and no optimization phases - just evaluating a simple parse tree with purely compile-time constants. So there's a good and perfectly valid reason for why C limits certain expressions to just a very particular subset. It's not just array sizes, it's case statements etc too. And those are very much part of the C standard. So an error message like warning: ISO C90 requires array sizes to be constant-expressions would be technically correct and useful from a portability angle. It tells you when you're doing something non-portable, and should be automatically enabled with "-ansi -pedantic", for example. So what's misleading is actually the name of the warning and the message, not that it happens. The warning isn't about "variable length", it's literally about the rules for what a "constant-expression" is. And in C, the expression (2,3) has a constant _value_ (namely 3), but it isn't a constant-expression as specified by the language. Now, the thing is that once you actually do variable length arrays, those old front-end rules make no sense any more (outside of the "give portability warnings" thing). Because once you do variable length arrays, you obviously _parse_ everything just fine, and you're doing to evaluate much more complex expressions than some limited constant-expression rule. And at that point it would make a whole lot more sense to add a *new* warning that basically says "I have to generate code for a variable-sized array", if you actually talk about VLA's. But that's clearly not what gcc actually did. So the problem really is that -Wvla doesn't actually warn about VLA's, but about something technically completely different. And that's why those stupid syntactic issues with min/max matter. It's not whether the end result is a compile-time constant or not, it's about completely different issues, like whether there is a comma-expression in it. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 5:30 PM, Linus Torvaldswrote: > On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook wrote: >> >> Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or >> some other name for the simple macro. Bleh. > > Oh, and I'm starting to see the real problem. > > It's not that our current "min/max()" are broiken. It's that "-Wvla" is > garbage. > > Lookie here: > > int array[(1,2)]; > > results in gcc saying > > warning: ISO C90 forbids variable length array ‘array’ [-Wvla] >int array[(1,2)]; >^~~ > > and that error message - and the name of the flag - is obviously pure garbage. > > What is *actually* going on is that ISO C90 requires an array size to > be not a constant value, but a constant *expression*. Those are two > different things. > > A constant expression has little to do with "compile-time constant". > It's a more restricted form of it, and has actual syntax requirements. > A comma expression is not a constant expression, for example, which > was why I tested this. > > So "-Wvla" is garbage, with a misleading name, and a misleading > warning string. It has nothing to do with "variable length" and > whether the compiler can figure it out at build time, and everything > to do with a _syntax_ rule. The warning string is basically the same to the one used for C++, i.e.: int size2 = 2; constexpr int size3 = 2; int array1[(2,2)]; int array2[(size2, size2)]; int array3[(size3, size3)]; only warns for array2 with: warning: ISO C++ forbids variable length array 'array2' [-Wvla] int array2[(size2, size2)]; So the warning is probably implemented to just trigger whenever VLAs are used but the given standard does not allow them, for all languages. The problem is why the ISO C90 frontend is not giving an error for using invalid syntax for array sizes to begin with? Miguel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 2018-03-10 at 16:50 +0100, Adam Borowski wrote: > Since we're on a btrfs mailing list Well... my original question was whether someone could make zerofree support for btrfs (which I think would be best if someone who knows how btrfs really works)... thus I directed the question to this list and not to some of qemu :-) > It works only with scsi and virtio-scsi drivers. Most qemu setups > use > either ide (ouch!) or virtio-blk. Seems my libvirt created VMs use "sata" per default... and it does seem to work with that either in the meantime. Thanks :-) Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 2018-03-10 at 19:37 +0500, Roman Mamedov wrote: > Note you can use it on HDDs too, even without QEMU and the like: via > using LVM > "thin" volumes. I use that on a number of machines, the benefit is > that since > TRIMed areas are "stored nowhere", those partitions allow for > incredibly fast > block-level backups, as it doesn't have to physically read in all the > free > space, let alone any stale data in there. LVM snapshots are also way > more > efficient with thin volumes, which helps during backup. I was thinking about using those... but then I'd have to use loop device files I guess... which I also want to avoid. > > dm-crypt per default blocks discard. > > Out of misguided paranoia. If your crypto is any good (and last I > checked AES > was good enough), there's really not a lot to gain for the "attacker" > knowing > which areas of the disk are used and which are not. I'm not an expert here... but a) I think it would be independent of AES and rather the encryption mode (e.g. XTS) which protects here or not... and b) we've seen too many attacks on crypto based on smart statistics and knowing which blocks on a medium are actually data or just "random crypto noise" (and you know that when using TRIM) can already tell a lot. At least it could tell an attacker how much data there is on a fs. > It works, just not with some of the QEMU virtualized disk device > drivers. > You don't need to use qemu-img to manually dig holes either, it's all > automatic. You're right... seems like in older version one needed to set virtio- scsi as device driver (which I possible missed), but nowadays it even seems to work with sata. > QEMU deallocates parts of its raw images for those areas which have > been > TRIM'ed by the guest. In fact I never use qcow2, always raw images > only. > Yet, boot a guest, issue fstrim, and see the raw file while still > having the > same size, show much lower actual disk usage in "du". Works with qcow2 as well... heck even Windows can do it (though it has no fstrim and it seems one needs to run defrag (which probably does next to defragmentation also what fstrim does). Fine for me,... though non qemu users may still be interested in having zerofree. Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cookwrote: > > Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or > some other name for the simple macro. Bleh. Oh, and I'm starting to see the real problem. It's not that our current "min/max()" are broiken. It's that "-Wvla" is garbage. Lookie here: int array[(1,2)]; results in gcc saying warning: ISO C90 forbids variable length array ‘array’ [-Wvla] int array[(1,2)]; ^~~ and that error message - and the name of the flag - is obviously pure garbage. What is *actually* going on is that ISO C90 requires an array size to be not a constant value, but a constant *expression*. Those are two different things. A constant expression has little to do with "compile-time constant". It's a more restricted form of it, and has actual syntax requirements. A comma expression is not a constant expression, for example, which was why I tested this. So "-Wvla" is garbage, with a misleading name, and a misleading warning string. It has nothing to do with "variable length" and whether the compiler can figure it out at build time, and everything to do with a _syntax_ rule. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cookwrote: > > And sparse freaks out too: > >drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect > type in initializer (different address spaces) @@expected void > *addr @@got struct mac_regs [noderef] http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 11:03 PM, Miguel Ojedawrote: > > Just compiled 4.9.0 and it seems to work -- so that would be the > minimum required. > > Sigh... > > Some enterprise distros are either already shipping gcc >= 5 or will > probably be shipping it soon (e.g. RHEL 8), so how much does it hurt > to ask for a newer gcc? Are there many users/companies out there using > enterprise distributions' gcc to compile and run the very latest > kernels? I wouldn't mind upping the compiler requirements, and we have other reasons to go to 4.6. But _this_ particular issue doesn't seem worth it to then go even further. Annoying. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, Mar 10, 2018 at 07:37:22PM +0500, Roman Mamedov wrote: > Note you can use it on HDDs too, even without QEMU and the like: via using LVM > "thin" volumes. I use that on a number of machines, the benefit is that since > TRIMed areas are "stored nowhere", those partitions allow for incredibly fast > block-level backups, as it doesn't have to physically read in all the free > space, let alone any stale data in there. LVM snapshots are also way more > efficient with thin volumes, which helps during backup. Since we're on a btrfs mailing list, if you use qemu, you really want sparse format:raw instead of qcow2 or preallocated raw. This also works great with TRIM. > > Back then it didn't seem to work. > > It works, just not with some of the QEMU virtualized disk device drivers. > You don't need to use qemu-img to manually dig holes either, it's all > automatic. It works only with scsi and virtio-scsi drivers. Most qemu setups use either ide (ouch!) or virtio-blk. You'd obviously want virtio-scsi; note that defconfig enables virtio-blk but not virtio-scsi; I assume most distribution kernels have both. It's a bit tedious to switch between the two as -blk is visible as /dev/vda while -scsi as /dev/sda. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 10:10 PM, Miguel Ojedawrote: > On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap wrote: >> On 03/09/2018 04:07 PM, Andrew Morton wrote: >>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook wrote: >>> When max() is used in stack array size calculations from literal values (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler thinks this is a dynamic calculation due to the single-eval logic, which is not needed in the literal case. This change removes several accidental stack VLAs from an x86 allmodconfig build: $ diff -u before.txt after.txt | grep ^- -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids variable length array ‘ids’ [-Wvla] -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length array ‘namebuf’ [-Wvla] -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ [-Wvla] -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ [-Wvla] -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ [-Wvla] -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array ‘buff64’ [-Wvla] Based on an earlier patch from Josh Poimboeuf. >>> >>> v1, v2 and v3 of this patch all fail with gcc-4.4.4: >>> >>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t': >>> ./include/linux/jiffies.h:444: error: first argument to >>> '__builtin_choose_expr' not a constant >> >> >> I'm seeing that problem with >>> gcc --version >> gcc (SUSE Linux) 4.8.5 > > Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal > 5.1.0 and it seems to work as well. And sparse freaks out too: drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect type in initializer (different address spaces) @@expected void *addr @@got struct mac_regs [noderef] *mac_regs drivers/net/ethernet/via/via-velocity.c:100:49: sparse: incorrect type in argument 2 (different base types) @@expected restricted pci_power_t [usertype] state @@got _t [usertype] state @@ Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or some other name for the simple macro. Bleh. -Kees -- Kees Cook Pixel Security -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Change of Ownership of the filesystem content when cloning a volume
I am 100% sure Netapp Flexclone can change the ownership of the clone content. We are using that functionality right now. https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-cmpr-900%2Fvolume__clone__create.html When you create clone in Flrxclone, you can specify a uid/gid. When the clone is created, all the files/directories/content in that clone is instantly owned by the uid/gid. What I have been looking for is a similar functionality in Btrfs or ZFS? I would like to put a diskimage on NFS, and mount it as a disk on my machine. And I want such snapshot and cloning(change ownership) capabilities in that disk image. So I was considering BTRFS or ZFS, and was wondering if they miht have that feature. Thanks Sarvi - Occam's Razor Rules On 3/9/18, 9:48 PM, "Andrei Borzenkov"wrote: 10.03.2018 02:13, Saravanan Shanmugham (sarvi) пишет: > > Netapp’s storage system, has the concept of snapshot/clones. > And when I create a clone from a snapshot, I can give/change ownership of entire tree in the volume to a different userid. You are probably mistaken. NetApp FlexClone (which you probably mean) does not have any ways to change volume content. Of course you can now mount this clone and do whatever you like from host, but that is completely unrelated to NetApp itself and can just as well be done using btrfs subvolume. > > Is something like that possible in BTRFS? > > We are looking to use CopyOnWrite to snapshot nightly build workspace and clone as developer workspaces to avoid building from scratch for developers, > And move directly for incremental builds. > For this we would like the clone workspace/volume to be instantly owned by the developer cloning the workspace. > > Thanks, > Sarvi > - > Occam's Razor Rules > > > N�r��y���b�X��ǧv�^�){.n�+{�n�߲)���w*?jg���?�ݢj/���z�ޖ��2�ޙ���&�)ߡ�a����?�G���h�?�j:+v���w�٥ >
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a): > > > And then report back on the output of the extra debug statements. Your global rsv is essentially unused, this means in the worst case the code should fallback to using the global rsv for satisfying the memory allocation for delayed refs. So we should figure out why this isn't' happening. >>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as >>> we hit ENOSPC again. >> There is the output: >> >> [24672.573075] BTRFS info (device sdb): space_info 4 has >> 18446744072971649024 free, is not full >> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, >> used=304593289216, pinned=2321940480, reserved=174800896, >> may_use=1811644416, readonly=131072 >> [24672.573079] use_block_rsv: Not using global blockrsv! Current >> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 >> global_rsv->space_info = 999a57db7000 >> [24672.573083] BTRFS: Transaction aborted (error -28) > Bummer, so you are indeed running out of global space reservations in > context which can't really use any other reservation type, thus the > ENOSPC. Was the stacktrace again during processing of running delayed refs? Yes, the stacktrace is below. [24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 btrfs_run_delayed_refs+0x259/0x270 [btrfs] [24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor zstd_decompress zstd_compress [24672.573161] xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2 [24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I 4.14.23-znr8+ #73 [24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 02/01/2011 [24672.573172] task: 999a23229140 task.stack: a85642094000 [24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs] [24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282 [24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 0006 [24672.573189] RDX: RSI: 0082 RDI: 999a6fcd66f0 [24672.573190] RBP: 95c24d68 R08: 0001 R09: 0479 [24672.573190] R10: 99974b1960e0 R11: 0479 R12: 999a5a65 [24672.573191] R13: 999a5a6511f0 R14: R15: [24672.573192] FS: () GS:999a6fcc() knlGS: [24672.573193] CS: 0010 DS: ES: CR0: 80050033 [24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 000206e0 [24672.573195] Call Trace: [24672.573215] btrfs_commit_transaction+0x3e1/0x950 [btrfs] [24672.573231] ? start_transaction+0x89/0x410 [btrfs] [24672.573246] transaction_kthread+0x195/0x1b0 [btrfs] [24672.573249] kthread+0xfc/0x130 [24672.573265] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs] [24672.573266] ? kthread_create_on_node+0x70/0x70 [24672.573269] ret_from_fork+0x35/0x40 [24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 0b 44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 [24672.573292] ---[ end trace b17d927a946cb02e ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 10 Mar 2018 15:19:05 +0100 Christoph Anton Mittererwrote: > TRIM/discard... not sure how far this is really a solution. It is the solution in a great many of usage scenarios, don't know enough about your particular one, though. Note you can use it on HDDs too, even without QEMU and the like: via using LVM "thin" volumes. I use that on a number of machines, the benefit is that since TRIMed areas are "stored nowhere", those partitions allow for incredibly fast block-level backups, as it doesn't have to physically read in all the free space, let alone any stale data in there. LVM snapshots are also way more efficient with thin volumes, which helps during backup. > dm-crypt per default blocks discard. Out of misguided paranoia. If your crypto is any good (and last I checked AES was good enough), there's really not a lot to gain for the "attacker" knowing which areas of the disk are used and which are not. > Some longer time ago I had a look at whether qemu would support that on > it's own,... i.e. the guest and it's btrfs would normally use discard, > but the image file below would mark the block as discarded and later on > e can use some qemu-img command to dig holes into exactly those > locations. > Back then it didn't seem to work. It works, just not with some of the QEMU virtualized disk device drivers. You don't need to use qemu-img to manually dig holes either, it's all automatic. > But even if it would in the meantime, a proper zerofree implementation > would be beneficial for all non-qemu/qcow2 users (e.g. if one uses raw > images in qemu, the whole thing couldn't work but with really zeroing > the blocks inside the guest. QEMU deallocates parts of its raw images for those areas which have been TRIM'ed by the guest. In fact I never use qcow2, always raw images only. Yet, boot a guest, issue fstrim, and see the raw file while still having the same size, show much lower actual disk usage in "du". -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ongoing Btrfs stability issues
On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote: > So for OLTP workloads you definitely want nodatacow enabled, bear in > mind this also disables crc checksumming, but your db engine should > already have such functionality implemented in it. Unlike repeated claims made here on the list and other places... I woudln't now *any* DB system which actually does this per default and or in a way that would be comparable to filesystem lvl checksumming. Look back in the archives... when I've asked several times for checksumming support *with* nodatacow, I evaluated the existing status for the big ones (postgres,mysql,sqlite,bdb)... and all of them had this either not enabled per default, not at all, or requiring special support for the program using the DB. Similar btw: no single VM image type I've evaluated back then had any form of checksumming integrated. Still, one of the major deficiencies (not in comparison to other fs, but in comparison to how it should be) of btrfs unfortunately :-( Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zerofree btrfs support?
On Sat, 2018-03-10 at 09:16 +0100, Adam Borowski wrote: > Do you want zerofree for thin storage optimization, or for security? I don't think one can really use it for security (neither on SSD or HDD). On both, zeroed blocks may still be readable by forensic measures. So optimisation, i.e. digging holes in VM image files and make them sparse. > For the former, you can use fstrim; this is enough on any modern SSD; > on HDD > you can rig the block device to simulate TRIM by writing zeroes. I'm > sure > one of dm-* can do this, if not -- should be easy to add, there's > also > qemu-nbd which allows control over discard, but incurs a performance > penalty > compared to playing with the block layer. Writing zeros if of course possible... but rather ugly... one really needs to write *everything* while a smart tool could just zero those block groups that have been used (while everything else is still zero from the original image file). TRIM/discard... not sure how far this is really a solution. The first thing that comes to my mind is, that *if* the discard would propagate down below a dm-crypt layer (e.g. in my case there is: SSD->partitions->dmcrypt->LUKS->btrfs->image-files-I-want-to-zero) it has effects on security, which is why dm-crypt per default blocks discard. Some longer time ago I had a look at whether qemu would support that on it's own,... i.e. the guest and it's btrfs would normally use discard, but the image file below would mark the block as discarded and later on e can use some qemu-img command to dig holes into exactly those locations. Back then it didn't seem to work. But even if it would in the meantime, a proper zerofree implementation would be beneficial for all non-qemu/qcow2 users (e.g. if one uses raw images in qemu, the whole thing couldn't work but with really zeroing the blocks inside the guest. Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs-progs: dump-tree: add degraded option
btrfs inspect dump-tree cli picks the disk with the largest generation to read the root tree, even when all the devices were not provided in the cli. But in 2 disks RAID1 you may need to know what's in the disks individually, so this option -x | --noscan indicates to use only the given disk to dump. Signed-off-by: Anand Jain--- v1->v2: rename --degraded to --noscan cmds-inspect-dump-tree.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index df44bb635c9c..d2676ce55af7 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -198,6 +198,7 @@ const char * const cmd_inspect_dump_tree_usage[] = { "-u|--uuid print only the uuid tree", "-b|--block print info from the specified block only", "-t|--tree print only tree with the given id (string or number)", + "-x|--noscanuse the disk in the arg, do not scan for the disks (for raid1)", NULL }; @@ -234,10 +235,11 @@ int cmd_inspect_dump_tree(int argc, char **argv) { "uuid", no_argument, NULL, 'u'}, { "block", required_argument, NULL, 'b'}, { "tree", required_argument, NULL, 't'}, + { "noscan", no_argument, NULL, 'x'}, { NULL, 0, NULL, 0 } }; - c = getopt_long(argc, argv, "deb:rRut:", long_options, NULL); + c = getopt_long(argc, argv, "deb:rRut:x", long_options, NULL); if (c < 0) break; switch (c) { @@ -286,6 +288,9 @@ int cmd_inspect_dump_tree(int argc, char **argv) } break; } + case 'x': + open_ctree_flags |= OPEN_CTREE_NO_DEVICES; + break; default: usage(cmd_inspect_dump_tree_usage); } -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
>>> And then report back on the output of the extra debug >>> statements. >>> >>> Your global rsv is essentially unused, this means >>> in the worst case the code should fallback to using the global rsv >>> for satisfying the memory allocation for delayed refs. So we should >>> figure out why this isn't' happening. >> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as we >> hit ENOSPC again. > > There is the output: > > [24672.573075] BTRFS info (device sdb): space_info 4 has 18446744072971649024 > free, is not full > [24672.573077] BTRFS info (device sdb): space_info total=308163903488, > used=304593289216, pinned=2321940480, reserved=174800896, may_use=1811644416, > readonly=131072 > [24672.573079] use_block_rsv: Not using global blockrsv! Current > blockrsv->type = 1 blockrsv->space_info = 999a57db7000 > global_rsv->space_info = 999a57db7000 > [24672.573083] BTRFS: Transaction aborted (error -28) Bummer, so you are indeed running out of global space reservations in context which can't really use any other reservation type, thus the ENOSPC. Was the stacktrace again during processing of running delayed refs? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ongoing Btrfs stability issues
On 9.03.2018 21:05, Alex Adriaanse wrote: > Am I correct to understand that nodatacow doesn't really avoid CoW when > you're using snapshots? In a filesystem that's snapshotted Yes, so nodatacow won't interfere with how snapshots operate. For more information on that topic check the following mailing list thread: https://www.spinics.net/lists/linux-btrfs/msg62715.html every 15 minutes, is there a difference between normal CoW and nodatacow when (in the case of Postgres) you update a small portion of a 1GB file many times per minute? Do you anticipate us seeing a benefit in stability and performance if we set nodatacow for the So regarding this, you can check : https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation Essentially every bit of small, random postgres update in the db file will cause a CoW operation + checksum IO which cause, and I quote, " thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM." So for OLTP workloads you definitely want nodatacow enabled, bear in mind this also disables crc checksumming, but your db engine should already have such functionality implemented in it. entire FS while retaining snapshots? Does nodatacow increase the chance of corruption in a database like Postgres, i.e. are writes still properly ordered/sync'ed when flushed to disk? Well most modern DB already implement some sort of a WAL, so the reliability responsibility is shifted on the db engine. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Dne 9.3.2018 v 20:03 Martin Svec napsal(a): > Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a): >> On 23.02.2018 16:28, Martin Svec wrote: >>> Hello, >>> >>> we have a btrfs-based backup system using btrfs snapshots and rsync. >>> Sometimes, >>> we hit ENOSPC bug and the filesystem is remounted read-only. However, >>> there's >>> still plenty of unallocated space according to "btrfs fi usage". So I think >>> this >>> isn't another edge condition when btrfs runs out of space due to fragmented >>> chunks, >>> but a bug in disk space allocation code. It suffices to umount the >>> filesystem and >>> remount it back and it works fine again. The frequency of ENOSPC seems to be >>> dependent on metadata chunks usage. When there's a lot of free space in >>> existing >>> metadata chunks, the bug doesn't happen for months. If most metadata chunks >>> are >>> above ~98%, we hit the bug every few days. Below are details regarding the >>> backup >>> server and btrfs. >>> >>> The backup works as follows: >>> >>> * Every night, we create a btrfs snapshot on the backup server and rsync >>> data >>> from a production server into it. This snapshot is then marked >>> read-only and >>> will be used as a base subvolume for the next backup snapshot. >>> * Every day, expired snapshots are removed and their space is freed. >>> Cleanup >>> is scheduled in such a way that it doesn't interfere with the backup >>> window. >>> * Multiple production servers are backed up in parallel to one backup >>> server. >>> * The backed up servers are mostly webhosting servers and mail servers, >>> i.e. >>> hundreds of billions of small files. (Yes, we push btrfs to the limits >>> :-)) >>> * Backup server contains ~1080 snapshots, Zlib compression is enabled. >>> * Rsync is configured to use whole file copying. >>> >>> System configuration: >>> >>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch >>> (see below) and >>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in >>> metadata_reserve_bytes) >>> >>> btrfs mount options: >>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15 >>> >>> $ btrfs fi df /backup: >>> >>> Data, single: total=28.05TiB, used=26.37TiB >>> System, single: total=32.00MiB, used=3.53MiB >>> Metadata, single: total=255.00GiB, used=250.73GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> $ btrfs fi show /backup: >>> >>> Label: none uuid: a52501a9-651c-4712-a76b-7b4238cfff63 >>> Total devices 2 FS bytes used 26.62TiB >>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb >>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc >>> >>> $ btrfs fi usage /backup: >>> >>> Overall: >>> Device size: 36.79TiB >>> Device allocated: 28.30TiB >>> Device unallocated:8.49TiB >>> Device missing: 0.00B >>> Used: 26.62TiB >>> Free (estimated): 10.17TiB (min: 10.17TiB) >>> Data ratio: 1.00 >>> Metadata ratio: 1.00 >>> Global reserve: 512.00MiB (used: 0.00B) >>> >>> Data,single: Size:28.05TiB, Used:26.37TiB >>>/dev/sdc 28.05TiB >>> >>> Metadata,single: Size:255.00GiB, Used:250.73GiB >>>/dev/sdb 255.00GiB >>> >>> System,single: Size:32.00MiB, Used:3.53MiB >>>/dev/sdb 32.00MiB >>> >>> Unallocated: >>>/dev/sdb 161.59GiB >>>/dev/sdc8.33TiB >>> >>> Btrfs filesystem uses two logical drives in single mode, backed by >>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting >>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume. >>> >>> Please note that we have a simple custom patch in btrfs which ensures >>> that metadata chunks are allocated preferably on SSD volume and data >>> chunks are allocated only on SATA volume. The patch slightly modifies >>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating >>> devices when a metadata chunk is requested and vice versa. However, >>> I'm quite sure that this patch doesn't cause the reported bug because >>> we log every call of the modified code and there're no __btrfs_alloc_chunk() >>> calls when ENOSPC is triggered. Moreover, we observed the same bug before >>> we developed the patch. (IIRC, Chris Mason mentioned that they work on >>> a similar feature in facebook, but I've found no official patches yet.) >>> >>> Dmesg dump: >>> >>> [285167.750763] use_block_rsv: 62468 callbacks suppressed >>> [285167.750764] BTRFS: block rsv returned -28 >>> [285167.750789] [ cut here ] >>> [285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 >>> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs] >>> [285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp >>> iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw >>> ip6table_filter iptable_nat
Re: How to replace a failed drive in btrfs RAID 1 filesystem
09.03.2018 19:43, Austin S. Hemmelgarn пишет: > > If the answer to either one or two is no but the answer to three is yes, > pull out the failed disk, put in a new one, mount the volume degraded, > and use `btrfs replace` as well (you will need to specify the device ID > for the now missing failed disk, which you can find by calling `btrfs > filesystem show` on the volume). I do not see it and I do not remember ever seeing device ID of missing devices. 10:/home/bor # blkid /dev/sda1: UUID="ce0caa57-7140-4374-8534-3443d21f3edc" TYPE="swap" PARTUUID="d2714b67-01" /dev/sda2: UUID="cc072e56-f671-4388-a4a0-2ffee7c98fdb" UUID_SUB="eaeb4c78-da94-43b3-acc7-c3e963f1108d" TYPE="btrfs" PTTYPE="dos" PARTUUID="d2714b67-02" /dev/sdb1: UUID="e4af8f3c-8307-4397-90e3-97b90989cf5d" UUID_SUB="f421f1e7-2bb0-4a67-a18e-cfcbd63560a8" TYPE="btrfs" PARTUUID="875525bf-01" 10:/home/bor # mount /dev/sdb1 /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program, or other error. 10:/home/bor # mount -o degraded /dev/sdb1 /mnt 10:/home/bor # btrfs fi sh /mnt Label: none uuid: e4af8f3c-8307-4397-90e3-97b90989cf5d Total devices 2 FS bytes used 256.00KiB devid2 size 1023.00MiB used 212.50MiB path /dev/sdb1 *** Some devices missing 10:/home/bor # btrfs fi us /mnt Overall: Device size: 2.00GiB Device allocated:425.00MiB Device unallocated:1.58GiB Device missing: 1023.00MiB Used:512.00KiB Free (estimated):912.62MiB (min: 912.62MiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) Data,RAID1: Size:102.25MiB, Used:128.00KiB /dev/sdb1 102.25MiB missing 102.25MiB Metadata,RAID1: Size:102.25MiB, Used:112.00KiB /dev/sdb1 102.25MiB missing 102.25MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/sdb1 8.00MiB missing 8.00MiB Unallocated: /dev/sdb1 810.50MiB missing 810.50MiB 10:/home/bor # rpm -q btrfsprogs btrfsprogs-4.15-2.1.x86_64 10:/home/bor # uname -a Linux 10 4.15.7-1-default #1 SMP PREEMPT Wed Feb 28 12:40:23 UTC 2018 (a36e160) x86_64 x86_64 x86_64 GNU/Linux 10:/home/bor # And "missing" is not the answer because I obviously may have more than one missing device. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to replace a failed drive in btrfs RAID 1 filesystem
Austin S. Hemmelgarn wrote: On 2018-03-09 11:02, Paul Richards wrote: Hello there, I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive. Before I attempt any recovery I’d like to ask what is the recommended approach? (The wiki docs suggest consulting here before attempting recovery[1].) The system is powered down currently and a replacement drive is being delivered soon. Should I use “replace”, or “add” and “delete”? Once replaced should I rebalance and/or scrub? I believe that the recovery may involve mounting in degraded mode. If I do this, how do I later get out of degraded mode, or if it’s automatic how do i determine when I’m out of degraded mode? It won't automatically mount degraded, you either have to explicitly ask it to, or you have to have an option to do so in your default mount options for the volume in /etc/fstab (which is dangerous for multiple reasons). Now, as to what the best way to go about this is, there are three things to consider: 1. Is the failed disk still usable enough that you can get good data off of it in a reasonable amount of time? If you're replacing the disk because of a lot of failed sectors, you can still probably get data off of it, while something like a head crash isn't worth trying to get data back. 2. Do you have enough room in the system itself to add another disk without removing one? 3. Is the replacement disk at least as big as the failed disk? If the answer to all three is yes, then just put in the new disk, mount the volume normally (you don't need to mount it degraded if the failed disk is working this well), and use `btrfs replace` to move the data. This is the most efficient option in terms of both time and is also generally the safest (and I personally always over-spec drive-bays in systems we build where I work specifically so that this approach can be used). If the answer to the third question is no, put in the new disk (removing the failed one first if the answer to the second question is no), mount the volume (mount it degraded if one of the first two questions is no, normally otherwise), then add the new disk to the volume with `btrfs device add` and remove the old one with `btrfs device delete` (using the 'missing' option if you had to remove the failed disk). This is needed because the replace operation requires the new device to be at least as big as the old one. If the answer to either one or two is no but the answer to three is yes, pull out the failed disk, put in a new one, mount the volume degraded, and use `btrfs replace` as well (you will need to specify the device ID for the now missing failed disk, which you can find by calling `btrfs filesystem show` on the volume). In the event that the replace operation refuses to run in this case, instead add the new disk to the volume with `btrfs device add` and then run `btrfs device delete missing` on the volume. If you follow any of the above procedures, you don't need to balance (the replace operation is equivalent to a block level copy and will result in data being distributed exactly the same as it was before, while the delete operation is a special type of balance), and you generally don't need to scrub the volume either (though it may still be a good idea). As far as getting back from degraded mode, you can just remount the volume to do so, though I would generally suggest rebooting. Note that there are three other possible approaches to consider as well: 1. If you can't immediately get a new disk _and_ all the data will fit on the other two disks, use `btrfs device delete` to remove the failed disk anyway, and run with just the two until you can get a new disk. This is exponentially safer than running the volume degraded until you get a new disk, and is the only case you realistically should delete a device before adding the new one. Make sure to balance the volume after adding the new device. 2. Depending on the situation, it may be faster to just recreate the whole volume from scratch using a backup than it is to try to repair it. This is actually the absolute safest method of handling this situation, as it makes sure that nothing from the old volume with the failed disk causes problems in the future. 3. If you don't have a backup, but have some temporary storage space that will fit all the data from the volume, you could also use `btrfs restore` to extract files from the old volume to temporary storage, recreate the volume, and copy the data back in from the temporary storage. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I did a quick scan of the wiki just to see, but I did not find any good info about how to recover a "RAID" like set if degraded. Information about how to recover, and what profiles can be recovered from would be good to have
Re: zerofree btrfs support?
On Sat, Mar 10, 2018 at 03:55:25AM +0100, Christoph Anton Mitterer wrote: > Just wondered... was it ever planned (or is there some equivalent) to > get support for btrfs in zerofree? Do you want zerofree for thin storage optimization, or for security? For the former, you can use fstrim; this is enough on any modern SSD; on HDD you can rig the block device to simulate TRIM by writing zeroes. I'm sure one of dm-* can do this, if not -- should be easy to add, there's also qemu-nbd which allows control over discard, but incurs a performance penalty compared to playing with the block layer. For zerofree for security, you'd need defrag (to dislodge partial pinned extents) first, and do a full balance to avoid data left in metadata nodes and in blocks beyond file ends (note that zerofree doesn't do this on traditional filesystems either). Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html