Re: degraded raid scribbling upon wrong device
On Thu, Jul 13, 2017 at 08:40:12AM +0200, Adam Borowski wrote: > Here's a set of test cases, two of them in some cases seem to scribble upon > the wrong device: > > * deg-mid-missing > * deg-last-replaced (not on the innocent "re") > * but never deg-last-missing > > When all goes ok, there are no errors other than wrong generation on the > re-added disk (expected). When it goes bad, there's a lot of corruption. > In all cases, though, the "Device missing:" field is wrong. I did not explore this adequately yet, in a good part because of ENOSPC triggering a lot of time for an unrelated reason that Omar just fixed (thanks!). So, here's what I know so far: * copying in, say, 2.2GB /usr/share is a lot more likely to trigger than dd-ing 2.2GB of /dev/null * no "real" degrading is needed: in the original scripts, the missing device is empty so all blocks are doubled anyway. It's not about degraded chunks but because of a bogus device. * bogus output of "btrfs f u" is a sure predictor that, with enough tries, you'll get corruption -- if it shows something when it should say "missing", shit is likely to happen Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device ready purpose
On Sat, Jul 22, 2017 at 1:58 PM, Adam Borowskiwrote: > On Sat, Jul 22, 2017 at 06:15:58PM +, Hugo Mills wrote: >> On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote: >> > I just did an additional test that's pretty icky behavior. >> > >> > 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready` >> > exits with 0 as expected. Physically remove both USB devices. >> > Reconnect one device. `btrfs device ready` still exits 0. That's >> > definitely not good. (If I leave that one device connected and reboot, >> > `btrfs device ready` exits 1). >> >>In a slightly less-specific way, this has been a problem pretty >> much since the inception of the FS. It's not possible to do the >> reverse of the "scan" operation on a device -- that is, invalidate/ >> remove the device's record in the kernel. So, as you've discovered >> here, if you have a device which is removed (overwritten, unplugged), >> the kernel still thinks it's a part of the FS. > > Alas, this needs to be fixed. The reproducers I posted last week give data > corruption in case a device that was once a part of the FS is reconnected. > It doesn't matter what it contains now -- be it another part of the FS or > something totally unrelated, as far as the device node (/dev/loop0, > /dev/sda1, etc) is reused, degraded mounts get confused. > > It wasn't urgent before as degraded mounts were broken before Qu's chunk > check patch (that's not even merged yet) -- but once running degraded is > not an emergency, there'll be folks doing so for an extended time. > >>It's something I recall being talked about a bit, some years ago. I >> don't recall now why it was going to be useful, though. I think you >> have a good use-case for such a new ioctl (or extension to the >> SCAN_DEV ioctl) now, though. > > Such an ioctl would be inherently racey. Even current udev code is -- > mounting right after losetup often fails, sometimes you even need to sleep > longer than 1 second. With the above in mind, I see no way other than > invalidating and re-checking all known devices at mount time. If we go back even further in time, what I'm trying to avoid is the problem with DE's where the user connects a two device Btrfs, and then they want to eject it. The DE is already confused because behind the scenes it has actually mounted each device to two different mount points, which Btrfs allows (it's one file system, on two mount points). That's confusing, but not a big problem. The big problem happens when the user wants to stop using that file system. So they eject one of the two appearing devices (which should of course only be one with Btrfs) and behind the scenes udisksd umounts just one of the mountpoints and then appears to delete that device node, which in effect makes the still mounted file system degraded, and results in corruption. Btrfs fixes this up on the next mount of both devices. But it's just asking for trouble. Output of this behavior here: https://bugs.freedesktop.org/show_bug.cgi?id=87277#c3 So then I started to look at whether it's possible to easily determine in advance if a Btrfs file system is single or multiple device, and let udisksd have a policy where it will just ignore multiple device Btrfs entirely - just don't support it until the guts of all this infrastructure gets better. 'strace btrfs filesystem show' curiously shows BTRFS_IOC_FS_INFO is only called for single device Btrfs. There is seemingly a much more esoteric, btrfs-progs only method for getting information for multiple device Btrfs volumes. And therefore I'm not certain if BTRFS_IOC_FS_INFO supports multiple device Btrfs, and would return num_devices so that it's possible to know whether to ignore devices for a multiple device Btrfs volume. *sigh* Chris Murphy -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device ready purpose
On Sat, Jul 22, 2017 at 06:15:58PM +, Hugo Mills wrote: > On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote: > > I just did an additional test that's pretty icky behavior. > > > > 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready` > > exits with 0 as expected. Physically remove both USB devices. > > Reconnect one device. `btrfs device ready` still exits 0. That's > > definitely not good. (If I leave that one device connected and reboot, > > `btrfs device ready` exits 1). > >In a slightly less-specific way, this has been a problem pretty > much since the inception of the FS. It's not possible to do the > reverse of the "scan" operation on a device -- that is, invalidate/ > remove the device's record in the kernel. So, as you've discovered > here, if you have a device which is removed (overwritten, unplugged), > the kernel still thinks it's a part of the FS. Alas, this needs to be fixed. The reproducers I posted last week give data corruption in case a device that was once a part of the FS is reconnected. It doesn't matter what it contains now -- be it another part of the FS or something totally unrelated, as far as the device node (/dev/loop0, /dev/sda1, etc) is reused, degraded mounts get confused. It wasn't urgent before as degraded mounts were broken before Qu's chunk check patch (that's not even merged yet) -- but once running degraded is not an emergency, there'll be folks doing so for an extended time. >It's something I recall being talked about a bit, some years ago. I > don't recall now why it was going to be useful, though. I think you > have a good use-case for such a new ioctl (or extension to the > SCAN_DEV ioctl) now, though. Such an ioctl would be inherently racey. Even current udev code is -- mounting right after losetup often fails, sometimes you even need to sleep longer than 1 second. With the above in mind, I see no way other than invalidating and re-checking all known devices at mount time. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device ready purpose
On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote: > I just did an additional test that's pretty icky behavior. > > 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready` > exits with 0 as expected. Physically remove both USB devices. > Reconnect one device. `btrfs device ready` still exits 0. That's > definitely not good. (If I leave that one device connected and reboot, > `btrfs device ready` exits 1). In a slightly less-specific way, this has been a problem pretty much since the inception of the FS. It's not possible to do the reverse of the "scan" operation on a device -- that is, invalidate/ remove the device's record in the kernel. So, as you've discovered here, if you have a device which is removed (overwritten, unplugged), the kernel still thinks it's a part of the FS. It's something I recall being talked about a bit, some years ago. I don't recall now why it was going to be useful, though. I think you have a good use-case for such a new ioctl (or extension to the SCAN_DEV ioctl) now, though. Hugo. -- Hugo Mills | UNIX: Italian pen maker hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: btrfs device ready purpose
I just did an additional test that's pretty icky behavior. 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready` exits with 0 as expected. Physically remove both USB devices. Reconnect one device. `btrfs device ready` still exits 0. That's definitely not good. (If I leave that one device connected and reboot, `btrfs device ready` exits 1). Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs device ready purpose
On Fri, Jul 21, 2017 at 11:55 PM, Andrei Borzenkovwrote: > 21.07.2017 17:36, Chris Murphy пишет: >>> >>> The command is just a simple wrapper around the DEVICES_READY ioctl, but >>> now that systemd has it's own wrapper tool, there are probably no users >>> of that subcommand in 'btrfs' tool itself. We can enhance the >>> documentation to state the expected purpose and that normal users can >>> ignore it. >> >> What is the expected purpose? It flat out does not seem to work at >> all. It doesn't wait when devices are missing, as the man description >> says. > > That's man page that is misleading. The intent was to let caller of > "btrfs device ready" to know when it has to wait. > >> And echo ? returns a 0 instead of 1. I'd expect the exit code is >> 0 to mean "yes all devices are ready", and exit code 1 "some devices >> not ready". But right now, I get the same result no matter what. >> > > That's not what I observe. > > linux-gtrk:~ # btrfs device ready /dev/sdb > linux-gtrk:~ # echo $? > 1 > linux-gtrk:~ # btrfs-debug-tree /dev/sdb > btrfs-progs v4.5.3+20160729 > warning, device 2 is missing > ... > > But if you call "btrfs device ready" AFTER kernel has already seen (or > decided about) all devices, then it returns 0. Basically, this is not > "filesystem ready" but "does kernel know about all devices for this > filesystem". OK! Super! This is the critical bit of behavior. My test is flawed. The multiple device volume was visible to the kernel, and then I merely deactivated the LV. The kernel had seen it, and isn't "missing" it at least in terms of 'btrfs device ready' whereas 'btrfs fi show' does report it as missing but is also using different ioctls. Even if I use 'btrfs device scan' a subsequent 'btrfs device ready' exits 0. But if I set skip activation 'lvchange -ky' and reboot, 'btrfs device ready' on the non-missing device does result in an exit code of 1. > Please do not confuse independent things. "btrfs device ready" simply > tells caller whether all devices have been seen by kernel. This is poor > man's solution for "can I mount it". What caller does with this > information is outside of scope of btrfs. Got it. Thanks. > >> I think it'd be better to return a code. >> 0: is complete, degraded not required >> 1: is incomplete, degraded should mount it >> 2: is incomplete, degraded won't mount it >> > > There is no way systemd can make use of this information using current > static unit dependencies. Really, this topic came up more than once > (including by you as well). systemd does not have adequate ways to > represent multi-device objects (this goes beyond btrfs, Linux MD is good > example). Sometimes it is possible to workaround it (Linux MD again). > But at the end, systemd needs to offer framework where btrfs et al can > plug in by providing status. Until this happens, discussion on this list > is pointless. Understood. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote: > On 2017-07-20 17:27, Nick Terrell wrote: > > This patch set adds xxhash, zstd compression, and zstd decompression > > modules. It also adds zstd support to BtrFS and SquashFS. > > > > Each patch has relevant summaries, benchmarks, and tests. > > For patches 2-3, I've compile tested and had runtime testing running for > about 18 hours now with no issues, so you can add: > > Tested-by: Austin S. HemmelgarnI assume you haven't tried it on arm64, right? I had no time to get 'round to it before, and just got the following build failure: CC fs/btrfs/zstd.o In file included from fs/btrfs/zstd.c:28:0: fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’ refcount_t pending_bios; ^~ scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed It's trivially fixably by: --- a/fs/btrfs/zstd.c +++ b/fs/btrfs/zstd.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include "compression.h" after which it works fine, although half an hour of testing isn't exactly exhaustive. Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that would be quite painful so I did not try yet. I guess re-testing your patch set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't be of much help. So far, previous versions have been running for weeks, with no issue since you fixed workspace flickering. On amd64 all is fine. I haven't tested SquashFS at all. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to detect "orphaned" subvolume attachment point in snapshot?
On 07/22/2017 09:58 AM, Andrei Borzenkov wrote: > Here is structure of snapshots in openSUSE; all snapshots of root volume > are created under /.snapshots subvolume: > > linux-gtrk:/host/home/src/python-btrfs/examples # sudo mount -o > ro,subvol=/ /dev/sda3 /mnt > linux-gtrk:/host/home/src/python-btrfs/examples # > ./show_directory_contents.py /mnt/ > directory /mnt/@ tree 257 inum 256 > inode generation 6 transid 315 size 72 nbytes 0 block_group 0 mode 40755 > nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none) > inode ref index 0 name utf-8 .. > ... > dir item list hash 1921786525 size 1 > dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots > ... > linux-gtrk:/host/home/src/python-btrfs/examples # > ./show_directory_contents.py /mnt/@/.snapshots/251/snapshot/ > directory /mnt/@/.snapshots/251/snapshot/ tree 774 inum 256 > inode generation 6 transid 15867 size 164 nbytes 0 block_group 0 mode > 40755 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none) > inode ref index 0 name utf-8 .. > ... > dir item list hash 1921786525 size 1 > dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots > ... > linux-gtrk:/host/home/src/python-btrfs/examples # > > Note that both directory items in /mnt/@ and > /mnt/@/.snapshots/251/snapshot store the same tree ID for .snapshots > item - 258. This causes grub2 btrfs driver loop - when it comes to > /mnt/@/.snapshots/251/snapshot and looks for .snapshots it jumps back to > /mnt/@/.snapshots tree. > > I see that in Linux kernel somehow distinguishes between both of them; I > am not sure how it actually does it though. > > What on-disk information should we check to find out "orphaned" snapshot > directory? The information is not stored in the subvolume that contains the "attachment point". So you cannot get the info at that location. If it was, that would mean that when creating a snapshot, some process would need to walk the entire directory structure and change all the locations in the tree that looked like if there was another nested subvolume placed there before. In tree 1, the tree of trees, there's information about root 258: -# btrfs inspect-internal dump-tree -t 1 /dev/[...]/blaat [...] item 19 key (258 ROOT_ITEM 0) itemoff 12635 itemsize 439 root data bytenr 21397504 level 0 dirid 256 refs 1 gen 11 lastsnap 0 flags 0x0(none) uuid d7fe436b-35b5-9b4e-805d-20b9294a55d0 ctransid 11 otransid 9 stransid 0 rtransid 0 item 20 key (258 ROOT_BACKREF 257) itemoff 12616 itemsize 19 root backref key dirid 256 sequence 2 name b I think the ROOT_BACKREF says that the only location where the contents of the nested subvolume should really be shown is when it's looked at via the "attachment point" in tree 257, directory 256, index 2 in the directory with name b. When looking at it via the VFS, you get a special inode 2 number when looking at it in a place that does not match the BACKREF again: -# stat b File: b Size: 0 Blocks: 0 IO Block: 4096 directory Device: 55h/85d Inode: 2 Links: 1 Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) Access: 2017-07-22 13:08:59.217456707 +0200 Modify: 2017-07-22 13:08:59.217456707 +0200 Change: 2017-07-22 13:08:59.217456707 +0200 Birth: - I don't have all structures of the root tree yet in python-btrfs it seems. Would be nice to create an example script that does a pretty printed version of tree 1. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to detect "orphaned" subvolume attachment point in snapshot?
Here is structure of snapshots in openSUSE; all snapshots of root volume are created under /.snapshots subvolume: linux-gtrk:/host/home/src/python-btrfs/examples # sudo mount -o ro,subvol=/ /dev/sda3 /mnt linux-gtrk:/host/home/src/python-btrfs/examples # ./show_directory_contents.py /mnt/ directory /mnt/@ tree 257 inum 256 inode generation 6 transid 315 size 72 nbytes 0 block_group 0 mode 40755 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none) inode ref index 0 name utf-8 .. ... dir item list hash 1921786525 size 1 dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots ... linux-gtrk:/host/home/src/python-btrfs/examples # ./show_directory_contents.py /mnt/@/.snapshots/251/snapshot/ directory /mnt/@/.snapshots/251/snapshot/ tree 774 inum 256 inode generation 6 transid 15867 size 164 nbytes 0 block_group 0 mode 40755 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none) inode ref index 0 name utf-8 .. ... dir item list hash 1921786525 size 1 dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots ... linux-gtrk:/host/home/src/python-btrfs/examples # Note that both directory items in /mnt/@ and /mnt/@/.snapshots/251/snapshot store the same tree ID for .snapshots item - 258. This causes grub2 btrfs driver loop - when it comes to /mnt/@/.snapshots/251/snapshot and looks for .snapshots it jumps back to /mnt/@/.snapshots tree. I see that in Linux kernel somehow distinguishes between both of them; I am not sure how it actually does it though. What on-disk information should we check to find out "orphaned" snapshot directory? TIA -andrei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel btrfs file system wedged -- is it toast?
> The btrfs developers should have known this, and announced this, a long time ago, in various prominent ways that it would be difficult for potential new users to miss. I'm also a user like you, and I felt like this too when I came here (BTW there are several traps in BTRFS, and other are causing partial or whole filesystem loss, so you're lucky). There's truth in your words that some warning is needed, but in this open-source business it is not clear who should give it to whom. Developers in the list are actually spending their time on adding such warnings to kernel and command-line tools, but e.g. people using GUI and not reading dmesg over breakfast won't see them anyways. All situation is unfortunate because hardware and OS vendors keep hyping BTRFS and making it default in their products when it is clearly not ready, but you're now talking to and blaming the wrong people. Personally for me coming to this list was the most helpful thing in understanding BTRFS current state and limitations. I'm still using it, although in a very careful and controlled manner. But browsing the list every day sadly takes time. If you can't afford it or are running something absolutely critical, better look to other, more mature filesystems. After all, as adage says: "legacy is what we run in production". -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html