Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote: >> I.e.: every shared segment should be accounted within quota (at least once). > I think what you mean to say here is that every shared extent should be > accounted to quotas for every location it is reflinked from. IOW, that > if an extent is shared between two subvolumes each with it's own quota, > they should both have it accounted against their quota. Yes. >> Moreover - if there would be per-subvolume RAID levels someday, the data >> should be accouted in relation to "default" (filesystem) RAID level, >> i.e. having a RAID0 subvolume on RAID1 fs should account half of the >> data, and twice the data in an opposite scenario (like "dup" profile on >> single-drive filesystem). > > This is irrelevant to your point here. In fact, it goes against it, > you're arguing for quotas to report data like `du`, but all of > chunk-profile stuff is invisible to `du` (and everything else in > userspace that doesn't look through BTRFS ioctls). My point is user-point, not some system tool like du. Consider this: 1. user wants higher (than default) protection of some data, 2. user wants more storage space with less protection. Ad. 1 - requesting better redundancy is similar to cp --reflink=never - there are functional differences, but the cost is similar: trading space for security, Ad. 2 - many would like to have .cache, .ccache, tmp or some build system directory with faster writes and no redundancy at all. This requires per-file/directory data profile attrs though. Since we agreed that transparent data compression is user's storage bonus, gains from the reduced redundancy should also profit user. Disclaimer: all the above statements in relation to conception and understanding of quotas, not to be confused with qgroups. -- Tomasz Pala
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On Fri, Aug 10, 2018 at 15:55:46 +0800, Qu Wenruo wrote: >> The first thing about virtually every mechanism should be >> discoverability and reliability. I expect my quota not to change without >> my interaction. Never. How did you cope with this? >> If not - how are you going to explain such weird behaviour to users? > > Read the manual first. > Not every feature is suitable for every use case. I, the sysadm, must RTFM. My users won't comprehend this and moreover - they won't even care. > IIRC lvm thin is pretty much the same for the same case. LVM doesn't pretend to be user-oriented, it is the system scope. LVM didn't name it's thin provisioning "quotas". > For 4 disk with 1T free space each, if you're using RAID5 for data, then > you can write 3T data. > But if you're also using RAID10 for metadata, and you're using default > inline, we can use small files to fill the free space, resulting 2T > available space. > > So in this case how would you calculate the free space? 3T or 2T or > anything between them? The answear is pretty simple: 3T. Rationale: - this is the space I do can put in a single data stream, - people are aware that there is metadata overhead with any object; after all, metadata are also data, - while filling the fs with small files the free space available would self-adjust after every single file put, so after uploading 1T of such files the df should report 1.5T free. There would be nothing weird(er that now) that 1T of data has actually eaten 1.5T of storage. No crystal ball calculations, just KISS; since one _can_ put 3T file (non sparse, uncompressible, bulk written) on a filesystem, the free space is 3T. > Only yourself know what the heck you're going to use the that 4 disks > with 1T free space each. > Btrfs can't look into your head and know what you're thinking. It shouldn't. I expect raw data - there is 3TB of unallocated space for current data profile. > That's the design from the very beginning of btrfs, yelling at me makes > no sense at all. Sorry if you receive me "yelling" - I honestly must put in on my non-native english. I just want to clarify some terminology and perspective expectations. They are irrevelant to the underlying technical solutions, but the literal *description* of the solution you provide should match user expectations of that terminology. > I have tried to explain what btrfs quota does and it doesn't, if it > doesn't fit you use case, that's all. > (Whether you have ever tried to understand is another problem) I am (more than before) aware what btrfs quotas are not. So, my only expectation (except for worldwide peace and other unrealistic ones) would be to stop using "quotas", "subvolume quotas" and "qgroups" interchangeably in btrfs context, as IMvHO these are not plain, well-known "quotas". -- Tomasz Pala
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On Fri, Aug 10, 2018 at 07:03:18 +0300, Andrei Borzenkov wrote: >> So - the limit set on any user > > Does btrfs support per-user quota at all? I am aware only of per-subvolume > quotas. Well, this is a kind of deceptive word usage in "post-truth" times. In this case both "user" and "quota" are not valid... - by "user" I ment general word, not unix-user account; such user might possess some container running full-blown guest OS, - by "quota" btrfs means - I guess, dataset-quotas? In fact: https://btrfs.wiki.kernel.org/index.php/Quota_support "Quota support in BTRFS is implemented at a subvolume level by the use of quota groups or qgroup" - what the hell is "quota group" and how it differs from qgroup? According to btrfs-quota(8): "The quota groups (qgroups) are managed by the subcommand btrfs qgroup(8)" - they are the same... just completely different from traditional "quotas". My suggestion would be to completely remove the standalone "quota" word from btrfs documentation - there is no "quota", just "subvolume quota" or "qgroup" supported. -- Tomasz Pala
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
ially when his data becomes "exclusive" one day without any known reason), misnamed ...and not reflecting anything valuable, unless the problems with extent fragmentation are already resolved somehow? So IMHO current quotas are: - not discoverable for user (shared->exclusive transition of my data by someone's else action), - not reliable for sysadm (offensive write pattern by any user can allocate virtually any space despite of quotas). -- Tomasz Pala
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote: > 2) Different limitations on exclusive/shared bytes >Btrfs can set different limit on exclusive/shared bytes, further >complicating the problem. > > 3) Btrfs quota only accounts data/metadata used by the subvolume >It lacks all the shared trees (mentioned below), and in fact such >shared tree can be pretty large (especially for extent tree and csum >tree). I'm not sure about the implications, but just to clarify some things: when limiting somebody's data space we usually don't care about the underlying "savings" coming from any deduplicating technique - these are purely bonuses for system owner, so he could do larger resource overbooking. So - the limit set on any user should enforce maximum and absolute space he has allocated, including the shared stuff. I could even imagine that creating a snapshot might immediately "eat" the available quota. In a way, that quota returned matches (give or take) `du` reported usage, unless "do not account reflinks withing single qgroup" was easy to implemet. I.e.: every shared segment should be accounted within quota (at least once). And the numbers accounted should reflect the uncompressed sizes. Moreover - if there would be per-subvolume RAID levels someday, the data should be accouted in relation to "default" (filesystem) RAID level, i.e. having a RAID0 subvolume on RAID1 fs should account half of the data, and twice the data in an opposite scenario (like "dup" profile on single-drive filesystem). In short: values representing quotas are user-oriented ("the numbers one bought"), not storage-oriented ("the numbers they actually occupy"). -- Tomasz Pala -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any chance to get snapshot-aware defragmentation?
On Fri, May 18, 2018 at 13:10:02 -0400, Austin S. Hemmelgarn wrote: > Personally though, I think the biggest issue with what was done was not > the memory consumption, but the fact that there was no switch to turn it > on or off. Making defrag unconditionally snapshot aware removes one of > the easiest ways to forcibly unshare data without otherwise altering the The "defrag only not-snapshotted data" mode would be enough for many use cases and wouldn't require more RAM. One could run this before taking a snapshot and merge _at least_ the new data. And even with current approach it should be possible to interlace defragmentation with some kind of naive-deduplication; "naive" in the approach of comparing blocks only within the same in-subvolume paths. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: your mail
On Sun, Feb 18, 2018 at 10:28:02 +0100, Tomasz Pala wrote: > I've already noticed this problem on February 10th: > [btrfs-progs] coreutils-like -i parameter, splitting permissions for various > tasks > > In short: not possible. Regular user can only create subvolumes. Not possible "oficially". Axel Burri has replied with more helpful approach: https://github.com/digint/btrfs-progs-btrbk Unfortunately this issue was not picked up by any developer, so for now we can only wait for splitting libbtrfsutil so this task could be easier. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: your mail
On Sun, Feb 18, 2018 at 08:14:25 +, Tomasz Kłoczko wrote: > For some reasons btrfs pool each volume is not displayed in mount and > df output, and I cannot find how to display volumes/snapshots usage > using btrfs command. In general: not possible without enabling quotas, which in turn impact snapshot performance significally. btrfs quota enable / btrfs quota rescan / btrfs qgroup sh --sort=excl / > So now I have many volumes and snapshots in my home directory, but to > maintain all this I must use root permission. As non-root working in > my own home which is separated btrfs volume it would be nice to have > the possibility to delegate permission to create, destroy, > send/receive, mount/umount etc. snapshots, volumes like it os possible > on zfs. I've already noticed this problem on February 10th: [btrfs-progs] coreutils-like -i parameter, splitting permissions for various tasks In short: not possible. Regular user can only create subvolumes. > BTW: someone maybe started working on something like .zfs hidden > directory functions which is in each zfs volume mountpoint? In btrfs world this is done differently - don't put main (working) volume in the root, but mount some branch by default, keeping all the subvolumes next to it. I.e. don't: @working_subvolume @working_subvolume/snapshots but: @root_of_the_fs @root_of_the_fs/working_subvolume @root_of_the_fs/snapshots In fact this is manual workaroud for the problem you've mentioned. > Have few or few tenths snapshots is not so big deal but the same on > scale few hundreds, thousands or more snapshots I think that would be > really hard without something like hidden .btrfs/snapshots directory. With few hundreds of subvolumes btrfs would fail miserably. > After few years not using btrfs (because previously was quite > unstable) It is really good to see that now I'm not able to crash it. It's not crashing with LTS 4.4 and 4.9 kernels, many reports of various crashes in 4.12, 4.14 and 4.15 were posted here. It is really hard to say, which of the post-4.9 kernels have reliable btrfs. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[btrfs-progs] coreutils-like -i parameter, splitting permissions for various tasks
There is a serious flaw in btrfs subcommands handling. Since all of them are handled by single 'btrfs' binary, there is no way to create any protection against accidental data loss for (the only one I've found, but still DANGEROUS) 'btrfs subvolume delete'. There are several protections that are being used for various commands. For example, with zsh having hist_ignore_space enabled I got: alias kill=' kill' alias halt=' halt' alias init=' init' alias poweroff=' poweroff' alias reboot=' reboot' alias shutdown=' shutdown' alias telinit=' telinit' so that these command are never saved into my shell history. Other system-wide protection enabled by default might be coreutils.sh creating aliases: alias cp=' cp --interactive --archive --backup=numbered --reflink=auto' alias mv=' mv --interactive --backup=numbered' alias rm=' rm --interactive --one-file-system --interactive=once' All such countermeasures reduce the probability of fatal mistakes. There is no 'prompt before doing ANYTHING irreversible' option for btrfs, so everyone needs to take special care typing commands. Since snapshotting and managing subvolumes is some daily-routine, not anything special (like creating storage pools or managing devices), this should be more forgiving for any user errors. Since there is no other (obvious) solution, I propose makeing "subvolume delete" ask for confirmation by default, unless used with newly introduced option, like -y(--yes). Moreover, since there might be different admin roles on the system, the btrfs-progs should be splitted into separate tools, so one could have quota-admin without permissions for managing devices, backup-admin with access to all the subvolumes or maintenance-admin that could issue scrub or rebalance volumes. For backward compatibility, these tools could be issued by 'btrfs' wrapper binary. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote: >> I personally think the degraded mount option is a mistake as this >> assumes that a lightly degraded system is not able to work which is false. >> If the system can mount to some working state then it should mount >> regardless if it is fully operative or not. If the array is in a bad >> state you need to learn about it by issuing a command or something. The >> same goes for a MD array (and yes, I am aware of the block layer vs >> filesystem thing here). > The problem with this is that right now, it is not safe to run a BTRFS > volume degraded and writable, but for an even remotely usable system Mounting read-only is still better than not mounting at all. For example, my emergency.target has limited network access and starts ssh server so I could recover from this situation remotely. > with pretty much any modern distro, you need your root filesystem to be > writable (or you need to have jumped through the hoops to make sure /var > and /tmp are writable even if / isn't). Easy to handle by systemd. Not only this, but much more is planned: http://0pointer.net/blog/projects/stateless.html -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
Just one final word, as all was already said: On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote: >> In other words, is it: >> - the systemd that threats btrfs WORSE than distributed filesystems, OR >> - btrfs that requires from systemd to be threaded BETTER than other fss? > Or maybe it's both? I'm more than willing to admit that what BTRFS does > expose currently is crap in terms of usability. The reason it hasn't > changed is that we (that is, the BTRFS people and the systemd people) > can't agree on what it should look like. Hard to agree with someone who refuses to do _anything_. You can choose to follow whatever, MD, LVM, ZFS, invent something totally different, write custom daemon or put timeout logic inside the kernel itself. It doesn't matter. You know the ecosystem - it is the udev that must be signalled somehow and systemd WILL follow. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
; A sub-volume is a BTRFS-specific concept referring to a mostly > independent filesystem tree within a BTRFS volume that still depends on > the super-blocks, chunk-tree, and a couple of other internal structures > from the main filesystem. LVM volumes also depend on VG metadata. Main btrfs 'volume', that handles other subvolumes, is only technical difference. >> Great example - how is systemd mounting distributed/network filesystems? >> Does it mount them blindly, in a loop, or fires some checks against >> _plausible_ availability? > Yes, but availability there is a boolean value. No, systemd won't try to mount remote filesystems until network is up. > In BTRFS it's tri-state > (as of right now, possibly four to six states in the future depending on > what gets merged), and the intermediate (not true or false) state can't > be checked in a trivial manner. All the udev need is: "am I ALLOWED to force-mount this, even if degraded". And this 'permission' must change after a user-supplied timeout. >> In other words, is it: >> - the systemd that threats btrfs WORSE than distributed filesystems, OR >> - btrfs that requires from systemd to be threaded BETTER than other fss? > Or maybe it's both? I'm more than willing to admit that what BTRFS does > expose currently is crap in terms of usability. The reason it hasn't > changed is that we (that is, the BTRFS people and the systemd people) > can't agree on what it should look like. This might be ANY way, that allows udev to work just like it works with MD. >> ...provided there are some measures taken for the premature operation to be >> repeated. There is non in btrfs-ecosystem. > Yes, because we expect the user to do so, just like LVM, and MD, and > pretty much every other block layer you're claiming we should be > behaving like. MD and LVM export their state, so the userspace CAN react. btrfs doesn't. >> Other init systems either fail at mounting degraded btrfs just like >> systemd does, or have buggy workarounds in their code reimplemented in >> each other just to handle thing, that should be centrally organized. >> > Really? So the fact that I can mount a 2-device volume with RAID1 > profiles degraded using OpenRC without needing anything more than adding > rootflags=degraded to the kernel parameters must be a fluke then... We are talking about automatic fallback after timeout, not manually casting any magic spells! Since OpenRC doesn't read rootflags at all: grep -iE 'rootflags|degraded|btrfs' openrc/**/* it won't support this without some extra code. > The thing is, it primarily breaks if there are hardware issues, > regardless of the init system being used, but at least the other init > systems _give you an error message_ (even if it's really the kernel > spitting it out) instead of just hanging there forever with no > indication of what's going on like systemd does. If your systemd waits forever and you have no error messages, report bug to your distro maintainer, as he is probably the one to blame for fixing what ain't broken. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 16:09:50 +0100, Tomasz Pala wrote: >> BCP for over a >> decade has been to put multipathing at the bottom, then crypto, then >> software RAID, than LVM, and then whatever filesystem you're using. > > Really? Let's enumerate some caveats of this: > > - crypto below software RAID means double-encryption (wasted CPU), > > - RAID below LVM means you're stuck with the same RAID-profile for all > the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for > system and RAID0 for various system caches (like ccache on software > builder machine) or transient LVM-level snapshots. > > - RAID below filesystem means loosing btrfs-RAID extra functionality, > like recovering data from different mirror when CRC mismatch happens, > > - crypto below LVN means encrypting everything, including data that is > not sensitive - more CPU wasted, And, what is much worse - encrypting everything using the same secret. BIG show-stopper. I would shred such BCP as ineffective and insecure for both, data integrity and confidentiality. > - RAID below LVM means no way to use SSD acceleration of part of the HDD > space using MD write-mostly functionality. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 10:05:34 -0500, Austin S. Hemmelgarn wrote: >> Instead, they should move their legs continuously and if the train is > not >> on the station yet, just climb back and retry. > No, that's really not a good analogy given the fact that that check for > the presence of a train takes a normal person milliseconds while the > event being raced against (the train departing) takes minutes. In the OMG... preventing races by "this would always take longer"? Seriously? > You're already looping forever _waiting_ for the volume to appear. How udev is waiting for events, not systemd. Nobody will do some crazy cross-layered shortcuts to overcome other's lazyness. > is that any different from lopping forever trying to _mount_ the volume Yes, because udev doesn't mount anything, ever. Not this binary dude! > instead given that failing to mount the volume is not going to damage > things. Failed premature attempt to mount prevents the system from booting WHEN the devices are ready - this is fatal. System boots randomly on racy conditions. But hey, "the devices will always appear faster, than the init attempt to do the mount"! Have you ever had some hardware RAID controller? Never heard about devices appearing after 5 minutes of warming up? > The issue here is that systemd refuses to implement any method > of actually retrying things that fail during startup.> 1. Such methods are trivial and I've already mentioned them a dozen of times. 2. They should be implemented in btrfs-upstream, not systemd-upstream, but I personally would happily help with writing them here. 3. They require full-circle path of 'allow-degraded' to be passed through btrfs code. >> mounting BEFORE volume is complete is FATAL - since no userspace daemon >> would ever retrigger the mount and the system won't came up. Provide one >> btrfsd volume manager and systemd could probably switch to using it. > And here you've lost any respect I might have had for you. Going personal? So thank you for discussion and good bye. Please refrain from answering me, I'm not going to discuss this any further with you. > **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM** Sorry dude, but I won't repeat for the 5th times all the alternatives. You *all* refuse to step in ANY possible solution mentioned. You *all* except the systemd to do ALL the job, just like other init systems were forced to do, against the good design principles. Good luck having btrfs degraded mount under systemd. > > This is one of the two biggest things I hate about systemd(the journal > is the other one for those who care). The journal has currently *many* drawbacks, but this is not 'by design' but 'by appropriate code missing for now'. The same applies to btrfs, isn't it? > You don't need some special daemon to set the time, Ever heard about NTP? > or to set the hostname, FUD - no such daemon > or to fetch account data, FUD > or even to track who's logged in FUD > As much as it may surprise the systemd developers, people got on just > fine handling setting the system time, setting the hostname, fetching > account info, tracking active users, and any number of myriad other > tasks before systemd decided they needed to have their own special daemon. > Sure, in myriad of different scattered distro-specific files. The only reason systemd stepped in for some of there is that nobody else could introduce and force Linux-wide consensus. And if anyone would succeed, there would be some Austins blaming them for 'overtaking good old trashyard into coherent de facto standard.' > In this particular case, you don't need a daemon because the kernel does > the state tracking. Sure, MD doesn't require daemon and LVM doesn't require either. But they do provide some - I know, they are all wrong. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 21:44:23 -0700, Chris Murphy wrote: > Btrfs is orthogonal to systemd's willingness to wait forever while > making no progress. It doesn't matter what it is, it shouldn't wait > forever. It times out after 90 seconds (by default) and then it fails the mount entirely. > It occurs to me there are such systemd service units specifically for > waiting for example > > systemd-networkd-wait-online.service, systemd-networkd-wait-online - > Wait for network to >come online > > chrony-wait.service - Wait for chrony to synchronize system clock > > NetworkManager has a version of this. I don't see why there can't be a > wait for Btrfs to normally mount, Because mounting degraded btrfs without -o degraded won't WAIT for anything just immediatelly return failed. > just simply try to mount, it fails, wait 10, try again, wait 10 try again. For the last time: No Such Logic In Systemd CORE Every wait/repeat is done using UNITS - as you already noticed itself. And these are plain, regular UNITS. Is there anything that prevents YOU, Chris, from writing these UNITS for btrfs? I know what makes ME stop writing these units - it's lack of feedback from btrfs.ko ioctl handler. Without this I am unable to write UNITS handling fstab mount entries, because the logic would PROBABLY have to be hardcoded inside systemd-fstab-generator. And such logic MUST NOT be hardcoded - this MUST be user-configurable, i.e. made on UNITS level. You might argue that some-distros-SysV units or some Gentoo-OpenRC have support for this and if you want to change anything this is only a few lines of shell code to be altered. But systemd-fstab-generator is compiled binary and so WON'T allow the behaviour to be user-configurable. > And then fail the unit so we end up at a prompt. This can also be easily done, just like emergency-shell spawns when configured. If only btrfs could accept and keep information about volume being allowed for degraded mount. OK, to be honest I _can_ write such rules now, keeping the 'allow-degraded' state somewhere else (in a file for example). But since this is some non-standarized side-channel, such code won't be accepted in systemd upstream, especially because it requires the current udev rule to be slightly changed. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 14:00:53 -0500, Austin S. Hemmelgarn wrote: > We already do so in the accepted standard manner. If the mount fails > because of a missing device, you get a very specific message in the > kernel log about it, as is the case for most other common errors (for > uncommon ones you usually just get a generic open_ctree error). This is > really the only option too, as the mount() syscall (which the mount > command calls) returns only 0 on success or -1 and an appropriate errno > value on failure, and we can't exactly go about creating a half dozen > new error numbers just for this (well, technically we could, but I very > much doubt that they would be accepted upstream, which defeats the purpose). This is exacly why the separate communication channel being the ioctl is currently used. And I really don't understand why do you fight against expanding this ioctl response. > With what you're proposing for BTRFS however, _everything_ is a > complicated decision, namely: > 1. Do you retry at all? During boot, the answer should usually be yes, > but during normal system operation it should normally be no (because we > should be letting the user handle issues at that point). This is exactly why I propose to introduce ioctl in btrfs.ko that accepts userspace-configured (as per-volume policy) expectations. > 2. How long should you wait before you retry? There is no right answer > here that will work in all cases (I've seen systems which take multiple > minutes for devices to become available on boot), especially considering > those of us who would rather have things fail early. btrfs-last-resort@.timer per analogy to mdadm-last-resort@.timer > 3. If the retry fails, do you retry again? How many times before it > just outright fails? This is going to be system specific policy. On > systems where devices may take a while to come online, the answer is > probably yes and some reasonably large number, while on systems where > devices are known to reliably be online immediately, it makes no sense > to retry more than once or twice. All of this is systemd timer/service job. > 4. If you are going to retry, should you try a degraded mount? Again, > this is going to be system specific policy (regular users would probably > want this to be a yes, while people who care about data integrity over > availability would likely want it to be a no). Just like above - user-configured in systemd timers/services easily. > 5. Assuming you do retry with the degraded mount, how many times should > a normal mount fail before things go degraded? This ties in with 3 and > has the same arguments about variability I gave there. As above. > 6. How many times do you try a degraded mount before just giving up? > Again, similar variability to 3. > 7. Should each attempt try first a regular mount and then a degraded > one, or do you try just normal a couple times and then switch to > degraded, or even start out trying normal and then start alternating? > Any of those patterns has valid arguments both for and against it, so > this again needs to be user configurable policy. > > Altogether, that's a total of 7 policy decisions that should be user > configurable. All of them easy to implement if the btrfs.ko could accept 'allow-degraded' per-volume instruction and return 'try-degraded' in the ioctl. > Having a config file other than /etc/fstab for the mount > command should probably be avoided for sanity reasons (again, BTRFS is a > filesystem, not a volume manager), so they would all have to be handled > through mount options. The kernel will additionally have to understand > that those options need to be ignored (things do try to mount > filesystems without calling a mount helper, most notably the kernel when > it mounts the root filesystem on boot if you're not using an initramfs). > All in all, this type of thing gets out of hand _very_ fast. You need to think about the two separately: 1. tracking STATE - this is remembering 'allow-degraded' option for now, 2. configured POLICY - this is to be handled by init system. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
er we put this into words, it is btrfs that behaves differently. >> The 'needless complication', as you named it, usually should be the default >> to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm? >> No easy way to RAID the drive (there are device-mapper tricks, they are >> just way more complicated). Even attaching SSD cache is not trivial >> without preparations (for bcache being the absolutely necessary, much >> easier with LVM in place). > For a bog-standard client system, all of those _ARE_ overkill (and > actually, so is BTRFS in many cases too, it's just that we're the only > option for main-line filesystem-level snapshots at the moment). Such standard systems don't have multidevice btrfs volumes neither, so they are beyond the problem discussed here. >>>> If btrfs pretends to be device manager it should expose more states, >>> >>> But it doesn't pretend to. >> >> Why mounting sda2 requires sdb2 in my setup then? > First off, it shouldn't unless you're using a profile that doesn't > tolerate any missing devices and have provided the `degraded` mount > option. It doesn't in your case because you are using systemd. I have written this previously (19-22 Dec, "Unexpected raid1 behaviour"): 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb, 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool), 3. try mount /dev/sda /test - fails mount /dev/sdb /test - works 4. reboot again and try in reversed order mount /dev/sdb /test - fails mount /dev/sda /test - works mounting btrfs without "btrfs device scan" doesn't work at all without udev rules (that mimic behaviour of the command). > Second, BTRFS is not a volume manager, it's a filesystem with > multi-device support. What is the designatum difference between 'volume' and 'subvolume'? > The difference is that it's not a block layer, As a de facto design choice only. > despite the fact that systemd is treating it as such. Yes, BTRFS has > failure modes that result in regular operations being refused based on > what storage devices are present, but so does every single distributed > filesystem in existence, and none of those are volume managers either. Great example - how is systemd mounting distributed/network filesystems? Does it mount them blindly, in a loop, or fires some checks against _plausible_ availability? In other words, is it: - the systemd that threats btrfs WORSE than distributed filesystems, OR - btrfs that requires from systemd to be threaded BETTER than other fss? >> There is a term for such situation: broken by design. > So in other words, it's broken by design to try to connect to a remote > host without pinging it first to see if it's online? Trying to connect to remote host without checking if OUR network is already up and if the remote target MIGHT be reachable using OUR routes. systemd checks LOCAL conditions: being online in case of network, being online in case of hardware, being online in case of virtual devices. > In all of those cases, there is no advantage to trying to figure out if > what you're trying to do is going to work before doing it, because every ...provided there are some measures taken for the premature operation to be repeated. There is non in btrfs-ecosystem. > There's a name for the type of design you're saying we should have here, > it's called a time of check time of use (TOCTOU) race condition. It's > one of the easiest types of race conditions to find, and also one of the > easiest to fix. Ask any sane programmer, and he will say that _that_ is > broken by design. Explained before. >> And you still blame systemd for using BTRFS_IOC_DEVICES_READY? > Given that it's been proven that it doesn't work and the developers > responsible for it's usage don't want to accept that it doesn't work? Yes. Remove it then. >> Just change the BTRFS_IOC_DEVICES_READY handler to always return READY. >> > Or maybe we should just remove it completely, because checking it _IS > WRONG_, That's right. But before commiting upstream, check for consequences. I've already described a few today, pointed the source and gave some possible alternate solutions. > which is why no other init system does it, and in fact no Other init systems either fail at mounting degraded btrfs just like systemd does, or have buggy workarounds in their code reimplemented in each other just to handle thing, that should be centrally organized. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote: > Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF > THIS_. It's functionally no different than prefacing an attempt to send > a signal to a process by checking if the process exists, or trying to > see if some other process is using a file that might be locked by Seriously, there is a race condition on train stations. People check if the train has stopped and opened the door before they move their legs to get in, but the train might be already gone - so this is pointless. Instead, they should move their legs continuously and if the train is not on the station yet, just climb back and retry. See the difference? I hope now you know what is the race condition. It is the condition, where CONSEQUENCES are fatal. mounting BEFORE volume is complete is FATAL - since no userspace daemon would ever retrigger the mount and the system won't came up. Provide one btrfsd volume manager and systemd could probably switch to using it. mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens and volume disappears, then this was either some operator action, so the umount SHOULD happen, or we are facing some MALFUNCION, which is fatal itself, not by being a "race condition". -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote: > systemd can't possibly need to know more information than a person > does in the exact same situation in order to do the right thing. No > human would wait 10 minutes, let alone literally the heat death of the > planet for "all devices have appeared" but systemd will. And it does We're already repeating - systemd waits for THE btrfs-compound-device, not ALL the block-devices. Just like it 'waits' for someone to plug USB pendrive in. It is a btrfs choice to not expose compound device as separate one (like every other device manager does), it is a btrfs drawback that doesn't provice anything else except for this IOCTL with it's logic, it is a btrfs drawback that there is nothing to push assembling into "OK, going degraded" state, it is btrfs drawback that there are no states... I've told already - pretend the /dev/sda1 device doesn't exist until assembled. If this overlapping usage was designed with 'easier mounting' on mind, this is simply bad design. > that by its own choice, its own policy. That's the complaint. It's > choosing to do something a person wouldn't do, given identical > available information. You are expecting systemd to mix in functions of kernel and udev. There is NO concept of 'assembled stuff' in systemd AT ALL. There is NO concept of 'waiting' in udev AT ALL. If you want to do some crazy interlayer shortcuts just implement btrfsd. > There's nothing the kernel is doing that's > telling systemd to wait for goddamn ever. There's nothing the kernel is doing that's telling udev there IS a degraded device assembled to be used. There's nothing a userspace-thing is doing that's telling udev to mark degraded device as mountable. There is NO DEVICE to be mounted, so systemd doesn't mount it. The difference is: YOU think that sda1 device is ephemeral, as it's covered by sda1 btrfs device that COULD BE mounted. I think that there is real sda1 device, following Linux rules of system registration, which CAN be overtaken by ephemeral btrfs-compound device. Can I mount that thing above sda1 block device? ONLY when it's properly registered in the system. Does btrfs-compound-device register in the system? - Yes, but only fully populated. Just don't expect people will break their code with broken designs just to overcome your own limitations. If you want systemd to mount degraded btrfs volume, just MAKE IT REGISTER in the system. How can btrfs register in the system being degraded? Either by some userspace daemon handling btrfs volumes states (which are missing from the kernel), or by some IOCTLs altering in-kernel states. So for the last time: nobody will break his own code to patch missing code from other (actively maintained) subsystem. If you expect degraded mounts, there are 2 choices: 1. implement degraded STATE _some_where_ - udev would handle falling back to degraded mount after specified timeout, 2. change this IOCTL to _always_ return 1 - udev would register any btrfs device, but you will get random behaviour of mounting degraded/populated. But you should expect that since there is no concept of any state below. Actually, this is ridiculous - you expect the degradation to be handled in some 3rd party software?! In init system? With the only thing you got is 'degraded' mount option?! What next - moving MD and LVM logic into systemd? This is not systemd's job - there are btrfs-specific kernel cmdline options to be parsed (allowing degraded volumes), there is tracking of volume health required. Yes, device-manager needs to track it's components, RAID controller needs to track minimum required redundancy. It's not only about mounting. But doing the degraded mounting is easy, only this one particular ioctl needs to be fixed: 1. counted devices not_ready 2. counted devices ok_degraded 3. counted devices==all => ok If btrfs DISTINGUISHES these two states, systemd would be able to use them. You might ask why this is important for the state to be kept inside some btrfs-related stuff, like kernel or btrfsd, while the systemd timer could do the same and 'just mount degraded'. The answear is simple: systemd.timer is just a sane default CONFIGURATION, that can be EASILY changed by system administrator. But somewhere, sometime, someone would have a NEED for totally different set of rules for handling degraded volumes, just like MD or LVM does. This would be totally irresponsible to hardcode any mount-degraded rule inside systemd itself. That is exactly why this must go through the udev - udev is responsible for handling devices in Linux world. How can I register btrfs device in udev, since it's overlapping the block device? I can't - the ioctl is one-way, doesn't accept any userspace feedback. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord
Re: degraded permanent mount option
On Sun, Jan 28, 2018 at 13:28:55 -0700, Chris Murphy wrote: >> Are you sure you really understand the problem? No mount happens because >> systemd waits for indication that it can mount and it never gets this >> indication. > > "not ready" is rather vague terminology but yes that's how systemd > ends up using the ioctl this rule depends on, even though the rule has > nothing to do with readiness per se. If all devices for a volume If you avoid using THIS ioctl, then you'd have nothing to fire the rule at all. One way or another, this is btrfs that must emit _some_ event or be polled _somehow_. > aren't found, we can correctly conclude a normal mount attempt *will* > fail. But that's all we can conclude. What I can't parse in all of > this is if the udev rule is a one shot, if the ioctl is a one shot, if > something is constantly waiting for "not all devices are found" to > transition to "all devices are found" or what. I can't actually parse It's not one shot. This works like this: sda1 appears -> udev catches event -> udev detects btrfs and IOCTLs => not ready sdb1 appears -> udev catches event -> udev detects btrfs and IOCTLs => ready The end. If there were some other device appearing after assembly, like /dev/md1, or if there were some event generated by btrfs code itself, udev could catch this and follow. Now, if you unplug sdb1, there's no such event at all. Since this IOCTL is the *only* thing that udev can rely on, it cannot be removed from the logic. So even if you create a timer to force assembly, you must do it by influencing the IOCTL response. Or creating some other IOCTL for this purpose, or creating some userspace daemon or whatever. > the two critical lines in this rule. I > > # let the kernel know about this btrfs filesystem, and check if it is complete > IMPORT{builtin}="btrfs ready $devnode" This sends IOCTL. > # mark the device as not ready to be used by the system > ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0" ^^this is IOCTL response being checked and SYSTEMD_READY set to 0 prevents systemd from mounting. > I think the Btrfs ioctl is a one shot. Either they are all present or not. The rules are called once per (block) device. So when btrfs scans all the devices to return READY, this would finally be systemd-ready. This is trivial to re-trigger udev rule (udevadm trigger), but there is no way to force btrfs to return READY after any timeout. > The waiting is a policy by systemd udev rule near as I can tell. There is no problem in waiting or re-triggering. This can be done in ~10 lines of rules. The problem is that the IOCTL won't EVER return READY until there are ALL the components present. It's simple as that: there MUST be some mechanism at device-manager level that tells if a compound device is mountable, degraded or not; upper layers (systemd-mount) do not care about degradation, handling redundancy/mirrors/chunks/stripes/spares is not it's job. It (systemd) can (easily!) handle expiration timer to push pending compound to be force-assembled, but currently there is no way to push. If the IOCTL would be extended to return TRYING_DEGRADED (when instructed to do so after expired timeout), systemd could handle additional per-filesystem fstab options, like x-systemd.allow-degraded. Then in would be possible to have best-effort policy for rootfs (to make machine boot), and more strict one for crucial data (do not mount it when there is no redundancy, wait for operator intervention). -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote: >> Tell me please, if you mount -o degraded btrfs - what would >> BTRFS_IOC_DEVICES_READY return? > > case BTRFS_IOC_DEVICES_READY: > ret = btrfs_scan_one_device(vol->name, FMODE_READ, > _fs_type, _devices); > if (ret) > break; > ret = !(fs_devices->num_devices == fs_devices->total_devices); > break; > > > All it cares about is whether the number of devices found is the same > as the number of devices any of that volume's supers claim make up > that volume. That's it. > >> This is not "outsmarting" nor "knowing better", on the contrary, this is >> "FOLLOWING the >> kernel-returned data". The umounting case is simply a bug in btrfs.ko >> that should change to READY state *if* someone has tried and apparently >> succeeded mounting the not-ready volume. > > Nope. That is not what the ioctl does. So who is to blame for creating utterly useless code? Userspace shouldn't depend on some stats (as number of devices is nothing more than that), but overall _availability_. I do not care if there are 2, 5 or 100 devices. I do care if there is ENOUGH devices to run regular (including N-way mirroring and hot spares) and if not - if there is ENOUGH devices to run degraded. Having ALL the devices is just the edge case. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sun, Jan 28, 2018 at 01:00:16 +0100, Tomasz Pala wrote: > It can't mount degraded, because the "missing" device might go online a > few seconds ago. s/ago/after/ >> The central problem is the lack of a timer and time out. > > You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack > anything, as you all state here, this should be easy to make this work. > Go ahead please. And just to make it even easier - this is how you can react to events inside udev (this is to eliminane btrfs-scan tool being required as it sux): https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f One could even try to trick systemd by SETTING (note the single '=') ENV{ID_BTRFS_READY}="0" - which would probably break as soon as btrfs.ko emits next 'changed' event. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sun, Jan 28, 2018 at 11:06:06 +0300, Andrei Borzenkov wrote: >> All systemd has to do is leave the mount alone that the kernel has >> already done, > > Are you sure you really understand the problem? No mount happens because > systemd waits for indication that it can mount and it never gets this > indication. And even after successful manual mount (with -o degraded) btrfs.ko insists that the device is not ready. That schizophrenia makes systemd umount that immediately, because this is the only proper way to handle missing devices (only the failed ones should go r/o). And there is really nothing systemd can do about this, until underlying code stops lying, unless we're going back to 1990s when devices were never unplugged or detached during system uptime. But even floppies could be ejected without system reboot. BTRFS is no exception here - when marked as 'not available', don't expect it to be kept used. Just fix the code to match reality. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sat, Jan 27, 2018 at 15:22:38 +, Duncan wrote: >> manages to mount degraded btrfs without problems. They just don't try >> to outsmart the kernel. > > No kidding. > > All systemd has to do is leave the mount alone that the kernel has > already done, instead of insisting it knows what's going on better than > the kernel does, and immediately umounting it. Tell me please, if you mount -o degraded btrfs - what would BTRFS_IOC_DEVICES_READY return? This is not "outsmarting" nor "knowing better", on the contrary, this is "FOLLOWING the kernel-returned data". The umounting case is simply a bug in btrfs.ko that should change to READY state *if* someone has tried and apparently succeeded mounting the not-ready volume. Otherwise - how should any system part behave when you detach some drive? Insist that "the kernel has already mounted it" and ignore kernel screaming "the device is (not yet there/gone)"? Just update the internal state after successful mount and this particular problem is gone. Unless there is some race condition and the state should be changed before the mount is announced to the userspace. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sat, Jan 27, 2018 at 14:12:01 -0700, Chris Murphy wrote: > doesn't count devices itself. The Btrfs systemd udev rule defers to > Btrfs kernel code by using BTRFS_IOC_DEVICES_READY. And it's totally > binary. Either they are all ready, in which case it exits 0, and if > they aren't all ready it exits 1. > > But yes, mounting whether degraded or not is sufficiently complicated > that you just have to try it. I don't get the point of wanting to know > whether it's possible without trying. Why would this information be If you want to blind-try it, just tell the btrfs.ko to flip the IOCTL bit. No shortcuts please, do it legit, where it belongs. >> Ie, the thing systemd can safely do, is to stop trying to rule everything, >> and refrain from telling the user whether he can mount something or not. > > Right. Open question is whether the timer and timeout can be > implemented in the systemd world and I don't see why not, I certainly It can. The reasons why it's not already there follow: 1. noone created udev rules and systemd units for btrfs-progs yet (that is trivial), 2. btrfs is not degraded-safe yet (the rules would have to check if the filesystem won't stuck in read-only mode for example, this is NOT trivial), 3. there is not way to tell the kernel that we want degraded (probably some new IOCTL) - this is the path that timer would use to trigger udev event releasing systemd mount. Let me repeat this, so this would be clear: this is NOT going to work as some systemd-shortcut being "mount -o degraded", this must go through the kernel IOCTL -> udev -> systemd path, i.e.: timer expires -> executes IOCTL with "OK, give me degraded /dev/blah" -> BTRFS_IOC_DEVICES_READY returns "READY" (or new value "DEGRADED") -> udev catches event and changes SYSTEMD_READY -> systemd mounts the volume. This is really simple. All you need to do is to pass "degraded" to the btrfs.ko, so the BTRFS_IOC_DEVICES_READY would return "go ahead". -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
ferent purposes [*], 2. lack the timing-out/degraded logic implemented somewhere. > issues a command to start the array anyway, and only then do you find > out if there are enough devices to start it. I don't understand the > value of knowing whether it is possible. Just try to mount it degraded > and then if it fails we fail, nothing can be done automatically it's > up to an admin. It can't mount degraded, because the "missing" device might go online a few seconds ago. > And even if you had this "degraded mount possible" state, you still > need a timer. So just build the timer. Exactly! This "timer" is btrfs-specific daemon that should be shipped with btrfs-tools. Well, maybe not the actual daemon, as btrfs handles incremental assembly on it's own, just the appropriate units and signalling. For mdadm there is --incremental used for gradually assemble via udev rule: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/udev-md-raid-assembly.rules (this also fires timer) and the systemd part for timing-out and degraded fallback: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd mdadm-last-resort@.timer mdadm-last-resort@.service There is appropriate code in LVM as well, using lvmetad, but this one is easier. So, let's step by step your proposal: > If all devices ready ioctl is true, the timer doesn't start, it means > all devices are available, mount normally. sure > If all devices ready ioctl is false, the timer starts, if all devices > appear later the ioctl goes to true, the timer is belayed, mount > normally. sure > If all devices ready ioctl is false, the timer starts, when the timer > times out, mount normally which fails and gives us a shell to > troubleshoot at. > OR > If all devices ready ioctl is false, the timer starts, when the timer > times out, mount with -o degraded which either succeeds and we boot or > it fails and we have a troubleshooting shell. Don't mix layers - just image your /dev/sda1 is not there and you simply cannot even try to mount it; this should be done like this: If all devices ready ioctl is false, the timer starts, when the timer times out, TELL THE KERNEL THAT WE WANT DEGRADED MOUNT. This in turn should switch IOCTL response to "OK, go degraded" which in turn would make udev rule to raise the flag[*] and then systemd could mount this. This is important that the kernel would be instructed in a way, not the last one, as this gives the chance to pass the degraded option using the cmdline. > The central problem is the lack of a timer and time out. You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack anything, as you all state here, this should be easy to make this work. Go ahead please. >> Unless there is *some* signalling from btrfs, there is really not much >> systemd can *safely* do. > > That is not true. It's not how mdadm works anyway. Yes it does. You can't mount mdadm until /dev/mdX appears, which happens when array get's fully assembled *OR* times out and kernel get's instructed to run array as degraded, which effects in /dev/mdX appearing. There is NO additional logic in systemd. This is NOT systemd that assembles degraded mdadm, this is mdadm that tells the kernel to assemble it and systemd mounts READY md. Moreover, systemd gives you a set of tools that you can use for timers. [*] the udev flag is required to distinguish /dev/sda1 block device from /dev/sda1 btrfs-volume device being ready. If there were separate device created, there would be no need for this entire IOCTL. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote: > It's quite obvious who's the culprit: every single remaining rc system > manages to mount degraded btrfs without problems. They just don't try to > outsmart the kernel. Yes. They are stupid enough to fail miserably with any more complicated setups, like stacking volume managers, crypto layer, network attached storage etc. Recently I've started mdadm on top of bunch of LVM volumes, with others using btrfs and others prepared for crypto. And you know what? systemd assembled everything just fine. So with argument just like yours: It's quite obvious who's the culprit: every single remaining filesystem manages to mount under systemd without problems. They just expose informations about their state. >> This is not a systemd issue, but apparently btrfs design choice to allow >> using any single component device name also as volume name itself. > > And what other user interface would you propose? The only alternative I see > is inventing a device manager (like you're implying below that btrfs does), > which would needlessly complicate the usual, single-device, case. The 'needless complication', as you named it, usually should be the default to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm? No easy way to RAID the drive (there are device-mapper tricks, they are just way more complicated). Even attaching SSD cache is not trivial without preparations (for bcache being the absolutely necessary, much easier with LVM in place). >> If btrfs pretends to be device manager it should expose more states, > > But it doesn't pretend to. Why mounting sda2 requires sdb2 in my setup then? >> especially "ready to be mounted, but not fully populated" (i.e. >> "degraded mount possible"). Then systemd could _fallback_ after timing >> out to degraded mount automatically according to some systemd-level >> option. > > You're assuming that btrfs somehow knows this itself. "It's quite obvious who's the culprit: every single volume manager keeps track of it's component devices". > Unlike the bogus > assumption systemd does that by counting devices you can know whether a > degraded or non-degraded mount is possible, it is in general not possible to > know whether a mount attempt will succeed without actually trying. There is a term for such situation: broken by design. > Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did > naive counting of this kind, it had to be replaced by actually checking > whether at least one copy of every block group is actually present. And you still blame systemd for using BTRFS_IOC_DEVICES_READY? [...] > just slow to initialize (USB...). So, systemd asks sda how many devices > there are, answer is "3" (sdb and sdc would answer the same, BTW). It can > even ask for UUIDs -- all devices are present. So, mount will succeed, > right? Systemd doesn't count anything, it asks BTRFS_IOC_DEVICES_READY as implemented in btrfs/super.c. > Ie, the thing systemd can safely do, is to stop trying to rule everything, > and refrain from telling the user whether he can mount something or not. Just change the BTRFS_IOC_DEVICES_READY handler to always return READY. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote: >> I just tested to boot with a single drive (raid1 degraded), even with >> degraded option in fstab and grub, unable to boot ! The boot process stop on >> initramfs. >> >> Is there a solution to boot with systemd and degraded array ? > > No. It is finger pointing. Both btrfs and systemd developers say > everything is fine from their point of view. Treating btrfs volume as ready by systemd would open a window of opportunity when volume would be mounted degraded _despite_ all the components are (meaning: "would soon") be ready - just like Chris Murphy wrote; provided there is -o degraded somewhere. This is not a systemd issue, but apparently btrfs design choice to allow using any single component device name also as volume name itself. IF a volume has degraded flag, then it is btrfs job to mark is as ready: >>>>>> ... and it still does not work even if I change it to root=/dev/sda1 >>>>>> explicitly because sda1 will *not* be announced as "present" to >>>>>> systemd> until all devices have been seen once ... ...so this scenario would obviously and magically start working. As for the regular by-UUID mounts: these links are created by udev WHEN underlying devices appear. Does btrfs volume appear? No. If btrfs pretends to be device manager it should expose more states, especially "ready to be mounted, but not fully populated" (i.e. "degraded mount possible"). Then systemd could _fallback_ after timing out to degraded mount automatically according to some systemd-level option. Unless there is *some* signalling from btrfs, there is really not much systemd can *safely* do. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote: >>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >>>> safely add "degraded" to the mount options? My primary concern is the >> [...] > > Well it only does rw once, then the next degraded is ro - there are > patches dealing with this better but I don't know the state. And > there's no resync code that I'm aware of, absolutely it's not good > enough to just kick off a full scrub - that has huge performance > implications and I'd consider it a regression compared to > functionality in LVM and mdadm RAID by default with the write intent > bitmap. Without some equivalent short cut, automatic degraded means a I read about the 'scrub' all over the time here, so let me ask this directly, as this is also not documented clearly: 1. is the full scrub required after ANY desync? (like: degraded mount followed by readding old device)? 2. if the scrub is omitted - is it possible that btrfs return invalid data (from the desynced and readded drive)? 3. is the scrub required to be scheduled on regular basis? By 'required' I mean by design/implementation issues/quirks, _not_ related to possible hardware malfunctions. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote: > I'm pretty sure degraded boot timeout policy is handled by dracut. The Well, last time I've checked dracut on systemd-system couldn't even generate systemd-less image. > kernel doesn't just automatically assemble an md array as soon as it's > possible (degraded) and then switch to normal operation as other MD devices are explicitly listed in mdadm.conf (for mdadm --assemble --scan) or kernel command line or metadata of autodetected partitions (fd). > devices appear. I have no idea how LVM manages the delay policy for > multiple devices. I *guess* it's not about waiting, but simply being executed after the devices are ready. And there is a VERY long history of various init systems having problems to boot systems using multi-layer setups (LVM/MD under or above LUKS, not to mention remote ones that need networking to be set up). All of this works reasonably well under systemd - except for the btrfs that uses single device node to match entire group of devices. Which is convenient for living person (no need to switch between /dev/mdX and /dev/sdX), but impossible to guess automatically by userspace tools. There is only probe IOCTL which doesn't handle degraded mode. > I don't think the delay policy belongs in the kernel. That is exactly why the systemd waits for appropriate udev state. > It's pie in the sky, and unicorns, but it sure would be nice to have > standardization rather than everyone rolling their own solution. The There was a de facto standard I think - expose component devices or require them to be specified. Apparently no such thing in btrfs, so it must be handled in btrfs-way. Also note that MD can be assembled by kernel itself, while btrfs cannot (so initrd is required for rootfs). -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote: > No, it isn't. You can just make the damn mount call with the supplied > options. If it succeeds, the volume was ready, if it fails, it wasn't, > it's that simple, and there's absolutely no reason that systemd can't > just do that in a loop until it succeeds or a timeout is reached. That There is no such loop, so if mount would happen before all the required devices show up, it would either definitely fail, or if there were 'degraded' in fstab, just start degraded. > any of these issues with the volume being completely unusable in a > degraded state. > > Also, it's not 'up to the filesystem', it's 'up to the underlying > device'. LUKS, LVM, MD, and everything else that's an actual device > layer is what systemd waits on. XFS, ext4, and any other filesystem > except BTRFS (and possibly ZFS, but I'm not 100% sure about that) > provides absolutely _NOTHING_ to wait on. Systemd just chose to handle You wait for all the devices to settle. One might have dozen of drives including some attached via network and it might take a time to become available. Since systemd knows nothing about underlying components, it simply waits for the btrfs itself to announce it's ready. > BTRFS like a device layer, and not a filesystem, so we have this crap to As btrfs handles many devices in "lower part", this effectively is device layer. Mounting /dev/sda happens to mount various other /dev/sd* that are _not_ explicitly exposed, so there is really not an alternative. Except for the 'mount loop' which is a no-go. > deal with, as well as the fact that it makes it impossible to manually > mount a BTRFS volume with missing or failed devices in degraded mode > under systemd (because it unmounts it damn near instantly because it > somehow thinks it knows better than the user what the user wants to do). This seems to be some distro-specific misconfiguration, didn't happen to me on plain systemd/udev. What is the reproducing scenario? >> This integration issue was so far silently ignored both by btrfs and >> systemd developers. > It's been ignored by BTRFS devs because there is _nothing_ wrong on this > side other than the naming choice for the ioctl. Systemd is _THE ONLY_ > init system which has this issue, every other one works just fine. Not true - mounting btrfs without "btrfs device scan" doesn't work at all without udev rules (that mimic behaviour of the command). Let me repeat example from Dec 19th: 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb, 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool), 3. try mount /dev/sda /test - fails mount /dev/sdb /test - works 4. reboot again and try in reversed order mount /dev/sdb /test - fails mount /dev/sda /test - works > As far as the systemd side, I have no idea why they are ignoring it, > though I suspect it's the usual spoiled brat mentality that seems to be > present about everything that people complain about regarding systemd. Explanation above. This is the point when _you_ need to stop ignoring the fact, that you simply cannot just try mounting devices in a loop as this would render any NAS/FC/iSCSI-backed or more complicated systems unusable or hide problems in case of temporary problems with connection. systemd waits for the _underlying_ device - unless btrfs exposes them as a list of _actual_ devices to wait for, there is nothing except for waiting for btrfs itself that systemd can do. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
Errata: On Wed, Dec 20, 2017 at 09:34:48 +0100, Tomasz Pala wrote: > /dev/sda -> 'not ready' > /dev/sdb -> 'not ready' > /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still > 'not ready' > /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> > 'ready' The last line should start with /dev/sdd. > After such timeout, I'd like to tell the kernel: "no more devices, give > me all the remaining btrfs volumes in degraded mode if possible". By Actually "if possible" means both: - if technically possible (i.e. required data is available, like half of RAID1), - AND if allowed for specific volume as there might be different policies. For example - one might allow rootfs to be started in degraded-rw mode in order for the system to boot up, /home in degraded read-only for the users to have access to their files and do not mount /srv degraded at all. The failed mount can be non-critical with 'nofail' fstab flag. > "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could > fire it's rules. And if there would be anything for udev to distinguish > 'ready' from 'ready-degraded' one could easily compose some notification > scripting on top of it, including sending e-mail to sysadmin. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote: >> Sth like this? I got such problem a few months ago, my solution was >> accepted upstream: >> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f > > I can't parse this commit. In particular I can't tell how long it > waits, or what triggers the end to waiting. The point is - it doesn't wait at all. Instead, every 'ready' btrfs device triggers event on all the pending devices. Consider 3-device filesystem consisting of /dev/sd[abd] with /dev/sdc being different, standalone btrfs: /dev/sda -> 'not ready' /dev/sdb -> 'not ready' /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready' /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready' This way all the parts of a volume are marked as ready, so systemd won't refuse mounting using legacy device nodes like /dev/sda. This particular solution depends on kernel returning 'btrfs ready', which would obviously not work for degraded arrays unless the btrfs.ko handles some 'missing' or 'mount_degraded' kernel cmdline options _before_ actually _trying_ to mount it with -o degraded. And there is a logical problem with this - _which_ array components should be ignored? Consider: volume1: /dev/sda /dev/sdb volume2: /dev/sdc /dev/sdd-broken If /dev/sdd is missing from the system, it would never be scanned, so /dev/sdc would be pending. It cannot be assembled just in time of scanning alone, because the same would happen with /dev/sda and there would be desync with /dev/sdb, which IS available - a few moments later. This is the place for the timeout you've mentioned - there should be *some* decent timeout allowing all the devices to show up (udev waits for 90 seconds by default or x-systemd.device-timeout=N from fstab). After such timeout, I'd like to tell the kernel: "no more devices, give me all the remaining btrfs volumes in degraded mode if possible". By "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could fire it's rules. And if there would be anything for udev to distinguish 'ready' from 'ready-degraded' one could easily compose some notification scripting on top of it, including sending e-mail to sysadmin. Is there anything that would make the kernel do the above? -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote: >> Sth like this? I got such problem a few months ago, my solution was >> accepted upstream: >> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f >> >> Rationale is in referred ticket, udev would not support any more btrfs >> logic, so unless btrfs handles this itself on kernel level (daemon?), >> that is all that can be done. > Or maybe systemd can quit trying to treat BTRFS like a volume manager > (which it isn't) and just try to mount the requested filesystem with the > requested options? Tried that before ("just mount my filesystem, stupid"), it is a no-go. The problem source is not within systemd treating BTRFS differently, but in btrfs kernel logic that it uses. Just to show it: 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb, 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool), 3. try mount /dev/sda /test - fails mount /dev/sdb /test - works 4. reboot again and try in reversed order mount /dev/sdb /test - fails mount /dev/sda /test - works THIS readiness is exposed via udev to systemd. And it must be used for multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc). In short: until *something* scans all the btrfs components, so the kernel makes it ready, systemd won't even try to mount it. > Then you would just be able to specify 'degraded' in > your mount options, and you don't have to care that the kernel refuses > to mount degraded filesystems without being explicitly asked to. Exactly. But since LP refused to try mounting despite kernel "not-ready" state - it is the kernel that must emit 'ready'. So the question is: how can I make kernel to mark degraded array as "ready"? The obvious answer is: do it via kernel command line, just like mdadm does: rootflags=device=/dev/sda,device=/dev/sdb rootflags=device=/dev/sda,device=missing rootflags=device=/dev/sda,device=/dev/sdb,degraded If only btrfs.ko recognized this, kernel would be able to assemble multivolume btrfs itself. Not only this would allow automated degraded mounts, it would also allow using initrd-less kernels on such volumes. >> It doesn't have to be default, might be kernel compile-time knob, module >> parameter or anything else to make the *R*aid work. > There's a mount option for it per-filesystem. Just add that to all your > mount calls, and you get exactly the same effect. If only they were passed... -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote: > Except the systems running on those ancient kernel versions are not > necessarily using a recent version of btrfs-progs. Still much easier to update a userspace tools than kernel (consider binary drivers for various hardware). > So in other words, spend the time to write up code for btrfs-progs that > will then be run by a significant minority of users because people using > old kernels usually use old userspace, and people using new kernels > won't have to care, instead of working on other bugs that are still > affecting people? I am aware of the dillema and the answer is: that depends. Depends on expected usefulness of such infrastructure regarding _future_ changes and possible bugs. In case of stable/mature/frozen projects this doesn't make much sense, as the possible incompatibilities would be very rare. Wheter this makes sense for btrfs? I don't know - it's not mature, but if the quirk rate would be too high to track appropriate kernel versions it might be really better to officially state "DO USE 4.14+ kernel, REALLY". This might be accomplished very easy - when releasing new btrfs-progs check currently available LTS kernel and use it as a base reference for warning. After all, "giving users a hurt me button is not ethical programming." >> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >> safely add "degraded" to the mount options? My primary concern is the >> machine UPTIME. I care less about the data, as they are backed up to >> some remote location and loosing day or week of changes is acceptable, >> brain-split as well, while every hour of downtime costs me a real money. > In which case you shouldn't be relying on _ANY_ kind of RAID by itself, > let alone BTRFS. If you care that much about uptime, you should be > investing in a HA setup and going from there. If downtime costs you I got this handled and don't use btrfs there - the question remains: in a situation as described above, is it safe now to add "degraded"? To rephrase the question: can degraded RAID1 run permanently as rw without some *internal* damage? >> Anyway, users shouldn't look through syslog, device status should be >> reported by some monitoring tool. > This is a common complaint, and based on developer response, I think the > consensus is that it's out of scope for the time being. There have been > some people starting work on such things, but nobody really got anywhere > because most of the users who care enough about monitoring to be > interested are already using some external monitoring tool that it's > easy to hook into. I agree, the btrfs code should only emit events, so SomeUserspaceGUIWhatever could display blinking exclamation mark. >> Well, the question is: either it is not raid YET, or maybe it's time to >> consider renaming? > Again, the naming is too ingrained. At a minimum, you will have to keep > the old naming, and at that point you're just wasting time and making > things _more_ confusing because some documentation will use the old True, but realizing that documentation is already flawed it gets easier. But I still don't know if it is going to be RAID some day? Or won't be "by design"? >> Ha! I got this disabled on every bus (although for different reasons) >> after boot completes. Lucky me:) > Security I'm guessing (my laptop behaves like that for USB devices for > that exact reason)? It's a viable option on systems that are tightly Yes, machines are locked and only authorized devices are allowed during boot. > IOW, if I lose a disk in a two device BTRFS volume set up for > replication, I'll mount it degraded, and convert it from the raid1 > profile to the single profile and then remove the missing disk from the > volume. I was about to do the same with my r/o-stuck btrfs system, unfortunatelly unplugged the wrong cable... >> Writing accurate documentation requires deep undestanding of internals. [...] > Writing up something like that is near useless, it would only be valid > for upstream kernels (And if you're using upstream kernels and following > the advice of keeping up to date, what does it matter anyway? The [...] > kernel that fixes the issues it reports.), because distros do whatever > the hell they want with version numbers (RHEL for example is notorious > for using _ancient_ version numbers bug having bunches of stuff > back-ported, and most other big distros that aren't Arch, Gentoo, or > Slackware derived do so too to a lesser degree), and it would require > constant curation to keep up to date. Only for long-term known issues OK, you've convinced me that kernel-vs-feature list is overhead. So maybe other approach: just like sys
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 12:47:33 -0700, Chris Murphy wrote: > The more verbose man pages are, the more likely it is that information > gets stale. We already see this with the Btrfs Wiki. So are you True. The same applies to git documentation (3rd paragraph): https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/ Fortunately this CAN be done properly, one of the greatest documentations I've seen is systemd one. What I don't like about documentation is lack of objectivity: $ zgrep -i bugs /usr/share/man/man8/*btrfs*.8.gz | grep -v bugs.debian.org Nothing. The old-school manuals all had BUGS section even if it was empty. Seriously, nothing appropriate to be put in there? Documentation must be symmetric - if it mentions feature X, it must mention at least the most common caveats. > volunteering to do the btrfs-progs work to easily check kernel > versions and print appropriate warnings? Or is this a case of > complaining about what other people aren't doing with their time? This is definitely the second case. You see, I got my issues with btrfs, I already know where to use it and when not. I've learned HARD and still didn't fully recovered (some dangling r/o, some ENOSPACE due to fragmentation etc). What I /MIGHT/ help to the community is to share my opinions and suggestions. And it's all up to you, what would you do with this. Either you blame me for complaining or you ignore me - you should realize, that _I_do_not_care_, because I already know things that I write. At least some other guy, some other day would read this thread and my opinions might save HIS day. After all, using btrfs should be preceded with research. No offence, just trying to be honest with you. Because the other thing that I've learned hard in my life is to listen regular users of my products and appreciate any feedback, even if it doesn't suit me. >> Now, if the current kernels won't toggle degraded RAID1 as ro, can I >> safely add "degraded" to the mount options? My primary concern is the [...] > Btrfs simply is not ready for this use case. If you need to depend on > degraded raid1 booting, you need to use mdadm or LVM or hardware raid. > Complaining about the lack of maturity in this area? Get in line. Or > propose a design and scope of work that needs to be completed to > enable it. I thought the work was already done if current kernel handles degraded RAID1 without switching to r/o, doesn't it? Or something else is missing? -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote: > with a read only file system. Another reason is the kernel code and > udev rule for device "readiness" means the volume is not "ready" until > all member devices are present. And while the volume is not "ready" > systemd will not even attempt to mount. Solving this requires kernel > and udev work, or possibly a helper, to wait an appropriate amount of Sth like this? I got such problem a few months ago, my solution was accepted upstream: https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f Rationale is in referred ticket, udev would not support any more btrfs logic, so unless btrfs handles this itself on kernel level (daemon?), that is all that can be done. > time. I also think it's a bad idea to implement automatic degraded > mounts unless there's an API for user space to receive either a push [...] > There is no amount of documentation that makes up for these > deficiencies enough to enable automatic degraded mounts by default. I > would consider it a high order betrayal of user trust to do it. It doesn't have to be default, might be kernel compile-time knob, module parameter or anything else to make the *R*aid work. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Tue, Dec 19, 2017 at 10:31:40 -0800, George Mitchell wrote: > I have significant experience as a user of raid1. I spent years using > software raid1 and then more years using hardware (3ware) raid1 and now > around 3 years using btrfs raid1. I have not found btrfs raid1 to be > less reliable than any of the previous implementations of raid. I have You are aware that in order to proof something one needs only one example? Degraded r/o is such, QED. Doesn't matter how long did you ride on top of any RAID implementation, unless you got them in action, i.e. had actual drive malfunction. Did you have broken drive under btrfs raid? > a failure, you don't just plug things back in and expect it to be fixed > without seriously investigating what has gone wrong and potential > unexpected consequences. I have found that even with hardware raid you > can find ways to screw things up to the point that you lose your data. Everything could be screwed beyond comprehension, but we're talking about PRIMARY objectives. In case of RAID1+ it seems to be obvious: https://en.oxforddictionaries.com/definition/redundancy - unplugging ANY SINGLE drive MUST NOT render system unusable. This is really as simple as that. > I have had situations where I reconnected a drive on hardware raid1 only > to find that the array would not sync and from there on I ended up > having to directly attach one of the drives and recover the partition I had a situation when replugging a drive started a sync of older data over the newer. So what? This doesn't change a thing - the drive reappearance or resync is RECOVERY part. RECOVERY scenarios are entirely different thing than REDUNDANCY itself. RECOVERY phase in some implementation could be entirely off-line process and it still would be RAID. Remove REDUNDANCY part and it's not RAID anymore. If one is naming thing an apple, shouldn't be surprised if others compare it to apples, not oranges. > table with test disk in order to regain access to my data. So NO FORM > of raid is a replacement for backups and NO FORM of raid is a > replacement for due diligence in recovery from failure mode. Raid gives And who said it is? > you a second chance when things go wrong, it does not make failures > transparent which is seemingly what we sometimes expect from raid. And Wouldn't want to worry you, but properly managed RAIDs make I/J-of-K trivial-failures transparent. Just like ECC protects N/M bits transparently. Investigating the reasons is sysadmin's job, just like other maintenance, including restoring protection level. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
. > This is a matter of opinion. Sure! And the particular opinion depends on system being affected. I'd rather not have any brain-split scenario under my database servers, but also won't mind data loss on BGP router as long as it keeps running and is fully operational. > I still contend that running half a two > device array for an extended period of time without reshaping it to be a > single device is a bad idea for cases other than BTRFS. The fewer > layers of code you're going through, the safer you are. I create single-device degraded MD RAID1 when I attach one disk for deployment (usually test machines), which are going to be converted into dual (production) in a future - attaching second disk to array is much easier and faster than messing with device nodes (or labels or anything). The same applies to LVM, it's better to have it even when not used at a moment. In case of btrfs there is no need for such preparations, as the devices are added without renaming. However, sometimes the systems end up without second disk attached. Either due to their low importance, sometimes power usage, others need to be quiet. One might ask, why don't I attach second disk before initial system creation - the answer is simple: I usually use the same drive models in RAID1, but it happens that drives bought from the same production lot fail simultaneously, so this approach mitigates the problem and gives more time to react. > Patches would be gratefully accepted. It's really not hard to update > the documentation, it's just that nobody has had the time to do it. Writing accurate documentation requires deep undestanding of internals. Me - for example, I know some of the results: "don't do this", "if X happens, Y should be done", "Z doesn't work yet, but there were some patches", "V was fixed in some recent kernel, but no idea which commit was it exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data that could be posted without creating the impression, that it's all about creating complain-list. Not to mention I'm absolutely not familiar with current patches, WIP and many many other corner cases or usage scenarios. In a fact, not only the internals, but motivation and design principles must be well understood to write piece of documentation. Otherwise some "fake news" propaganda is being created, just like https://suckless.org/sucks/systemd or other systemd-haters that haven't spent a day in their life for writing SysV init scripts or managing a bunch of mission critical machines with handcrafted supervisors. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
ted, 3. I was about to fix the volume, accidentally the machine has rebooted. Which should do no harm if I had a RAID1. 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE, as long as you accept "no more redundancy"... 4a. ...or had an N-way mirror and there is still some redundancy if N>2. Since we agree, that btrfs RAID != common RAID, as there are/were different design principles and some features are in WIP state at best, the current behaviour should be better documented. That's it. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote: > The fact is, the only cases where this is really an issue is if you've > either got intermittently bad hardware, or are dealing with external Well, the RAID1+ is all about the failing hardware. > storage devices. For the majority of people who are using multi-device > setups, the common case is internally connected fixed storage devices > with properly working hardware, and for that use case, it works > perfectly fine. If you're talking about "RAID"-0 or storage pools (volume management) that is true. But if you imply, that RAID1+ "works perfectly fine as long as hardware works fine" this is fundamentally wrong. If the hardware needs to work properly for the RAID to work properly, noone would need this RAID in the first place. > that BTRFS should not care. At the point at which a device is dropping > off the bus and reappearing with enough regularity for this to be an > issue, you have absolutely no idea how else it's corrupting your data, > and support of such a situation is beyond any filesystem (including ZFS). Support for such situation is exactly what RAID performs. So don't blame people for expecting this to be handled as long as you call the filesystem feature a 'RAID'. If this feature is not going to mitigate hardware hiccups by design (as opposed to "not implemented yet, needs some time", which is perfectly understandable), just don't call it 'RAID'. All the features currently working, like bit-rot mitigation for duplicated data (dup/raid*) using checksums, are something different than RAID itself. RAID means "survive failure of N devices/controllers" - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not _expected_ to happen after single disk failure (without any reappearing). -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Tue, Dec 12, 2017 at 08:50:15 +0800, Qu Wenruo wrote: > Even without snapshot, things can easily go crazy. > > This will write 128M file (max btrfs file extent size) and write it to disk. > # xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file > > Then, overwrite the 1~128M range. > # xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file > > Guess your real disk usage, it's 127M + 128M = 255M. > > The point here, if there is any reference of a file extent, the whole > extent won't be freed, even it's only 1M of a 128M extent. OK, /this/ is scary. I guess nocow prevents this behaviour? I have +C chatted the file eating my space and it ceased. > Are you pre-allocating the file before write using tools like dd? I have no idea, this could be checked in source of http://pam-abl.sourceforge.net/ But this is plain Berkeley DB (5.3 in my case)... which scarries me even more: $ rpm -q --what-requires 'libdb-5.2.so()(64bit)' 'libdb-5.3.so()(64bit)' | wc -l 14 # ipoldek desc -B db5.3 Package:db5.3-5.3.28.0-4.x86_64 Required(by): apache1-base, apache1-mod_ssl, apr-util-dbm-db, bogofilter, c-icap, c-icap-srv_url_check, courier-authlib, courier-authlib-authuserdb, courier-imap, courier-imap-common, cyrus-imapd, cyrus-imapd-libs, cyrus-sasl, cyrus-sasl-sasldb, db5.3-devel, db5.3-utils, dnshistory, dsniff, evolution-data-server, evolution-data-server-libs, exim, gda-db, ggz-server, heimdal-libs-common, hotkeys, inn, inn-libs, isync, jabberd, jigdo, jigdo-gtk, jnettop, libetpan, libgda3, libgda3-devel, libhome, libqxt, libsolv, lizardfs-master, maildrop, moc, mutt, netatalk, nss_updatedb, ocaml-dbm, opensips, opensmtpd, pam-pam_abl, pam-pam_ccreds, perl-BDB, perl-BerkeleyDB, perl-BerkeleyDB, perl-DB_File, perl-URPM, perl-cyrus-imapd, php4-dba, php52-dba, php53-dba, php54-dba, php55-dba, php56-dba, php70-dba, php70-dba, php71-dba, php71-dba, php72-dba, php72-dba, postfix, python-bsddb, python-modules, python3-bsddb3, redland, ruby-modules, sendmail, squid-session_acl, squid-time_quota_acl, squidGuard, subversion-libs, swish-e, tomoe-svn, webalizer-base, wwwcount OK, not much of user-applications here, as they mostly use sqlite. I wonder how this one db-library behaves: $ find . -name \*.sqlite | xargs ls -gGhS | head -n1 -rw-r--r-- 1 15M 2017-12-08 12:14 ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite $ ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite | head -n1 File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite has 128 extents: At least every $HOME/{.{,c}cache,tmp} should be +C... > And if possible, use nocow for this file. Actually, this should be officially advised to use +C for entire /var tree and every other tree that might be exposed for hostile write patterns, like /home or /tmp (if held on btrfs). I'd say, that from security point of view the nocow should be default, unless specified for mount or specific file... Currently, if I mount with nocow, there is no way to whitelist trusted users or secure location, and until btrfs-specific options could be handled per subvolume, there is really no alternative. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote: >> I could debug something before I'll clean this up, is there anything you >> want to me to check/know about the files? > > fiemap result along with btrfs dump-tree -t2 result. fiemap attached, but dump-tree requires unmounted fs, doesn't it? >> - I've lost 3.6 GB during the night with reasonably small >> amount of writes, I guess it might be possible to trash entire >> filesystem within 10 minutes if doing this on purpose. > > That's a little complex. > To get into such situation, snapshot must be used and one must know > which file extent is shared and how it's shared. Hostile user might assume that any of his own files old enough were being snapshotted. Unless snapshots are not used at all... The 'obvious' solution would be for quotas to limit the data size including extents lost due to fragmentation, but this is not the real solution as users don't care about fragmentation. So we're back to square one. > But as I mentioned, XFS supports reflink, which means file extent can be > shared between several inodes. > > From the message I got from XFS guys, they free any unused space of a > file extent, so it should handle it quite well. Forgive my ignorance, as I'm not familiar with details, but isn't the problem 'solvable' by reusing space freed from the same extent for any single (i.e. the same) inode? This would certainly increase fragmentation of a file, but reduce extent usage significially. Still, I don't comprehend the cause of my situation. If - after doing a defrag (after snapshotting whatever there were already trashed) btrfs decides to allocate new extents for the file, why doesn't is use them efficiently as long as I'm not doing snapshots anymore? I'm attaching the second fiemap, the same file from last snapshot taken. According to this one-liner: for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done current file doesn't share any physical locations with the old one. But still grows, so what does this situation have with snapshots anyway? Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB occupied per extent. How is that possible? -- Tomasz Pala <go...@pld-linux.org> File log.14 has 933 extents: # Logical Physical Length Flags 0: 00297a001000 1000 1: 1000 00297aa01000 1000 2: 2000 002979ffe000 1000 3: 3000 00297d1fc000 1000 4: 4000 00297e5f7000 1000 5: 5000 00297d1fe000 1000 6: 6000 00297c7f4000 1000 7: 7000 00297dbf9000 1000 8: 8000 00297eff3000 1000 9: 9000 0029821c7000 1000 10: a000 002982bbf000 1000 11: b000 0029803e 1000 12: c000 00297b40 1000 13: d000 002979601000 1000 14: e000 002980dd5000 1000 15: f000 0029821be000 1000 16: 0001 00298715f000 1000 17: 00011000 002985d71000 1000 18: 00012000 00298537f000 1000 19: 00013000 00298676 1000 20: 00014000 00298498d000 1000 21: 00015000 0029821b4000 1000 22: 00016000 0029817c7000 1000 23: 00017000 00298a2fa000 1000 24: 00018000 002988f1f000 1000 25: 00019000 00298d47f000 1000 26: 0001a000 00298c0af000 1000 27: 0001b000 00298a2ee000 1000 28: 0001c000 00298a2eb000 1000 29: 0001d000 0029905f2000 1000 30: 0001e000 00298f22a000 1000 31: 0001f000 00298de66000 1000 32: 0002 00298ace3000 1000 33: 00021000 00298a2e9000 1000 34: 00022000 00298a2e7000 1000 35: 00023000 00298b6c3000 1000 36: 00024000 002990fd5000 1000 37: 00025000 002992d6c000 1000 38: 00026000 0029954db000 1000 39: 00027000 002993747000 1000 40: 00028000 002992d
Re: exclusive subvolume space missing
On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote: > I have found a directory - pam_abl databases, which occupy 10 MB (yes, > TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after # df Filesystem Size Used Avail Use% Mounted on /dev/sda264G 61G 2.8G 96% / # btrfs fi du . Total Exclusive Set shared Filename 0.00B 0.00B - ./1/__db.register 10.00MiB10.00MiB - ./1/log.01 16.00KiB 0.00B - ./1/hosts.db 16.00KiB 0.00B - ./1/users.db 168.00KiB 0.00B - ./1/__db.001 40.00KiB 0.00B - ./1/__db.002 44.00KiB 0.00B - ./1/__db.003 10.28MiB10.00MiB - ./1 0.00B 0.00B - ./__db.register 16.00KiB16.00KiB - ./hosts.db 16.00KiB16.00KiB - ./users.db 10.00MiB10.00MiB - ./log.13 0.00B 0.00B - ./__db.001 0.00B 0.00B - ./__db.002 0.00B 0.00B - ./__db.003 20.31MiB20.03MiB 284.00KiB . # btrfs fi defragment log.13 # df /dev/sda264G 54G 9.4G 86% / 6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: failed to repair root items: Input/output error
On Sun, Dec 10, 2017 at 15:18:32 +, constantine wrote: > I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is > within warranty. > I can't mount it read-write ("no rw mounting after error"). There is a data-corruption issue with this controller! The same as 840 EVO - just google this. In short: either use recent kernel (AFAIR 4.0.5+ for 840 EVO and some newer for entire 8* Samsung SSD family blacklisting) or disable NCQ. Using queued TRIM on this drive leads to data loss! Firmware zeroes fist 512 bytes of a block, sorry. If you only had smaller drive, as 850s up to 512 GB have different controller... > checksum verify failed on 103009173504 found 25334496 wanted 3500 > bytenr mismatch, want=103009173504, have=889192478 > ERROR: failed to repair root items: Input/output error > > What do these errors mean? > What should I do to fix the filesystem and be able to mount it read-write? You probably can't fix this - there is data missing on bare metal, so you should recover using backups. If you don't have one, you need to perform manual data recovery procedures (like photorec) with little chances to restore complete files due to the nature of data loss (beginning of blocks). -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: >> 1. is there any switch resulting in 'defrag only exclusive data'? > > IIRC, no. I have found a directory - pam_abl databases, which occupy 10 MB (yes, TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after defrag. After defragging files were not snapshotted again and I've lost 3.6 GB again, so I got this fully reproducible. There are 7 files, one of which is 99% of the space (10 MB). None of them has nocow set, so they're riding all-btrfs. I could debug something before I'll clean this up, is there anything you want to me to check/know about the files? The fragmentation impact is HUGE here, 1000-ratio is almost a DoS condition which could be triggered by malicious user during a few hours or faster - I've lost 3.6 GB during the night with reasonably small amount of writes, I guess it might be possible to trash entire filesystem within 10 minutes if doing this on purpose. >> 3. I guess there aren't, so how could I accomplish my target, i.e. >>reclaiming space that was lost due to fragmentation, without breaking >>spanshoted CoW where it would be not only pointless, but actually harmful? > > What about using old kernel, like v4.13? Unfortunately (I guess you had 3.13 on mind), I need the new ones and will be pushing towards 4.14. >> 4. How can I prevent this from happening again? All the files, that are >>written constantly (stats collector here, PostgreSQL database and >>logs on other machines), are marked with nocow (+C); maybe some new >>attribute to mark file as autodefrag? +t? > > Unfortunately, nocow only works if there is no other subvolume/inode > referring to it. This shouldn't be my case anymore after defrag (==breaking links). I guess no easy way to check refcounts of the blocks? > But in my understanding, btrfs is not suitable for such conflicting > situation, where you want to have snapshots of frequent partial updates. > > IIRC, btrfs is better for use case where either update is less frequent, > or update is replacing the whole file, not just part of it. > > So btrfs is good for root filesystem like /etc /usr (and /bin /lib which > is pointing to /usr/bin and /usr/lib) , but not for /var or /run. That is something coherent with my conclusions after 2 years on btrfs, however I didn't expect a single file to eat 1000 times more space than it should... I wonder how many other filesystems were trashed like this - I'm short of ~10 GB on other system, many other users might be affected by that (telling the Internet stories about btrfs running out of space). It is not a problem that I need to defrag a file, the problem is I don't know: 1. whether I need to defrag, 2. *what* should I defrag nor have a tool that would defrag smart - only the exclusive data or, in general, the block that are worth defragging if space released from extents is greater than space lost on inter-snapshot duplication. I can't just defrag entire filesystem since it breaks links with snapshots. This change was a real deal-breaker here... Any way to fed the deduplication code with snapshots maybe? There are directories and files in the same layout, this could be fast-tracked to check and deduplicate. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sun, Dec 03, 2017 at 01:45:45 +, Duncan wrote: > OTOH, it's also quite possible that people chose btrfs at least partly > for other reasons, say the "storage pool" qualities, and would rather Well, to name some: 1. filesystem-level backups via snapshot/send/receive - much cleaner and faster than rsyncs or other old-fashioned methods. This obviously requires the CoW-once feature; - caveat: for btrfs-killing usage patterns all the snapshots but the last one need to be removed; 2. block-level checksums with RAID1-awareness - in contrary to mdadm RAIDx, which chooses random data copy from underlying devices, this is much less susceptible to bit rot; - caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID instead of real balancing), no N-way mirroring nor write-mostly flag. 3. compression - there is no real alternative, however: - caveat: requires CoW enabled, which makes it not suitable for ...systemd journals, which compress with great ratio (c.a. 1:10), nor for various databases, as they will be nocowed sooner or later; 4. storage pools you've mentioned - they are actually not much superior to LVM-based approach; until one could create subvolume with different profile (e.g. 'disable RAID1 for /var/log/journal') it is still better to create separate filesystems, meaning one have to use LVM or (the hard way) paritioning. Some of the drawbacks above are immanent to CoW and so shouldn't be expected to be fixed internally, as the needs are conflicting, but their impact might be nullified by some housekeeping. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote: >> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from > [...] >> Now make various small changes to the file, say under 16 KiB each. These >> will each be COWed elsewhere as one might expect. by default 16 KiB at >> a time I believe (might be 4 KiB, as it was back when the default leaf > > I got ~500 small files (100-500 kB) updated partially in regular > intervals: > > # du -Lc **/*.rrd | tail -n1 > 105Mtotal > >> But here's the kicker. Even without a snapshot locking that original 100 >> MiB extent in place, if even one of the original 16 KiB blocks isn't >> rewritten, that entire 100 MiB extent will remain locked in place, as the >> original 16 KiB blocks that have been changed and thus COWed elsewhere >> aren't freed one at a time, the full 100 MiB extent only gets freed, all >> at once, once no references to it remain, which means once that last >> block of the extent gets rewritten. OTOH - should this happen with nodatacow files? As I mentioned before, these files are chattred +C (however this was not their initial state due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ). Am I wrong thinking, that in such case they should occupy twice their size maximum? Or maybe there is some tool that could show me the real space wasted by file, including extents count etc? -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote: > OK, is this supposed to be raid1 or single data, because the above shows > metadata as all raid1, while some data is single tho most is raid1, and > while old mkfs used to create unused single chunks on raid1 that had to > be removed manually via balance, those single data chunks aren't unused. It is supposed to be RAID1, the single data were leftovers from my previous attempts to gain some space by converting into single profile. Which miserably failed BTW (would it be smarter with "soft" option?), but I've already managed to clear this. > Assuming the intent is raid1, I'd recommend doing... > > btrfs balance start -dconvert=raid1,soft / Yes, this was the way to go. It also reclaimed the 8 GB. I assume the failing -dconvert=single somehow locked that 8 GB, so this issue should be addressed in btrfs-tools to report such locked out region. You've already noted that the single profile data occupied much less itself. So this was the first issue, the second is running overhead, that accumulates over time. Since yesterday, when I had 19 GB free, I've lost 4 GB already. The scenario you've described is very probable: > btrfs balance start -dusage=N / [...] > allocated value toward usage. I too run relatively small btrfs raid1s > and would suggest trying N=5, 20, 40, 70, until the spread between There were no effects above N=10 (both dusage and musage). > consuming your space either, as I'd suspect they might if the problem were > for instance atime updates, so while noatime is certainly recommended and I use noatime by default since years, so not the source of problem here. > The other possibility that comes to mind here has to do with btrfs COW > write patterns... > Suppose you start with a 100 MiB file (I'm adjusting the sizes down from [...] > Now make various small changes to the file, say under 16 KiB each. These > will each be COWed elsewhere as one might expect. by default 16 KiB at > a time I believe (might be 4 KiB, as it was back when the default leaf I got ~500 small files (100-500 kB) updated partially in regular intervals: # du -Lc **/*.rrd | tail -n1 105Mtotal > But here's the kicker. Even without a snapshot locking that original 100 > MiB extent in place, if even one of the original 16 KiB blocks isn't > rewritten, that entire 100 MiB extent will remain locked in place, as the > original 16 KiB blocks that have been changed and thus COWed elsewhere > aren't freed one at a time, the full 100 MiB extent only gets freed, all > at once, once no references to it remain, which means once that last > block of the extent gets rewritten. > > So perhaps you have a pattern where files of several MiB get mostly > rewritten, taking more space for the rewrites due to COW, but one or > more blocks remain as originally written, locking the original extent > in place at its full size, thus taking twice the space of the original > file. > > Of course worst-case is rewrite the file minus a block, then rewrite > that minus a block, then rewrite... in which case the total space > usage will end up being several times the size of the original file! > > Luckily few people have this sort of usage pattern, but if you do... > > It would certainly explain the space eating... Did anyone investigated how is that related to RRD rewrites? I don't use rrdcached, never thought that 100 MB of data might trash entire filesystem... best regards, -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
OK, I seriously need to address that, as during the night I lost 3 GB again: On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote: >> # btrfs fi sh / >> Label: none uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a >> Total devices 2 FS bytes used 44.10GiB Total devices 2 FS bytes used 47.28GiB >> # btrfs fi usage / >> Overall: >> Used: 88.19GiB Used: 94.58GiB >> Free (estimated): 18.75GiB (min: 18.75GiB) Free (estimated): 15.56GiB (min: 15.56GiB) >> >> # btrfs dev usage / - output not changed >> # btrfs fi df / >> Data, RAID1: total=51.97GiB, used=43.22GiB Data, RAID1: total=51.97GiB, used=46.42GiB >> System, RAID1: total=32.00MiB, used=16.00KiB >> Metadata, RAID1: total=2.00GiB, used=895.69MiB >> GlobalReserve, single: total=131.14MiB, used=0.00B GlobalReserve, single: total=135.50MiB, used=0.00B >> >> # df >> /dev/sda264G 45G 19G 71% / /dev/sda264G 48G 16G 76% / >> However the difference is on active root fs: >> >> -0/29124.29GiB 9.77GiB >> +0/29115.99GiB 76.00MiB 0/29119.19GiB 3.28GiB > > Since you have already showed the size of the snapshots, which hardly > goes beyond 1G, it may be possible that extent booking is the cause. > > And considering it's all exclusive, defrag may help in this case. I'm going to try defrag here, but have a bunch of questions before; as defrag would break CoW, I don't want to defrag files that span multiple snapshots, unless they have huge overhead: 1. is there any switch resulting in 'defrag only exclusive data'? 2. is there any switch resulting in 'defrag only extents fragmented more than X' or 'defrag only fragments that would be possibly freed'? 3. I guess there aren't, so how could I accomplish my target, i.e. reclaiming space that was lost due to fragmentation, without breaking spanshoted CoW where it would be not only pointless, but actually harmful? 4. How can I prevent this from happening again? All the files, that are written constantly (stats collector here, PostgreSQL database and logs on other machines), are marked with nocow (+C); maybe some new attribute to mark file as autodefrag? +t? For example, the largest file from stats collector: Total Exclusive Set shared Filename 432.00KiB 176.00KiB 256.00KiB load/load.rrd but most of them has 'Set shared'==0. 5. The stats collector is running from the beginning, according to the quota output was not the issue since something happened. If the problem was triggered by (guessing) low space condition, and it results in even more space lost, there is positive feedback that is dangerous, as makes any filesystem unstable ("once you run out of space, you won't recover"). Does it mean btrfs is simply not suitable (yet?) for frequent updates usage pattern, like RRD files? 6. Or maybe some extra steps just before taking snapshot should be taken? I guess 'defrag exclusive' would be perfect here - reclaiming space before it is being locked inside snapshot. Rationale behind this is obvious: since the snapshot-aware defrag was removed, allow to defrag snapshot exclusive data only. This would of course result in partial file defragmentation, but that should be enough for pathological cases like mine. -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
one --- --- 0/36422.63GiB 72.03MiB none none --- --- 0/28510.78GiB 75.95MiB none none --- --- 0/29115.99GiB 76.24MiB none none --- --- <- this one (default rootfs) got fixed 0/32321.35GiB 95.85MiB none none --- --- 0/36923.26GiB 96.12MiB none none --- --- 0/32421.36GiB104.46MiB none none --- --- 0/32721.36GiB115.42MiB none none --- --- 0/36823.27GiB118.25MiB none none --- --- 0/29511.20GiB148.59MiB none none --- --- 0/29812.38GiB283.41MiB none none --- --- 0/26012.25GiB 3.22GiB none none --- --- <- 170712, initial snapshot, OK 0/31217.54GiB 4.56GiB none none --- --- <- 170811, definitely less excl 0/38821.69GiB 7.16GiB none none --- --- <- this one has <100M exclusive So the one block of data was released, but there are probably two more stuck here. If the 4.5G and 7G were freed I would have 45-4.5-7=33G used, which would agree with the 25G of data I've counted manually. Any ideas how to look inside these two snapshots? > Rescan and --sync are important to get the correct number. > (while rescan can take a long long time to finish) # time btrfs quota rescan -w / quota rescan started btrfs quota rescan -w / 0.00s user 0.00s system 0% cpu 30.798 total > And further more, please ensure that all deleted files are really deleted. > Btrfs delay file and subvolume deletion, so you may need to sync several > times or use "btrfs subv sync" to ensure deleted files are deleted. Yes, I was aware about that. However I've never had to wait after rebalance... regards, -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote: >> qgroupid rfer excl >> >> 0/26012.25GiB 3.22GiB from 170712 - first snapshot >> 0/31217.54GiB 4.56GiB from 170811 >> 0/36625.59GiB 2.44GiB from 171028 >> 0/37023.27GiB 59.46MiB from 18 - prev snapshot >> 0/38821.69GiB 7.16GiB from 171125 - last snapshot >> 0/29124.29GiB 9.77GiB default subvolume > > You may need to manually sync the filesystem (trigger a transaction > commitment) to update qgroup accounting. The data I've pasted were just calculated. >> # btrfs quota enable / >> # btrfs qgroup show / >> WARNING: quota disabled, qgroup data may be out of date >> [...] >> # btrfs quota enable / - for the second time! >> # btrfs qgroup show / >> WARNING: qgroup data inconsistent, rescan recommended > > Please wait the rescan, or any number is not correct. Here I was pointing that first "quota enable" resulted in "quota disabled" warning until I've enabled it once again. > It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to > ensure you understand all the limitation. I probably won't understand them all, but this is not an issue of my concern as I don't use it. There is simply no other way I am aware that could show me per-subvolume stats. Well, straightforward way, as the hard way I'm using (btrfs send) confirms the problem. You could simply remove all the quota results I've posted and there will be the underlaying problem, that the 25 GB of data I got occupies 52 GB. At least one recent snapshot, that was taken after some minor (<100 MB) changes from the subvolume, that has undergo some minor changes since then, occupied 8 GB during one night when the entire system was idling. This was crosschecked on files metadata (mtimes compared) and 'du' results. As a last-resort I've rebalanced the disk (once again), this time with -dconvert=raid1 (to get rid of the single residue). -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote: > I assume there is program eating up the space. > Not btrfs itself. Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be find by lsof on 3.4.75 kernel, but the space was returning after killing Xorg. The system I'm having problem now is very recent, the space doesn't return after reboot/emergency and doesn't sum up with files. >> Now, the weird part for me is exclusive data count: >> >> # btrfs sub sh ./snapshot-171125 >> [...] >> Subvolume ID: 388 >> # btrfs fi du -s ./snapshot-171125 >> Total Exclusive Set shared Filename >> 21.50GiB63.35MiB20.77GiB snapshot-171125 > > That's the difference between how sub show and quota works. > > For quota, it's per-root owner check. Just to be clear: I've enabled quota _only_ to see subvolume usage on spot. And exclusive data - the more detailed approach I've described in e-mail I've send a minute ago. > Means even a file extent is shared between different inodes, if all > inodes are inside the same subvolume, it's counted as exclusive. > And if any of the file extent belongs to other subvolume, then it's > counted as shared. Good to know, but this is almost UID0-only system. There are system users (vendor provided) and 2 ssh accounts for su, but nobody uses this machine for daily work. The quota values were the last tool I could find to debug. > For fi du, it's per-inode owner check. (The exact behavior is a little > more complex, I'll skip such corner case to make it a little easier to > understand). > > That's to say, if one file extent is shared by different inodes, then > it's counted as shared, no matter if these inodes belong to different or > the same subvolume. > > That's to say, "fi du" has a looser condition for "shared" calculation, > and that should explain why you have 20+G shared. There shouldn't be many multi-inode extents inside single subvolume, as this is mostly fresh system, with no containers, no deduplication, snapshots are taken from the same running system before or after some more important change is done. By 'change' I mean altering text config files mostly (plus etckeeper's git metadata), so the volume of difference is extremelly low. Actually most of the difs between subvolumes come from updating distro packages. There were not much reflink copies made on this partition, only one kernel source compiled (.ccache files removed today). So this partition is as clean, as it could be after almost 5 months in use. Actually I should rephrase the problem: "snapshot has taken 8 GB of space despite nothing has altered source subvolume" -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
hot. Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null 5.68GiB 0:03:23 [28.6MiB/s] I've created a new snapshot right now to compare it with 171125: 75.5MiB 0:00:43 [1.73MiB/s] OK, I could even compare all the snapshots in sequence: # for i in snapshot-17*; btrfs prop set $i ro true # p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv > /dev/null; p="$i" done 1.7GiB 0:00:15 [ 114MiB/s] 1.03GiB 0:00:38 [27.2MiB/s] 155MiB 0:00:08 [19.1MiB/s] 1.08GiB 0:00:47 [23.3MiB/s] 294MiB 0:00:29 [ 9.9MiB/s] 324MiB 0:00:42 [7.69MiB/s] 82.8MiB 0:00:06 [12.7MiB/s] 64.3MiB 0:00:05 [11.6MiB/s] 137MiB 0:00:07 [19.3MiB/s] 85.3MiB 0:00:13 [6.18MiB/s] 62.8MiB 0:00:19 [3.21MiB/s] 132MiB 0:00:42 [3.15MiB/s] 102MiB 0:00:42 [2.42MiB/s] 197MiB 0:00:50 [3.91MiB/s] 321MiB 0:01:01 [5.21MiB/s] 229MiB 0:00:18 [12.3MiB/s] 109MiB 0:00:11 [ 9.7MiB/s] 139MiB 0:00:14 [9.32MiB/s] 573MiB 0:00:35 [15.9MiB/s] 64.1MiB 0:00:30 [2.11MiB/s] 172MiB 0:00:11 [14.9MiB/s] 98.9MiB 0:00:07 [14.1MiB/s] 54MiB 0:00:08 [6.17MiB/s] 78.6MiB 0:00:02 [32.1MiB/s] 15.1MiB 0:00:01 [12.5MiB/s] 20.6MiB 0:00:00 [ 23MiB/s] 20.3MiB 0:00:00 [ 23MiB/s] 110MiB 0:00:14 [7.39MiB/s] 62.6MiB 0:00:11 [5.67MiB/s] 65.7MiB 0:00:08 [7.58MiB/s] 731MiB 0:00:42 [ 17MiB/s] 73.7MiB 0:00:29 [ 2.5MiB/s] 322MiB 0:00:53 [6.04MiB/s] 105MiB 0:00:35 [2.95MiB/s] 95.2MiB 0:00:36 [2.58MiB/s] 74.2MiB 0:00:30 [2.43MiB/s] 75.5MiB 0:00:46 [1.61MiB/s] This is 9.3 GB of total diffs between all the snapshots I got. Plus 15 GB of initial snapshot means there is about 25 GB used, while df reports twice the amount, way too much for overhead: /dev/sda264G 52G 11G 84% / # btrfs quota enable / # btrfs qgroup show / WARNING: quota disabled, qgroup data may be out of date [...] # btrfs quota enable / - for the second time! # btrfs qgroup show / WARNING: qgroup data inconsistent, rescan recommended [...] 0/42815.96GiB 19.23MiB newly created (now) snapshot Assuming the qgroups output is bugus and the space isn't physically occupied (which is coherent with btrfs fi du output and my expectation) the question remains: why is that bogus-excl removed from available space as reported by df or btrfs fi df/usage? And how to reclaim it? [~/test]# btrfs device usage / /dev/sda2, ID: 1 Device size:64.00GiB Device slack: 0.00B Data,single: 1.07GiB Data,RAID1: 55.97GiB Metadata,RAID1: 2.00GiB System,RAID1: 32.00MiB Unallocated: 4.93GiB /dev/sdb2, ID: 2 Device size:64.00GiB Device slack: 0.00B Data,single: 132.00MiB Data,RAID1: 55.97GiB Metadata,RAID1: 2.00GiB System,RAID1: 32.00MiB Unallocated: 5.87GiB -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
exclusive subvolume space missing
Hello, I got a problem with btrfs running out of space (not THE Internet-wide, well known issues with interpretation). The problem is: something eats the space while not running anything that justifies this. There were 18 GB free space available, suddenly it dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten right now, just as I'm writing this e-mail: /dev/sda264G 63G 452M 100% / /dev/sda264G 63G 365M 100% / /dev/sda264G 63G 316M 100% / /dev/sda264G 63G 287M 100% / /dev/sda264G 63G 268M 100% / /dev/sda264G 63G 239M 100% / /dev/sda264G 63G 230M 100% / /dev/sda264G 63G 182M 100% / /dev/sda264G 63G 163M 100% / /dev/sda264G 64G 153M 100% / /dev/sda264G 64G 143M 100% / /dev/sda264G 64G 96M 100% / /dev/sda264G 64G 88M 100% / /dev/sda264G 64G 57M 100% / /dev/sda264G 64G 25M 100% / while my rough calculations show, that there should be at least 10 GB of free space. After enabling quotas it is somehow confirmed: # btrfs qgroup sh --sort=excl / qgroupid rfer excl 0/5 16.00KiB 16.00KiB [30 snapshots with about 100 MiB excl] 0/33324.53GiB305.79MiB 0/29813.44GiB312.74MiB 0/32723.79GiB427.13MiB 0/33123.93GiB930.51MiB 0/26012.25GiB 3.22GiB 0/31219.70GiB 4.56GiB 0/38828.75GiB 7.15GiB 0/29130.60GiB 9.01GiB <- this is the running one This is about 30 GB total excl (didn't find a switch to sum this up). I know I can't just add 'excl' to get usage, so tried to pinpoint the exact files that occupy space in 0/388 exclusively (this is the last snapshots taken, all of the snapshots are created from the running fs). Now, the weird part for me is exclusive data count: # btrfs sub sh ./snapshot-171125 [...] Subvolume ID: 388 # btrfs fi du -s ./snapshot-171125 Total Exclusive Set shared Filename 21.50GiB63.35MiB20.77GiB snapshot-171125 How is that possible? This doesn't even remotely relate to 7.15 GiB from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB. And the same happens with other snapshots, much more exclusive data shown in qgroup than actually found in files. So if not files, where is that space wasted? Metadata? btrfs-progs-4.12 running on Linux 4.9.46. best regards, -- Tomasz Pala <go...@pld-linux.org> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html