Re: Crashed filesystem, nothing helps
Now i do a btrfs-find-root, but it runs now since 5 day without a result. How long should i wait? Or is it already to late to hope? mainframe:~ # btrfs-find-root.static /dev/sdb1 parent transid verify failed on 29376512 wanted 1327723 found 1489835 parent transid verify failed on 29376512 wanted 1327723 found 1489835 parent transid verify failed on 29376512 wanted 1327723 found 1489835 parent transid verify failed on 29376512 wanted 1327723 found 1489835 Ignoring transid failure Superblock thinks the generation is 1490226 Superblock thinks the level is 1 The process is still running... Regards, Thomas -- Thomas Wurfbaum Starkertshofen 15 85084 Reichertshofen Tel.: +49-160-3696336 Mail: tho...@wurfbaum.net Google+:http://google.com/+ThomasWurfbaum Facebook: https://www.facebook.com/profile.php?id=16061335414 Xing: https://www.xing.com/profile/Thomas_Wurfbaum signature.asc Description: This is a digitally signed message part.
Re: Power down tests...
On Sun, Aug 06, 2017 at 08:15:45PM -0600, Chris Murphy wrote: > On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N> wrote: > > We're running a couple of experiments on our servers with btrfs > > (kernel version 4.4). > > And we're running some abrupt power-off tests for a couple of scenarios: > > > > 1. We have a filesystem on top of two different btrfs filesystems > > (distributed across N disks). i.e. Our filesystem lays out data and > > metadata on top of these two filesystems. > > This is astronomically more complicated than the already complicated > scenario with one file system on a single normal partition of a well > behaved (non-lying) single drive. > > You have multiple devices, so any one or all of them could drop data > during the power failure and in different amounts. In the best case > scenario, at next mount the supers are checked on all the devices, and > the lowest common denominator generation is found, and therefore the > lowest common denominator root tree. No matter what it means some data > is going to be lost. That's exactly why we have CoW. Unless at least one of the disks lies, there's no way for data from a fully committed transaction to be lost. Any writes after that are _supposed_ to be lost. Reordering writes between disks is no different from reordering writes on a single disk. Even more so with NVMe where you have multiple parallel writes on the same device, with multiple command queues. You know the transaction has hit the, uhm, platters, only once every device says so, and that's when you can start writing the new superblock. > > > The issue that we're facing is that a few files have been zero-sized. > > I can't tell you if that's a bug or not because I'm not sure how your > software creates these 16M backing files, if they're fallocated or > touched or what. It's plausible they're created as zero length files, > and the file system successful creates them, and then data is written > to them, but before there is either committed metadata or an updated > super pointing to the new root tree you get a power failure. And in > that case, I expect a zero length file or maybe some partial amount of > data is there. It's the so-called O_PONIES issue. No filesystem can know whether you want files written immediately (abysmal performance) or held in cache until later (sacrificing durability). The only portable interface to do so is f{,data}sync: any write that hasn't been synced cannot be relied upon. Some traditional filesystems have implicitly synced things, but all such details are filesystem specific. Btrfs in particular has -o flushoncommit, which instead of a fsync after every single write gathers writes from the last 30 seconds and flushes them as one transaction. More generic interfaces have been proposed but none has been implemented yet. Heck, I'm playing with one such idea myself, although I'm not sure if I know enough to ensure the semantics I have in mind. > > As a result, there is either a data-loss, or inconsistency in the > > stacked filesystem's metadata. > > Sounds expected for any file system, but chances are there's more > missing with a CoW file system since by nature it rolls back to the > most recent sane checkpoint for the fs metadata without any regard to > what data is lost to make that happen. The goal is to not lose the > file system in such a case, as some amount of data is always going to > happen All it takes is to _somehow_ tell the filesystem you demand the same guarantees for data as it already provides for metadata. And a CoW or log-based filesystem can actually deliver such a demand. > and why power losses need to be avoided (UPS's and such). An UPS can't protect you from a kernel crash, a motherboard running out of smoke, a stick of memory going bad or unseated, power supply deciding it wants a break from delivering the juice (for redundant power supplies, the thingy mediating power will do so), etc, etc. There's no way around crash tolerance. > The > fact that you have a file system on top of a file system makes it more > fragile because the 2nd file system's metadata *IS* data as far as the > 1st file system is concerned. And that data is considered expendable. Only because by default the underlying filesystem has been taught to consider it expendable. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair) ⠈⠳⣄ • use glitches to walk on water -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Power down tests...
On Fri, Aug 4, 2017 at 6:09 AM, Shyam Prasad Nwrote: > Thanks guys. I've enabled that option now. Let's see how it goes. > One general question regarding the stability of btrfs in kernel > version 4.4. Is this okay for power off test cases? Or are there many > important fixes in newer kernels? $ git log --grep=power tags/v4.4...tags/v4.12 -- fs/btrfs The answer is yes there are power failure related fixes since 4.4. I can't tell you off hand to what degree they're backported, you'd have to do a search with whatever specific sub version of 4.4 you're using. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Power down tests...
On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad Nwrote: > Hi all, > > We're running a couple of experiments on our servers with btrfs > (kernel version 4.4). > And we're running some abrupt power-off tests for a couple of scenarios: > > 1. We have a filesystem on top of two different btrfs filesystems > (distributed across N disks). What's the layout from physical devices all the way to your 16M file? This is hardware raid, lvm linear, Btrfs raid? All of that matters. Do the drives have write caching disabled? You might be better off with the drive write cache disabled, and then add bcache or dm-cache and an SSD to compensate. But that's just speculation on my part. The write cache in the drives is definitely volatile. And disabling them will definitely make writes slower. So, you might have slightly better luck with another layout. But the bottom line is, you need to figure out a way to avoid *any* data loss in your files because otherwise that means the 2nd file system has data loss and even corruption. This is not something a file system choice can solve. You need reliable power and reliable shutdown. And you may also need a cluster file system like ceph or glusterfs instead of depending on a single box to stay upright. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Power down tests...
On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad Nwrote: > Hi all, > > We're running a couple of experiments on our servers with btrfs > (kernel version 4.4). > And we're running some abrupt power-off tests for a couple of scenarios: > > 1. We have a filesystem on top of two different btrfs filesystems > (distributed across N disks). i.e. Our filesystem lays out data and > metadata on top of these two filesystems. This is astronomically more complicated than the already complicated scenario with one file system on a single normal partition of a well behaved (non-lying) single drive. You have multiple devices, so any one or all of them could drop data during the power failure and in different amounts. In the best case scenario, at next mount the supers are checked on all the devices, and the lowest common denominator generation is found, and therefore the lowest common denominator root tree. No matter what it means some data is going to be lost. Next there is a file system on top of a file system, I assume it's a file that's loopback mounted? >With the test workload, it > is going to generate a good amount of 16MB files on top of the system. > On abrupt power-off and following reboot, what is the recommended > steps to be run. We're attempting btrfs mount, which seems to fail > sometimes. If it fails, we run a fsck and then mount the btrfs. I'd want to know why it fails. And then I'd check all the supers on all the devices with 'btrfs inspect-internal dump-super -fa '. Are all the copies on a given device the same and valid? Are all the copies among all devices the same and valid? I'm expecting there will be discrepancies and then you have to figure out if the mount logic is really finding the right root to try to mount. I'm not sure if kernel code by default reports back in detail what logic its using and exactly where it fails, or if you just get the generic open_ctree mount failure message. And then it's an open question whether the supers need fixing, or whether the 'usebackuproot' mount option is the way to go. It might depend on the status of the supers how that logic ends up working. Again, it might be useful if there were debug info that explicitly shows the mount logic actually being used, dumped to kernel messages. I'm not sure if that code exists when CONFIG_BTRFS_DEBUG is enabled (as in, I haven't looked but I've thought it really could come in handy in some of the cases we see of mount failure can can't tell where things are getting stuck with the existing reporting). >The > issue that we're facing is that a few files have been zero-sized. I can't tell you if that's a bug or not because I'm not sure how your software creates these 16M backing files, if they're fallocated or touched or what. It's plausible they're created as zero length files, and the file system successful creates them, and then data is written to them, but before there is either committed metadata or an updated super pointing to the new root tree you get a power failure. And in that case, I expect a zero length file or maybe some partial amount of data is there. >As a > result, there is either a data-loss, or inconsistency in the stacked > filesystem's metadata. Sounds expected for any file system, but chances are there's more missing with a CoW file system since by nature it rolls back to the most recent sane checkpoint for the fs metadata without any regard to what data is lost to make that happen. The goal is to not lose the file system in such a case, as some amount of data is always going to happen, and why power losses need to be avoided (UPS's and such). The fact that you have a file system on top of a file system makes it more fragile because the 2nd file system's metadata *IS* data as far as the 1st file system is concerned. And that data is considered expendable. > We're mounting the btrfs with commit period of 5s. However, I do > expect btrfs to journal the I/Os that are still dirty. Why then are we > seeing the above behaviour. commit 5s might make the problem worse by requiring such constant flushing of dirty data that you're getting a bunch of disk contention, hard to say since there's no details about the workload at the time of the power failure. Changing nothing else but but commit= mount option, what difference do you see (with a scientific sample) if any between commit 5 and default commit 30 when it comes to the amount of data loss? Another thing we don't know is the application or service writing out these 16M backing files behavior when it comes to fsync or fdatasync or fadvise. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Save to use 'clear_cache' in mount -o remount?
On Sun, Aug 6, 2017 at 8:49 AM, Cloud Adminwrote: > Hi, > is it safe (has it an effect?) to use the 'clear_cache' option in a > 'mount -o remount'? I recognize messages in my kernel log regarding > 'BTRFS info (device dm-7): The free space cache file (31215079915520) > is invalid. skip it'. I would like to fix it and would do it (in best > case) without rebooting. It is safe and the cache is immediately rebuilt. You will see 'space_cache,clear_cache' as mount options, which is normal and you can ignore. I just tried this with a remount and I get the expected messaging. [ 400.465624] BTRFS info (device sda7): force clearing of disk cache [ 400.465639] BTRFS info (device sda7): disk space caching is enabled Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Save to use 'clear_cache' in mount -o remount?
Hi, is it safe (has it an effect?) to use the 'clear_cache' option in a 'mount -o remount'? I recognize messages in my kernel log regarding 'BTRFS info (device dm-7): The free space cache file (31215079915520) is invalid. skip it'. I would like to fix it and would do it (in best case) without rebooting. Bye Frank -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html