Re: slow btrfs with a single kworker process using 100% CPU
i've backported the free space cache tree to my kerne and hopefully any fixes related to it. The first mount with clear_cache,space_cache=v2 took around 5 hours. Currently i do not see any kworker with 100CPU but i don't see much load at all. btrfs-transaction tooks around 2-4% CPU together with a kworker process and some 2-3% mdadm processes. I/O Wait is at 3%. That's it. It does not do much more. Writing a file does not work. Greets, Stefan Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko: > Roman, initially I had a single process occupying 100% CPU, when sysrq it was > indicating as "btrfs_find_space_for_alloc" > but that's when I used the autodefrag, compress, forcecompress and commit=10 > mount flags and space_cache was v1 by default. > when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu > has dissapeared, but the shite performance remained. > > > As to the chunk size, there is no information in the article about the type > of data that was used. While in our case we are pretty certain about the > compressed block size (32-128). I am currently inclining towards 32k as it > might be ideal in a situation when we have a 5 disk raid5 array. > > In theory > 1. The minimum compressed write (32k) would fill the chunk on a single disk, > thus the IO cost of the operation would be 2 reads (original chunk + original > parity) and 2 writes (new chunk + new parity) > > 2. The maximum compressed write (128k) would require the update of 1 chunk on > each of the 4 data disks + 1 parity write > > > > Stefan what mount flags do you use? > > kos > > > > - Original Message - > From: "Roman Mamedov"> To: "Konstantin V. Gavrilenko" > Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" > , linux-btrfs@vger.kernel.org, "Peter Grandi" > > Sent: Wednesday, 16 August, 2017 2:00:03 PM > Subject: Re: slow btrfs with a single kworker process using 100% CPU > > On Wed, 16 Aug 2017 12:48:42 +0100 (BST) > "Konstantin V. Gavrilenko" wrote: > >> I believe the chunk size of 512kb is even worth for performance then the >> default settings on my HW RAID of 256kb. > > It might be, but that does not explain the original problem reported at all. > If mdraid performance would be the bottleneck, you would see high iowait, > possibly some CPU load from the mdX_raidY threads. But not a single Btrfs > thread pegging into 100% CPU. > >> So now I am moving the data from the array and will be rebuilding it with 64 >> or 32 chunk size and checking the performance. > > 64K is the sweet spot for RAID5/6: > http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid0 rescue
OK this time, also -mraid1 -draid0, and filled it with some more metadata this time, but I then formatted NTFS, then ext4, then xfs. And then wiped those signatures. Brutal, especially ext4 which writes a lot more stuff and zeros a bunch of areas too. # btrfs rescue super -v /dev/mapper/vg-2 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Device: id = 2, name = /dev/mapper/vg-2 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 device name = /dev/mapper/vg-1 superblock bytenr = 67108864 device name = /dev/mapper/vg-2 superblock bytenr = 67108864 [All bad supers]: All supers are valid, no need to recover Obviously vg-2 is missing its first superblock and this tool is not complaining about it at all. Normal mount does not work (generic open ctree error). # btrfs check /dev/mapper/vg-1 warning, device 2 is missing Umm, no. But yeah because the first super is missing the kernel isn't considering it a Btrfs volume at all. There's also other errors with the check, due to metadata being stepped on I'm guessing. But we need a way to fix an obviously stepped on first super, and I don't like the idea of using btrfs check for that anyway. All I need is the first copy fixed up, and then just do a normal mount. But let's see how messy this gets, pointing check to the damaged device and the known good 2nd super (-s0 is the first super): # btrfs check -s 1 /dev/mapper/vg-2 using SB copy 1, bytenr 67108864 ...skipping checksum errors etc OK so I guess I have to try --repair. # btrfs check --repair -s1 /dev/mapper/vg-2 enabling repair mode using SB copy 1, bytenr 67108864 ...skipping checksum errors etc. ]# btrfs rescue super -v /dev/mapper/vg-1 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 67108864 [All bad supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 That is fucked. It broke the previously good super on vg-1? [root@f26wnuc ~]# btrfs rescue super -v /dev/mapper/vg-2 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Device: id = 2, name = /dev/mapper/vg-2 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 67108864 device name = /dev/mapper/vg-2 superblock bytenr = 67108864 [All bad supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 Worse, it did not actually fix the bad/missing superblock on vg-2 either. Let's answer Y to its questions... [root@f26wnuc ~]# btrfs rescue super -v /dev/mapper/vg-2 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Device: id = 2, name = /dev/mapper/vg-2 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 67108864 device name = /dev/mapper/vg-2 superblock bytenr = 67108864 [All bad supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are you sure? [y/N]: y checksum verify failed on 20971520 found 348F13AD wanted 8100 checksum verify failed on 20971520 found 348F13AD wanted 8100 Recovered bad superblocks successful [root@f26wnuc ~]# btrfs rescue super -v /dev/mapper/vg-2 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Device: id = 2, name = /dev/mapper/vg-2 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 device name = /dev/mapper/vg-1 superblock bytenr = 67108864 device name = /dev/mapper/vg-2 superblock bytenr = 65536 device name = /dev/mapper/vg-2 superblock bytenr = 67108864 [All bad supers]: All supers are valid, no need to recover OK! That's better! Mount it. dmesg https://pastebin.com/6kVzYLfZ Pretty boring, bad tree block, and then some read errors corrected. I get more similarly formatted errors, different numbers... but no failures. Scrub it... # btrfs scrub status /mnt/yo scrub status for b2ee5125-cf56-493a-b094-81fe8330115a scrub started at Wed Aug 16 23:08:54 2017, running for 00:00:30 total bytes scrubbed: 1.19GiB with 5 errors error details: csum=5 corrected errors: 5, uncorrectable errors: 0, unverified errors: 0 # There's almost no data on this file system, it's mostly metadata which is raid1 so that's why data survives. But even in the previous example where some data is clobbered, the data loss is limited. The file system itself survives, and can continue to be used. The 'btrfs rescue super' function could be better, and it looks like there's a bug in btrfs check's superblock repair. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to
Re: Raid0 rescue
I'm testing explicitly for this case: # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 vg Vwi-a-tz-- 10.00g thintastic0.00 2 vg Vwi-a-tz-- 10.00g thintastic0.00 thintastic vg twi-aotz-- 100.00g 0.00 0.38 # mkfs.btrfs -f -mraid1 -draid0 /dev/mapper/vg-1 /dev/mapper/vg-2 ... mount and copy some variable data to the volume, most files are less than 64KiB, and even some are less than 2KiB. So there will be a mix of files that will definitely get nerfed by damaged strips, and many that will live from the drive not accidentally formatted, as well as inline. But for sure the file system *ought* to survive. umount and then format NTFS # mkfs.ntfs -f /dev/mapper/vg-2 Now get this bit of curiousness: # wipefs /dev/mapper/vg-2 offset type 0x1fedos [partition table] 0x10040 btrfs [filesystem] UUID: bebaedc5-96a1-4163-9527-8254ecae817e 0x3 ntfs [filesystem] UUID: 67AD98CF36096C70 So the two supers can co-exist. That invariably is going to cause kernel code confusion. blkid will neither consider it NTFS nor Btrfs. So it's sortof in a zombie situation. Get this: # btrfs rescue super -v /dev/mapper/vg-1 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 device name = /dev/mapper/vg-1 superblock bytenr = 67108864 [All bad supers]: All supers are valid, no need to recover # btrfs rescue super -v /dev/mapper/vg-2 All Devices: Device: id = 1, name = /dev/mapper/vg-1 Device: id = 2, name = /dev/mapper/vg-2 Before Recovering: [All good supers]: device name = /dev/mapper/vg-1 superblock bytenr = 65536 device name = /dev/mapper/vg-1 superblock bytenr = 67108864 device name = /dev/mapper/vg-2 superblock bytenr = 65536 device name = /dev/mapper/vg-2 superblock bytenr = 67108864 [All bad supers]: All supers are valid, no need to recover # So the first command sees the supers only on vg-1, it doesn't go looking at vg-2 at all presumably because kernel code is ignoring that device due to two different file system supers (?). But the second command forces it to look at vg-2, and it says the Btrfs supers are fine, and then also auto discovers the vg-1 device too. OK so I'm just going to cheat at this point and wipefs just the NTFS magic so this device is now seen as Btrfs. # wipefs -n -o 0x3 /dev/mapper/vg-2 /dev/mapper/vg-2: 8 bytes were erased at offset 0x0003 (ntfs): 4e 54 46 53 20 20 20 20 # wipefs -o 0x3 /dev/mapper/vg-2 /dev/mapper/vg-2: 8 bytes were erased at offset 0x0003 (ntfs): 4e 54 46 53 20 20 20 20 # partprobe # blkid ... /dev/mapper/vg-1: UUID="bebaedc5-96a1-4163-9527-8254ecae817e" UUID_SUB="ef9dbcf0-bb0b-4faf-a7b4-02f1c92631e4" TYPE="btrfs" /dev/mapper/vg-2: UUID="bebaedc5-96a1-4163-9527-8254ecae817e" UUID_SUB="490504ea-4ee4-47ad-91a7-58b6ccf4be8e" TYPE="btrfs" PTTYPE="dos" ... OK good. Except, what is PTTYPE? Ohh, that's the first entry in the wipefs command way at the top I bet. [root@f26wnuc ~]# wipefs -o 0x1fe /dev/mapper/vg-2 /dev/mapper/vg-2: 2 bytes were erased at offset 0x01fe (dos): 55 aa # blkid ... /dev/mapper/vg-1: UUID="bebaedc5-96a1-4163-9527-8254ecae817e" UUID_SUB="ef9dbcf0-bb0b-4faf-a7b4-02f1c92631e4" TYPE="btrfs" /dev/mapper/vg-2: UUID="bebaedc5-96a1-4163-9527-8254ecae817e" UUID_SUB="490504ea-4ee4-47ad-91a7-58b6ccf4be8e" TYPE="btrfs" ... Yep! OK let's just try a normal mount. It mounts! No errors at all. list all the files on the file system (about 700). No errors. Let's cat a few to /dev/null manually no errors. OK I'm bored. Let's just scrub it. [root@f26wnuc yo]# btrfs scrub status /mnt/yo/ scrub status for bebaedc5-96a1-4163-9527-8254ecae817e scrub started at Wed Aug 16 19:40:26 2017, running for 00:00:10 total bytes scrubbed: 529.62MiB with 181 errors error details: csum=181 corrected errors: 0, uncorrectable errors: 181, unverified errors: 0 One file is affected, the large ~1+GiB file. [77898.116429] BTRFS warning (device dm-6): checksum error at logical 1621229568 on dev /dev/mapper/vg-2, sector 2621568, root 5, inode 257, offset 517341184, length 4096, links 1 (path: Fedora-Workstation-Live-x86_64-Rawhide-20170814.n.0.iso) [77898.116463] BTRFS error (device dm-6): bdev /dev/mapper/vg-2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [77898.116478] BTRFS error (device dm-6): unable to fixup (regular) error at logical 1621229568 on dev /dev/mapper/vg-2 There's about 9 more of those kinds of messages. Anyway that looks to me like that file itself is nerfed by the NTFS format, but the file system itself wasn't hit. There's no fixups
Re: btrfs fi du -s gives Inappropriate ioctl for device
On Wed, Aug 16, 2017 at 3:27 AM, Piotr Szymaniakwrote: > On Mon, Aug 14, 2017 at 05:40:30PM -0600, Chris Murphy wrote: >> On Mon, Aug 14, 2017 at 4:57 PM, Piotr Szymaniak wrote: >> >> > >> > and... some issues: >> > ~ # btrfs fi du -s /mnt/red/\@backup/ >> > Total Exclusive Set shared Filename >> > ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for >> > device >> >> >> It's a bug, but I don't know if any devs are working on a fix yet. >> >> The problem is that the subvolume being snapshot, contains subvolumes. >> The resulting snapshot, contains an empty directory in place of the >> nested subvolume(s), and that is the cause for the error. > > Ok, but why, on the same btrfs, it works on some subvols with subvols and does > not work on other subvols with subvols? If it does not work - OK, if it works > - > OK, but that seems a bit... random? > > ~ # btrfs fi du -s /mnt/red/\@backup/ > /mnt/red/\@backup/.snapshot/monthly_2017-08-01_05\:30\:01/ /mnt/red/\@svn/ > /mnt/red/\@svn/.snapshot/weekly_2017-08-05_04\:20\:02/ > Total Exclusive Set shared Filename > ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for > device > ERROR: cannot check space of > '/mnt/red/@backup/.snapshot/monthly_2017-08-01_05:30:01/': Inappropriate > ioctl for device > 52.23GiB10.57MiB 4.13GiB /mnt/red/@svn/ >4.35GiB 1.03MiB 4.12GiB > /mnt/red/@svn/.snapshot/weekly_2017-08-05_04:20:02/ I don't know. It might be that there's something inconsistent about the inode for the missing/ghost subvolume placeholder directory at snapshot creation time? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qcow2 images make scrub believe the filesystem is corrupted.
>>> On Tue, Aug 15, 2017 at 7:12 PM, Paulo Diaswrote: Device Model: Samsung SSD 850 EVO M.2 500GB Serial Number:S33DNX0H812686V LU WWN Device Id: 5 002538 d4130d027 Firmware Version: EMT21B6Q >>> Unfortunately no firmware updates listed with Samsung for this model. It's worth filing a bug report with them, and then try not using either fstrim or discard for a while and see if the problem reoccurs. If not, then that suggests trim bug in the firmware. If it does still occur it could just be defective hardware. Does smartctl -x reveal any issues? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
>>> I've one system where a single kworker process is using 100% >>> CPU sometimes a second process comes up with 100% CPU >>> [btrfs-transacti]. [ ... ] >> [ ... ]1413 Snapshots. I'm deleting 50 of them every night. But >> btrfs-cleaner process isn't running / consuming CPU currently. Reminder that: https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow "The cost of several operations, including currently balance, device delete and fs resize, is proportional to the number of subvolumes, including snapshots, and (slightly super-linearly) the number of extents in the subvolumes." >> [ ... ] btrfs is mounted with compress-force=zlib > Could be similar issue as what I had recently, with the RAID5 and > 256kb chunk size. please provide more information about your RAID > setup. It is similar, but updating in-place compressed files can create this situation even without RAID5 RMW: https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation "Files with a lot of random writes can become heavily fragmented (1+ extents) causing thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. ... Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs)." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS error (device sda4): failed to read chunk tree: -5
On Wed, Aug 16, 2017 at 4:25 PM, Zirconium Hackerwrote: > Hi, > This is my first time using a mailing list, and I hope I'm doing this right. > > $ uname -a > Linux thinkpad 4.12.6-1-ARCH #1 SMP PREEMPT Sat Aug 12 09:16:22 CEST > 2017 x86_64 GNU/Linux > $ btrfs --version > btrfs-progs v4.12 > $ sudo mount -o ro,recovery /dev/sda4 /mnt > mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda4, > missing codepage or helper program, or other error. > $ dmesg | tail > > [ 1289.087439] BTRFS warning (device sda4): 'recovery' is deprecated, > use 'usebackuproot' instead > [ 1289.087440] BTRFS info (device sda4): trying to use backup root at mount > time > [ 1289.087442] BTRFS info (device sda4): disk space caching is enabled > [ 1289.097757] BTRFS error (device sda4): failed to read chunk tree: -5 > [ 1289.135222] BTRFS error (device sda4): open_ctree failed > > $ sudo btrfs check /dev/sda4 > bytenr mismatch, want=61809344512, have=0 > Couldn't read tree root > ERROR: cannot open file system > $ sudo btrfs restore - -D /dev/sda4 . > bytenr mismatch, want=61809344512, have=0 > Couldn't read tree root > Could not open root, trying backup super > bytenr mismatch, want=61809344512, have=0 > Couldn't read tree root > Could not open root, trying backup super > ERROR: superblock bytenr 274877906944 is larger than device size 58056507392 > Could not open root, trying backup super > What happened before this? What do you get for: btrfs rescue super -v /dev/sda4 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS error (device sda4): failed to read chunk tree: -5
Hi, This is my first time using a mailing list, and I hope I'm doing this right. $ uname -a Linux thinkpad 4.12.6-1-ARCH #1 SMP PREEMPT Sat Aug 12 09:16:22 CEST 2017 x86_64 GNU/Linux $ btrfs --version btrfs-progs v4.12 $ sudo mount -o ro,recovery /dev/sda4 /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda4, missing codepage or helper program, or other error. $ dmesg | tail [ 1289.087439] BTRFS warning (device sda4): 'recovery' is deprecated, use 'usebackuproot' instead [ 1289.087440] BTRFS info (device sda4): trying to use backup root at mount time [ 1289.087442] BTRFS info (device sda4): disk space caching is enabled [ 1289.097757] BTRFS error (device sda4): failed to read chunk tree: -5 [ 1289.135222] BTRFS error (device sda4): open_ctree failed $ sudo btrfs check /dev/sda4 bytenr mismatch, want=61809344512, have=0 Couldn't read tree root ERROR: cannot open file system $ sudo btrfs restore - -D /dev/sda4 . bytenr mismatch, want=61809344512, have=0 Couldn't read tree root Could not open root, trying backup super bytenr mismatch, want=61809344512, have=0 Couldn't read tree root Could not open root, trying backup super ERROR: superblock bytenr 274877906944 is larger than device size 58056507392 Could not open root, trying backup super A script called btrfs-undelete (https://gist.github.com/Changaco/45f8d171027ea2655d74) also fails with similar errors. I'd like to recover at least one folder, my desktop -- everything else was backed up. I'm using PhotoRec to try and recover some files, but I'd like a better solution that keeps filenames and at least some folder structure. Thanks in advance! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fix cross-compile build
On 8/15/17 7:17 PM, Qu Wenruo wrote: > > > On 2017年08月16日 02:11, Eric Sandeen wrote: >> The mktables binary needs to be build with the host >> compiler at built time, not the target compiler, because >> it runs at build time to generate the raid tables. >> >> Copy auto-fu from xfsprogs and modify Makefile to >> accomodate this. >> >> Reported-by: Hallo32>> Signed-off-by: Eric Sandeen > > Looks better than my previous patch. > With @BUILD_CLFAGS support and better BUILD_CC/CLFAGS detection for native > build environment. > > Reviewed-by: Qu Wenruo Thanks - and sorry for missing your earlier patch, I didn't mean to ignore it. :) I just missed it. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4 v4] btrfs: add compression trace points
From: Anand JainThis patch adds compression and decompression trace points for the purpose of debugging. Signed-off-by: Anand Jain Reviewed-by: Nikolay Borisov --- v4: Accepts David's review comments . changes from unsigned long to u64. . format changes v3: . Rename to a simple names, without worrying about being compatible with the future naming. . The type was not working fixed it. v2: . Use better naming. (If transform is not good enough I have run out of ideas, pls suggest). . To be applied on top of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next (tested without namelen check patch set) fs/btrfs/compression.c | 11 +++ include/trace/events/btrfs.h | 36 2 files changed, 47 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index d2ef9ac2a630..4a652f67ee87 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -895,6 +895,10 @@ int btrfs_compress_pages(int type, struct address_space *mapping, start, pages, out_pages, total_in, total_out); + + trace_btrfs_compress(1, 1, mapping->host, type, *total_in, + *total_out, start, ret); + free_workspace(type, workspace); return ret; } @@ -921,6 +925,10 @@ static int btrfs_decompress_bio(struct compressed_bio *cb) workspace = find_workspace(type); ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb); + + trace_btrfs_compress(0, 0, cb->inode, type, + cb->compressed_len, cb->len, cb->start, ret); + free_workspace(type, workspace); return ret; @@ -943,6 +951,9 @@ int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page, dest_page, start_byte, srclen, destlen); + trace_btrfs_compress(0, 1, dest_page->mapping->host, + type, srclen, destlen, start_byte, ret); + free_workspace(type, workspace); return ret; } diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index d412c49f5a6a..d0c0bd4fe3c2 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1629,6 +1629,42 @@ TRACE_EVENT(qgroup_meta_reserve, show_root_type(__entry->refroot), __entry->diff) ); +TRACE_EVENT(btrfs_compress, + + TP_PROTO(int compress, int page, struct inode *inode, unsigned int type, + u64 len_before, u64 len_after, u64 start, int ret), + + TP_ARGS(compress, page, inode, type, len_before, len_after, start, ret), + + TP_STRUCT__entry_btrfs( + __field(int,compress) + __field(int,page) + __field(u64,i_ino) + __field(unsigned int, type) + __field(u64,len_before) + __field(u64,len_after) + __field(u64,start) + __field(int,ret) + ), + + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), + __entry->compress = compress; + __entry->page = page; + __entry->i_ino = inode->i_ino; + __entry->type = type; + __entry->len_before = len_before; + __entry->len_after = len_after; + __entry->start = start; + __entry->ret= ret; + ), + + TP_printk_btrfs("%s %s ino=%llu type=%s len_before=%llu len_after=%llu "\ + "start=%llu ret=%d", + __entry->compress ? "compress" : "decompress", + __entry->page ? "page" : "bio", __entry->i_ino, + show_compress_type(__entry->type), __entry->len_before, + __entry->len_after, __entry->start, __entry->ret) +); #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency
Amir, That's a fair response. I certainly did not mean to add more work on your end :) Using dm-log-writes for now is a reasonable approach. Like I mentioned before, I think there is further work involved in getting CrashMonkey to a useful point (where it finds at least known bugs). Once this is done, I'd be happy to rework the device_wrapper as a DM target (or perhaps as a modification of log-writes) for upstream. I'm not sure how feasible it would be to keep functionality in-kernel simple, but we will try our best. We will keep this goal in mind as we continue development, so that we don't make any decisions that will prevent us from going the DM target route later. Thanks, Vijay On Wed, Aug 16, 2017 at 3:27 PM, Amir Goldsteinwrote: > On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram > wrote: >> Hi Josef, >> >> Thank you for the detailed reply -- I think it provides several >> pointers for our future work. It sounds like we have a similar vision >> for where we want this to go, though we may disagree about how to >> implement this :) This is exciting! >> >> I agree that we should be building off existing work if it is a good >> option. We might end up using log-writes, but for now we see several >> problems: >> >> - The log-writes code is not documented well. As you have mentioned, >> at this point, only you know how it works, and we are not seeing a lot >> of adoption by other developers of log-writes as well. >> >> - I don't think our requirements exactly match what log-writes >> provides. For example, at some point we want to introduce checkpoints >> so that we can co-relate a crash state with file-system state at the >> time of crash. We also want to add functionality to guide creation of >> random crash states (see below). This might require changing >> log-writes significantly. I don't know if that would be a good idea. >> >> Regarding random crashes, there is a lot of complexity there that >> log-writes couldn't handle without significant changes. For example, >> just randomly generating crash states and testing each state is >> unlikely to catch bugs. We need a more nuanced way of doing this. We >> plan to add a lot of functionality to CrashMonkey to (a) let the user >> guide crash-state generation (b) focus on "interesting" states (by >> re-ordering or dropping metadata). All of this will likely require >> adding more sophistication to the kernel module. I don't think we want >> to take log-writes and add a lot of extra functionality. >> >> Regarding logging writes, I think there is a difference in approach >> between log-writes and CrashMonkey. We don't really care about the >> completion order since the device may anyway re-order the writes after >> that point. Thus, the set of crash states generated by CrashMonkey is >> bound only by FUA and FLUSH flags. It sounds as if log-writes focuses >> on a more restricted set of crash states. >> >> CrashMonkey works with the 4.4 kernel, and we will try and keep up >> with changes to the kernel that breaks CrashMonkey. CrashMonkey is >> useless without the user-space component, so users will be needing to >> compile some code anyway. I do not believe it will matter much whether >> it is in-tree or not, as long as it compiles with the latest kernel. >> >> Regarding discard, multi-device support, and application-level crash >> consistency, this is on our road-map too! Our current priority is to >> build enough scaffolding to reproduce a known crash-consistency bug >> (such as the delayed allocation bug of ext4), and then go on and try >> to find new bugs in newer file systems like btrfs. >> >> Adding CrashMonkey into the kernel is not a priority at this point (I >> don't think CrashMonkey is useful enough at this point to do so). When >> CrashMonkey becomes useful enough to do so, we will perhaps add the >> device_wrapper as a DM target to enable adoption. >> >> Our hope currently is that developers like Ari will try out >> CrashMonkey in its current form, which will guide us as to what >> functionality to add to CrashMonkey to find bugs more effectively. >> > > Vijay, > > I can only speak for myself, but I think I represent other filesystem > developers with this response: > - Often with competing projects the end > results is always for the best when project members cooperate to combine > the best of both projects. > - Some of your project goals (e.g. user guided crash states) sound very > intriguing > - IMO you are severely underestimating the pros in mainlined > kernel code for other developers. If you find the dm-log-writes target > is lacking functionality it would be MUCH better if you work to improve it. > Even more - it would be far better if you make sure that your userspace > tools can work also with the reduced functionality in mainline kernel. > - If you choose to complete your academic research before crossing over > to existing code base, that is a reasonable choice for you to make, but >
Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency
On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaramwrote: > Hi Josef, > > Thank you for the detailed reply -- I think it provides several > pointers for our future work. It sounds like we have a similar vision > for where we want this to go, though we may disagree about how to > implement this :) This is exciting! > > I agree that we should be building off existing work if it is a good > option. We might end up using log-writes, but for now we see several > problems: > > - The log-writes code is not documented well. As you have mentioned, > at this point, only you know how it works, and we are not seeing a lot > of adoption by other developers of log-writes as well. > > - I don't think our requirements exactly match what log-writes > provides. For example, at some point we want to introduce checkpoints > so that we can co-relate a crash state with file-system state at the > time of crash. We also want to add functionality to guide creation of > random crash states (see below). This might require changing > log-writes significantly. I don't know if that would be a good idea. > > Regarding random crashes, there is a lot of complexity there that > log-writes couldn't handle without significant changes. For example, > just randomly generating crash states and testing each state is > unlikely to catch bugs. We need a more nuanced way of doing this. We > plan to add a lot of functionality to CrashMonkey to (a) let the user > guide crash-state generation (b) focus on "interesting" states (by > re-ordering or dropping metadata). All of this will likely require > adding more sophistication to the kernel module. I don't think we want > to take log-writes and add a lot of extra functionality. > > Regarding logging writes, I think there is a difference in approach > between log-writes and CrashMonkey. We don't really care about the > completion order since the device may anyway re-order the writes after > that point. Thus, the set of crash states generated by CrashMonkey is > bound only by FUA and FLUSH flags. It sounds as if log-writes focuses > on a more restricted set of crash states. > > CrashMonkey works with the 4.4 kernel, and we will try and keep up > with changes to the kernel that breaks CrashMonkey. CrashMonkey is > useless without the user-space component, so users will be needing to > compile some code anyway. I do not believe it will matter much whether > it is in-tree or not, as long as it compiles with the latest kernel. > > Regarding discard, multi-device support, and application-level crash > consistency, this is on our road-map too! Our current priority is to > build enough scaffolding to reproduce a known crash-consistency bug > (such as the delayed allocation bug of ext4), and then go on and try > to find new bugs in newer file systems like btrfs. > > Adding CrashMonkey into the kernel is not a priority at this point (I > don't think CrashMonkey is useful enough at this point to do so). When > CrashMonkey becomes useful enough to do so, we will perhaps add the > device_wrapper as a DM target to enable adoption. > > Our hope currently is that developers like Ari will try out > CrashMonkey in its current form, which will guide us as to what > functionality to add to CrashMonkey to find bugs more effectively. > Vijay, I can only speak for myself, but I think I represent other filesystem developers with this response: - Often with competing projects the end results is always for the best when project members cooperate to combine the best of both projects. - Some of your project goals (e.g. user guided crash states) sound very intriguing - IMO you are severely underestimating the pros in mainlined kernel code for other developers. If you find the dm-log-writes target is lacking functionality it would be MUCH better if you work to improve it. Even more - it would be far better if you make sure that your userspace tools can work also with the reduced functionality in mainline kernel. - If you choose to complete your academic research before crossing over to existing code base, that is a reasonable choice for you to make, but the reasonable choice for me to make is to try Joseph's tools from his repo (even if not documented) and *only* if it doesn't meet my needs I would make the extra effort to try out CrashMonkey. - AFAIK the state of filesystem crash consistency testing tools is so bright (maybe except in Facebook ;) , so my priority is to get *some* automated testing tools in motion In any case, I'm glad this discussion started and I hope it would expedite the adoption of crash testing tools. I wish you all the best with your project. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] btrfs: convert enum btrfs_compression_type to define
On 08/16/2017 09:59 PM, David Sterba wrote: On Sun, Aug 13, 2017 at 12:02:42PM +0800, Anand Jain wrote: There isn't a huge list to manage the types, which can be managed with defines. It helps to easily print the types in tracing as well. We use enums in a lot of places, I'd rather keep it as it is. This patch converts all of them, and it was at only one place. I hope I didn't miss any. Further the next patch 3/4 needs it to be define instead of enums, handling enums in the tracing isn't as easy as define. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Wed, Aug 16, 2017 at 8:01 AM, Qu Wenruowrote: > BTW, when Fujitsu tested the postgresql workload on btrfs, the result is > quite interesting. > > For HDD, when number of clients is low, btrfs shows obvious performance > drop. > And the problem seems to be mandatory metadata COW, which leads to > superblock FUA updates. > And when number of clients grow, difference between btrfs and other fses > gets much smaller, the bottleneck is the HDD itself. > > While for SSD, when number of clients is low, btrfs is almost the same > performance as other fses, nodatacow/nodatasum only provides marginal > difference. > But when number of clients grows, btrfs falls far behind other fses. > The reason seems to be related to how postgresql commit its transaction, > which always fsync its journal sequentially without concurrency. I wonder to what degree fsync is used as a hammer for a problem that needs more granular indicators to solve, like fsadvise() and even extending it? But I'm also curious if the above behaviors you report, how it changes by combining SSD and HDD via either dm-cache or bcache? Do the worst aspects of SSD and HDD get muted in that case? Or do the worst aspects become even worse across the board? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency
Hi Josef, Thank you for the detailed reply -- I think it provides several pointers for our future work. It sounds like we have a similar vision for where we want this to go, though we may disagree about how to implement this :) This is exciting! I agree that we should be building off existing work if it is a good option. We might end up using log-writes, but for now we see several problems: - The log-writes code is not documented well. As you have mentioned, at this point, only you know how it works, and we are not seeing a lot of adoption by other developers of log-writes as well. - I don't think our requirements exactly match what log-writes provides. For example, at some point we want to introduce checkpoints so that we can co-relate a crash state with file-system state at the time of crash. We also want to add functionality to guide creation of random crash states (see below). This might require changing log-writes significantly. I don't know if that would be a good idea. Regarding random crashes, there is a lot of complexity there that log-writes couldn't handle without significant changes. For example, just randomly generating crash states and testing each state is unlikely to catch bugs. We need a more nuanced way of doing this. We plan to add a lot of functionality to CrashMonkey to (a) let the user guide crash-state generation (b) focus on "interesting" states (by re-ordering or dropping metadata). All of this will likely require adding more sophistication to the kernel module. I don't think we want to take log-writes and add a lot of extra functionality. Regarding logging writes, I think there is a difference in approach between log-writes and CrashMonkey. We don't really care about the completion order since the device may anyway re-order the writes after that point. Thus, the set of crash states generated by CrashMonkey is bound only by FUA and FLUSH flags. It sounds as if log-writes focuses on a more restricted set of crash states. CrashMonkey works with the 4.4 kernel, and we will try and keep up with changes to the kernel that breaks CrashMonkey. CrashMonkey is useless without the user-space component, so users will be needing to compile some code anyway. I do not believe it will matter much whether it is in-tree or not, as long as it compiles with the latest kernel. Regarding discard, multi-device support, and application-level crash consistency, this is on our road-map too! Our current priority is to build enough scaffolding to reproduce a known crash-consistency bug (such as the delayed allocation bug of ext4), and then go on and try to find new bugs in newer file systems like btrfs. Adding CrashMonkey into the kernel is not a priority at this point (I don't think CrashMonkey is useful enough at this point to do so). When CrashMonkey becomes useful enough to do so, we will perhaps add the device_wrapper as a DM target to enable adoption. Our hope currently is that developers like Ari will try out CrashMonkey in its current form, which will guide us as to what functionality to add to CrashMonkey to find bugs more effectively. Thanks, Vijay On Wed, Aug 16, 2017 at 8:06 AM, Josef Bacikwrote: > On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote: >> Hi Amir, >> >> I neglected to mention this earlier: CrashMonkey does not require >> recompiling the kernel (it is a stand-alone kernel module), and has >> been tested with the kernel 4.4. It should work with future kernel >> versions as long as there are no changes to the bio structure. >> >> As it is, I believe CrashMonkey is compatible with the current kernel. >> It certainly provides functionality beyond log-writes (the ability to >> replay a subset of writes between FLUSH/FUA), and we intend to add >> more functionality in the future. >> >> Right now, CrashMonkey does not do random sampling among possible >> crash states -- it will simply test a given number of unique states. >> Thus, right now I don't think it is very effective in finding >> crash-consistency bugs. But the entire infrastructure to profile a >> workload, construct crash states, and test them with fsck is present. >> >> I'd be grateful if you could try it and give us feedback on what make >> testing easier/more useful for you. As I mentioned before, this is a >> work-in-progress, so we are happy to incorporate feedback. >> > > Sorry I was travelling yesterday so I couldn't give this my full attention. > Everything you guys do is already accomplished with dm-log-writes. If you > look > at the example scripts I've provided > > https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh > https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh > > The first initiates the replay, and points at the second script to run after > each entry is replayed. The whole point of this stuff was to make it as > flexible as possible. The way we use it is to replay, create a snapshot of > the >
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Wed, Aug 16, 2017 at 09:53:57AM -0400, Austin S. Hemmelgarn wrote: > > So apart from some central DBs for the storage management system > > itself, CoW is mostly no issue for us. > > But I've talked to some friend at the local super computing centre and > > they have rather general issues with CoW at their virtualisation > > cluster. > > Like SUSE's snapper making many snapshots leading the storage images of > > VMs apparently to explode (in terms of space usage). > SUSE is pathological case of brain-dead defaults. Snapper needs to > either die or have some serious sense beat into it. When you turn off > the automatic snapshot generation for everything but updates and set the > retention policy to not keep almost everything, it's actually not bad at > all. The defaults for timeline are really bad, the partition is almost never big enough to hold 10 months worth of data updates, not to say 10 years. A rolling distro can fill the space even with the daily or weeky settings set to low numbers. But certain people had different oppinion and I was not successful to change that. The least I did was to document some of the usecases and the hints that could allow one to have a bit more understanding of the effects. https://github.com/kdave/btrfsmaintenance#tuning-periodic-snapshotting > > For some of their storage backends there simply seem to be no de- > > duplication available (or other reasons that prevent it's usage). > If the snapshots are being CoW'ed, then dedupe won't save them any > space. Also, nodatacow is inherently at odds with reflinks used for dedupe. > > > > From that I'd guess there would be still people who want the nice > > features of btrfs (snapshots, checksumming, etc.), while still being > > able to nodatacow in specific cases. > Snapshots work fine with nodatacow, each block gets CoW'ed once when > it's first written to, and then goes back to being NOCOW. The only > caveat is that you probably want to defrag either once everything has > been rewritten, or right after the snapshot. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Thu, Aug 03, 2017 at 08:08:59PM +0200, waxhead wrote: > BTRFS biggest problem is not that there are some bits and pieces that > are thoroughly screwed up (raid5/6 (which just got some fixes by the > way)), but the fact that the documentation is rather dated. > > There is a simple status page here > https://btrfs.wiki.kernel.org/index.php/Status > > As others have pointed out already the explanations on the status page > is not exactly good. For example compression (that was also mentioned) > is as of writing this marked as 'Mostly ok' '(needs verification and > source) - auto repair and compression may crash' > > Now, I am aware that many use compression without trouble. I am not sure > how many that has compression with disk issues and don't have trouble , > but I would at least expect to see more people yelling on the mailing > list if that where the case. The problem here is that this message is > rather scary and certainly does NOT sound like 'mostly ok' for most people. > > What exactly needs verification and source? the mostly ok statement or > something else?! A more detailed explanation would be required here to > avoid scaring people away. > > Same thing with the trim feature that is marked OK . It clearly says > that is has performance implications. It is marked OK so one would > expect it to not cause the filesystem to fail, but if the performance > becomes so slow that the filesystem gets practically unusable it is of > course not "OK". The relevant information is missing for people to make > a decent choice and I certainly don't know how serious these performance > implications are, if they are at all relevant... I'll try to restructure the page so it reflects status of the features from more aspects, like overall/performance/"known bad scenarios". The in-row notes are proably bad idea as they are short on details, the section under table will be better for that. > Most people interested in BTRFS are probably a bit more paranoid and > concerned about their data than the average computer user. What people > tend to forget is that other filesystems either have NO redundancy, > auto-repair and other fancy features that BTRFS have. So for the > compression example above... if you run compressed files on ext4 and > your disk gets some corruption you are in a no better state than what > you would be with btrfs either (in fact probably worse). Also nothing is > stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you > should be VERY safe. > > Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh > so here is what I think should be done: > > 1. The documentation needs to either be improved (or old non-relevant > stuff simply removed / archived somewhere) Agreed, this happens from time. > 2. The status page MUST always be up to date for the latest kernel > release (It's ok so far , let's hope nobody sleeps here) I'm watching over the page. It's been locked from edits so there's a mandatory review of the new contents, the update process is documented on the page. > 3. Proper explanations must be given so the layman and reasonably > technical people understand the risks / issues for non-ok stuff. This can be hard, the audience are both technical and non-technical users. The page is supposed to give quick overview, the more detailed information is either in the notes or on separate pages linked from there. I believe this structure should be able to cover what you need, but the acutal contents hasn't been written and there are not enough people willing/capable of writing it. > 4. There should be links to roadmaps for each feature on the status page > that clearly stats what is being worked on for the NEXT kernel release We've tried something like that in the past, the page got out of sync with reality over time and was deleted. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
[ ... ] >>> Snapshots work fine with nodatacow, each block gets CoW'ed >>> once when it's first written to, and then goes back to being >>> NOCOW. >>> The only caveat is that you probably want to defrag either >>> once everything has been rewritten, or right after the >>> snapshot. >> I thought defrag would unshare the reflinks? > Which is exactly why you might want to do it. It will get rid > of the overhead of the single CoW operation, and it will make > sure there is minimal fragmentation. > IOW, when mixing NOCOW and snapshots, you either have to use > extra space, or you deal with performance issues. Aside from > that though, it works just fine and has no special issues as > compared to snapshots without NOCOW. The above illustrates my guess as to why RHEL 7.4 dropped Btrfs support, which is: * RHEL is sold to managers who want to minimize the cost of upgrades and sysadm skills. * Every time a customer creates a ticket, RH profits fall. * RH had adopted 'ext3' because it was an in-place upgrade from 'ext2' and "just worked", 'ext4' because it was an in-place upgrade from 'ext3' and was supposed to "just work", and then was looking at Btrfs as an in-place upgrade from 'ext4', and presumably also a replacement for MD RAID, that would "just work". * 'ext4' (and XFS before that) already created a few years ago trouble because of the 'O_PONIES' controversy. * Not only Btrfs still has "challenges" as to multi-device functionality, and in-place upgrades from 'ext4' have "challenges" too, it has many "special cases" that need skill and discretion to handle, because it tries to cover so many different cases, and the first thing many a RH customer would do is to create a ticket to ask what to do, or how to fix a choice already made. Try to imagine the impact on the RH ticketing system of a switch from 'ext4' to Btrfs, with explanations like the above, about NOCOW, defrag, snapshots, balance, reflinks, and the exact order in which they have to be performed for best results. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/7] add sanity check for extent inline ref type
On Wed, Aug 16, 2017 at 04:53:15PM +0200, David Sterba wrote: > On Mon, Aug 07, 2017 at 03:55:24PM -0600, Liu Bo wrote: > > An invalid extent inline ref type could be read from a btrfs image and > > it ends up with a panic[1], this set is to deal with the insane value > > gracefully in patch 1-2 and clean up BUG() in the code in patch 3-6. > > > > Patch 7 adds one more check to see if the ref is a valid shared one. > > > > I'm not sure in the real world what may result in this corruption, but > > I've seen several reports on the ML about __btrfs_free_extent saying > > something was missing (or simply wrong), while testing this set with > > btrfs-corrupt-block, I found that switching ref type could end up that > > situation as well, eg. a data extent's ref type > > (BTRFS_EXTENT_DATA_REF_KEY) is switched to (BTRFS_TREE_BLOCK_REF_KEY). > > Hopefully this can give people more sights next time when that > > happens. > > > > [1]:https://www.spinics.net/lists/linux-btrfs/msg65646.html > > The series looks good to me overall, there are some minor comments. The > use of WARN(1, ...) will lack the common message prefix identifying the > filesystem, so I suggest to use the btrfs_err helper and consider if the > WARN_ON(1) is really useful in the place. Most of them look like that. > > in patch btrfs_inline_ref_types, rename it to btrfs_inline_ref_type, so > it's in line with other similar definitions. Sounds good, I'll update them then. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
[ ... ] > But I've talked to some friend at the local super computing > centre and they have rather general issues with CoW at their > virtualisation cluster. Amazing news! :-) > Like SUSE's snapper making many snapshots leading the storage > images of VMs apparently to explode (in terms of space usage). Well, this could be an argument that some of your friends are being "challenged" by running the storage systems of a "super computing centre" and that they could become "more prepared" about system administration, for example as to the principle "know which tool to use for which workload". Or else it could be an argument that they expect Btrfs to do their job while they watch cat videos from the intertubes. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
> We use the crcs to catch storage gone wrong, [ ... ] And that's an opportunistically feasible idea given that current CPUs can do that in real-time. > [ ... ] It's possible to protect against all three without COW, > but all solutions have their own tradeoffs and this is the setup > we chose. It's easy to trust and easy to debug and at scale that > really helps. Indeed all filesystem designs have pathological workloads, and system administrators and applications developers who are "more prepared" know which one is best for which workload, or try to figure it out. > Some databases also crc, and all drives have correction bits of > of some kind. There's nothing wrong with crcs happening at lots > of layers. Well, there is: in theory checksumming should be end-to-end, that is entirely application level, so applications that don't need it don't pay the price, but having it done at other layers can help the very many applications that don't do it and should do it, and it is cheap, and can help when troubleshooting exactly there the problem is. It is an opportunistic thing to do. > [ ... ] My real goal is to make COW fast enough that we can > leave it on for the database applications too. Obviously I > haven't quite finished that one yet ;) [ ... ] And this worries me because it portends the usual "marketing" goal of making Btrfs all things to all workloads, the "OpenStack of filesystems", with little consideration for complexity, maintainability, or even sometimes reality. The reality is that all known storage media have hugely anisotropic performance envelopes, both as to functionality, cost, speed, reliability, and there is no way to have an automagic filesystem that "just works" in all cases, despite the constant demands for one by "less prepared" storage administrators and application developers. The reality is also that if one such filesystem could automagically adapt to cover optimally the performance envelopes of every possible device and workload, it would be so complex as to be unmaintainable in practice. So Btrfs, in its base "Rodeh" functionality, with COW, checksums, subvolumes, shapshots, *on a single device*, works pretty well and reliably and it is already very useful, for most workloads. Some people also like some of its exotic complexities like in-place compression and defragmentation, but they come at a high cost. For workloads that inflict lots of small random in-place updates on storage, like tablespaces for DBMSes etc, perhaps simpler less featureful storage abstraction layers are more appropriate, from OCFS2 to simple DM/LVM2 LVs, and Btrfs NOCOW approximates them well. BTW as to the specifics of DBMSes and filesystems, there is a classic paper making eminently reasonable, practical, suggestions that have been ignored for only 35 years and some: %A M. R. Stonebraker %T Operating system support for database management %J CACM %V 24 %D JUL 1981 %P 412-418 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: copy fsid to super_block s_uuid
On Tue, Aug 01, 2017 at 06:35:08PM +0800, Anand Jain wrote: > We didn't copy fsid to struct super_block.s_uuid so Overlay disables > index feature with btrfs as the lower FS. > > kernel: overlayfs: fs on '/lower' does not support file handles, falling back > to index=off. > > Fix this by publishing the fsid through struct super_block.s_uuid. > > Signed-off-by: Anand Jain> --- > I tried to know if in case did we deliberately missed this for some reason, > however there is no information on that. If we mount a non-default subvol in > the next mount/remount, its still the same FS, so publishing the FSID > instead of subvol uuid is correct, OR I can't think any other reason for > not using s_uuid for btrfs. I think that setting s_uuid is the last missing bit. Overlay needs the file handle encoding support from the lower filesystem, which is supported. Filling the whole filesystem id is correct, the subvolume id is encoded in the file handle buffer from inside btrfs_encode_fh. >From that point I think the patch is ok, but haven't tested it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Remove unused sectorsize variable from struct map_lookup
This variable was added in 1abe9b8a138c ("Btrfs: add initial tracepointi support for btrfs"), yet it never really got used, only assigned to. So let's remove it. Signed-off-by: Nikolay Borisov--- fs/btrfs/volumes.c | 2 -- fs/btrfs/volumes.h | 1 - 2 files changed, 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f93ac3d7e997..47a0cb1dcc5e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4836,7 +4836,6 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, j * stripe_size; } } - map->sector_size = info->sectorsize; map->stripe_len = raid_stripe_len; map->io_align = raid_stripe_len; map->io_width = raid_stripe_len; @@ -6491,7 +6490,6 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, map->num_stripes = num_stripes; map->io_width = btrfs_chunk_io_width(leaf, chunk); map->io_align = btrfs_chunk_io_align(leaf, chunk); - map->sector_size = btrfs_chunk_sector_size(leaf, chunk); map->stripe_len = btrfs_chunk_stripe_len(leaf, chunk); map->type = btrfs_chunk_type(leaf, chunk); map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk); diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 6f45fd60d15a..d0193e795dc2 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -353,7 +353,6 @@ struct map_lookup { int io_align; int io_width; u64 stripe_len; - int sector_size; int num_stripes; int sub_stripes; struct btrfs_bio_stripe stripes[]; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: expose internal free space tree routine only if sanity tests are enabled
The internal free space tree management routines are always exposed for testing purposes. Make them dependent on SANITY_TESTS being on so that they are exposed only when they really have to. Signed-off-by: Nikolay Borisov--- fs/btrfs/free-space-tree.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h index 54ffced3bce8..ba3787df43c3 100644 --- a/fs/btrfs/free-space-tree.h +++ b/fs/btrfs/free-space-tree.h @@ -44,7 +44,7 @@ int remove_from_free_space_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 start, u64 size); -/* Exposed for testing. */ +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS struct btrfs_free_space_info * search_free_space_info(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, @@ -68,5 +68,6 @@ int convert_free_space_to_extents(struct btrfs_trans_handle *trans, struct btrfs_path *path); int free_space_test_bit(struct btrfs_block_group_cache *block_group, struct btrfs_path *path, u64 offset); +#endif #endif -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-16 10:11, Christoph Anton Mitterer wrote: On Wed, 2017-08-16 at 09:53 -0400, Austin S. Hemmelgarn wrote: Go try BTRFS on top of dm-integrity, or on a system with T10-DIF or T13-EPP support When dm-integrity is used... would that be enough for btrfs to do a proper repair in the RAID+nodatacow case? I assume it can't do repairs now there, because how should it know which copy is valid. dm-integrity is functionally a 1:1 mapping target (it uses a secondary device for storing the integrity info, but it requires one table per target). It takes one backing device, and gives one mapped device. The setup I'm suggesting would involve putting that on each device that you have BTRFS configured to use. When the checksum there fails, you get a read error (AFAIK at least), which will trigger the regular BTRFS recovery code just like a failed checksum. So in this case, it should recover just fine if one copy is bogus (assuming it's a media issue and not something between the the block device and the filesystem. In all honesty, putting BTRFS on dm-integrity is going to be slow. If you can find some T10 DIF or T13 EPP hardware, that will almost certainly be faster. (which you should have access to given the amount of funding CERN gets) Hehe, CERN may get that funding (I don't know),... but the universities rather don't ;-) Point taken, I often forget that funding isn't exactly distributed in the most obvious ways. Except it isn't clear with nodatacow, because it might be a false positive. Sure, never claimed the opposite... just that I'd expect this to be less likely than the other way round, and less of a problem in practise. Any number of hardware failures or errors can cause the same net effect as an unclean shutdown, and even some much more complicated issues (a loose data cable to a storage device is probably one of the best examples, as it's trivial to explain and not as rare as most people think). SUSE is pathological case of brain-dead defaults. Snapper needs to either die or have some serious sense beat into it. When you turn off the automatic snapshot generation for everything but updates and set the retention policy to not keep almost everything, it's actually not bad at all. Well, still, with CoW (unless you have some form of deduplication, which in e.g. their use case would have to be on the layers below btrfs), your storage usage will grow probably more significantly than without. Yes, and for most VM use cases I would advocate not using BTRFS snapshots inside the VM and instead using snapshot functionality in the VM software itself. That still has performance issues in some cases, but at least it's easier to see where the data is actually being used. And as you've mentioned yourself in the other mail, there's still the issue with fragmentation. Snapshots work fine with nodatacow, each block gets CoW'ed once when it's first written to, and then goes back to being NOCOW. The only caveat is that you probably want to defrag either once everything has been rewritten, or right after the snapshot. I thought defrag would unshare the reflinks? Which is exactly why you might want to do it. It will get rid of the overhead of the single CoW operation, and it will make sure there is minimal fragmentation. IOW, when mixing NOCOW and snapshots, you either have to use extra space, or you deal with performance issues. Aside from that though, it works just fine and has no special issues as compared to snapshots without NOCOW. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/7] add sanity check for extent inline ref type
On Mon, Aug 07, 2017 at 03:55:24PM -0600, Liu Bo wrote: > An invalid extent inline ref type could be read from a btrfs image and > it ends up with a panic[1], this set is to deal with the insane value > gracefully in patch 1-2 and clean up BUG() in the code in patch 3-6. > > Patch 7 adds one more check to see if the ref is a valid shared one. > > I'm not sure in the real world what may result in this corruption, but > I've seen several reports on the ML about __btrfs_free_extent saying > something was missing (or simply wrong), while testing this set with > btrfs-corrupt-block, I found that switching ref type could end up that > situation as well, eg. a data extent's ref type > (BTRFS_EXTENT_DATA_REF_KEY) is switched to (BTRFS_TREE_BLOCK_REF_KEY). > Hopefully this can give people more sights next time when that > happens. > > [1]:https://www.spinics.net/lists/linux-btrfs/msg65646.html The series looks good to me overall, there are some minor comments. The use of WARN(1, ...) will lack the common message prefix identifying the filesystem, so I suggest to use the btrfs_err helper and consider if the WARN_ON(1) is really useful in the place. Most of them look like that. in patch btrfs_inline_ref_types, rename it to btrfs_inline_ref_type, so it's in line with other similar definitions. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4 v3] btrfs: add compression trace points
On Sun, Aug 13, 2017 at 12:02:44PM +0800, Anand Jain wrote: > From: Anand Jain> > This patch adds compression and decompression trace points for the > purpose of debugging. > > Signed-off-by: Anand Jain > Reviewed-by: Nikolay Borisov > --- > v3: > . Rename to a simple names, without worrying about being >compatible with the future naming. > . The type was not working fixed it. > v2: > . Use better naming. >(If transform is not good enough I have run out of ideas, pls suggest). > . To be applied on top of >git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next >(tested without namelen check patch set) > fs/btrfs/compression.c | 11 +++ > include/trace/events/btrfs.h | 39 +++ > 2 files changed, 50 insertions(+) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index d2ef9ac2a630..4a652f67ee87 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -895,6 +895,10 @@ int btrfs_compress_pages(int type, struct address_space > *mapping, > start, pages, > out_pages, > total_in, total_out); > + > + trace_btrfs_compress(1, 1, mapping->host, type, *total_in, > + *total_out, start, ret); > + > free_workspace(type, workspace); > return ret; > } > @@ -921,6 +925,10 @@ static int btrfs_decompress_bio(struct compressed_bio > *cb) > > workspace = find_workspace(type); > ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb); > + > + trace_btrfs_compress(0, 0, cb->inode, type, > + cb->compressed_len, cb->len, cb->start, ret); > + > free_workspace(type, workspace); > > return ret; > @@ -943,6 +951,9 @@ int btrfs_decompress(int type, unsigned char *data_in, > struct page *dest_page, > dest_page, start_byte, > srclen, destlen); > > + trace_btrfs_compress(0, 1, dest_page->mapping->host, > + type, srclen, destlen, start_byte, ret); > + > free_workspace(type, workspace); > return ret; > } > diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h > index d412c49f5a6a..db33d6649d12 100644 > --- a/include/trace/events/btrfs.h > +++ b/include/trace/events/btrfs.h > @@ -1629,6 +1629,45 @@ TRACE_EVENT(qgroup_meta_reserve, > show_root_type(__entry->refroot), __entry->diff) > ); > > +TRACE_EVENT(btrfs_compress, > + > + TP_PROTO(int compress, int page, struct inode *inode, > + unsigned int type, > + unsigned long len_before, unsigned long len_after, > + unsigned long start, int ret), > + > + TP_ARGS(compress, page, inode, type, len_before, > + len_after, start, ret), > + > + TP_STRUCT__entry_btrfs( > + __field(int,compress) > + __field(int,page) > + __field(ino_t, i_ino) u64 for the inode number > + __field(unsigned int, type) > + __field(unsigned long, len_before) > + __field(unsigned long, len_after) > + __field(unsigned long, start) and u64 here > + __field(int,ret) > + ), > + > + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), > + __entry->compress = compress; > + __entry->page = page; > + __entry->i_ino = inode->i_ino; > + __entry->type = type; > + __entry->len_before = len_before; > + __entry->len_after = len_after; > + __entry->start = start; > + __entry->ret= ret; > + ), > + > + TP_printk_btrfs("%s %s ino=%lu type=%s len_before=%lu len_after=%lu > start=%lu ret=%d", The format looks good, although I'm not sure we need to make the distinction between page and bio compression. This also needs the extra argument for the tracepoint. > + __entry->compress ? "compress":"uncompress", decompress > + __entry->page ? "page":"bio", __entry->i_ino, add spaces around : > + show_compress_type(__entry->type), > + __entry->len_before, __entry->len_after, __entry->start, > + __entry->ret) > +); > #endif /* _TRACE_BTRFS_H */ > > /* This part must be outside protection */ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Wed, 2017-08-16 at 09:53 -0400, Austin S. Hemmelgarn wrote: > Go try BTRFS on top of dm-integrity, or on a > system with T10-DIF or T13-EPP support When dm-integrity is used... would that be enough for btrfs to do a proper repair in the RAID+nodatacow case? I assume it can't do repairs now there, because how should it know which copy is valid. > (which you should have access to > given the amount of funding CERN gets) Hehe, CERN may get that funding (I don't know),... but the universities rather don't ;-) > Except it isn't clear with nodatacow, because it might be a false > positive. Sure, never claimed the opposite... just that I'd expect this to be less likely than the other way round, and less of a problem in practise. > SUSE is pathological case of brain-dead defaults. Snapper needs to > either die or have some serious sense beat into it. When you turn > off > the automatic snapshot generation for everything but updates and set > the > retention policy to not keep almost everything, it's actually not bad > at > all. Well, still, with CoW (unless you have some form of deduplication, which in e.g. their use case would have to be on the layers below btrfs), your storage usage will grow probably more significantly than without. And as you've mentioned yourself in the other mail, there's still the issue with fragmentation. > Snapshots work fine with nodatacow, each block gets CoW'ed once when > it's first written to, and then goes back to being NOCOW. The only > caveat is that you probably want to defrag either once everything > has > been rewritten, or right after the snapshot. I thought defrag would unshare the reflinks? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] btrfs: use appropriate define for the fsid
On Sat, Jul 29, 2017 at 05:50:09PM +0800, Anand Jain wrote: > Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE or of same size, > for the purpose of doing it correctly use BTRFS_FSID_SIZE instead. > > Signed-off-by: Anand JainReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] btrfs: decode compress type for tracing
On Sun, Aug 13, 2017 at 12:02:43PM +0800, Anand Jain wrote: > So with this now we see the compression type in string. > > Signed-off-by: Anand JainReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017年08月16日 21:12, Chris Mason wrote: On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote: On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: Quite a few applications actually _do_ have some degree of secondary verification or protection from a crash. Go look at almost any database software. Then please give proper references for this! This is from 2015, where you claimed this already and I looked up all the bigger DBs and they either couldn't do it at all, didn't to it per default, or it required application support (i.e. from the programs using the DB) https://www.spinics.net/lists/linux-btrfs/msg50258.html It usually will not have checksumming, but it will almost always have support for a journal, which is enough to cover the particular data loss scenario we're talking about (unexpected unclean shutdown). I don't think we talk about this: We talk about people wanting checksuming to notice e.g. silent data corruption. The crash case is only the corner case about what happens then if data is written correctly but csums not. We use the crcs to catch storage gone wrong, both in terms of simple things like cabling, bus errors, drives gone crazy or exotic problems like every time I reboot the box a handful of sectors return EFI partition table headers instead of the data I wrote. You don't need data center scale for this to happen, but it does help... So, we do catch crc errors in prod and they do keep us from replicating bad data over good data. Some databases also crc, and all drives have correction bits of of some kind. There's nothing wrong with crcs happening at lots of layers. Btrfs couples the crcs with COW because it's the least complicated way to protect against: * bits flipping * IO getting lost on the way to the drive, leaving stale but valid data in place * IO from sector A going to sector B instead, overwriting valid data with other valid data. It's possible to protect against all three without COW, but all solutions have their own tradeoffs and this is the setup we chose. It's easy to trust and easy to debug and at scale that really helps. In general, production storage environments prefer clearly defined errors when the storage has the wrong data. EIOs happen often, and you want to be able to quickly pitch the bad data and replicate in good data. Btrfs csum is really good, specially for case like RAID1/5/6 where csum can provide extra info about which mirror/stripe/parity can be trusted, with minimal space wasted. DM layer should really have the ability to verify its data at that timing like btrfs. My real goal is to make COW fast enough that we can leave it on for the database applications too. Yes, most of the complexity of nodatasum/nodatacow comes from those special workload. BTW, when Fujitsu tested the postgresql workload on btrfs, the result is quite interesting. For HDD, when number of clients is low, btrfs shows obvious performance drop. And the problem seems to be mandatory metadata COW, which leads to superblock FUA updates. And when number of clients grow, difference between btrfs and other fses gets much smaller, the bottleneck is the HDD itself. While for SSD, when number of clients is low, btrfs is almost the same performance as other fses, nodatacow/nodatasum only provides marginal difference. But when number of clients grows, btrfs falls far behind other fses. The reason seems to be related to how postgresql commit its transaction, which always fsync its journal sequentially without concurrency. While Btrfs needs to wait its data write before updating its log tree, this makes most of its time wasted on waiting data IO. In that case, nodatacow does improves the performance, by allowing btrfs to update its log tree without waiting data IO. But in both case, CoW itself, like allocating new extent, or calculating csum, is not the main cause to slow down btrfs. That's to say, nodatacow is not as important as we used to think. If we can get rid of nodatacow/nodatasum, there will be much less thing to consider for us developers, and less related bugs. Thanks, Qu Obviously I haven't quite finished that one yet ;) But I'd rather keep the building block of all the other btrfs features in place than try to do crcs differently. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] btrfs: convert enum btrfs_compression_type to define
On Sun, Aug 13, 2017 at 12:02:42PM +0800, Anand Jain wrote: > There isn't a huge list to manage the types, which can be managed > with defines. It helps to easily print the types in tracing as well. We use enums in a lot of places, I'd rather keep it as it is. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] btrfs: remove unused BTRFS_COMPRESS_LAST
On Sun, Aug 13, 2017 at 12:02:41PM +0800, Anand Jain wrote: > We aren't using this define, so removing it. > > Signed-off-by: Anand JainReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-16 09:12, Chris Mason wrote: My real goal is to make COW fast enough that we can leave it on for the database applications too. Obviously I haven't quite finished that one yet ;) But I'd rather keep the building block of all the other btrfs features in place than try to do crcs differently. In general, the performance issue isn't because of the time it takes to CoW the blocks, it's because of the fragmentation it introduces. That fragmentation could in theory be mitigated by making CoW happen at a larger chunk size, but that would push the issue more towards being one of CoW performance, not fragmentation. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-16 09:31, Christoph Anton Mitterer wrote: Just out of curiosity: On Wed, 2017-08-16 at 09:12 -0400, Chris Mason wrote: Btrfs couples the crcs with COW because this (which sounds like you want it to stay coupled that way)... plus It's possible to protect against all three without COW, but all solutions have their own tradeoffs and this is the setup we chose. It's easy to trust and easy to debug and at scale that really helps. ... this (which sounds more you think the checksumming is so helpful, that it would be nice in the nodatacow as well). What does that mean now? Things will stay as they are... or it may become a goal to get checksumming for nodatacow (while of course still retaining the possibility to disable both, datacow AND checksumming)? It means that you have other options if you want this so badly that you need to keep pestering the developers about it but can't be arsed to try to code it yourself. Go try BTRFS on top of dm-integrity, or on a system with T10-DIF or T13-EPP support (which you should have access to given the amount of funding CERN gets), or even on a ZFS zvol if you're crazy enough. It works wonderfully in the first two cases, and reliably (but not efficiently) in the third, and all of them provide exactly what you want, plus the bonus that they do a slightly better job of differentiating between media and memory errors. In general, production storage environments prefer clearly defined errors when the storage has the wrong data. EIOs happen often, and you want to be able to quickly pitch the bad data and replicate in good data. Which would also rather point towards getting clear EIOs (and thus checksumming) in the nodatacow case. Except it isn't clear with nodatacow, because it might be a false positive. My real goal is to make COW fast enough that we can leave it on for the database applications too. Obviously I haven't quite finished that one yet ;) Well the question is, even if you manage that sooner or later, will everyone be fully satisfied by this?! I've mentioned earlier on the list that I manage one of the many big data/computing centres for LHC. Our use case is typically big plain storage servers connected via some higher level storage management system (http://dcache.org/)... with mostly write once/read many. So apart from some central DBs for the storage management system itself, CoW is mostly no issue for us. But I've talked to some friend at the local super computing centre and they have rather general issues with CoW at their virtualisation cluster. Like SUSE's snapper making many snapshots leading the storage images of VMs apparently to explode (in terms of space usage). SUSE is pathological case of brain-dead defaults. Snapper needs to either die or have some serious sense beat into it. When you turn off the automatic snapshot generation for everything but updates and set the retention policy to not keep almost everything, it's actually not bad at all. For some of their storage backends there simply seem to be no de- duplication available (or other reasons that prevent it's usage). If the snapshots are being CoW'ed, then dedupe won't save them any space. Also, nodatacow is inherently at odds with reflinks used for dedupe. From that I'd guess there would be still people who want the nice features of btrfs (snapshots, checksumming, etc.), while still being able to nodatacow in specific cases. Snapshots work fine with nodatacow, each block gets CoW'ed once when it's first written to, and then goes back to being NOCOW. The only caveat is that you probably want to defrag either once everything has been rewritten, or right after the snapshot. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
Just out of curiosity: On Wed, 2017-08-16 at 09:12 -0400, Chris Mason wrote: > Btrfs couples the crcs with COW because this (which sounds like you want it to stay coupled that way)... plus > It's possible to protect against all three without COW, but all > solutions have their own tradeoffs and this is the setup we > chose. It's > easy to trust and easy to debug and at scale that really helps. ... this (which sounds more you think the checksumming is so helpful, that it would be nice in the nodatacow as well). What does that mean now? Things will stay as they are... or it may become a goal to get checksumming for nodatacow (while of course still retaining the possibility to disable both, datacow AND checksumming)? > In general, production storage environments prefer clearly defined > errors when the storage has the wrong data. EIOs happen often, and > you > want to be able to quickly pitch the bad data and replicate in good > data. Which would also rather point towards getting clear EIOs (and thus checksumming) in the nodatacow case. > My real goal is to make COW fast enough that we can leave it on for > the > database applications too. Obviously I haven't quite finished that > one > yet ;) Well the question is, even if you manage that sooner or later, will everyone be fully satisfied by this?! I've mentioned earlier on the list that I manage one of the many big data/computing centres for LHC. Our use case is typically big plain storage servers connected via some higher level storage management system (http://dcache.org/)... with mostly write once/read many. So apart from some central DBs for the storage management system itself, CoW is mostly no issue for us. But I've talked to some friend at the local super computing centre and they have rather general issues with CoW at their virtualisation cluster. Like SUSE's snapper making many snapshots leading the storage images of VMs apparently to explode (in terms of space usage). For some of their storage backends there simply seem to be no de- duplication available (or other reasons that prevent it's usage). From that I'd guess there would be still people who want the nice features of btrfs (snapshots, checksumming, etc.), while still being able to nodatacow in specific cases. > But I'd rather keep the building block of all the other btrfs > features in place than try to do crcs differently. Mhh I see, what a pity. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote: On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: Quite a few applications actually _do_ have some degree of secondary verification or protection from a crash. Go look at almost any database software. Then please give proper references for this! This is from 2015, where you claimed this already and I looked up all the bigger DBs and they either couldn't do it at all, didn't to it per default, or it required application support (i.e. from the programs using the DB) https://www.spinics.net/lists/linux-btrfs/msg50258.html It usually will not have checksumming, but it will almost always have support for a journal, which is enough to cover the particular data loss scenario we're talking about (unexpected unclean shutdown). I don't think we talk about this: We talk about people wanting checksuming to notice e.g. silent data corruption. The crash case is only the corner case about what happens then if data is written correctly but csums not. We use the crcs to catch storage gone wrong, both in terms of simple things like cabling, bus errors, drives gone crazy or exotic problems like every time I reboot the box a handful of sectors return EFI partition table headers instead of the data I wrote. You don't need data center scale for this to happen, but it does help... So, we do catch crc errors in prod and they do keep us from replicating bad data over good data. Some databases also crc, and all drives have correction bits of of some kind. There's nothing wrong with crcs happening at lots of layers. Btrfs couples the crcs with COW because it's the least complicated way to protect against: * bits flipping * IO getting lost on the way to the drive, leaving stale but valid data in place * IO from sector A going to sector B instead, overwriting valid data with other valid data. It's possible to protect against all three without COW, but all solutions have their own tradeoffs and this is the setup we chose. It's easy to trust and easy to debug and at scale that really helps. In general, production storage environments prefer clearly defined errors when the storage has the wrong data. EIOs happen often, and you want to be able to quickly pitch the bad data and replicate in good data. My real goal is to make COW fast enough that we can leave it on for the database applications too. Obviously I haven't quite finished that one yet ;) But I'd rather keep the building block of all the other btrfs features in place than try to do crcs differently. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2
On Fri, Aug 04, 2017 at 02:41:18PM +0300, Nikolay Borisov wrote: > The buffer passed to btrfs_ioctl_tree_search* functions have to be at least > sizeof(struct btrfs_ioctl_search_header). If this is not the case then the > ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum > required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW > error with ->buf_size being set to the value passed by user space. Fix this by > removing the size check and relying on search_ioctl, which already includes it > and correctly sets buf_size. > > Signed-off-by: Nikolay BorisovReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency
On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote: > Hi Amir, > > I neglected to mention this earlier: CrashMonkey does not require > recompiling the kernel (it is a stand-alone kernel module), and has > been tested with the kernel 4.4. It should work with future kernel > versions as long as there are no changes to the bio structure. > > As it is, I believe CrashMonkey is compatible with the current kernel. > It certainly provides functionality beyond log-writes (the ability to > replay a subset of writes between FLUSH/FUA), and we intend to add > more functionality in the future. > > Right now, CrashMonkey does not do random sampling among possible > crash states -- it will simply test a given number of unique states. > Thus, right now I don't think it is very effective in finding > crash-consistency bugs. But the entire infrastructure to profile a > workload, construct crash states, and test them with fsck is present. > > I'd be grateful if you could try it and give us feedback on what make > testing easier/more useful for you. As I mentioned before, this is a > work-in-progress, so we are happy to incorporate feedback. > Sorry I was travelling yesterday so I couldn't give this my full attention. Everything you guys do is already accomplished with dm-log-writes. If you look at the example scripts I've provided https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh The first initiates the replay, and points at the second script to run after each entry is replayed. The whole point of this stuff was to make it as flexible as possible. The way we use it is to replay, create a snapshot of the replay, mount, unmount, fsck, delete the snapshot and carry on to the next position in the log. There is nothing keeping us from generating random crash points, this has been something on my list of things to do forever. All that would be required would be to hold the entries between flush/fua events in memory, and then replay them in whatever order you deemed fit. That's the only functionality missing from my replay-log stuff that CrashMonkey has. The other part of this is getting user space applications to do more thorough checking of consistency that it expects, which I implemented here https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07 fsx will randomly do operations to a file, and every time it fsync()'s it saves it's state and marks the log. Then we can go back and replay the log to the mark and md5sum the file to make sure it matches the saved state. This infrastructure was meant to be as simple as possible so the possiblities for crash consistency testing were endless. One of the next areas we plan to use this in Facebook is just for application consistency, so we can replay the fs and verify the application works in whatever state the fs is at any given point. I looked at your code and you are logging entries at submit time, not completion time. The reason I do those crazy acrobatics is because we have had bugs in previous kernels where we were not waiting for io completion of important metadata before writing out the super block, so logging only at completion allows us to catch that class of problems. The other thing CrashMonkey is missing is DISCARD support. We fuck up discard support constantly, and being able to replay discards to make sure we're not discarding important data is very important. I'm not trying to shit on your project, obviously it's a good idea, that's why I did it years ago ;). The community is going to use what is easiest to use, and modprobe dm-log-writes is a lot easier than compiling and insmod'ing an out of tree driver. Also your driver won't work on upstream kernels because of the way the bio flags were changed recently, which is why we prefer using upstream solutions. If you guys want to get this stuff used then it would be better at this point to build on top of what we already have. Just off the top of my head we need 1) Random replay support for replay-log. This is probably a day or two worth of work for a student. 2) Documentation, because right now I'm the only one who knows how this works. 3) My patches need to actually be pushed into upstream fstests. This would be the largest win because then all the fs developers would be running the tests by default. 4) Multi-device support. One thing that would be good to have and is a dream of mine is to connect multiple devices to one log, so we can do things like make sure mdraid or btrfs's raid consistency. We could do super evil things like only replay one device, or replay alternating writes on each device. This would be a larger project but would be super helpful. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: slow btrfs with a single kworker process using 100% CPU
Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko: > Roman, initially I had a single process occupying 100% CPU, when sysrq it was > indicating as "btrfs_find_space_for_alloc" > but that's when I used the autodefrag, compress, forcecompress and commit=10 > mount flags and space_cache was v1 by default. > when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu > has dissapeared, but the shite performance remained. space_cache=v2 is not supported by the opensuse kernel - but as i compile the kernel myself anyway. Is there a patchset to add support for space_cache=v2? Greets, Stefan > > As to the chunk size, there is no information in the article about the type > of data that was used. While in our case we are pretty certain about the > compressed block size (32-128). I am currently inclining towards 32k as it > might be ideal in a situation when we have a 5 disk raid5 array. > > In theory > 1. The minimum compressed write (32k) would fill the chunk on a single disk, > thus the IO cost of the operation would be 2 reads (original chunk + original > parity) and 2 writes (new chunk + new parity) > > 2. The maximum compressed write (128k) would require the update of 1 chunk on > each of the 4 data disks + 1 parity write > > > > Stefan what mount flags do you use? > > kos > > > > - Original Message - > From: "Roman Mamedov"> To: "Konstantin V. Gavrilenko" > Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" > , linux-btrfs@vger.kernel.org, "Peter Grandi" > > Sent: Wednesday, 16 August, 2017 2:00:03 PM > Subject: Re: slow btrfs with a single kworker process using 100% CPU > > On Wed, 16 Aug 2017 12:48:42 +0100 (BST) > "Konstantin V. Gavrilenko" wrote: > >> I believe the chunk size of 512kb is even worth for performance then the >> default settings on my HW RAID of 256kb. > > It might be, but that does not explain the original problem reported at all. > If mdraid performance would be the bottleneck, you would see high iowait, > possibly some CPU load from the mdX_raidY threads. But not a single Btrfs > thread pegging into 100% CPU. > >> So now I am moving the data from the array and will be rebuilding it with 64 >> or 32 chunk size and checking the performance. > > 64K is the sweet spot for RAID5/6: > http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko: > Roman, initially I had a single process occupying 100% CPU, when sysrq it was > indicating as "btrfs_find_space_for_alloc" > but that's when I used the autodefrag, compress, forcecompress and commit=10 > mount flags and space_cache was v1 by default. > when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu > has dissapeared, but the shite performance remained. > > > As to the chunk size, there is no information in the article about the type > of data that was used. While in our case we are pretty certain about the > compressed block size (32-128). I am currently inclining towards 32k as it > might be ideal in a situation when we have a 5 disk raid5 array. > > In theory > 1. The minimum compressed write (32k) would fill the chunk on a single disk, > thus the IO cost of the operation would be 2 reads (original chunk + original > parity) and 2 writes (new chunk + new parity) > > 2. The maximum compressed write (128k) would require the update of 1 chunk on > each of the 4 data disks + 1 parity write > > > > Stefan what mount flags do you use? noatime,compress-force=zlib,noacl,space_cache,skip_balance,subvolid=5,subvol=/ Greets, Stefan > kos > > > > - Original Message - > From: "Roman Mamedov"> To: "Konstantin V. Gavrilenko" > Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" > , linux-btrfs@vger.kernel.org, "Peter Grandi" > > Sent: Wednesday, 16 August, 2017 2:00:03 PM > Subject: Re: slow btrfs with a single kworker process using 100% CPU > > On Wed, 16 Aug 2017 12:48:42 +0100 (BST) > "Konstantin V. Gavrilenko" wrote: > >> I believe the chunk size of 512kb is even worth for performance then the >> default settings on my HW RAID of 256kb. > > It might be, but that does not explain the original problem reported at all. > If mdraid performance would be the bottleneck, you would see high iowait, > possibly some CPU load from the mdX_raidY threads. But not a single Btrfs > thread pegging into 100% CPU. > >> So now I am moving the data from the array and will be rebuilding it with 64 >> or 32 chunk size and checking the performance. > > 64K is the sweet spot for RAID5/6: > http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
Am 16.08.2017 um 14:29 schrieb Konstantin V. Gavrilenko: > Roman, initially I had a single process occupying 100% CPU, when sysrq it was > indicating as "btrfs_find_space_for_alloc" > but that's when I used the autodefrag, compress, forcecompress and commit=10 > mount flags and space_cache was v1 by default. > when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu > has dissapeared, but the shite performance remained. > > > As to the chunk size, there is no information in the article about the type > of data that was used. While in our case we are pretty certain about the > compressed block size (32-128). I am currently inclining towards 32k as it > might be ideal in a situation when we have a 5 disk raid5 array. > > In theory > 1. The minimum compressed write (32k) would fill the chunk on a single disk, > thus the IO cost of the operation would be 2 reads (original chunk + original > parity) and 2 writes (new chunk + new parity) > > 2. The maximum compressed write (128k) would require the update of 1 chunk on > each of the 4 data disks + 1 parity write > > > > Stefan what mount flags do you use? noatime,compress-force=zlib,noacl,space_cache,skip_balance,subvolid=5,subvol=/ Greets, Stefan > kos > > > > - Original Message - > From: "Roman Mamedov"> To: "Konstantin V. Gavrilenko" > Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" > , linux-btrfs@vger.kernel.org, "Peter Grandi" > > Sent: Wednesday, 16 August, 2017 2:00:03 PM > Subject: Re: slow btrfs with a single kworker process using 100% CPU > > On Wed, 16 Aug 2017 12:48:42 +0100 (BST) > "Konstantin V. Gavrilenko" wrote: > >> I believe the chunk size of 512kb is even worth for performance then the >> default settings on my HW RAID of 256kb. > > It might be, but that does not explain the original problem reported at all. > If mdraid performance would be the bottleneck, you would see high iowait, > possibly some CPU load from the mdX_raidY threads. But not a single Btrfs > thread pegging into 100% CPU. > >> So now I am moving the data from the array and will be rebuilding it with 64 >> or 32 chunk size and checking the performance. > > 64K is the sweet spot for RAID5/6: > http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
Roman, initially I had a single process occupying 100% CPU, when sysrq it was indicating as "btrfs_find_space_for_alloc" but that's when I used the autodefrag, compress, forcecompress and commit=10 mount flags and space_cache was v1 by default. when I switched to "relatime,compress-force=zlib,space_cache=v2" the 100% cpu has dissapeared, but the shite performance remained. As to the chunk size, there is no information in the article about the type of data that was used. While in our case we are pretty certain about the compressed block size (32-128). I am currently inclining towards 32k as it might be ideal in a situation when we have a 5 disk raid5 array. In theory 1. The minimum compressed write (32k) would fill the chunk on a single disk, thus the IO cost of the operation would be 2 reads (original chunk + original parity) and 2 writes (new chunk + new parity) 2. The maximum compressed write (128k) would require the update of 1 chunk on each of the 4 data disks + 1 parity write Stefan what mount flags do you use? kos - Original Message - From: "Roman Mamedov"To: "Konstantin V. Gavrilenko" Cc: "Stefan Priebe - Profihost AG" , "Marat Khalili" , linux-btrfs@vger.kernel.org, "Peter Grandi" Sent: Wednesday, 16 August, 2017 2:00:03 PM Subject: Re: slow btrfs with a single kworker process using 100% CPU On Wed, 16 Aug 2017 12:48:42 +0100 (BST) "Konstantin V. Gavrilenko" wrote: > I believe the chunk size of 512kb is even worth for performance then the > default settings on my HW RAID of 256kb. It might be, but that does not explain the original problem reported at all. If mdraid performance would be the bottleneck, you would see high iowait, possibly some CPU load from the mdX_raidY threads. But not a single Btrfs thread pegging into 100% CPU. > So now I am moving the data from the array and will be rebuilding it with 64 > or 32 chunk size and checking the performance. 64K is the sweet spot for RAID5/6: http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
On Wed, 16 Aug 2017 12:48:42 +0100 (BST) "Konstantin V. Gavrilenko"wrote: > I believe the chunk size of 512kb is even worth for performance then the > default settings on my HW RAID of 256kb. It might be, but that does not explain the original problem reported at all. If mdraid performance would be the bottleneck, you would see high iowait, possibly some CPU load from the mdX_raidY threads. But not a single Btrfs thread pegging into 100% CPU. > So now I am moving the data from the array and will be rebuilding it with 64 > or 32 chunk size and checking the performance. 64K is the sweet spot for RAID5/6: http://louwrentius.com/linux-raid-level-and-chunk-size-the-benchmarks.html -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
I believe the chunk size of 512kb is even worth for performance then the default settings on my HW RAID of 256kb. Peter Grandi explained it earlier on in one of his posts. QTE ++ That runs counter to this simple story: suppose a program is doing 64KiB IO: * For *reads*, there are 4 data drives and the strip size is 16KiB: the 64KiB will be read in parallel on 4 drives. If the strip size is 256KiB then the 64KiB will be read sequentially from just one disk, and 4 successive reads will be read sequentially from the same drive. * For *writes* on a parity RAID like RAID5 things are much, much more extreme: the 64KiB will be written with 16KiB strips on a 5-wide RAID5 set in parallel to 5 drives, with 4 stripes being updated with RMW. But with 256KiB strips it will partially update 5 drives, because the stripe is 1024+256KiB, and it needs to do RMW, and four successive 64KiB drives will need to do that too, even if only one drive is updated. Usually for RAID5 there is an optimization that means that only the specific target drive and the parity drives(s) need RMW, but it is still very expensive. This is the "storage for beginners" version, what happens in practice however depends a lot on specific workload profile (typical read/write size and latencies and rates), caching and queueing algorithms in both Linux and the HA firmware. ++ UNQTE I've also found another explanation of the same problem with the right chunk size and how it works here http://holyhandgrenade.org/blog/2011/08/disk-performance-part-2-raid-layouts-and-stripe-sizing/#more-1212 So in my understanding, when working with compressed data, your compressed data will vary between 128kb (urandom) and 32kb (zeroes) that will be passed to the FS to take care of. and in our setup of large chunk sizes, if we need to write 32kb-128kb of compressed data, the RAID5 would need to perform 3 read operations and 2 write operations. As updating a parity chunk requires either - The original chunk, the new chunk, and the old parity block - Or, all chunks (except for the parity chunk) in the stripe diskdisk1 disk2 disk3 disk4 chunk size 512kb 512kb 512kb 512kbP So in worst case scenario, in order to write 32kb, RAID5 would need to read (480 + 512 + P512) then write (32 + P512) That's my current understanding of the situation. I was planning to write an update to my story later on, once I hopefully solve the problem. But an intermidiary update is that I have performed full defrag with full compression (2 days). Then balance of the all data (10 days)and it didn't help the performance . So now I am moving the data from the array and will be rebuilding it with 64 or 32 chunk size and checking the performance. VG, kos - Original Message - From: "Stefan Priebe - Profihost AG"To: "Konstantin V. Gavrilenko" Cc: "Marat Khalili" , linux-btrfs@vger.kernel.org Sent: Wednesday, 16 August, 2017 11:26:38 AM Subject: Re: slow btrfs with a single kworker process using 100% CPU Am 16.08.2017 um 11:02 schrieb Konstantin V. Gavrilenko: > Could be similar issue as what I had recently, with the RAID5 and 256kb chunk > size. > please provide more information about your RAID setup. Hope this helps: # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] md0 : active raid5 sdd1[1] sdf1[4] sdc1[0] sde1[2] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 6/30 pages [24KB], 65536KB chunk md2 : active raid5 sdm1[2] sdl1[1] sdk1[0] sdn1[4] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 7/30 pages [28KB], 65536KB chunk md1 : active raid5 sdi1[2] sdg1[0] sdj1[4] sdh1[1] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 7/30 pages [28KB], 65536KB chunk md3 : active raid5 sdp1[1] sdo1[0] sdq1[2] sdr1[4] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 6/30 pages [24KB], 65536KB chunk # btrfs fi usage /vmbackup/ Overall: Device size: 43.65TiB Device allocated: 31.98TiB Device unallocated: 11.67TiB Device missing: 0.00B Used: 30.80TiB Free (estimated): 12.84TiB (min: 12.84TiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID0: Size:31.83TiB, Used:30.66TiB /dev/md07.96TiB /dev/md17.96TiB /dev/md27.96TiB /dev/md37.96TiB Metadata,RAID0: Size:153.00GiB, Used:141.34GiB /dev/md0 38.25GiB /dev/md1 38.25GiB /dev/md2 38.25GiB /dev/md3 38.25GiB System,RAID0: Size:128.00MiB, Used:2.28MiB /dev/md0
Re: btrfs fi du -s gives Inappropriate ioctl for device
On Mon, Aug 14, 2017 at 05:40:30PM -0600, Chris Murphy wrote: > On Mon, Aug 14, 2017 at 4:57 PM, Piotr Szymaniakwrote: > > > > > and... some issues: > > ~ # btrfs fi du -s /mnt/red/\@backup/ > > Total Exclusive Set shared Filename > > ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for > > device > > > It's a bug, but I don't know if any devs are working on a fix yet. > > The problem is that the subvolume being snapshot, contains subvolumes. > The resulting snapshot, contains an empty directory in place of the > nested subvolume(s), and that is the cause for the error. Ok, but why, on the same btrfs, it works on some subvols with subvols and does not work on other subvols with subvols? If it does not work - OK, if it works - OK, but that seems a bit... random? ~ # btrfs fi du -s /mnt/red/\@backup/ /mnt/red/\@backup/.snapshot/monthly_2017-08-01_05\:30\:01/ /mnt/red/\@svn/ /mnt/red/\@svn/.snapshot/weekly_2017-08-05_04\:20\:02/ Total Exclusive Set shared Filename ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for device ERROR: cannot check space of '/mnt/red/@backup/.snapshot/monthly_2017-08-01_05:30:01/': Inappropriate ioctl for device 52.23GiB10.57MiB 4.13GiB /mnt/red/@svn/ 4.35GiB 1.03MiB 4.12GiB /mnt/red/@svn/.snapshot/weekly_2017-08-05_04:20:02/ Best regards, Piotr Szymaniak. signature.asc Description: Digital signature
Re: slow btrfs with a single kworker process using 100% CPU
Am 16.08.2017 um 11:02 schrieb Konstantin V. Gavrilenko: > Could be similar issue as what I had recently, with the RAID5 and 256kb chunk > size. > please provide more information about your RAID setup. Hope this helps: # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] md0 : active raid5 sdd1[1] sdf1[4] sdc1[0] sde1[2] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 6/30 pages [24KB], 65536KB chunk md2 : active raid5 sdm1[2] sdl1[1] sdk1[0] sdn1[4] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 7/30 pages [28KB], 65536KB chunk md1 : active raid5 sdi1[2] sdg1[0] sdj1[4] sdh1[1] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 7/30 pages [28KB], 65536KB chunk md3 : active raid5 sdp1[1] sdo1[0] sdq1[2] sdr1[4] 11717406720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [] bitmap: 6/30 pages [24KB], 65536KB chunk # btrfs fi usage /vmbackup/ Overall: Device size: 43.65TiB Device allocated: 31.98TiB Device unallocated: 11.67TiB Device missing: 0.00B Used: 30.80TiB Free (estimated): 12.84TiB (min: 12.84TiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID0: Size:31.83TiB, Used:30.66TiB /dev/md07.96TiB /dev/md17.96TiB /dev/md27.96TiB /dev/md37.96TiB Metadata,RAID0: Size:153.00GiB, Used:141.34GiB /dev/md0 38.25GiB /dev/md1 38.25GiB /dev/md2 38.25GiB /dev/md3 38.25GiB System,RAID0: Size:128.00MiB, Used:2.28MiB /dev/md0 32.00MiB /dev/md1 32.00MiB /dev/md2 32.00MiB /dev/md3 32.00MiB Unallocated: /dev/md02.92TiB /dev/md12.92TiB /dev/md22.92TiB /dev/md32.92TiB Stefan > > p.s. > you can also check the tread "Btrfs + compression = slow performance and high > cpu usage" > > - Original Message - > From: "Stefan Priebe - Profihost AG"> To: "Marat Khalili" , linux-btrfs@vger.kernel.org > Sent: Wednesday, 16 August, 2017 10:37:43 AM > Subject: Re: slow btrfs with a single kworker process using 100% CPU > > Am 16.08.2017 um 08:53 schrieb Marat Khalili: >>> I've one system where a single kworker process is using 100% CPU >>> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is >>> there anything i can do to get the old speed again or find the culprit? >> >> 1. Do you use quotas (qgroups)? > > No qgroups and no quota. > >> 2. Do you have a lot of snapshots? Have you deleted some recently? > > 1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner > process isn't running / consuming CPU currently. > >> More info about your system would help too. > Kernel is OpenSuSE Leap 42.3. > > btrfs is mounted with > compress-force=zlib > > btrfs is running as a raid0 on top of 4 md raid 5 devices. > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
Could be similar issue as what I had recently, with the RAID5 and 256kb chunk size. please provide more information about your RAID setup. p.s. you can also check the tread "Btrfs + compression = slow performance and high cpu usage" - Original Message - From: "Stefan Priebe - Profihost AG"To: "Marat Khalili" , linux-btrfs@vger.kernel.org Sent: Wednesday, 16 August, 2017 10:37:43 AM Subject: Re: slow btrfs with a single kworker process using 100% CPU Am 16.08.2017 um 08:53 schrieb Marat Khalili: >> I've one system where a single kworker process is using 100% CPU >> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is >> there anything i can do to get the old speed again or find the culprit? > > 1. Do you use quotas (qgroups)? No qgroups and no quota. > 2. Do you have a lot of snapshots? Have you deleted some recently? 1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner process isn't running / consuming CPU currently. > More info about your system would help too. Kernel is OpenSuSE Leap 42.3. btrfs is mounted with compress-force=zlib btrfs is running as a raid0 on top of 4 md raid 5 devices. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
Am 16.08.2017 um 08:53 schrieb Marat Khalili: >> I've one system where a single kworker process is using 100% CPU >> sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is >> there anything i can do to get the old speed again or find the culprit? > > 1. Do you use quotas (qgroups)? No qgroups and no quota. > 2. Do you have a lot of snapshots? Have you deleted some recently? 1413 Snapshots. I'm deleting 50 of them every night. But btrfs-cleaner process isn't running / consuming CPU currently. > More info about your system would help too. Kernel is OpenSuSE Leap 42.3. btrfs is mounted with compress-force=zlib btrfs is running as a raid0 on top of 4 md raid 5 devices. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qcow2 images make scrub believe the filesystem is corrupted.
BTW, to determine it's really data corruption, you could check the data checksum by executing "btrfs check --check-data-csum". --check-data-csum has its limitation of skipping remaining mirrors if the first mirror is correct, but since your data is single, such limitation is not a problem at all. Or, you could also try the out-of-tree btrfs-progs with offline scrub support: https://github.com/adam900710/btrfs-progs/tree/offline_scrub It should be much like kernel scrub equivalent in btrfs-progs. Using "btrfs scrub start --offline " should be able to verify all checksum for data and metadata. If btrfs-progs reports csum error (for data), then it's really corrupted, and highly possible caused by discard mount option. Thanks, Qu On 2017年08月16日 10:28, Qu Wenruo wrote: On 2017年08月16日 09:51, Paulo Dias wrote: Hi, thanks for the quick answer. So, since i wrote this i tested this even further. First, and as you predicted, if i try to cp the file to another location i get read errors: root@kerberos:/home/groo# cp Fedora/Fedora.qcow2 / cp: error reading 'Fedora/Fedora.qcow2': Input/output error Less possible to blame scrub now. As normal read routine also reports such error, it maybe a real corruption of the file. so i used this trick: # modprobe nbd # qemu-nbd --connect=/dev/nbd0 Fedora2.qcow2 # ddrescue /dev/nbd0 new_file.raw # qemu-nbd --disconnect /dev/nbd0 # qemu-img convert -O qcow2 new_file.raw new_file.qcow2 and sure enough i was able to recreate the qcow2 but with this errors: ago 15 22:19:49 kerberos kernel: block nbd0: Other side returned error (5) ago 15 22:19:49 kerberos kernel: print_req_error: I/O error, dev nbd0, sector 22159872 ago 15 22:19:49 kerberos kernel: BTRFS warning (device sda3): csum failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected csum 0xe3338de1 mirror 1 Still csum error. And furthermore, both the expected and on-disk csum is not special value like crc32 for all zero page. So it may means that, it's a real corruption. ago 15 22:19:49 kerberos kernel: block nbd0: Other side returned error (5) ago 15 22:19:49 kerberos kernel: print_req_error: I/O error, dev nbd0, sector 22160016 ago 15 22:19:49 kerberos kernel: Buffer I/O error on dev nbd0, logical block 2770002, async page read ago 15 22:19:49 kerberos kernel: BTRFS warning (device sda3): csum failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected csum 0xe3338de1 mirror 1 At least, we now know which inode (968837 of root 258) and file offset (17455849472 length 4K) is corrupted. ago 15 22:19:49 kerberos kernel: block nbd0: Other side returned error (5) ago 15 22:19:49 kerberos kernel: print_req_error: I/O error, dev nbd0, sector 22160016 ago 15 22:19:49 kerberos kernel: Buffer I/O error on dev nbd0, logical block 2770002, async page read ago 15 22:20:47 kerberos kernel: BTRFS warning (device sda3): csum failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected csum 0xe3338de1 mirror 1 ago 15 22:20:47 kerberos kernel: BTRFS warning (device sda3): csum failed root 258 ino 968837 off 17455849472 csum 0xcc028588 expected csum 0xe3338de1 mirror 1 block 2770002, async page read ago 15 22:21:32 kerberos kernel: block nbd0: NBD_DISCONNECT ago 15 22:21:32 kerberos kernel: block nbd0: shutting down sockets i deleted the original Fedora.qcow2 and again scrub said i didnt had any errors, so i wondered, could it be the raid1 code (long shot), so i moved the metadata back to DUP. btrfs fi balance start -dconvert=single -mconvert=dup /home/ OK, data is not touched. Single to single, so data chunks are not touched. And your metadata is always good, so no problem should happen during balance. BTW, if you balance data, (no need to do convert, just balancing all data), it should also report error if my assumption is correct: Some data is *really* corrupted. root@kerberos:/home/groo# btrfs filesystem usage -T /home/ Overall: Device size: 333.50GiB Device allocated: 18.06GiB Device unallocated: 315.44GiB Device missing: 0.00B Used: 16.25GiB Free (estimated):315.83GiB (min: 158.11GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 39.45MiB (used: 0.00B) Data Metadata System Id Path single DUP DUP Unallocated -- - - --- 1 /dev/sda3 16.00GiB 2.00GiB 64.00MiB 181.94GiB 2 /dev/sdb7- -- 133.03GiB 3 /dev/sdb8- -- 488.13MiB -- - - --- Total 16.00GiB 1.00GiB 32.00MiB 315.44GiB Used 15.61GiB 329.27MiB 16.00KiB and once again copied the NEW fedora.qcow2 back to home and rerun scrub > and once again i got errors: root@kerberos:/home/groo# btrfs scrub start -B
Re: slow btrfs with a single kworker process using 100% CPU
I've one system where a single kworker process is using 100% CPU sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is there anything i can do to get the old speed again or find the culprit? 1. Do you use quotas (qgroups)? 2. Do you have a lot of snapshots? Have you deleted some recently? More info about your system would help too. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
slow btrfs with a single kworker process using 100% CPU
Hello, I've one system where a single kworker process is using 100% CPU sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is there anything i can do to get the old speed again or find the culprit? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html