Re: A Big Thank You, and some Notes on Current Recovery Tools.
Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo: > On 2018年01月01日 08:48, Stirling Westrup wrote: >> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK >> YOU to Nikolay Borisov and most especially to Qu Wenruo! >> >> Thanks to their tireless help in answering all my dumb questions I have >> managed to get my BTRFS working again! As I speak I have the full, >> non-degraded, quad of drives mounted and am updating my latest backup >> of their contents. >> >> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T >> drives failed, and with help I was able to make a 100% recovery of the >> lost data. I do have some observations on what I went through though. >> Take this as constructive criticism, or as a point for discussing >> additions to the recovery tools: >> >> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3 >> errors exactly coincided with the 3 super-blocks on the drive. > > WTF, why all these corruption all happens at btrfs super blocks?! > > What a coincident. Maybe it's a hybrid drive with flash? Or something that went wrong in the drive-internal cache memory the very time when superblocks where updated? I bet that the sectors aren't really broken, just the on-disk checksum didn't match the sector. I remember such things happening to me more than once back in the days when drives where still connected by molex power connectors. Those connectors started to get loose over time, due to thermals or repeated disconnect and connect. That is, drives sometimes started to no longer have a reliable power source which let to all sorts of very strange problems, mostly resulting in pseudo-defective sectors. That said, the OP would like to check the power supply after this coincidence... Maybe it's aging and no longer able to support all four drives, CPU, GPU and stuff with stable power. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance problems
Am Thu, 28 Dec 2017 00:39:37 + schrieb Duncan: >> I can I get btrfs balance to work in the background, without adversely >> affecting other applications? > > I'd actually suggest a different strategy. > > What I did here way back when I was still on reiserfs on spinning rust, > where it made more difference than on ssd, but I kept the settings when > I switched to ssd and btrfs, and at least some others have mentioned > that similar settings helped them on btrfs as well, is... > > Problem: The kernel virtual-memory subsystem's writeback cache was > originally configured for systems with well under a Gigabyte of RAM, and > the defaults no longer work so well on multi-GiB-RAM systems, > particularly above 8 GiB RAM, because they are based on a percentage of > available RAM, and will typically let several GiB of dirty writeback > cache accumulate before kicking off any attempt to actually write it to > storage. On spinning rust, when writeback /does/ finally kickoff, this > can result in hogging the IO for well over half a minute at a time, > where 30 seconds also happens to be the default "flush it anyway" time. This is somehow like the buffer bloat discussion for networking... Big buffers increase latency. There is more than one type of buffer. Additionally to what Duncan wrote (first type of buffer), the kernel lately got a new option to fight this "buffer bloat": writeback- throttling. It may help to enable that option. The second type of buffer is the io queue. So, you may also want to lower the io queue depth (nr_requests) of your devices. I think it defaults to 128 while most consumer drives only have a queue depth of 31 or 32 commands. Thus, reducing nr_requests for some of your devices may help you achieve better latency (but reduces throughput). Especially if working with io schedulers that do not implement io priorities, you could simply lower nr_requests to around or below the native command queue depth of your devices. The device itself can handle it better in that case, especially on spinning rust, as the firmware knows when to pull certain selected commands from the queue during a rotation of the media. The kernel knows nothing about rotary positions, it can only use the queue to prioritize and reorder requests but cannot take advantage of rotary positions of the heads. See $ grep ^ /sys/block/*/queue/nr_requests You may also get better results with increasing the nr_requests instead but at the cost of also adjusting the write buffer sizes, because with large nr_requests, you don't want blocking on writes so early, at least not when you need good latency. This probably works best for you with schedulers that care about latency, like deadline or kyber. For testing, keep in mind that everything works in dependence to each other setting. So change one at a time, take your tests, then change another and see how that relates to the first change, even when the first change made your experience worse. Another tip that's missing: Put different access classes onto different devices. That is, if you have a directory structure that's mostly written to, put it on it's own physical devices, with separate tuning and appropriate filesystem (log structured and cow filesystems are good at streaming writes). Put read mostly workloads also on their own device and filesystems. Put realtime workloads on their own device and filesystems. This gives you a much easier chance to succeed. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)
Am Thu, 21 Dec 2017 13:51:40 -0500 schrieb Chris Mason: > On 12/20/2017 03:59 PM, Timofey Titovets wrote: >> How reproduce: >> touch test_file >> chattr +C test_file >> dd if=/dev/zero of=test_file bs=1M count=1 >> btrfs fi def -vrczlib test_file >> filefrag -v test_file >> >> test_file >> Filesystem type is: 9123683e >> File size of test_file is 1048576 (256 blocks of 4096 bytes) >> ext: logical_offset:physical_offset: length: expected: flags: >>0:0.. 31: 72917050.. 72917081: 32: encoded >>1: 32.. 63: 72917118.. 72917149: 32: 72917082: encoded >>2: 64.. 95: 72919494.. 72919525: 32: 72917150: encoded >>3: 96.. 127: 72927576.. 72927607: 32: 72919526: encoded >>4: 128.. 159: 72943261.. 72943292: 32: 72927608: encoded >>5: 160.. 191: 72944929.. 72944960: 32: 72943293: encoded >>6: 192.. 223: 72944952.. 72944983: 32: 72944961: encoded >>7: 224.. 255: 72967084.. 72967115: 32: 72944984: >> last,encoded,eof >> test_file: 8 extents found >> >> I can't found at now, where that error happen in code, >> but it's reproducible on Linux 4.14.8 > > We'll silently cow in a few cases, this is one. I think the question was about compression, not cow. I can reproduce this behavior: $ touch nocow.dat $ touch cow.dat $ chattr +c cow.dat $ chattr +C nocow.dat $ dd if=/dev/zero of=cow.dat count=1 bs=1M $ dd if=/dev/zero of=nocow.dat count=1 bs=1M $ filefrag -v cow.dat Filesystem type is: 9123683e File size of cow.dat is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 1044845154..1044845185: 32: encoded,shared 1: 32.. 63: 1044845166..1044845197: 32: 1044845186: encoded,shared 2: 64.. 95: 1044845167..1044845198: 32: 1044845198: encoded,shared 3: 96.. 127: 1044851064..1044851095: 32: 1044845199: encoded,shared 4: 128.. 159: 1044851065..1044851096: 32: 1044851096: encoded,shared 5: 160.. 191: 1044852160..1044852191: 32: 1044851097: encoded,shared 6: 192.. 223: 1044943106..1044943137: 32: 1044852192: encoded,shared 7: 224.. 255: 1045054792..1045054823: 32: 1044943138: last,encoded,shared,eof cow.dat: 8 extents found $ filefrag -v nocow.dat Filesystem type is: 9123683e File size of nocow.dat is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 255: 1196077983..1196078238:256: last,shared,eof nocow.dat: 1 extent found Now it seems to be compressed (8x 128k extents): $ filefrag -v nocow.dat Filesystem type is: 9123683e File size of nocow.dat is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 1121866367..1121866398: 32: encoded,shared 1: 32.. 63: 1121866369..1121866400: 32: 1121866399: encoded,shared 2: 64.. 95: 1121866370..1121866401: 32: 1121866401: encoded,shared 3: 96.. 127: 1121866371..1121866402: 32: 1121866402: encoded,shared 4: 128.. 159: 1121866372..1121866403: 32: 1121866403: encoded,shared 5: 160.. 191: 1121866373..1121866404: 32: 1121866404: encoded,shared 6: 192.. 223: 1121866374..1121866405: 32: 1121866405: encoded,shared 7: 224.. 255: 1121866375..1121866406: 32: 1121866406: last,encoded,shared,eof nocow.dat: 8 extents found -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[4.14] WARNING: CPU: 2 PID: 779 at fs/btrfs/backref.c:1255 find_parent_nodes+0x892/0x1340
Hello! During balance I'm seeing the following in dmesg: [194123.693226] [ cut here ] [194123.693231] WARNING: CPU: 2 PID: 779 at fs/btrfs/backref.c:1255 find_parent_nodes+0x892/0x1340 [194123.693232] Modules linked in: f2fs bridge stp llc cifs ccm xfs rfcomm bnep fuse cachefiles snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic af_packet iTCO_wdt iTCO_vendor_support btusb btintel bluetooth kvm_intel snd_hda_intel rfkill snd_hda_codec kvm ecdh_generic snd_hda_core rtc_cmos lpc_ich irqbypass snd_pcm snd_timer snd soundcore uas usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_uvm(PO) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs [194123.693252] CPU: 2 PID: 779 Comm: crawl Tainted: P O 4.14.0-pf5 #3 [194123.693252] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [194123.693253] task: 8803e0a53840 task.stack: c90003f24000 [194123.693254] RIP: 0010:find_parent_nodes+0x892/0x1340 [194123.693255] RSP: 0018:c90003f27b30 EFLAGS: 00010286 [194123.693256] RAX: RBX: 88000bcdb3a0 RCX: 88000bef7d68 [194123.693256] RDX: 88000bef7d68 RSI: 88042f29b620 RDI: 88000bef7c60 [194123.693256] RBP: c90003f27c50 R08: 0001b620 R09: 812949c0 [194123.693257] R10: ea2f36c0 R11: 88041d003980 R12: [194123.693257] R13: 88000bef7c60 R14: 8801c1487a40 R15: 88000bef7d68 [194123.693258] FS: 7f4cae8cb700() GS:88042f28() knlGS: [194123.693258] CS: 0010 DS: ES: CR0: 80050033 [194123.693259] CR2: 7f63e7953038 CR3: 0003e190a002 CR4: 001606e0 [194123.693259] Call Trace: [194123.693262] ? kmem_cache_free+0x13d/0x170 [194123.693265] ? btrfs_find_all_roots_safe+0x89/0xf0 [194123.693266] btrfs_find_all_roots_safe+0x89/0xf0 [194123.693267] ? extent_same_check_offsets+0x60/0x60 [194123.693269] iterate_extent_inodes+0x154/0x270 [194123.693270] ? extent_same_check_offsets+0x60/0x60 [194123.693271] ? iterate_inodes_from_logical+0x81/0x90 [194123.693272] iterate_inodes_from_logical+0x81/0x90 [194123.693273] btrfs_ioctl+0x8a3/0x2420 [194123.693275] ? generic_file_read_iter+0x2c1/0x7c0 [194123.693276] ? do_vfs_ioctl+0x8a/0x5c0 [194123.693277] do_vfs_ioctl+0x8a/0x5c0 [194123.693278] ? __fget+0x62/0xa0 [194123.693279] SyS_ioctl+0x36/0x70 [194123.693281] entry_SYSCALL_64_fastpath+0x13/0x94 [194123.693283] RIP: 0033:0x3d9acf0037 [194123.693283] RSP: 002b:7f4cae8c8448 EFLAGS: 0246 ORIG_RAX: 0010 [194123.693284] RAX: ffda RBX: 000a RCX: 003d9acf0037 [194123.693284] RDX: 7f4cae8c85a8 RSI: c0389424 RDI: 0003 [194123.693285] RBP: 000a R08: R09: 0400 [194123.693285] R10: 0020 R11: 0246 R12: 000a [194123.693285] R13: R14: 000a R15: 7f4ca8080f70 [194123.693286] Code: 3d e4 f6 e2 00 4c 89 ee e8 2c 5e ed ff 44 89 f2 45 85 f6 0f 85 4a fe ff ff e9 4f fb ff ff 85 c0 0f 84 22 fc ff ff e9 5d fc ff ff <0f> ff 48 85 db 0f 84 49 fc ff ff e9 37 fc ff ff 41 f6 47 42 01 [194123.693300] ---[ end trace 8c5a2104d4238bb7 ]--- -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ref doesn't match the record start and is compressed
Hello! Since my system was temporarily overloaded (load 500+ for probably several hours but it recovered, no reboot needed), I'm seeing btrfs-related backtraces in dmesg with the fs going RO. I tried to fix it but it fails: > Fixed 0 roots. > checking extents > ref mismatch on [3043919785984 57344] extent item 1, found 3 > Data backref 3043919785984 root 265 owner 83535 offset 131072 num_refs 0 not > found in extent tree > Incorrect local backref count on 3043919785984 root 265 owner 83535 offset > 131072 found 1 wanted 0 back 0x61854760 > Backref disk bytenr does not match extent record, bytenr=3043919785984, ref > bytenr=3043919790080 > Backref bytes do not match extent backref, bytenr=3043919785984, ref > bytes=57344, backref bytes=40960 > Data backref 3043919785984 root 259 owner 11522804 offset 0 num_refs 0 not > found in extent tree > Incorrect local backref count on 3043919785984 root 259 owner 11522804 offset > 0 found 1 wanted 0 back 0x57892430 > Backref disk bytenr does not match extent record, bytenr=3043919785984, ref > bytenr=3043919831040 > Backref bytes do not match extent backref, bytenr=3043919785984, ref > bytes=57344, backref bytes=8192 > backpointer mismatch on [3043919785984 57344] > attempting to repair backref discrepency for bytenr 3043919785984 > Ref doesn't match the record start and is compressed, please take a > btrfs-image of this file system and send it to a btrfs developer so they can > complete this functionality for bytenr 3043919790080 > failed to repair damaged filesystem, aborting > enabling repair mode > Checking filesystem on /dev/bcacache48 8 UUID: > bc201ce5-8f2b-4263-995a-6641e89d4c88 When I identify the files with "btrfs inspect-internal" and try to delete them, btrfs fails and goes into RO mode: [ 735.877040] [ cut here ] [ 735.877048] WARNING: CPU: 2 PID: 809 at fs/btrfs/extent-tree.c:7053 __btrfs_free_extent.isra.71+0x740/0xbc0 [ 735.877049] Modules linked in: uas usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_uvm(PO) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs [ 735.877065] CPU: 2 PID: 809 Comm: umount Tainted: P O 4.14.0-pf4 #1 [ 735.877066] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [ 735.877067] task: 88040b6cf080 task.stack: c900035b [ 735.877070] RIP: 0010:__btrfs_free_extent.isra.71+0x740/0xbc0 [ 735.877071] RSP: 0018:c900035b3c00 EFLAGS: 00010246 [ 735.877073] RAX: fffe RBX: 02c4b7c2 RCX: c900035b3bb8 [ 735.877074] RDX: 02c4b7c2d000 RSI: 88021c22f607 RDI: c900035b3bb8 [ 735.877075] RBP: fffe R08: R09: 0001464f [ 735.877076] R10: 0002 R11: 076d R12: 880414618000 [ 735.877077] R13: 880406b3d850 R14: R15: 0109
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
Am Fri, 17 Nov 2017 06:51:52 +0300 schrieb Andrei Borzenkov <arvidj...@gmail.com>: > 16.11.2017 19:13, Kai Krakow пишет: > ... > > > BTW: From user API perspective, btrfs snapshots do not guarantee > > perfect granular consistent backups. > > Is it documented somewhere? I was relying on crash-consistent > write-order-preserving snapshots in NetApp for as long as I remember. > And I was sure btrfs offers is as it is something obvious for > redirect-on-write idea. I think it has ordering guarantees, but it is not as atomic in time as one might think. That's the point. But devs may tell better. > > A user-level file transaction may > > still end up only partially in the snapshot. If you are running > > transaction sensitive applications, those usually do provide some > > means of preparing a freeze and a thaw of transactions. > > > > Is snapshot creation synchronous to know when thaw? I think you could do "btrfs snap create", then "btrfs fs sync", and everything should be fine. > > I think the user transactions API which could've been used for this > > will even be removed during the next kernel cycles. I remember > > reiserfs4 tried to deploy something similar. But there's no > > consistent layer in the VFS for subscribing applications to > > filesystem snapshots so they could prepare and notify the kernel > > when they are ready. > > I do not see what VFS has to do with it. NetApp works by simply > preserving previous consistency point instead of throwing it away. > I.e. snapshot is always last committed image on stable storage. Would > something like this be possible on btrfs level by duplicating current > on-disk root (sorry if I use wrong term)? I think btrfs gives the same consistency. But the moment you issue "btrfs snap create" may delay snapshot creation a little bit. So if your application relies on exact point in time snapshots, you need to ensure synchronizing your application to the filesystem. I think the same is true for NetApp. I just wanted to point that out because it may not be obvious, given that btrfs snapshot creation is built right into the tool chain of filesystem itself, unlike e.g. NetApp or LVM, or other storage layers. Background: A good while back I was told that btrfs snapshots during ongoing IO may result in some of the later IO carried over to before the snapshot. Transactional ordering of IO operations is still guaranteed but it may overlap with snapshot creation. So you can still loose a transaction you didn't expect to loose at that point in time. So I understood this as: If you just want to ensure transactional integrity of your database, you are all fine with btrfs snapshots. But if you want to ensure that a just finished transaction makes it into the snapshot completely, you have to sync the processes. However, things may have changed since then. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
Am Wed, 15 Nov 2017 08:11:04 +0100 schrieb waxhead: > As for dedupe there is (to my knowledge) nothing fully automatic yet. > You have to run a program to scan your filesystem but all the > deduplication is done in the kernel. > duperemove works apparently quite well when I tested it, but there > may be some performance implications. There's bees as near-line deduplication tool, that is it watches for generation changes in the filesystem and walks the inodes. It only looks at extents, not at files. Deduplication itself is then delegated to the kernel which ensures all changes are data-safe. The process is running as a daemon and processes your changes in realtime (delayed by a few seconds to minutes of course, due to transaction commit and hashing phase). You need to dedicate it part of your RAM to work, around 1 GB is usually sufficient to work well enough. The RAM will be locked and cannot be swapped out, so you should have a sufficiently equipped system. Works very well here (2TB of data, 1GB hash table, 16GB RAM). New dDuplicated files are picked up within seconds, scanned (hitting the cache most of the time thus not requiring physical IO), and then submitted to the kernel for deduplication. I'd call that fully automatic: Once set up, it just works, and works well. Performance impact is very low once the initial scan is done. https://github.com/Zygo/bees -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
Link 2 slipped away, adding it below... Am Tue, 14 Nov 2017 15:51:57 -0500 schrieb Dave: > On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov wrote: > > > > On Mon, 13 Nov 2017 22:39:44 -0500 > > Dave wrote: > > > > > I have my live system on one block device and a backup snapshot > > > of it on another block device. I am keeping them in sync with > > > hourly rsync transfers. > > > > > > Here's how this system works in a little more detail: > > > > > > 1. I establish the baseline by sending a full snapshot to the > > > backup block device using btrfs send-receive. > > > 2. Next, on the backup device I immediately create a rw copy of > > > that baseline snapshot. > > > 3. I delete the source snapshot to keep the live filesystem free > > > of all snapshots (so it can be optimally defragmented, etc.) > > > 4. hourly, I take a snapshot of the live system, rsync all > > > changes to the backup block device, and then delete the source > > > snapshot. This hourly process takes less than a minute currently. > > > (My test system has only moderate usage.) > > > 5. hourly, following the above step, I use snapper to take a > > > snapshot of the backup subvolume to create/preserve a history of > > > changes. For example, I can find the version of a file 30 hours > > > prior. > > > > Sounds a bit complex, I still don't get why you need all these > > snapshot creations and deletions, and even still using btrfs > > send-receive. > > > Hopefully, my comments below will explain my reasons. > > > > > Here is my scheme: > > > > /mnt/dst <- mounted backup storage volume > > /mnt/dst/backup <- a subvolume > > /mnt/dst/backup/host1/ <- rsync destination for host1, regular > > directory /mnt/dst/backup/host2/ <- rsync destination for host2, > > regular directory /mnt/dst/backup/host3/ <- rsync destination for > > host3, regular directory etc. > > > > /mnt/dst/backup/host1/bin/ > > /mnt/dst/backup/host1/etc/ > > /mnt/dst/backup/host1/home/ > > ... > > Self explanatory. All regular directories, not subvolumes. > > > > Snapshots: > > /mnt/dst/snaps/backup <- a regular directory > > /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 > > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T13:00/ <- > > snapshot 2 > > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T14:00/ <- > > snapshot 3 of /mnt/dst/backup > > > > Accessing historic data: > > /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash > > ... > > /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup > > system). > > > > > > No need for btrfs send-receive, only plain rsync is used, directly > > from hostX:/ to /mnt/dst/backup/host1/; > > > I prefer to start with a BTRFS snapshot at the backup destination. I > think that's the most "accurate" starting point. No, you should finish with a snapshot. Use the rsync destination as a "dirty" scratch area, let rsync also delete files which are no longer in the source. After successfully running rsync, make a snapshot of that directory and make it RO, leave the scratch in place (even when rsync dies or becomes killed). I once made some scripts[2] following those rules, you may want to adapt them. > > No need to create or delete snapshots during the actual backup > > process; > > Then you can't guarantee consistency of the backed up information. Take a temporary snapshot of the source, rsync to to the scratch destination, take a RO snapshot of that destination, remove the temporary snapshot. BTW: From user API perspective, btrfs snapshots do not guarantee perfect granular consistent backups. A user-level file transaction may still end up only partially in the snapshot. If you are running transaction sensitive applications, those usually do provide some means of preparing a freeze and a thaw of transactions. I think the user transactions API which could've been used for this will even be removed during the next kernel cycles. I remember reiserfs4 tried to deploy something similar. But there's no consistent layer in the VFS for subscribing applications to filesystem snapshots so they could prepare and notify the kernel when they are ready. > > A single common timeline is kept for all hosts to be backed up, > > snapshot count not multiplied by the number of hosts (in my case > > the backup location is multi-purpose, so I somewhat care about > > total number of snapshots there as well); > > > > Also, all of this works even with source hosts which do not use > > Btrfs. > > That's not a concern for me because I prefer to use BTRFS everywhere. At least I suggest looking into bees[1] to deduplicate the backup destination. Rsync is not very efficient to work with btrfs snapshots. It will break reflinks often and write inefficiently sized blocks, even with inplace option. Also,
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
Am Tue, 14 Nov 2017 15:51:57 -0500 schrieb Dave: > On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov wrote: > > > > On Mon, 13 Nov 2017 22:39:44 -0500 > > Dave wrote: > > > > > I have my live system on one block device and a backup snapshot > > > of it on another block device. I am keeping them in sync with > > > hourly rsync transfers. > > > > > > Here's how this system works in a little more detail: > > > > > > 1. I establish the baseline by sending a full snapshot to the > > > backup block device using btrfs send-receive. > > > 2. Next, on the backup device I immediately create a rw copy of > > > that baseline snapshot. > > > 3. I delete the source snapshot to keep the live filesystem free > > > of all snapshots (so it can be optimally defragmented, etc.) > > > 4. hourly, I take a snapshot of the live system, rsync all > > > changes to the backup block device, and then delete the source > > > snapshot. This hourly process takes less than a minute currently. > > > (My test system has only moderate usage.) > > > 5. hourly, following the above step, I use snapper to take a > > > snapshot of the backup subvolume to create/preserve a history of > > > changes. For example, I can find the version of a file 30 hours > > > prior. > > > > Sounds a bit complex, I still don't get why you need all these > > snapshot creations and deletions, and even still using btrfs > > send-receive. > > > Hopefully, my comments below will explain my reasons. > > > > > Here is my scheme: > > > > /mnt/dst <- mounted backup storage volume > > /mnt/dst/backup <- a subvolume > > /mnt/dst/backup/host1/ <- rsync destination for host1, regular > > directory /mnt/dst/backup/host2/ <- rsync destination for host2, > > regular directory /mnt/dst/backup/host3/ <- rsync destination for > > host3, regular directory etc. > > > > /mnt/dst/backup/host1/bin/ > > /mnt/dst/backup/host1/etc/ > > /mnt/dst/backup/host1/home/ > > ... > > Self explanatory. All regular directories, not subvolumes. > > > > Snapshots: > > /mnt/dst/snaps/backup <- a regular directory > > /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 > > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T13:00/ <- > > snapshot 2 > > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T14:00/ <- > > snapshot 3 of /mnt/dst/backup > > > > Accessing historic data: > > /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash > > ... > > /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup > > system). > > > > > > No need for btrfs send-receive, only plain rsync is used, directly > > from hostX:/ to /mnt/dst/backup/host1/; > > > I prefer to start with a BTRFS snapshot at the backup destination. I > think that's the most "accurate" starting point. No, you should finish with a snapshot. Use the rsync destination as a "dirty" scratch area, let rsync also delete files which are no longer in the source. After successfully running rsync, make a snapshot of that directory and make it RO, leave the scratch in place (even when rsync dies or becomes killed). I once made some scripts[2] following those rules, you may want to adapt them. > > No need to create or delete snapshots during the actual backup > > process; > > Then you can't guarantee consistency of the backed up information. Take a temporary snapshot of the source, rsync to to the scratch destination, take a RO snapshot of that destination, remove the temporary snapshot. BTW: From user API perspective, btrfs snapshots do not guarantee perfect granular consistent backups. A user-level file transaction may still end up only partially in the snapshot. If you are running transaction sensitive applications, those usually do provide some means of preparing a freeze and a thaw of transactions. I think the user transactions API which could've been used for this will even be removed during the next kernel cycles. I remember reiserfs4 tried to deploy something similar. But there's no consistent layer in the VFS for subscribing applications to filesystem snapshots so they could prepare and notify the kernel when they are ready. > > A single common timeline is kept for all hosts to be backed up, > > snapshot count not multiplied by the number of hosts (in my case > > the backup location is multi-purpose, so I somewhat care about > > total number of snapshots there as well); > > > > Also, all of this works even with source hosts which do not use > > Btrfs. > > That's not a concern for me because I prefer to use BTRFS everywhere. At least I suggest looking into bees[1] to deduplicate the backup destination. Rsync is not very efficient to work with btrfs snapshots. It will break reflinks often and write inefficiently sized blocks, even with inplace option. Also, rsync won't efficiently catch files
Re: A partially failing disk in raid0 needs replacement
Am Tue, 14 Nov 2017 17:48:56 +0500 schrieb Roman Mamedov: > [1] Note that "ddrescue" and "dd_rescue" are two different programs > for the same purpose, one may work better than the other. I don't > remember which. :) One is a perl implementation and is the one working worse. ;-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs seed question
Am Thu, 12 Oct 2017 09:20:28 -0400 schrieb Joseph Dunn: > On Thu, 12 Oct 2017 12:18:01 +0800 > Anand Jain wrote: > > > On 10/12/2017 08:47 AM, Joseph Dunn wrote: > > > After seeing how btrfs seeds work I wondered if it was possible > > > to push specific files from the seed to the rw device. I know > > > that removing the seed device will flush all the contents over to > > > the rw device, but what about flushing individual files on demand? > > > > > > I found that opening a file, reading the contents, seeking back > > > to 0, and writing out the contents does what I want, but I was > > > hoping for a bit less of a hack. > > > > > > Is there maybe an ioctl or something else that might trigger a > > > similar action? > > > >You mean to say - seed-device delete to trigger copy of only the > > specified or the modified files only, instead of whole of > > seed-device ? What's the use case around this ? > > > > Not quite. While the seed device is still connected I would like to > force some files over to the rw device. The use case is basically a > much slower link to a seed device holding significantly more data than > we currently need. An example would be a slower iscsi link to the > seed device and a local rw ssd. I would like fast access to a > certain subset of files, likely larger than the memory cache will > accommodate. If at a later time I want to discard the image as a > whole I could unmount the file system or if I want a full local copy > I could delete the seed-device to sync the fs. In the mean time I > would have access to all the files, with some slower (iscsi) and some > faster (ssd) and the ability to pick which ones are in the faster > group at the cost of one content transfer. > > I'm not necessarily looking for a new feature addition, just if there > is some existing call that I can make to push specific files from the > slow mirror to the fast one. If I had to push a significant amount of > metadata that would be fine, but the file contents feeding some > computations might be large and useful only to certain clients. > > So far I found that I can re-write the file with the same contents and > thanks to the lack of online dedupe these writes land on the rw mirror > so later reads to that file should not hit the slower mirror. By the > way, if I'm misunderstanding how the read process would work after the > file push please correct me. > > I hope this makes sense but I'll try to clarify further if you have > more questions. You could try to wrap something like bcache ontop of the iscsi device, then make it a read-mostly cache (like bcache write-around mode). This probably involves rewriting the iscsi contents to add a bcache header. You could try mdcache instead. Then you sacrifice a few gigabytes of local SSD storage of the caching layer. I guess that you're sharing the seed device with different machines. As bcache will add a protective superblock, you may need to thin-clone the seed image on the source to have independent superblocks per each bcache instance. Not sure how this applies to mdcache as I never used it. But the caching approach is probably the easiest way to go for you. And it's mostly automatic once deployed: you don't have to manually choose which files to move to the sprout... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
Am Tue, 31 Oct 2017 07:28:58 -0400 schrieb "Austin S. Hemmelgarn": > On 2017-10-31 01:57, Marat Khalili wrote: > > On 31/10/17 00:37, Chris Murphy wrote: > >> But off hand it sounds like hardware was sabotaging the expected > >> write ordering. How to test a given hardware setup for that, I > >> think, is really overdue. It affects literally every file system, > >> and Linux storage technology. > >> > >> It kinda sounds like to me something other than supers is being > >> overwritten too soon, and that's why it's possible for none of the > >> backup roots to find a valid root tree, because all four possible > >> root trees either haven't actually been written yet (still) or > >> they've been overwritten, even though the super is updated. But > >> again, it's speculation, we don't actually know why your system > >> was no longer mountable. > > Just a detached view: I know hardware should respect > > ordering/barriers and such, but how hard is it really to avoid > > overwriting at least one complete metadata tree for half an hour > > (even better, yet another one for a day)? Just metadata, not data > > extents. > If you're running on an SSD (or thinly provisioned storage, or > something else which supports discards) and have the 'discard' mount > option enabled, then there is no backup metadata tree (this issue was > mentioned on the list a while ago, but nobody ever replied), because > it's already been discarded. This is ideally something which should > be addressed (we need some sort of discard queue for handling in-line > discards), but it's not easy to address. > > Otherwise, it becomes a question of space usage on the filesystem, > and this is just another reason to keep some extra slack space on the > FS (though that doesn't help _much_, it does help). This, in theory, > could be addressed, but it probably can't be applied across mounts of > a filesystem without an on-disk format change. Well, maybe inline discard is working at the wrong level. It should kick in when the reference through any of the backup roots is dropped, not when the current instance is dropped. Without knowledge of the internals, I guess discards could be added to a queue within a new tree in btrfs, and only added to that queue when dropped from the last backup root referencing it. But this will probably add some bad performance spikes. I wonder how a regular fstrim run through cron applies to this problem? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Thu, 2 Nov 2017 22:47:31 -0400 schrieb Dave <davestechs...@gmail.com>: > On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow <hurikha...@gmail.com> > wrote: > > > > > You may want to try btrfs autodefrag mount option and see if it > > improves things (tho, the effect may take days or weeks to apply if > > you didn't enable it right from the creation of the filesystem). > > > > Also, autodefrag will probably unshare reflinks on your snapshots. > > You may be able to use bees[1] to work against this effect. Its > > interaction with autodefrag is not well tested but it works fine > > for me. Also, bees is able to reduce some of the fragmentation > > during deduplication because it will rewrite extents back into > > bigger chunks (but only for duplicated data). > > > > [1]: https://github.com/Zygo/bees > > I will look into bees. And yes, I plan to try autodefrag. (I already > have it enabled now.) However, I need to understand something about > how btrfs send-receive works in regard to reflinks and fragmentation. > > Say I have 2 snapshots on my live volume. The earlier one of them has > already been sent to another block device by btrfs send-receive (full > backup). Now defrag runs on the live volume and breaks some percentage > of the reflinks. At this point I do an incremental btrfs send-receive > using "-p" (or "-c") with the diff going to the same other block > device where the prior snapshot was already sent. > > Will reflinks be "made whole" (restored) on the receiving block > device? Or is the state of the source volume replicated so closely > that reflink status is the same on the target? > > Also, is fragmentation reduced on the receiving block device? > > My expectation is that fragmentation would be reduced and duplication > would be reduced too. In other words, does send-receive result in > defragmentation and deduplication too? As far as I understand, btrfs send/receive doesn't create an exact mirror. It just replays the block operations between generation numbers. That is: If it finds new blocks referenced between generations, it will write a _new_ block to the destination. So, no, it won't reduce fragmentation or duplication. It just keeps reflinks intact as long as such extents weren't touched within the generation range. Otherwise they are rewritten as new extents. Autodefrag and deduplication processes will as such probably increase duplication at the destination. A developer may have a better clue, tho. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Fri, 3 Nov 2017 08:58:22 +0300 schrieb Marat Khalili: > On 02/11/17 04:39, Dave wrote: > > I'm going to make this change now. What would be a good way to > > implement this so that the change applies to the $HOME/.cache of > > each user? > I'd make each user's .cache a symlink (should work but if it won't > then bind mount) to a per-user directory in some separately mounted > volume with necessary options. On a systemd system, each user already has a private tmpfs location at /run/user/$(id -u). You could add to the central login script: # CACHE_DIR="/run/user/$(id -u)/cache" # mkdir -p $CACHE_DIR && ln -snf $CACHE_DIR $HOME/.cache You should not run this as root (because of mkdir -p). You could wrap it into an if statement: # if [ "$(whoami)" -ne "root" ]; then # ... # fi -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Thu, 2 Nov 2017 22:59:36 -0400 schrieb Dave: > On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn > wrote: > > On 2017-11-01 21:39, Dave wrote: > >> I'm going to make this change now. What would be a good way to > >> implement this so that the change applies to the $HOME/.cache of > >> each user? > >> > >> The simple way would be to create a new subvolume for each existing > >> user and mount it at $HOME/.cache in /etc/fstab, hard coding that > >> mount location for each user. I don't mind doing that as there are > >> only 4 users to consider. One minor concern is that it adds an > >> unexpected step to the process of creating a new user. Is there a > >> better way? > >> > > The easiest option is to just make sure nobody is logged in and run > > the following shell script fragment: > > > > for dir in /home/* ; do > > rm -rf $dir/.cache > > btrfs subvolume create $dir/.cache > > done > > > > And then add something to the user creation scripts to create that > > subvolume. This approach won't pollute /etc/fstab, will still > > exclude the directory from snapshots, and doesn't require any > > hugely creative work to integrate with user creation and deletion. > > > > In general, the contents of the .cache directory are just that, > > cached data. Provided nobody is actively accessing it, it's > > perfectly safe to just nuke the entire directory... > > I like this suggestion. Thank you. I had intended to mount the .cache > subvolumes with the NODATACOW option. However, with this approach, I > won't be explicitly mounting the .cache subvolumes. Is it possible to > use "chattr +C $dir/.cache" in that loop even though it is a > subvolume? And, is setting the .cache directory to NODATACOW the right > choice given this scenario? From earlier comments, I believe it is, > but I want to be sure I understood correctly. It is important to apply "chattr +C" to the _empty_ directory, because even if used recursively, it won't apply to already existing, non-empty files. But the +C attribute is inherited by newly created files and directory: So simply follow the "chattr +C on empty directory" and you're all set. BTW: You cannot mount subvolumes from an already mounted btrfs device with different mount options. That is currently not implemented (except for maybe a very few options). So the fstab approach probably wouldn't have helped you (depending on your partition layout). You can simply just create subvolumes within the location needed and they are implicitly mounted. Then change the particular subvolume cow behavior with chattr. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
Am Thu, 2 Nov 2017 23:24:29 -0400 schrieb Dave <davestechs...@gmail.com>: > On Thu, Nov 2, 2017 at 4:46 PM, Kai Krakow <hurikha...@gmail.com> > wrote: > > Am Wed, 1 Nov 2017 02:51:58 -0400 > > schrieb Dave <davestechs...@gmail.com>: > > > [...] > [...] > [...] > >> > >> Thanks for confirming. I must have missed those reports. I had > >> never considered this idea until now -- but I like it. > >> > >> Are there any blogs or wikis where people have done something > >> similar to what we are discussing here? > > > > I used rsync before, backup source and destination both were btrfs. > > I was experiencing the same btrfs bug from time to time on both > > devices, luckily not at the same time. > > > > I instead switched to using borgbackup, and xfs as the destination > > (to not fall the same-bug-in-two-devices pitfall). > > I'm going to stick with btrfs everywhere. My reasoning is that my > biggest pitfalls will be related to lack of knowledge. So focusing on > learning one filesystem better (vs poorly learning two) is the better > strategy for me, given my limited time. (I'm not an IT professional of > any sort.) > > Is there any problem with the Borgbackup repository being on btrfs? No. I just wanted to point out that keeping backup and source on different media (which includes different technology, too) is common best practice and adheres to the 3-2-1 backup strategy. > > Borgbackup achieves a > > much higher deduplication density and compression, and as such also > > is able to store much more backup history in the same storage > > space. The first run is much slower than rsync (due to enabled > > compression) but successive runs are much faster (like 20 minutes > > per backup run instead of 4-5 hours). > > > > I'm currently storing 107 TB of backup history in just 2.2 TB backup > > space, which counts a little more than one year of history now, > > containing 56 snapshots. This is my retention policy: > > > > * 5 yearly snapshots > > * 12 monthly snapshots > > * 14 weekly snapshots (worth around 3 months) > > * 30 daily snapshots > > > > Restore is fast enough, and a snapshot can even be fuse-mounted > > (tho, in that case mounted access can be very slow navigating > > directories). > > > > With latest borgbackup version, the backup time increased to around > > 1 hour from 15-20 minutes in the previous version. That is due to > > switching the file cache strategy from mtime to ctime. This can be > > tuned to get back to old performance, but it may miss some files > > during backup if you're doing awkward things to file timestamps. > > > > I'm also backing up some servers with it now, then use rsync to sync > > the borg repository to an offsite location. > > > > Combined with same-fs local btrfs snapshots with short retention > > times, this could be a viable solution for you. > > Yes, I appreciate the idea. I'm going to evaluate both rsync and > Borgbackup. > > The advantage of rsync, I think, is that it will likely run in just a > couple minutes. That will allow me to run it hourly and to keep my > live volume almost entire free of snapshots and fully defragmented. > It's also very simple as I already have rsync. And since I'm going to > run btrfs on the backup volume, I can perform hourly snapshots there > and use Snapper to manage retention. It's all very simple and relies > on tools I already have and know. > > However, the advantages of Borgbackup you mentioned (much higher > deduplication density and compression) make it worth considering. > Maybe Borgbackup won't take long to complete successive (incremental) > backups on my system. Once a full backup was taken, incremental backups are extremely fast. At least for me, it works much faster than rsync. And as with btrfs snapshots, each incremental backup is also a full backup. It's not like traditional backup software that needs the backup parent and grand parent to make use of the differential and/or incremental backups. There's one caveat, tho: Only one process can access a repository at a time, that is you need to serialize different backup jobs if you want them to go into the same repository. Deduplication is done only within the same repository. Tho, you might be able to leverage btrfs deduplication (e.g. using bees) across multiple repositories if you're not using encrypted repositories. But since you're currently using send/receive and/or rsync, encrypted storage of the backup doesn't seem to be an important point to you. Burp with its client/server approach may have an advantage here, so its setup see
Re: defragmenting best practice?
Am Tue, 31 Oct 2017 20:37:27 -0400 schrieb Dave: > > Also, you can declare the '.firefox/default/' directory to be > > NOCOW, and that "just works". > > The cache is in a separate location from the profiles, as I'm sure you > know. The reason I suggested a separate BTRFS subvolume for > $HOME/.cache is that this will prevent the cache files for all > applications (for that user) from being included in the snapshots. We > take frequent snapshots and (afaik) it makes no sense to include cache > in backups or snapshots. The easiest way I know to exclude cache from > BTRFS snapshots is to put it on a separate subvolume. I assumed this > would make several things related to snapshots more efficient too. > > As far as the Firefox profile being declared NOCOW, as soon as we take > the first snapshot, I understand that it will become COW again. So I > don't see any point in making it NOCOW. Ah well, not really. The files and directories will still be nocow - however, the next write to any such file after a snapshot will make a cow operation. So you still see the fragmentation effect but to a much lesser extent. But the files itself will remain in nocow format. You may want to try btrfs autodefrag mount option and see if it improves things (tho, the effect may take days or weeks to apply if you didn't enable it right from the creation of the filesystem). Also, autodefrag will probably unshare reflinks on your snapshots. You may be able to use bees[1] to work against this effect. Its interaction with autodefrag is not well tested but it works fine for me. Also, bees is able to reduce some of the fragmentation during deduplication because it will rewrite extents back into bigger chunks (but only for duplicated data). [1]: https://github.com/Zygo/bees -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
Am Wed, 1 Nov 2017 02:51:58 -0400 schrieb Dave: > > > >> To reconcile those conflicting goals, the only idea I have come up > >> with so far is to use btrfs send-receive to perform incremental > >> backups > > > > As already said by Romain Mamedov, rsync is viable alternative to > > send-receive with much less hassle. According to some reports it > > can even be faster. > > Thanks for confirming. I must have missed those reports. I had never > considered this idea until now -- but I like it. > > Are there any blogs or wikis where people have done something similar > to what we are discussing here? I used rsync before, backup source and destination both were btrfs. I was experiencing the same btrfs bug from time to time on both devices, luckily not at the same time. I instead switched to using borgbackup, and xfs as the destination (to not fall the same-bug-in-two-devices pitfall). Borgbackup achieves a much higher deduplication density and compression, and as such also is able to store much more backup history in the same storage space. The first run is much slower than rsync (due to enabled compression) but successive runs are much faster (like 20 minutes per backup run instead of 4-5 hours). I'm currently storing 107 TB of backup history in just 2.2 TB backup space, which counts a little more than one year of history now, containing 56 snapshots. This is my retention policy: * 5 yearly snapshots * 12 monthly snapshots * 14 weekly snapshots (worth around 3 months) * 30 daily snapshots Restore is fast enough, and a snapshot can even be fuse-mounted (tho, in that case mounted access can be very slow navigating directories). With latest borgbackup version, the backup time increased to around 1 hour from 15-20 minutes in the previous version. That is due to switching the file cache strategy from mtime to ctime. This can be tuned to get back to old performance, but it may miss some files during backup if you're doing awkward things to file timestamps. I'm also backing up some servers with it now, then use rsync to sync the borg repository to an offsite location. Combined with same-fs local btrfs snapshots with short retention times, this could be a viable solution for you. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why do full balance and deduplication reduce available free space?
Am Mon, 02 Oct 2017 22:19:32 +0200 schrieb Niccolò Belli <darkba...@linuxsystems.it>: > Il 2017-10-02 21:35 Kai Krakow ha scritto: > > Besides defragging removing the reflinks, duperemove will unshare > > your snapshots when used in this way: If it sees duplicate blocks > > within the subvolumes you give it, it will potentially unshare > > blocks from the snapshots while rewriting extents. > > > > BTW, you should be able to use duperemove with read-only snapshots > > if used in read-only-open mode. But I'd rather suggest to use bees > > instead: It works at whole-volume level, walking extents instead of > > files. That way it is much faster, doesn't reprocess already > > deduplicated extents, and it works with read-only snapshots. > > > > Until my patch it didn't like mixed nodatasum/datasum workloads. > > Currently this is fixed by just leaving nocow data alone as users > > probably set nocow for exactly the reason to not fragment extents > > and relocate blocks. > > Bad Btrfs Feature Interactions: btrfs read-only snapshots (never > tested, probably wouldn't work well) > > Unfortunately it seems that bees doesn't support read-only snapshots, > so it's a no way. > > P.S. > I tried duperemove with -A, but besides taking much longer it didn't > improve the situation. > Are you sure that the culprit is duperemove? AFAIK it shouldn't > unshare extents... Unsharing of extents depends... If an extent is shared between a r/o and r/w snapshot, rewriting the extent for deduplication ends up in a shared extent again but it is no longer reflinked with the original r/o snapshot. At least if btrfs doesn't allow to change extents part of a r/o snapshot... Which you all tell is the case... And then, there's unsharing of metadata by the deduplication process itself. Both effects should be minimal, tho. But since chunks are allocated in 1GB sizes, it may jump 1GB worth of allocation just for a few extra MB needed. A metadata rebalance may fix this. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why do full balance and deduplication reduce available free space?
Am Mon, 02 Oct 2017 12:02:16 +0200 schrieb Niccolò Belli: > This is how I performe balance: btrfs balance start --full-balance > rootfs This is how I perform deduplication (duperemove is from git > master): duperemove -drh --dedupe-options=noblock > --hashfile=../rootfs.hash Besides defragging removing the reflinks, duperemove will unshare your snapshots when used in this way: If it sees duplicate blocks within the subvolumes you give it, it will potentially unshare blocks from the snapshots while rewriting extents. BTW, you should be able to use duperemove with read-only snapshots if used in read-only-open mode. But I'd rather suggest to use bees instead: It works at whole-volume level, walking extents instead of files. That way it is much faster, doesn't reprocess already deduplicated extents, and it works with read-only snapshots. Until my patch it didn't like mixed nodatasum/datasum workloads. Currently this is fixed by just leaving nocow data alone as users probably set nocow for exactly the reason to not fragment extents and relocate blocks. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13.3 still has the out of space kernel oops
Am Wed, 27 Sep 2017 00:45:04 +0200 schrieb Ian Kumlien: > I just had my laptop hit the out of space kernel oops which it kinda > hard to recover from > > Everything states "out of disk" even with 20 gigs free (both according > to df and btrfs fi df) You should run balance from time to time. I can suggest the auto balance script from here: https://www.spinics.net/lists/linux-btrfs/msg52076.html It can be run unattended on a regular basis. > So I'm suspecting that i need to run btrfs check on it to recover the > lost space (i have mounted it with clear_cache and nothing) I don't think that "btrfs check" would recover free space, that's not a file system corruption, it's an allocation issue due to unbalanced chunks. > The problem is, finally getting a shell with rd.shell rd.break=cmdline > - systemd is still a pain and since it's "udev" it's not allowing me > to do cryptsetup luksopen due to "dependencies" Does "emergency" as a cmdline work? It should boot to emergency mode of systemd. Also, "recovery" as a cmdline could work, boots to a different mode. Both work for me using dracut on Gentoo with systemd. > Basically, btrfs check should be able to run on a ro mounted > fileystem, this is too hard to get working without having to bring a > live usb stick atm I think this is possible in the latest version but only running in non-repair mode. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
dmesg: csum "varying number" expected csum 0x0 mirror 1 (with trace/oops)
Hello! I came across noting some kernel messages which seem to be related to btrfs, should I worry? I'm currently running scrub on the device now. inode-resolve points to an unimportant, easily recoverable file: $ sudo btrfs inspect-internal inode-resolve -v 1229624 /mnt/btrfs-pool/gentoo/usr/portage/ ioctl ret=0, bytes_left=4033, bytes_missing=0, cnt=1, missed=0 /mnt/btrfs-pool/gentoo/usr/portage//packages/dev-lang/tcl/tcl-8.6.6-2.xpak The file wasn't modified in month and it has been in order previously because the backup didn't choke on it: $ ls -alt -rw-r--r-- 1 root root 11241835 19. Apr 22:11 /mnt/btrfs-pool/gentoo/usr/portage//packages/dev-lang/tcl/tcl-8.6.6-2.xpak I can read the file without problems, no new messages in dmesg: $ cat /mnt/btrfs-pool/gentoo/usr/portage//packages/dev-lang/tcl/tcl-8.6.6-2.xpak >/dev/null What's strange is the long time gaps between the btrfs reports and the kernel backtraces... Also, expected csum=0 looks strange. That mount is subvol_id=0. Not sure if inode-resolve works that way. I retried for the important subvolumes and it resolved for none. Just in case, I have a backlog of multiple daily backups. $ uname -a Linux jupiter 4.12.14-ck #1 SMP PREEMPT Fri Sep 22 02:47:44 CEST 2017 x86_64 Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz GenuineIntel GNU/Linux [88597.792462] BTRFS info (device bcache48): no csum found for inode 1229624 start 1650688 [88597.793304] BTRFS warning (device bcache48): csum failed root 280 ino 1229624 off 1650688 csum 0x953c1e92 expected csum 0x mirror 1 [128569.451376] [ cut here ] [128569.451382] WARNING: CPU: 7 PID: 146 at kernel/workqueue.c:2041 process_one_work+0x44/0x310 [128569.451383] Modules linked in: cifs ccm fuse bridge stp llc veth rfcomm bnep cachefiles btusb af_packet btintel iTCO_wdt bluetooth tun iTCO_vendor_support rfkill ecdh_generic kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core rtc_cmos snd_pcm snd_timer snd soundcore lpc_ich irqbypass uas usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs [128569.451406] CPU: 7 PID: 146 Comm: bcache Tainted: P O 4.12.14-ck #1 [128569.451407] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [128569.451410] task: 880419bf8bc0 task.stack: c95c4000 [128569.451412] RIP: 0010:process_one_work+0x44/0x310 [128569.451412] RSP: :c95c7e78 EFLAGS: 00013002 [128569.451413] RAX: 0007 RBX: 880419bdcf00 RCX: 88042f2d8020 [128569.451414] RDX: 88042f2d8018 RSI: 880120454f08 RDI: 880419bdcf00 [128569.451415] RBP: 88042f2d8000 R08: R09: [128569.451415] R10: 8a209a609e79 R11: R12: [128569.451416] R13: 88042f2e1b00 R14: 88042f2e1b80 R15: 880419bdcf30 [128569.451417] FS: () GS:88042f3c() knlGS: [128569.451418] CS: 0010 DS: ES: CR0: 80050033 [128569.451418] CR2: 00b0 CR3: 0003f23ef000 CR4: 001406e0 [128569.451419] Call Trace: [128569.451422] ? rescuer_thread+0x20b/0x370 [128569.451424] ? kthread+0xf2/0x130 [128569.451425] ? process_one_work+0x310/0x310 [128569.451426] ? kthread_create_on_node+0x40/0x40 [128569.451428] ? ret_from_fork+0x22/0x30 [128569.451429] Code: 04 b8 00 00 00 00 4c 0f 44 e8 49 8b 45 08 44 8b a0 00 01 00 00 41 83 e4 20 f6 45 10 04 75 0e 65 8b 05 79 38 f7 7e 3b 45 04 74 02 <0f> ff 48 ba eb 83 b5 80 46 86 c8 61 48 0f af d6 48 c1 ea 3a 48 [128569.451449] ---[ end trace d00a1585e5166d18 ]--- [148997.934146] BUG: unable to handle kernel paging request at c96af000 [148997.934154] IP: memcpy_erms+0x6/0x10 [148997.934155] PGD 41d021067 [148997.934156] P4D 41d021067 [148997.934157] PUD 41d022067 [148997.934158] PMD 41961c067 [148997.934158] PTE 0 [148997.934162] Oops: 0002 [#1] PREEMPT SMP [148997.934163] Modules linked in: cifs ccm fuse bridge stp llc veth rfcomm bnep cachefiles btusb af_packet btintel iTCO_wdt bluetooth tun iTCO_vendor_support rfkill ecdh_generic kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core rtc_cmos snd_pcm snd_timer snd soundcore lpc_ich irqbypass uas usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs [148997.934188] CPU: 6 PID: 966 Comm: kworker/u16:16 Tainted: PW O 4.12.14-ck #1 [148997.934190] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [148997.934193] Workqueue: btrfs-endio btrfs_endio_helper [148997.934194] task: 88001496a340 task.stack: c90011b68000 [148997.934196] RIP:
Re: Give up on bcache?
Am Tue, 26 Sep 2017 23:33:19 +0500 schrieb Roman Mamedov: > On Tue, 26 Sep 2017 16:50:00 + (UTC) > Ferry Toth wrote: > > > https://www.phoronix.com/scan.php?page=article=linux414-bcache- > > raid=2 > > > > I think it might be idle hopes to think bcache can be used as a ssd > > cache for btrfs to significantly improve performance.. > > My personal real-world experience shows that SSD caching -- with > lvmcache -- does indeed significantly improve performance of a large > Btrfs filesystem with slowish base storage. > > And that article, sadly, only demonstrates once again the general > mediocre quality of Phoronix content: it is an astonishing oversight > to not check out lvmcache in the same setup, to at least try to draw > some useful conclusion, is it Bcache that is strangely deficient, or > SSD caching as a general concept does not work well in the hardware > setup utilized. Bcache is actually not meant to increase benchmark performance except for very few corner cases. It is designed to improve interactivity and perceived performance, reducing head movements. On the bcache homepage there's actually tips on how to benchmark bcache correctly, including warm-up phase and turning on sequential caching. Phoronix doesn't do that, they test default settings, which is imho a good thing but you should know the consequences and research how to turn the knobs. Depending on the caching mode and cache size, the SQlite test may not show real-world numbers. Also, you should optimize some btrfs options to work correctly with bcache, e.g. force it to mount "nossd" as it detects the bcache device as SSD - which is wrong for some workloads, I think especially desktop workloads and most server workloads. Also, you may want to tune udev to correct some attributes so other applications can do their detection and behavior correctly, too: $ cat /etc/udev/rules.d/00-ssd-scheduler.rules ACTION=="add|change", KERNEL=="bcache*", ATTR{queue/rotational}="1" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/iosched/slice_idle}="0" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="kyber" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq" Take note: on a non-mq system you may want to use noop/deadline/cfq instead of kyber/bfq. I'm running bcache since over two years now and the performance improvement is very very high with boot times going down to 30-40s from 3+ minutes previously, faster app startup times (almost instantly like on SSD), reduced noise by reduced head movements, etc. Also, it has easy setup (no split metadata/data cache, you can attach more than one device to a single cache), and it is rocksolid even when crashing the system. Bcache learns by using LRU for caching: What you don't need will be pushed out of cache over time, what you use, stays. This is actually a lot like "hot data caching". Given a big enough cache, everything of your daily needs would stay in cache, easily achieving hit ratios around 90%. Since sequential access is bypassed, you don't have to worry to flush the cache with large copy operations. My system uses a 512G SSD with 400G dedicated to bcache, attached to 3x 1TB HDD draid0 mraid1 btrfs, filled with 2TB of net data and daily backups using borgbackup. Bcache runs in writeback mode, the backup takes around 15 minutes each night to dig through all data and stores it to an internal intermediate backup also on bcache (xfs, write-around mode). Currently not implemented, this intermediate backup will later be mirrored to external, off-site location. Some of the rest of the SSD is EFI-ESP, some swap space, and over-provisioned area to keep bcache performance high. $ uptime && bcache-status 21:28:44 up 3 days, 20:38, 3 users, load average: 1,18, 1,44, 2,14 --- bcache --- UUIDaacfbcd9-dae5-4377-92d1-6808831a4885 Block Size 4.00 KiB Bucket Size 512.00 KiB Congested? False Read Congestion 2.0ms Write Congestion20.0ms Total Cache Size400 GiB Total Cache Used400 GiB (100%) Total Cache Unused 0 B (0%) Evictable Cache 396 GiB (99%) Replacement Policy [lru] fifo random Cache Mode (Various) Total Hits 2364518 (89%) Total Misses290764 Total Bypass Hits 4284468 (100%) Total Bypass Misses 0 Total Bypassed 215 GiB The bucket size and block size was chosen to best fit with Samsung TLC arrangement. But this is pure theory, I never benchmarked the benefits. I just feel more comfortable that way. ;-) One should also keep in mind: The way how btrfs works cannot optimally use bcache, as cow will obviously invalidate data in bcache - but bcache doesn't have knowledge of this. Of course, such
Re: Btrfs performance with small blocksize on SSD
Am Mon, 25 Sep 2017 07:04:14 + schrieb "Fuhrmann, Carsten": > Well the correct translation for "Laufzeit" is runtime and not > latency. But thank you for that hint, I'll change it to "gesamt > Laufzeit" to make it more clear. How about better translating it to English in the first place as you're trying to reach an international community? Also, it would be nice to put the exact test you did as a command line or configuration file, so it can be replayed on other systems, and - most important - by the developers, to easily uncover what is causing the behavior... > Best regards > > Carsten > > -Ursprüngliche Nachricht- > Von: Andrei Borzenkov [mailto:arvidj...@gmail.com] > Gesendet: Sonntag, 24. September 2017 18:43 > An: Fuhrmann, Carsten ; Qu Wenruo > ; linux-btrfs@vger.kernel.org Betreff: Re: > AW: Btrfs performance with small blocksize on SSD > > 24.09.2017 16:53, Fuhrmann, Carsten пишет: > > Hello, > > > > 1) > > I used direct write (no page cache) but I didn't disable the Disk > > cache of the HDD/SSD itself. In all tests I wrote 1GB and looked > > for the runtime of that write process. > > So "latency" on your diagram means total time to write 1GiB file? > That is highly unusual meaning for "latency" which normally means > time to perform single IO. If so, you should better rename Y-axis to > something like "total run time". > N_r__yb_Xv_^_)__{.n_+{_n___)w*jg______j/___z_2___&_)___a_____G___h__j:+v___w -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs performance with small blocksize on SSD
Am Sun, 24 Sep 2017 19:43:05 +0300 schrieb Andrei Borzenkov: > 24.09.2017 16:53, Fuhrmann, Carsten пишет: > > Hello, > > > > 1) > > I used direct write (no page cache) but I didn't disable the Disk > > cache of the HDD/SSD itself. In all tests I wrote 1GB and looked > > for the runtime of that write process. > > So "latency" on your diagram means total time to write 1GiB file? That > is highly unusual meaning for "latency" which normally means time to > perform single IO. If so, you should better rename Y-axis to something > like "total run time". If you look closely it says "Laufzeit" which visually looks similar to "latency" but really means "run time". ;-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Thu, 21 Sep 2017 22:10:13 +0200 schrieb Kai Krakow <hurikha...@gmail.com>: > Am Wed, 20 Sep 2017 07:46:52 -0400 > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > > > > Fragmentation: Files with a lot of random writes can become > > > heavily fragmented (1+ extents) causing excessive multi-second > > > spikes of CPU load on systems with an SSD or large amount a RAM. > > > On desktops this primarily affects application databases > > > (including Firefox). Workarounds include manually defragmenting > > > your home directory using btrfs fi defragment. Auto-defragment > > > (mount option autodefrag) should solve this problem. > > > > > > Upon reading that I am wondering if fragmentation in the Firefox > > > profile is part of my issue. That's one thing I never tested > > > previously. (BTW, this system has 256 GB of RAM and 20 cores.) > > Almost certainly. Most modern web browsers are brain-dead and > > insist on using SQLite databases (or traditional DB files) for > > everything, including the cache, and the usage for the cache in > > particular kills performance when fragmentation is an issue. > > At least in Chrome, you can turn on simple cache backend, which, I > think, is using many small instead of one huge file. This suit btrfs > much better: > > chrome://flags/#enable-simple-cache-backend > > > And then I suggest also doing this (as your login user): > > $ cd $HOME > $ mv .cache .cache.old > $ mkdir .cache > $ lsattr +C .cache Oops, of course that's chattr, not lsattr > $ rsync -av .cache.old/ .cache/ > $ rm -Rf .cache.old > > This makes caches for most applications nocow. Chrome performance was > completely fixed for me by doing this. > > I'm not sure where Firefox puts its cache, I only use it on very rare > occasions. But I think it's going to .cache/mozilla last time looked > at it. > > You may want to close all apps before converting the cache directory. > > Also, I don't see any downsides in making this nocow. That directory > could easily be also completely volatile. If something breaks due to > no longer protected by data csum, just clean it out. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Wed, 20 Sep 2017 07:46:52 -0400 schrieb "Austin S. Hemmelgarn": > > Fragmentation: Files with a lot of random writes can become > > heavily fragmented (1+ extents) causing excessive multi-second > > spikes of CPU load on systems with an SSD or large amount a RAM. On > > desktops this primarily affects application databases (including > > Firefox). Workarounds include manually defragmenting your home > > directory using btrfs fi defragment. Auto-defragment (mount option > > autodefrag) should solve this problem. > > > > Upon reading that I am wondering if fragmentation in the Firefox > > profile is part of my issue. That's one thing I never tested > > previously. (BTW, this system has 256 GB of RAM and 20 cores.) > Almost certainly. Most modern web browsers are brain-dead and insist > on using SQLite databases (or traditional DB files) for everything, > including the cache, and the usage for the cache in particular kills > performance when fragmentation is an issue. At least in Chrome, you can turn on simple cache backend, which, I think, is using many small instead of one huge file. This suit btrfs much better: chrome://flags/#enable-simple-cache-backend And then I suggest also doing this (as your login user): $ cd $HOME $ mv .cache .cache.old $ mkdir .cache $ lsattr +C .cache $ rsync -av .cache.old/ .cache/ $ rm -Rf .cache.old This makes caches for most applications nocow. Chrome performance was completely fixed for me by doing this. I'm not sure where Firefox puts its cache, I only use it on very rare occasions. But I think it's going to .cache/mozilla last time looked at it. You may want to close all apps before converting the cache directory. Also, I don't see any downsides in making this nocow. That directory could easily be also completely volatile. If something breaks due to no longer protected by data csum, just clean it out. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD caching an existing btrfs raid1
Am Wed, 20 Sep 2017 17:51:15 +0200 schrieb Psalle: > On 19/09/17 17:47, Austin S. Hemmelgarn wrote: > (...) > > > > A better option if you can afford to remove a single device from > > that array temporarily is to use bcache. Bcache has one specific > > advantage in this case, multiple backend devices can share the same > > cache device. This means you don't have to carve out dedicated > > cache space for each disk on the SSD and leave some unused space so > > that you can add new devices if needed. The downside is that you > > can't convert each device in-place, but because you're using BTRFS, > > you can still convert the volume as a whole in-place. The > > procedure for doing so looks like this: > > > > 1. Format the SSD as a bcache cache. > > 2. Use `btrfs device delete` to remove a single hard drive from the > > array. > > 3. Set up the drive you just removed as a bcache backing device > > bound to the cache you created in step 1. > > 4. Add the new bcache device to the array. > > 5. Repeat from step 2 until the whole array is converted. > > > > A similar procedure can actually be used to do almost any > > underlying storage conversion (for example, switching to whole disk > > encryption, or adding LVM underneath BTRFS) provided all your data > > can fit on one less disk than you have. > > Thanks Austin, that's just great. For some reason I had discarded > bcache thinking that it would force me to rebuild from scratch, but > this kind of incremental migration is exactly why I hoped was > possible. I have plenty of space to replace the devices one by one. > > I will report back my experience in a few days, I hope. I've done it exactly that way in the past and it worked flawlessly (but it took 24+ hours). But it was easy for me because I was also adding a third disk to the pool, so existing stuff could easily move. I suggest to initialize bcache to writearound mode while converting, so your maybe terabytes of disk don't go through the SSD. If you later decide to remove bcache or not sure about future bcache usage, you can wrap any partition into a bcache container - just don't connect it to a cache and it will work like a normal partition. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent_io.c:1989
Am Mon, 18 Sep 2017 20:30:41 +0200 schrieb Holger Hoffstätte: > On 09/18/17 19:09, Liu Bo wrote: > > This 'mirror 0' looks fishy, (as mirror comes from > > btrfs_io_bio->mirror_num, which should be at least 1 if raid1 setup > > is in use.) > > > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you > > No, it did not; Gentoo always strives to be as close to mainline as > possible except for urgent security & low-risk convenience fixes. According to https://dev.gentoo.org/~mpagano/genpatches/patches-4.13-2.htm it's not only security patches. But as the list shows, there are indeed no btrfs patches. But there's one that may change btrfs behavior (tho unlikely), that is enabling native gcc optimizations if you choose so. I don't think that's a default option in Gentoo. I'm using native optimizations myself and see no strange mirror issues in btrfs. OTOH, I've lately switched to gentoo ck patchset to get better optimizations for gaming and realtime apps. But it's still at the 4.12 series. Are you sure the system crashed and wasn't just stuck at reading from the disks? If the disks have error correction and recovery enabled, the Linux block layer times out on the requests that the drives eventually won't fix anyways and resets the link after 30s. The drive timeout is 120s by default. You can change that on enterprise grade and NAS-ready drives, also a handful of desktop drives support it. Smartctl is used to set the values, just google "smartctl scterc". You could also adjust the timeout of the scsi layer to above the drive timeout, that means more than 120s if you cannot change scterc. I think it makes most sense to not reset the link before the drive had its chance to answer the request. I think there are pros and cons of changing these values. I always recommend to increase the scsi timeout above the scterc timeout. Personally, I lower the scterc timeout to 70 centisecs, and let the scsi timeout just at its default. RAID setups should use this to get control of their own error correction methods: The drive returns from request early and the RAID can do its job of reading from another copy, i.e. btrfs or mdraid, then repair it by writing back a correct copy which the drive converts into a sector relocation aka self-repair. Other people may jump in and recommend their own perspective of why or why not change which knob to which value. But well, as long as you saw no scsi errors reported when the "crash" occurred, these values are not involved in your problem anyways. What about "btrfs device stats"? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A user cannot remove his readonly snapshots?!
Am Sat, 16 Sep 2017 09:36:33 +0200 schrieb Ulli Horlacher <frams...@rus.uni-stuttgart.de>: > On Sat 2017-09-16 (01:22), Kai Krakow wrote: > > > > tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete > > > 2017-09-15_1859.test Delete subvolume (no-commit): > > > '/test/tux/zz/.snapshot/2017-09-15_1859.test' ERROR: cannot delete > > > '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only file > > > system > > > > See "man mount" in section btrfs mount options: There is a mount > > option to allow normal user to delete snapshots. > > As I wrote first: "I have mounted with option user_subvol_rm_allowed" > > tux@xerus:/test/tux/zz/.snapshot: mount | grep /test > /dev/sdd4 on /test type btrfs > (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/) > > This does not help. A user cannot remove a readonly snapshot he just > has created. Yes, sorry, I only later discovered the other posts. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A user cannot remove his readonly snapshots?!
Am Sat, 16 Sep 2017 00:02:01 +0200 schrieb Ulli Horlacher: > On Fri 2017-09-15 (23:44), Ulli Horlacher wrote: > > On Fri 2017-09-15 (22:07), Peter Grandi wrote: > > > [...] > > > > > > Ordinary permissions still apply both to 'create' and 'delete': > > > > My user tux is the owner of the snapshot directory, because he has > > created it! > > I can delete normal subvolumes but not the readonly snapshots: > > tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume create test > Create subvolume './test' > > tux@xerus:/test/tux/zz/.snapshot: ll > drwxr-xr-x tux users- 2017-09-15 18:22:26 > 2017-09-15_1822.test drwxr-xr-x tux users- 2017-09-15 > 18:22:26 2017-09-15_1824.test drwxr-xr-x tux users- > 2017-09-15 18:57:39 2017-09-15_1859.test drwxr-xr-x tux > users- 2017-09-15 23:58:51 test > > tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete test > Delete subvolume (no-commit): '/test/tux/zz/.snapshot/test' > > tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete > 2017-09-15_1859.test Delete subvolume (no-commit): > '/test/tux/zz/.snapshot/2017-09-15_1859.test' ERROR: cannot delete > '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only file system See "man mount" in section btrfs mount options: There is a mount option to allow normal user to delete snapshots. But this is said to has security implication I cannot currently tell. Maybe someone else knows. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Fri, 15 Sep 2017 16:11:50 +0200 schrieb Michał Sokołowski: > On 09/15/2017 03:07 PM, Tomasz Kłoczko wrote: > > [...] > > Case #1 > > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 > > storage -> guest BTRFS filesystem > > SQL table row insertions per second: 1-2 > > > > Case #2 > > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw > > storage -> guest EXT4 filesystem > > SQL table row insertions per second: 10-15 > > Q -1) why you are comparing btrfs against ext4 on top of the btrfs > > which is doing own COW operations on bottom of such sandwiches .. > > if we SUPPOSE to be talking about impact of the fragmentation on > > top of btrfs? > > Tomasz, > you seem to be convinced that fragmentation does not matter. I found > this (extremely bad, true) example says otherwise. Sorry to jump this, but did you at least set the qemu image to nocow? Otherwise this example is totally flawed because you're testing qemu storage layer mostly and not btrfs. A better test would've been to test qemu raw on btrfs cow vs on btrfs nocow, with both the same file system inside the qemu image. But you are modifying multiple parameters at once during the test, and I expect then everyone has a huge impact on performance but only one is specific to btrfs which you apparently did not test this way. Personally, running qemu cow2 on btrfs cow really helps nothing except really bad performance. Make one of both layers nocow and it should become better. If you want to give some better numbers, please reduce this test to just one cow layer, the one at the top layer: btrfs host fs. Copy the image somewhere else to restore from, and ensure (using filefrag) that the starting situation matches each test run. Don't change any parameters of the qemu layer at each test. And run a file system inside which doesn't do any fancy stuff, like ext2 or ext3 without journal. Use qemu raw storage. Then test again with cow vs nocow on the host side. Create a nocow copy of your image (use size of the source image for truncate): # rm -f qemu-image-nocow.raw # touch qemu-image-nocow.raw # chattr +C -c qemu-image-nocow.raw # dd if=source-image.raw of=qemu-image-nocow.raw bs=1M # btrfs fi defrag -f qemu-image-nocow.raw # filefrag -v qemu-image-nocow.raw Create a cow copy of your image: # rm -f qemu-image-cow.raw # touch qemu-image-cow.raw # chattr -C -c qemu-image-cow.raw # dd if=source-image.raw of=qemu-image-cow.raw bs=1M # btrfs fi defrag -f qemu-image-cow.raw # filefrag -v qemu-image-cow.raw Given that host btrfs is mounted datacow,compress=none and without autodefrag, and you don't touch the source image contents during tests. Now run your test script inside both qemu machines, take your measurements and check fragmentation again after the run. filefrag should report no more fragments than before the test for the first test, but should report a magnitude more for the second test. Now copy (cp) both one at a time to a new file and measure the time. It should be slower for the highly fragmented version. Don't forget to run tests with and without flushed caches so we get cold and warm numbers. In this scenario, qemu would only be the application to modify the raw image files and you're actually testing the impact of fragmentation of btrfs. You could also make a reflink copy of the nocow test image and do a third test to see that it introduces fragmentation now, tho probably much lower than for the cow test image. You can verify the numbers with filefrag. According to Tomasz, your tests should not run at vastly different speeds because fragmentation has no impact on performance, quod est demonstrandum... I think we will not get to the "erat" part. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Thu, 14 Sep 2017 18:48:54 +0100 schrieb Tomasz Kłoczko <kloczko.tom...@gmail.com>: > On 14 September 2017 at 16:24, Kai Krakow <hurikha...@gmail.com> > wrote: [..] > > Getting e.g. boot files into read order or at least nearby improves > > boot time a lot. Similar for loading applications. > > By how much it is possible to improve boot time? > Just please some example which I can try to replay which ill be > showing that we have similar results. > I still have one one of my laptops with spindle on btrfs root fs ( and > no other FSess in use) so I could be able to confirm that my numbers > are enough close to your numbers. I need to create a test setup because this system uses bcache. The difference (according to systemd-analyze) between warm bcache and no bcache at all ranges from 16-30s boot time vs. 3+ minutes boot time. I could turn off bcache, do a boot trace, try to rearrange boot files, boot again. However, that is not very reproducible as the current file layout is not defined. It'd be better to setup a separate machine where I could start over from a "well defined" state before applying optimization steps to see the differences between different strategies. At least readahead is not very helpful, I tested that in the past. It reduces boot time just by a few seconds, maybe 20-30, thus going from 3+ minutes to 2+ minutes. I still have an old laptop lying around: Single spindle, should make a good test scenario. I'll have to see if I can get it back into shape. It will take me some time. > > Shake tries to > > improve this by rewriting the files - and this works because file > > systems (given enough free space) already do a very good job at > > doing this. But constant system updates degrade this order over > > time. > > OK. Please prepare some database, import some data which size will be > few times of not used RAM (best if this multiplication factor will be > at least 10). Then do some batch of selects measuring distribution > latencies of those queries. Well, this is pretty easy. Systemd-journald is a real beast when it comes to cow fragmentation. Results can be easily generated and reproduced. There are long traces of discussions in the systemd mailing list and I simply decided to make the files nocow right from the start and that fixed it for me. I can simply revert it and create benchmarks. > This will give you some data about. not fragmented data. Well, I would probably do it the other way around: Generate a fragmented journal file (as that is how journald creates the file over time), then rewrite it by some manner to reduce extents, then run journal operations again on this file. Does it bother you to turn this around? > Then on next stage try to apply some number of update queries and > after reboot the system or drop all caches. and repeat the same set of > selects. > After this all what you need to do is compare distribution of the > latencies. Which tool to use to measure which latencies? Speaking of latencies: What's of interest here is perceived performance resulting mostly from seek overhead (except probably in the journal file case which just overwhelmes by the pure amount of extents). I'm not sure if measuring VFS latencies would provide any useful insights here. VFS probably works fast enough still in this case. > > It really doesn't matter if some big file is laid out in 1 > > allocation of 1 GB or in 250 allocations of 4MB: It really doesn't > > make a big difference. > > > > Recombining extents into bigger once, tho, can make a big > > difference in an aging btrfs, even on SSDs. > > That it may be an issue with using extents. I can't follow why you argue that a file with thousands of extents vs a file of same size but only a few extents would makes no difference to operate on. And of course this has to do with extents. But btrfs uses extents. Do you suggest to use ZFS instead? Due to how cow works, the effect would probably be less or barely noticable for writes, but read scanning through the file becomes slow with clearly more "noise" from the moving heads. > Again: please show some results of some test unit which anyone will be > able to reply and confirm or not that this effect really exist. > > If problem really exist and is related ot extents you should have real > scenario explanation why ZFS is not using extents. That was never the discussion. You brought in the ZFS point. I read about the design reasoning behind ZFS when it appeared and started gain public interest years back. > btrfs is not to far from classic approach do FS because it srill uses > allocation structures. > This is not the case in context of ZFS because this technology has no > information about what is already allocates. What about btrfs free space tree? Isn'
Re: defragmenting best practice?
Am Thu, 14 Sep 2017 17:24:34 +0200 schrieb Kai Krakow <hurikha...@gmail.com>: Errors corrected, see below... > Am Thu, 14 Sep 2017 14:31:48 +0100 > schrieb Tomasz Kłoczko <kloczko.tom...@gmail.com>: > > > On 14 September 2017 at 12:38, Kai Krakow <hurikha...@gmail.com> > > wrote: [..] > > > > > > I suggest you only ever defragment parts of your main subvolume or > > > rely on autodefrag, and let bees do optimizing the snapshots. > > Please read that again including the parts you omitted. > > > > > Also, I experimented with adding btrfs support to shake, still > > > working on better integration but currently lacking time... :-( > > > > > > Shake is an adaptive defragger which rewrites files. With my > > > current patches it clones each file, and then rewrites it to its > > > original location. This approach is currently not optimal as it > > > simply bails out if some other process is accessing the file and > > > leaves you with an (intact) temporary copy you need to move back > > > in place manually. > > > > If you really want to have real and *ideal* distribution of the data > > across physical disk first you need to build time travel device. > > This device will allow you to put all blocks which needs to be read > > in perfect order (to read all data only sequentially without seek). > > However it will be working only in case of spindles because in case > > of SSDs there is no seek time. > > Please let us know when you will write drivers/timetravel/ Linux > > kernel driver. When such driver will be available I promise I'll > > write all necessary btrfs code by myself in matter of few days (it > > will be piece of cake compare to build such device). > > > > But seriously .. > > Seriously: Defragmentation on spindles is IMHO not about getting the > perfect continuous allocation but providing better spatial layout of > the files you work with. > > Getting e.g. boot files into read order or at least nearby improves > boot time a lot. Similar for loading applications. Shake tries to > improve this by rewriting the files - and this works because file > systems (given enough free space) already do a very good job at doing > this. But constant system updates degrade this order over time. > > It really doesn't matter if some big file is laid out in 1 allocation > of 1 GB or in 250 allocations of 4MB: It really doesn't make a big > difference. > > Recombining extents into bigger once, tho, can make a big difference > in an aging btrfs, even on SSDs. > > Bees is, btw, not about defragmentation: I have some OS containers > running and I want to deduplicate data after updates. It seems to do a > good job here, better than other deduplicators I found. And if some > defrag tools destroyed your snapshot reflinks, bees can also help > here. On its way it may recombine extents so it may improve > fragmentation. But usually it probably defragments because it needs ^^^ It fragments! > to split extents that a defragger combined. > > But well, I think getting 100% continuous allocation is really not the > achievement you want to get, especially when reflinks are a primary > concern. > > > > Only context/scenario when you may want to lower defragmentation is > > when you are something needs to allocate continuous area lower than > > free space and larger than largest free chunk. Something like this > > happens only when volume is working on almost 100% allocated space. > > In such scenario even you bees cannot do to much as it may be not > > enough free space to move some other data in larger chunks to > > defragment FS physical space. > > Bees does not do that. > > > > If your workload will be still writing > > new data to FS such defragmentation may give you (maybe) few more > > seconds and just after this FS will be 100% full, > > > > In other words if someone is thinking that such defragmentation > > daemon is solving any problems he/she may be 100% right .. such > > person is only *thinking* that this is truth. > > Bees is not about that. > > > > kloczek > > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix > > it". > > Do you know the saying "think first, then act"? > > > > So first show that fragmentation is hurting latency of the > > access to btrfs data and it will be possible to measurable such > > impact. Before you will start measuring this you need to learn how o > > sample for example VFS la
Re: defragmenting best practice?
Am Thu, 14 Sep 2017 14:31:48 +0100 schrieb Tomasz Kłoczko <kloczko.tom...@gmail.com>: > On 14 September 2017 at 12:38, Kai Krakow <hurikha...@gmail.com> > wrote: [..] > > > > I suggest you only ever defragment parts of your main subvolume or > > rely on autodefrag, and let bees do optimizing the snapshots. Please read that again including the parts you omitted. > > Also, I experimented with adding btrfs support to shake, still > > working on better integration but currently lacking time... :-( > > > > Shake is an adaptive defragger which rewrites files. With my current > > patches it clones each file, and then rewrites it to its original > > location. This approach is currently not optimal as it simply bails > > out if some other process is accessing the file and leaves you with > > an (intact) temporary copy you need to move back in place > > manually. > > If you really want to have real and *ideal* distribution of the data > across physical disk first you need to build time travel device. This > device will allow you to put all blocks which needs to be read in > perfect order (to read all data only sequentially without seek). > However it will be working only in case of spindles because in case of > SSDs there is no seek time. > Please let us know when you will write drivers/timetravel/ Linux > kernel driver. When such driver will be available I promise I'll > write all necessary btrfs code by myself in matter of few days (it > will be piece of cake compare to build such device). > > But seriously .. Seriously: Defragmentation on spindles is IMHO not about getting the perfect continuous allocation but providing better spatial layout of the files you work with. Getting e.g. boot files into read order or at least nearby improves boot time a lot. Similar for loading applications. Shake tries to improve this by rewriting the files - and this works because file systems (given enough free space) already do a very good job at doing this. But constant system updates degrade this order over time. It really doesn't matter if some big file is laid out in 1 allocation of 1 GB or in 250 allocations of 4MB: It really doesn't make a big difference. Recombining extents into bigger once, tho, can make a big difference in an aging btrfs, even on SSDs. Bees is, btw, not about defragmentation: I have some OS containers running and I want to deduplicate data after updates. It seems to do a good job here, better than other deduplicators I found. And if some defrag tools destroyed your snapshot reflinks, bees can also help here. On its way it may recombine extents so it may improve fragmentation. But usually it probably defragments because it needs to split extents that a defragger combined. But well, I think getting 100% continuous allocation is really not the achievement you want to get, especially when reflinks are a primary concern. > Only context/scenario when you may want to lower defragmentation is > when you are something needs to allocate continuous area lower than > free space and larger than largest free chunk. Something like this > happens only when volume is working on almost 100% allocated space. > In such scenario even you bees cannot do to much as it may be not > enough free space to move some other data in larger chunks to > defragment FS physical space. Bees does not do that. > If your workload will be still writing > new data to FS such defragmentation may give you (maybe) few more > seconds and just after this FS will be 100% full, > > In other words if someone is thinking that such defragmentation daemon > is solving any problems he/she may be 100% right .. such person is > only *thinking* that this is truth. Bees is not about that. > kloczek > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix > it". Do you know the saying "think first, then act"? > So first show that fragmentation is hurting latency of the > access to btrfs data and it will be possible to measurable such > impact. Before you will start measuring this you need to learn how o > sample for example VFS layer latency. Do you know how to do this to > deliver such proof? You didn't get the point. You only read "defragmentation" and your alarm lights lid up. You even think bees would be a defragmenter. It probably is more the opposite because it introduces more fragments in exchange for more reflinks. > PS2. The same "discussions" about fragmentation where in the past > about +10 years ago after ZFS has been introduced. Just to let you > know that after initial ZFS introduction up to now was not written > even single line of ZFS code to handle active fragmentation and no one > been able to prove that something about active defragmentation needs >
Re: defragmenting best practice?
Am Tue, 12 Sep 2017 18:28:43 +0200 schrieb Ulli Horlacher: > On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: > > When I do a > > btrfs filesystem defragment -r /directory > > does it defragment really all files in this directory tree, even if > > it contains subvolumes? > > The man page does not mention subvolumes on this topic. > > No answer so far :-( > > But I found another problem in the man-page: > > Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as > well as with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or > >= 3.13.4 will break up the ref-links of COW data (for example files > >copied with > cp --reflink, snapshots or de-duplicated data). This may cause > considerable increase of space usage depending on the broken up > ref-links. > > I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several > snapshots. > Therefore, I better should avoid calling "btrfs filesystem defragment > -r"? > > What is the defragmenting best practice? > Avoid it completly? You may want to try https://github.com/Zygo/bees. It is a daemon watching the file system generation changes, scanning the blocks and then recombines them. Of course, this process somewhat defeats the purpose of defragging in the first place as it will undo some of the defragmenting. I suggest you only ever defragment parts of your main subvolume or rely on autodefrag, and let bees do optimizing the snapshots. Also, I experimented with adding btrfs support to shake, still working on better integration but currently lacking time... :-( Shake is an adaptive defragger which rewrites files. With my current patches it clones each file, and then rewrites it to its original location. This approach is currently not optimal as it simply bails out if some other process is accessing the file and leaves you with an (intact) temporary copy you need to move back in place manually. Shake works very well with the idea of detecting how defragmented, how old, and how far away from an "ideal" position a file is and exploits standard Linux file systems behavior to optimally placing files by rewriting them. It then records its status per file in extended attributes. It also works with non-btrfs file systems. My patches try to avoid defragging files with shared extents, so this may help your situation. However, it will still shuffle files around if they are too far from their ideal position, thus destroying shared extents. A future patch could use extent recombining and skip shared extents in that process. But first I'd like to clean out some of the rough edges together with the original author of shake. Look here: https://github.com/unbrice/shake and also check out the pull requests and comments there. You shouldn't currently run shake unattended and only on specific parts of your FS you feel need defragmenting. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
Am Sun, 10 Sep 2017 20:15:52 +0200 schrieb Ferenc-Levente Juhos: > >Problem is that each raid1 block group contains two chunks on two > >separate devices, it can't utilize fully three devices no matter > >what. If that doesn't suit you then you need to add 4th disk. After > >that FS will be able to use all unallocated space on all disks in > >raid1 profile. But even then you'll be able to safely lose only one > >disk since BTRFS still will be storing only 2 copies of data. > > I hope I didn't say that I want to utilize all three devices fully. It > was clear to me that there will be 2 TB of wasted space. > Also I'm not questioning the chunk allocator for RAID1 at all. It's > clear and it always has been clear that for RAID1 the chunks need to > be allocated on different physical devices. > If I understood Kai's point of view, he even suggested that I might > need to do balancing to make sure that the free space on the three > devices is being used smartly. Hence the questions about balancing. It will allocate chunks from the device with the most space available. So while you fill your disks space usage will evenly distribute. The problem comes when you start deleting stuff, some chunks may even be freed, and everything becomes messed up. In an aging file system you may notice that the chunks are no longer evenly distributed. A balance is a way to fix that because it will reallocate chunks and coalesce data back into single chunks, making free space for new allocations. In this process it will actually evenly distribute your data again. You may want to use this rebalance script: https://www.spinics.net/lists/linux-btrfs/msg52076.html > I mean in worst case it could happen like this: > > Again I have disks of sizes 3, 3, 8: > Fig.1 > Drive1(8) Drive2(3) Drive3(3) > - X1X1 > - X2X2 > - X3X3 > Here the new drive is completely unused. Even if one X1 chunk would be > on Drive1 it would be still a sub-optimal allocation. This won't happen while filling a fresh btrfs. Chunks are always allocated from a device with most free space (within the raid1 constraints). This it will allocate space alternating between disk1+2 and disk1+3. > This is the optimal allocation. Will btrfs allocate like this? > Considering that Drive1 has the most free space. > Fig. 2 > Drive1(8) Drive2(3) Drive3(3) > X1X1- > X2- X2 > X3X3- > X4- X4 Yes. > From my point of view Fig.2 shows the optimal allocation, by the time > the disks Drive2 and Drive3 are full (3TB) Drive1 must have 6TB > (because it is exclusively holding the mirrors for both Drive2 and 3). > For sure now btrfs can say, since two of the drives are completely > full he can't allocate any more chunks and the remaining 2 TB of space > from Drive1 is wasted. This is clear it's even pointed out by the > btrfs size calculator. Yes. > But again if the above statements are true, then df might as well tell > the "truth" and report that I have 3.5 TB space free and not 1.5TB (as > it is reported now). Again here I fully understand Kai's explanation. > Because coming back to my first e-mail, my "problem" was that df is > reporting 1.5 TB free, whereas the whole FS holds 2.5 TB of data. The size calculator has undergone some revisions. I think it currently estimates the free space from net data to raw data ratio across all devices, taking the current raid constraints into account. Calculating free space in btrfs is difficult because in the future btrfs may even support different raid levels for different sub volumes. It's probably best to calculate for the worst case scenario then. Even today it's already difficult if you use different raid levels for meta data and content data: The filesystem cannot predict the future of allocations. It can only give an educated guess. And the calculation was revised a few times to not "overshoot". > So the question still remains, is it just that df is intentionally not > smart enough to give a more accurate estimation, The df utility doesn't now anything about btrfs allocations. The value is estimated by btrfs itself. To get more detailed info for capacity planning, you should use "btrfs fi df" and its various siblings. > or is the assumption > that the allocator picks the drive with most free space mistaken? > If I continue along the lines of what Kai said, and I need to do > re-balance, because the allocation is not like shown above (Fig.2), > then my question is still legitimate. Are there any filters that one > might use to speed up or to selectively balance in my case? or will I > need to do full balance? Your assumption is misguided. The total free space estimation is a totally different thing than what the allocator bases its decision on. See "btrfs dev usage". The allocator uses space from the biggest unallocated space
Re: Help me understand what is going on with my RAID1 FS
Am Sun, 10 Sep 2017 15:45:42 +0200 schrieb FLJ: > Hello all, > > I have a BTRFS RAID1 volume running for the past year. I avoided all > pitfalls known to me that would mess up this volume. I never > experimented with quotas, no-COW, snapshots, defrag, nothing really. > The volume is a RAID1 from day 1 and is working reliably until now. > > Until yesterday it consisted of two 3 TB drives, something along the > lines: > > Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db > Total devices 2 FS bytes used 2.47TiB > devid1 size 2.73TiB used 2.47TiB path /dev/sdb > devid2 size 2.73TiB used 2.47TiB path /dev/sdc > > Yesterday I've added a new drive to the FS and did a full rebalance > (without filters) over night, which went through without any issues. > > Now I have: > Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db > Total devices 3 FS bytes used 2.47TiB > devid1 size 2.73TiB used 1.24TiB path /dev/sdb > devid2 size 2.73TiB used 1.24TiB path /dev/sdc > devid3 size 7.28TiB used 2.48TiB path /dev/sda > > # btrfs fi df /mnt/BigVault/ > Data, RAID1: total=2.47TiB, used=2.47TiB > System, RAID1: total=32.00MiB, used=384.00KiB > Metadata, RAID1: total=4.00GiB, used=2.74GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > But still df -h is giving me: > Filesystem Size Used Avail Use% Mounted on > /dev/sdb 6.4T 2.5T 1.5T 63% /mnt/BigVault > > Although I've heard and read about the difficulty in reporting free > space due to the flexibility of BTRFS, snapshots and subvolumes, etc., > but I only have a single volume, no subvolumes, no snapshots, no > quotas and both data and metadata are RAID1. > > My expectation would've been that in case of BigVault Size == Used + > Avail. > > Actually based on http://carfax.org.uk/btrfs-usage/index.html I > would've expected 6 TB of usable space. Here I get 6.4 which is odd, > but that only 1.5 TB is available is even stranger. > > Could anyone explain what I did wrong or why my expectations are > wrong? > > Thank you in advance Btrfs reports estimated free space from the free space of the smallest member as it can only guarantee that. In your case this is 2.73 minus 1.24 free which is roughly around 1.5T. But since this free space distributes across three disks with one having much more free space, it probably will use up that space at half the rate of actual allocation. But due to how btrfs allocates from free space in chunks, that may not be possible - thus the low unexpected value. You will probably need to run balance once in a while to evenly redistribute allocated chunks across all disks. It may give you better estimates if you combine sdb and sdc into one logical device, e.g. using raid0 or jbod via md or lvm. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum failed root -9
Am Wed, 14 Jun 2017 15:39:50 +0200 schrieb Henk Slager <eye...@gmail.com>: > On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager <eye...@gmail.com> > wrote: > > On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow <hurikha...@gmail.com> > > wrote: > >> Am Mon, 12 Jun 2017 11:00:31 +0200 > >> schrieb Henk Slager <eye...@gmail.com>: > >> > [...] > >> > >> There's btrfs-progs v4.11 available... > > > > I started: > > # btrfs check -p --readonly /dev/mapper/smr > > but it stopped with printing 'Killed' while checking extents. The > > board has 8G RAM, no swap (yet), so I just started lowmem mode: > > # btrfs check -p --mode lowmem --readonly /dev/mapper/smr > > > > Now after a 1 day 77 lines like this are printed: > > ERROR: extent[5365470154752, 81920] referencer count mismatch (root: > > 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2 > > > > It is still running, hopefully it will finish within 2 days. But > > lateron I can compile/use latest progs from git. Same for kernel, > > maybe with some tweaks/patches, but I think I will also plug the > > disk into a faster machine then ( i7-4770 instead of the J1900 ). > > > [...] > >> > >> What looks strange to me is that the parameters of the error > >> reports seem to be rotated by one... See below: > >> > [...] > >> > >> Why does it say "ino 1"? Does it mean devid 1? > > > > On a 3-disk btrfs raid1 fs I see in the journal also "read error > > corrected: ino 1" lines for all 3 disks. This was with a 4.10.x > > kernel, ATM I don't know if this is right or wrong. > > > [...] > >> > >> And why does it say "root -9"? Shouldn't it be "failed -9 root 257 > >> ino 515567616"? In that case the "off" value would be completely > >> missing... > >> > >> Those "rotations" may mess up with where you try to locate the > >> error on disk... > > > > I hadn't looked at the numbers like that, but as you indicate, I > > also think that the 1-block csum fail location is bogus because the > > kernel calculates that based on some random corruption in critical > > btrfs structures, also looking at the 77 referencer count > > mismatches. A negative root ID is already a sort of red flag. When > > I can mount the fs again after the check is finished, I can > > hopefully use the output of the check to get clearer how big the > > 'damage' is. > > The btrfs lowmem mode check ends with: > > ERROR: root 7331 EXTENT_DATA[928390 3506176] shouldn't be hole > ERROR: errors found in fs roots > found 6968612982784 bytes used, error(s) found > total csum bytes: 6786376404 > total tree bytes: 25656016896 > total fs tree bytes: 14857535488 > total extent tree bytes: 3237216256 > btree space waste bytes: 3072362630 > file data blocks allocated: 38874881994752 > referenced 36477629964288 > > In total 2000+ of those "shouldn't be hole" lines. > > A non-lowmem check, now done with kernel 4.11.4 and progs v4.11 and > 16G swap added ends with 'noerrors found' Don't trust lowmem mode too much. The developer of lowmem mode may tell you more about specific edge cases. > W.r.t. holes, maybe it is woth to mention the super-flags: > incompat_flags 0x369 > ( MIXED_BACKREF | > COMPRESS_LZO | > BIG_METADATA | > EXTENDED_IREF | > SKINNY_METADATA | > NO_HOLES ) I think it's not worth to follow up on this holes topic: I guess it was a false report of lowmem mode, and it was fixed with 4.11 btrfs progs. > The fs has received snapshots from source fs that had NO_HOLES enabled > for some time, but after registed this bug: > https://bugzilla.kernel.org/show_bug.cgi?id=121321 > I put back that NO_HOLES flag to zero on the source fs. It seems I > forgot to do that on the 8TB target/backup fs. But I don't know if > there is a relation between this flag flipping and the btrfs check > error messages. > > I think I leave it as is for the time being, unless there is some news > how to fix things with low risk (or maybe via a temp overlay snapshot > with DM). But the lowmem check took 2 days, that's not really fun. > The goal for the 8TB fs is to have an up to 7 year snapshot history at > sometime, now the oldest snapshot is from early 2014, so almost > halfway :) Btrfs is still much too unstable to trust 7 years worth of backup to it. You will probably loose it at some point, especially while many snapshots are still such a huge performance breaker in btrfs. I suggest trying out also other alternatives like borg backup for such a project. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: csum failed root -9
Am Mon, 12 Jun 2017 11:00:31 +0200 schrieb Henk Slager: > Hi all, > > there is 1-block corruption a 8TB filesystem that showed up several > months ago. The fs is almost exclusively a btrfs receive target and > receives monthly sequential snapshots from two hosts but 1 received > uuid. I do not know exactly when the corruption has happened but it > must have been roughly 3 to 6 months ago. with monthly updated > kernel+progs on that host. > > Some more history: > - fs was created in november 2015 on top of luks > - initially bcache between the 2048-sector aligned partition and luks. > Some months ago I removed 'the bcache layer' by making sure that cache > was clean and then zeroing 8K bytes at start of partition in an > isolated situation. Then setting partion offset to 2064 by > delete-recreate in gdisk. > - in december 2016 there were more scrub errors, but related to the > monthly snapshot of december2016. I have removed that snapshot this > year and now only this 1-block csum error is the only issue. > - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that > includes some SMR related changes in the blocklayer this disk works > fine with btrfs. > - the smartctl values show no error so far but I will run an extended > test this week after another btrfs check which did not show any error > earlier with the csum fail being there > - I have noticed that the board that has the disk attached has been > rebooted due to power-failures many times (unreliable power switch and > power dips from energy company) and the 150W powersupply is broken and > replaced since then. Also due to this, I decided to remove bcache > (which has been in write-through and write-around only). > > Some btrfs inpect-internal exercise shows that the problem is in a > directory in the root that contains most of the data and snapshots. > But an rsync -c with an identical other clone snapshot shows no > difference (no writes to an rw snapshot of that clone). So the fs is > still OK as file-level backup, but btrfs replace/balance will fatal > error on just this 1 csum error. It looks like that this is not a > media/disk error but some HW induced error or SW/kernel issue. > Relevant btrfs commands + dmesg info, see below. > > Any comments on how to fix or handle this without incrementally > sending all snapshots to a new fs (6+ TiB of data, assuming this won't > fail)? > > > # uname -r > 4.11.3-1-default > # btrfs --version > btrfs-progs v4.10.2+20170406 There's btrfs-progs v4.11 available... > fs profile is dup for system+meta, single for data > > # btrfs scrub start /local/smr What looks strange to me is that the parameters of the error reports seem to be rotated by one... See below: > [27609.626555] BTRFS error (device dm-0): parent transid verify failed > on 6350718500864 wanted 23170 found 23076 > [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718500864 (dev /dev/mapper/smr sector 11681212672) > [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718504960 (dev /dev/mapper/smr sector 11681212680) > [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718509056 (dev /dev/mapper/smr sector 11681212688) > [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350718513152 (dev /dev/mapper/smr sector 11681212696) > [37663.606455] BTRFS error (device dm-0): parent transid verify failed > on 6350453751808 wanted 23170 found 23075 > [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350453751808 (dev /dev/mapper/smr sector 11679647008) > [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350453755904 (dev /dev/mapper/smr sector 11679647016) > [37663.685587] BTRFS info (device dm-0): read error corrected: ino 1 > off 635045376 (dev /dev/mapper/smr sector 11679647024) > [37663.685798] BTRFS info (device dm-0): read error corrected: ino 1 > off 6350453764096 (dev /dev/mapper/smr sector 11679647032) Why does it say "ino 1"? Does it mean devid 1? > [43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs: > wr 0, rd 0, flush 0, corrupt 1, gen 0 > [43497.234605] BTRFS error (device dm-0): unable to fixup (regular) > error at logical 7175413624832 on dev /dev/mapper/smr > > # < figure out which chunk with help of btrfs py lib > > > chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808 > length 1073741824 used 1073741824 used_pct 100 > chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632 > length 1073741824 used 1073741824 used_pct 100 > > # btrfs balance start -v > -dvrange=7174898057216..7174898057217 /local/smr > > [74250.913273] BTRFS info (device dm-0): relocating block group > 7174898057216 flags data > [74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino > 257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1 > [74255.965804] BTRFS warning (device dm-0): csum
Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
Am Tue, 23 May 2017 07:21:33 -0400 schrieb "Austin S. Hemmelgarn": > On 2017-05-22 22:07, Chris Murphy wrote: > > On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN > > wrote: > >> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote: > [...] > [...] > [...] > >> > >> Oh, swap will work, you're sure? > >> I already have an SSD, if that's good enough, I can give it a > >> shot. > > > > Yeah although I have no idea how much swap is needed for it to > > succeed. I'm not sure what the relationship is to fs metadata chunk > > size to btrfs check RAM requirement is; but if it wants all of the > > metadata in RAM, then whatever btrfs fi us shows you for metadata > > may be a guide (?) for how much memory it's going to want. > I think the in-memory storage is a bit more space efficient than the > on-disk storage, but I'm not certain, and I'm pretty sure it takes up > more space when it's actually repairing things. If I'm doing the > math correctly, you _may_ need up to 50% _more_ than the total > metadata size for the FS in virtual memory space. > > > > Another possibility is zswap, which still requires a backing device, > > but it might be able to limit how much swap to disk is needed if the > > data to swap out is highly compressible. *shrug* > > > zswap won't help in that respect, but it might make swapping stuff > back in faster. It just keeps a compressed copy in memory in > parallel to writing the full copy out to disk, then uses that > compressed copy to swap in instead of going to disk if the copy is > still in memory (but it will discard the compressed copies if memory > gets really low). In essence, it reduces the impact of swapping when > memory pressure is moderate (the situation for most desktops for > example), but becomes almost useless when you have very high memory > pressure (which is what describes this usage). Is this really how zswap works? I always thought it acts as a compressed write-back cache in front of the swap devices. Pages first go to zswap compressed, and later write-back kicks in and migrates those compressed pages to real swap, but still compressed. This is done by zswap putting two (or up to three in modern kernels) compressed pages into one page. It has the downside of uncompressing all "buddy pages" when only one is needed back in. But it stays compressed. This also tells me zswap will either achieve around 1:2 or 1:3 effective compression ratio or none. So it cannot be compared to how streaming compression works. OTOH, if the page is reloaded from cache before write-back kicks in, it will never be written to swap but just uncompressed and discarded from the cache. Under high memory pressure it doesn't really work that well due to high CPU overhead if pages constantly swap out, compress, write, read, uncompress, swap in... This usually results in very low CPU usage for processes but high IO and disk wait and high kernel CPU usage. But it defers memory pressure conditions to a little later in exchange for more a little more IO usage and more CPU usage. If you have a lot of inactive memory around, it can make a difference. But it is counter productive if almost all your memory is active and pressure is high. So, in this scenario, it probably still doesn't help. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] Btrfs: compression must free at least PAGE_SIZE
Am Sat, 20 May 2017 19:49:53 +0300 schrieb Timofey Titovets: > Btrfs already skip store of data where compression didn't free at > least one byte. So make logic better and make check that compression > free at least one PAGE_SIZE, because in another case it useless to > store this data compressed > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/lzo.c | 5 - > fs/btrfs/zlib.c | 3 ++- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c > index bd0b0938..7f38bc3c 100644 > --- a/fs/btrfs/lzo.c > +++ b/fs/btrfs/lzo.c > @@ -229,8 +229,11 @@ static int lzo_compress_pages(struct list_head > *ws, in_len = min(bytes_left, PAGE_SIZE); > } > > - if (tot_out > tot_in) > + /* Compression must save at least one PAGE_SIZE */ > + if (tot_out + PAGE_SIZE => tot_in) { Shouldn't this be ">" instead of ">=" (btw, I don't think => works)... Given the case that tot_in is 8192, and tot_out is 4096, we saved a complete page but 4096+4096 would still be equal to 8192. The former logic only pretended that there is no point in compression if we saved 0 bytes. BTW: What's the smallest block size that btrfs stores? Is it always PAGE_SIZE? I'm not familiar with btrfs internals... > + ret = -E2BIG; > goto out; > + } > > /* store the size of all chunks of compressed data */ > cpage_out = kmap(pages[0]); > diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c > index 135b1082..2b04259b 100644 > --- a/fs/btrfs/zlib.c > +++ b/fs/btrfs/zlib.c > @@ -191,7 +191,8 @@ static int zlib_compress_pages(struct list_head > *ws, goto out; > } > > - if (workspace->strm.total_out >= workspace->strm.total_in) { > + /* Compression must save at least one PAGE_SIZE */ > + if (workspace->strm.total_out + PAGE_SIZE >= > workspace->strm.total_in) { ret = -E2BIG; Same as above... > goto out; > } > -- > 2.13.0 -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Am Fri, 5 May 2017 08:43:23 -0700 schrieb Marc MERLIN <m...@merlins.org>: [missing quote of the command] > > Corrupted blocks are corrupted, that command is just trying to > > corrupt it again. > > It won't do the black magic to adjust tree blocks to avoid them. > > I see. you may hve seen the earlier message from Kai Krakow who was > able to to recover his FS by trying this trick, but I understand it > can't work in all cases. Huh, what trick? I don't take credit for it... ;-) The corrupt-block trick must've been someone else... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Tue, 16 May 2017 14:21:20 +0200 schrieb Tomasz Torcz <to...@pipebreaker.pl>: > On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote: > > Am Mon, 15 May 2017 22:05:05 +0200 > > schrieb Tomasz Torcz <to...@pipebreaker.pl>: > > > [...] > > > > > > Let me add my 2 cents. bcache-writearound does not cache writes > > > on SSD, so there are less writes overall to flash. It is said > > > to prolong the life of the flash drive. > > > I've recently switched from bcache-writeback to > > > bcache-writearound, because my SSD caching drive is at the edge > > > of it's lifetime. I'm using bcache in following configuration: > > > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My > > > SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years > > > ago. > > > > > > Now, according to > > > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html > > > 120GB and 250GB warranty only covers 75 TBW (terabytes written). > > > > According to your chart, all your data is written twice to bcache. > > It may have been better to buy two drives, one per mirror. I don't > > think that SSD firmwares do deduplication - so data is really > > written twice. > > I'm aware of that, but 50 GB (I've got 100GB caching partition) > is still plenty to cache my ~, some media files, two small VMs. > On the other hand I don't want to overspend. This is just a home > server. > Nb. I'm still waiting for btrfs native SSD caching, which was > planned for 3.6 kernel 5 years ago :) > ( > https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6 > ) > > > > > > > > My > > > drive has # smartctl -a /dev/sda | grep LBA 241 > > > Total_LBAs_Written 0x0032 099 099 000Old_age > > > Always - 136025596053 > > > > Doesn't say this "99%" remaining? The threshold is far from being > > reached... > > > > I'm curious, what is Wear_Leveling_Count reporting? > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 > 096 096 000Old_age Always - 18227 12 > Power_Cycle_Count 0x0032 099 099 000Old_age > Always - 29 177 Wear_Leveling_Count 0x0013 001 > 001 000Pre-fail Always - 4916 > > Is this 001 mean 1%? If so, SMART contradicts datasheets. And I > don't think I shoud see read errors for 1% wear. It more means 1% left, that is 99% wear... Most of these are counters from 100 down to zero, with THRESH being the threshold point below or at which it is considered failed or failing. Only a few values work the other way around (like temperature). Be careful with interpreting raw values: they may be very manufacturer specific and not normalized. According to Total_LBAs_Written, the manufacturer thinks the drive could still take 100x more (only 1% used). But your wear level is almost 100% (value = 001). I think that value isn't really designed around the flash cell lifetime, but intermediate components like caches. So you need to read most values "backwards": It's not a used counter, but a "what's left" counter. What does it tell you about reserved blocks usage? Note that it's sort of double negation here: value 100 used means 100% unused or 0% used... ;-) Or just insert a "minus" in front of those values and think of them counting up to zero. So on a time axis it's at -100% of the total lifetime scale and 0 is the fail point (or whatever "thresh" says). -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Mon, 15 May 2017 22:05:05 +0200 schrieb Tomasz Torcz <to...@pipebreaker.pl>: > On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote: > > > > > It's worth noting also that on average, COW filesystems like BTRFS > > > (or log-structured-filesystems will not benefit as much as > > > traditional filesystems from SSD caching unless the caching is > > > built into the filesystem itself, since they don't do in-place > > > rewrites (so any new write by definition has to drop other data > > > from the cache). > > > > Yes, I considered that, too. And when I tried, there was almost no > > perceivable performance difference between bcache-writearound and > > bcache-writeback. But the latency of performance improvement was > > much longer in writearound mode, so I sticked to writeback mode. > > Also, writing random data is faster because bcache will defer it to > > background and do writeback in sector order. Sequential access is > > passed around bcache anyway, harddisks are already good at that. > > Let me add my 2 cents. bcache-writearound does not cache writes > on SSD, so there are less writes overall to flash. It is said > to prolong the life of the flash drive. > I've recently switched from bcache-writeback to bcache-writearound, > because my SSD caching drive is at the edge of it's lifetime. I'm > using bcache in following configuration: > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD > is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago. > > Now, according to > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html > 120GB and 250GB warranty only covers 75 TBW (terabytes written). According to your chart, all your data is written twice to bcache. It may have been better to buy two drives, one per mirror. I don't think that SSD firmwares do deduplication - so data is really written twice. They may do compression but that won't be streaming compression but per-block compression, so it won't help here as a deduplicator. Also, due to internal structure, compression would probably work similar to how zswap works: By combining compressed blocks into "buddy blocks", so only compression above 2:1 will merge compressed blocks into single blocks. For most of your data, this won't be true. So effectively, this has no overall effect. For this reason, I doubt that any firmware takes the chance for compression, effects are just too low vs. the management overhead and complexity that adds to the already complicated FTL layer. > My > drive has # smartctl -a /dev/sda | grep LBA 241 > Total_LBAs_Written 0x0032 099 099 000Old_age > Always - 136025596053 Doesn't say this "99%" remaining? The threshold is far from being reached... I'm curious, what is Wear_Leveling_Count reporting? > which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well… > > [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0: > [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd > 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto > reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB: > Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524] > blk_update_request: I/O error, dev sda, sector 204505785 > > Above started appearing recently. So, I was really suprised that: > - this drive is only rated for 120 TBW > - I went through this limit in only 2 years > > The workload is lightly utilised home server / media center. I think, bcache is a real SSD killer for drives around 120GB size or below... I had similar life usage with my previous small SSD just after one year. But I never had a sense error because I took it out of service early. And I switched to writearound, too. I think the write-pattern of bcache cannot be handled well by the FTL. It behaves like a log-structured file system, with new writes only appended, and sometimes a garbage collection is done by freeing complete erase blocks. Maybe it could work better, if btrfs could pass information about freed blocks down to bcache. Btrfs has a lot of these due to COW nature. I wonder if this would already be supported if turning on discard in btrfs? Does anyone know? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Mon, 15 May 2017 08:03:48 -0400 schrieb "Austin S. Hemmelgarn": > > That's why I don't trust any of my data to them. But I still want > > the benefit of their speed. So I use SSDs mostly as frontend caches > > to HDDs. This gives me big storage with fast access. Indeed, I'm > > using bcache successfully for this. A warm cache is almost as fast > > as native SSD (at least it feels almost that fast, it will be > > slower if you threw benchmarks at it). > That's to be expected though, most benchmarks don't replicate actual > usage patterns for client systems, and using SSD's for caching with > bcache or dm-cache for most server workloads except a file server > will usually get you a performance hit. You mean "performance boost"? Almost every read-most server workload should benefit... I file server may be the exact opposite... Also, I think dm-cache and bcache work very differently and are not directly comparable. Their benefit depends much on the applied workload. If I remember right, dm-cache is more about keeping "hot data" in the flash storage while bcache is more about reducing seeking. So dm-cache optimizes for bigger throughput of SSDs while bcache optimizes for almost-zero seek overhead of SSDs. Depending on your underlying storage, one or the other may even give zero benefit or worsen performance. Which is what I'd call a "performance hit"... I didn't ever try dm-cache, tho. For reasons I don't remember exactly, I didn't like something about how it's implemented, I think it was related to crash recovery. I don't know if that still holds true with modern kernels. It may have changed but I never looked back to revise that decision. > It's worth noting also that on average, COW filesystems like BTRFS > (or log-structured-filesystems will not benefit as much as > traditional filesystems from SSD caching unless the caching is built > into the filesystem itself, since they don't do in-place rewrites (so > any new write by definition has to drop other data from the cache). Yes, I considered that, too. And when I tried, there was almost no perceivable performance difference between bcache-writearound and bcache-writeback. But the latency of performance improvement was much longer in writearound mode, so I sticked to writeback mode. Also, writing random data is faster because bcache will defer it to background and do writeback in sector order. Sequential access is passed around bcache anyway, harddisks are already good at that. But of course, the COW nature of btrfs will lower the hit rate I can on writes. That's why I see no benefit in using bcache-writethrough with btrfs. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Mon, 15 May 2017 07:46:01 -0400 schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > On 2017-05-12 14:27, Kai Krakow wrote: > > Am Tue, 18 Apr 2017 15:02:42 +0200 > > schrieb Imran Geriskovan <imran.gerisko...@gmail.com>: > > > >> On 4/17/17, Austin S. Hemmelgarn <ahferro...@gmail.com> wrote: > [...] > >> > >> I'm trying to have a proper understanding of what "fragmentation" > >> really means for an ssd and interrelation with wear-leveling. > >> > >> Before continuing lets remember: > >> Pages cannot be erased individually, only whole blocks can be > >> erased. The size of a NAND-flash page size can vary, and most > >> drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have > >> blocks of 128 or 256 pages, which means that the size of a block > >> can vary between 256 KB and 4 MB. > >> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ > >> > >> Lets continue: > >> Since block sizes are between 256k-4MB, data smaller than this will > >> "probably" will not be fragmented in a reasonably empty and trimmed > >> drive. And for a brand new ssd we may speak of contiguous series > >> of blocks. > >> > >> However, as drive is used more and more and as wear leveling > >> kicking in (ie. blocks are remapped) the meaning of "contiguous > >> blocks" will erode. So any file bigger than a block size will be > >> written to blocks physically apart no matter what their block > >> addresses says. But my guess is that accessing device blocks > >> -contiguous or not- are constant time operations. So it would not > >> contribute performance issues. Right? Comments? > >> > >> So your the feeling about fragmentation/performance is probably > >> related with if the file is spread into less or more blocks. If # > >> of blocks used is higher than necessary (ie. no empty blocks can be > >> found. Instead lots of partially empty blocks have to be used > >> increasing the total # of blocks involved) then we will notice > >> performance loss. > >> > >> Additionally if the filesystem will gonna try something to reduce > >> the fragmentation for the blocks, it should precisely know where > >> those blocks are located. Then how about ssd block informations? > >> Are they available and do filesystems use it? > >> > >> Anyway if you can provide some more details about your experiences > >> on this we can probably have better view on the issue. > > > > What you really want for SSD is not defragmented files but > > defragmented free space. That increases life time. > > > > So, defragmentation on SSD makes sense if it cares more about free > > space but not file data itself. > > > > But of course, over time, fragmentation of file data (be it meta > > data or content data) may introduce overhead - and in btrfs it > > probably really makes a difference if I scan through some of the > > past posts. > > > > I don't think it is important for the file system to know where the > > SSD FTL located a data block. It's just important to keep > > everything nicely aligned with erase block sizes, reduce rewrite > > patterns, and free up complete erase blocks as good as possible. > > > > Maybe such a process should be called "compaction" and not > > "defragmentation". In the end, the more continuous blocks of free > > space there are, the better the chance for proper wear leveling. > > There is one other thing to consider though. From a practical > perspective, performance on an SSD is a function of the number of > requests and what else is happening in the background. The second > aspect isn't easy to eliminate on most systems, but the first is > pretty easy to mitigate by defragmenting data. > > Reiterating the example I made elsewhere in the thread: > Assume you have an SSD and storage controller that can use DMA to > transfer up to 16MB of data off of the disk in a single operation. > If you need to load a 16MB file off of this disk and it's properly > aligned (it usually will be with most modern filesystems if the > partition is properly aligned) and defragmented, it will take exactly > one operation (assuming that doesn't get interrupted). By contrast, > if you have 16 fragments of 1MB each, that will take at minimum 2 > operations, and more likely 15-16 (depends on where everything is > on-disk, and how smart the driver is about minimizing the
Re: Btrfs/SSD
Am Mon, 15 May 2017 14:09:20 +0100 schrieb Tomasz Kusmierz: > > Traditional hard drives usually do this too these days (they've > > been under-provisioned since before SSD's existed), which is part > > of why older disks tend to be noisier and slower (the reserved > > space is usually at the far inside or outside of the platter, so > > using sectors from there to replace stuff leads to long seeks). > > Not true. When HDD uses 10% (10% is just for easy example) of space > as spare than aligment on disk is (US - used sector, SS - spare > sector, BS - bad sector) > > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > US US US US US US US US US SS > > if failure occurs - drive actually shifts sectors up: > > US US US US US US US US US SS > US US US BS BS BS US US US US > US US US US US US US US US US > US US US US US US US US US US > US US US US US US US US US SS > US US US BS US US US US US US > US US US US US US US US US SS > US US US US US US US US US SS This makes sense... Reserve area somehow implies it is continuous and as such located at one far end of the platter. But your image totally makes sense. > that strategy is in place to actually mitigate the problem that > you’ve described, actually it was in place since drives were using > PATA :) so if your drive get’s nosier over time it’s either a broken > bearing or demagnetised arm magnet causing it to not aim propperly - > so drive have to readjust position multiple times before hitting a > right track -- To unsubscribe from this list: send the line > "unsubscribe linux-btrfs" in the body of a message to > majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html I can confirm that such drives usually do not get noisier except there's something broken other than just a few sectors. And faulty bearing in notebook drives is the most often scenario I see. I always recommend to replace such drives early because they will usually fail completely. Such notebooks are good candidates for SSD replacements btw. ;-) The demagnetised arm magnet is an interesting error scenario - didn't think of it. Thanks for the pointer. But still, there's one noise you can easily identify as bad sectors: When the drive starts clicking for 30 or more seconds while trying to read data, and usually also freezes the OS during that time. Such drives can be "repaired" by rewriting the offending sectors (because it will be moved to reserve area then). But I guess it's best to already replace such a drive by that time. Early, back in PATA times, I often had harddisks exposing seemingly bad sectors when power was cut while the drive was writing data. I usually used dd to rewrite such sectors and the drive was good as new again - except I lost some file data maybe. Luckily, modern drives don't show such behavior. And also SSDs learned to handle this... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
Am Sun, 14 May 2017 22:57:26 +0200 schrieb Lionel Bouton: > I've coded one Ruby script which tries to balance between the cost of > reallocating group and the need for it. The basic idea is that it > tries to keep the proportion of free space "wasted" by being allocated > although it isn't used below a threshold. It will bring this > proportion down enough through balance that minor reallocation won't > trigger a new balance right away. It should handle pathological > conditions as well as possible and it won't spend more than 2 hours > working on a single filesystem by default. We deploy this as a daily > cron script through Puppet on all our systems and it works very well > (I didn't have to use balance manually to manage free space since we > did that). Note that by default it sleeps a random amount of time to > avoid IO spikes on VMs running on the same host. You can either edit > it or pass it "0" which will be used for the max amount of time to > sleep bypassing this precaution. > > Here is the latest version : https://pastebin.com/Rrw1GLtx > Given its current size, I should probably push it on github... Yes, please... ;-) > I've seen other maintenance scripts mentioned on this list so you > might something simpler or more targeted to your needs by browsing > through the list's history. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: balancing every night broke balancing so now I can't balance anymore?
Am Sun, 14 May 2017 13:15:09 -0700 schrieb Marc MERLIN: > On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote: > > On 05/13/2017 10:54 PM, Marc MERLIN wrote: > > > Kernel 4.11, btrfs-progs v4.7.3 > > > > > > I run scrub and balance every night, been doing this for 1.5 > > > years on this filesystem. > > > > What are the exact commands you run every day? > > http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html > (at the bottom) > every night: > 1) scrub > 2) balance -musage=0 > 3) balance -musage=20 > 4) balance -dusage=0 > 5) balance -dusage=20 > > > > How did I get into such a misbalanced state when I balance every > > > night? > > > > I don't know, since I don't know what you do exactly. :) > > Now you do :) > > > > My filesystem is not full, I can write just fine, but I sure > > > cannot rebalance now. > > > > Yes, because you have quite some allocated but unused space. If > > btrfs cannot just allocate more chunks, it starts trying a bit > > harder to reuse all the empty spots in the already existing > > chunks. > > Ok. shouldn't balance fix problems just like this? > I have 60GB-ish free, or in this case that's also >25%, that's a lot > > Speaking of unallocated, I have more now: > Device unallocated:993.00MiB > > This kind of just magically fixed itself during snapshot rotation and > deletion I think. > Sure enough, balance works again, but this feels pretty fragile. > Looking again: > Device size: 228.67GiB > Device allocated: 227.70GiB > Device unallocated:993.00MiB > Free (estimated): 58.53GiB (min: 58.53GiB) > > You're saying that I need unallocated space for new chunks to be > created, which is required by balance. > Should btrfs not take care of keeping some space for me? > Shoudln't a nigthly balance, which I'm already doing, help even more > with this? > > > > Besides adding another device to add space, is there a way around > > > this and more generally not getting into that state anymore > > > considering that I already rebalance every night? > > > > Add monitoring and alerting on the amount of unallocated space. > > > > FWIW, this is what I use for that purpose: > > > > https://packages.debian.org/sid/munin-plugins-btrfs > > https://packages.debian.org/sid/monitoring-plugins-btrfs > > > > And, of course the btrfs-heatmap program keeps being a fun tool to > > create visual timelapses of your filesystem, so you can learn how > > your usage pattern is resulting in allocation of space by btrfs, > > and so that you can visually see what the effect of your btrfs > > balance attempts is: > > That's interesting, but ultimately, users shoudln't have to > micromanage their filesystem to that level, even btrfs. > > a) What is wrong in my nightly script that I should fix/improve? You may want to try https://www.spinics.net/lists/linux-btrfs/msg52076.html > b) How do I recover from my current state? That script may work it's way through. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[OT] SSD performance patterns (was: Btrfs/SSD)
Am Sat, 13 May 2017 09:39:39 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted: > > > In the end, the more continuous blocks of free space there are, the > > better the chance for proper wear leveling. > > Talking about which... > > When I was doing my ssd research the first time around, the going > recommendation was to keep 20-33% of the total space on the ssd > entirely unallocated, allowing it to use that space as an FTL > erase-block management pool. > > At the time, I added up all my "performance matters" data dirs and > allowing for reasonable in-filesystem free-space, decided I could fit > it in 64 GB if I had to, tho 80 GB would be a more comfortable fit, > so allowing for the above entirely unpartitioned/unused slackspace > recommendations, had a target of 120-128 GB, with a reasonable range > depending on actual availability of 100-160 GB. > > It turned out, due to pricing and availability, I ended up spending > somewhat more and getting 256 GB (238.5 GiB). Of course that allowed > me much more flexibility than I had expected and I ended up with > basically everything but the media partition on the ssds, PLUS I > still left them at only just over 50% partitioned, (using the gdisk > figures, 51%- partitioned, 49%+ free). I put by ESP (for UEFI) onto the SSD and also played with putting swap onto it dedicated to hibernation. But I discarded the hibernation idea and removed the swap because it didn't work well: It wasn't much faster then waking from HDD, and hibernation is not that reliable anyways. Also, hybrid hibernation is not yet integrated into KDE so I stick to sleep mode currently. The rest of my SSD (also 500GB) is dedicated to bcache. This fits my complete work set of daily work with hit ratios going up to 90% and beyond. My filesystem boots and feels like SSD, the HDDs are almost silent and still my file system is 3TB on 3x 1TB HDD. > Given that, I've not enabled btrfs trim/discard (which saved me from > the bugs with it a few kernel cycles ago), and while I do have a > weekly fstrim systemd timer setup, I've not had to be too concerned > about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was > known not to be trimming everything it really should have been. This is a good recommendation as TRIM is still a slow operation because Queued TRIM is not used for most drives due to buggy firmware. So you not only circumvent kernel and firmware bugs, but also get better performance that way. > Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? Am I correct in asserting that if > one is following that, the FTL already has plenty of erase-blocks > available for management and the discussion about filesystem level > trim and free space management becomes much less urgent, tho of > course it's still worth considering if it's convenient to do so? > > And am I also correct in believing that while it's not really worth > spending more to over-provision to the near 50% as I ended up doing, > if things work out that way as they did with me because the > difference in price between 30% overprovisioning and 50% > overprovisioning ends up being trivial, there's really not much need > to worry about active filesystem trim at all, because the FTL has > effectively half the device left to play erase-block musical chairs > with as it decides it needs to? I think, things may have changed since long ago. See below. But it certainly depends on which drive manufacturer you chose, I guess. I can at least confirm that bigger drives wear their write cycles much slower, even when filled up. My old 128MB Crucial drive was worn after only 1 year (I swapped it early, I kept an eye on SMART numbers). My 500GB Samsung drive is around 1 year old now, I do write a lot more data to it, but according to SMART it should work for at least 5 to 7 more years. By that time, I probably already swapped it for a bigger drive. So I guess you should maybe look at your SMART numbers and calculate the expected life time: Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE)) with WLC = Wear_Leveling_Count should get you the expected remaining power on hours. My drive is powered on 24/7 most of the time but if you power your drive only 8 hours per day, you can easily ramp up the life time by three times of days vs. me. ;-) There is also Total_LBAs_Written but that, at least for me, usually gives much higher lifetime values so I'd stick with the pessimistic ones. Even when WLC goes to zero, the drive should still have reserved blocks available. My drive sets the threshold to 0 for WLC which makes me think that it is not fatal when it hits 0 because the drive still has reserved blocks. And for reserved blocks, the threshold is 10%.
Re: Btrfs/SSD
Am Sat, 13 May 2017 14:52:47 +0500 schrieb Roman Mamedov <r...@romanrm.net>: > On Fri, 12 May 2017 20:36:44 +0200 > Kai Krakow <hurikha...@gmail.com> wrote: > > > My concern is with fail scenarios of some SSDs which die unexpected > > and horribly. I found some reports of older Samsung SSDs which > > failed suddenly and unexpected, and in a way that the drive > > completely died: No more data access, everything gone. HDDs start > > with bad sectors and there's a good chance I can recover most of > > the data except a few sectors. > > Just have your backups up-to-date, doesn't matter if it's SSD, HDD or > any sort of RAID. > > In a way it's even better, that SSDs [are said to] fail abruptly and > entirely. You can then just restore from backups and go on. Whereas a > failing HDD can leave you puzzled on e.g. whether it's a cable or > controller problem instead, and possibly can even cause some data > corruption which you won't notice until too late. My current backup strategy can handle this. I never backup files from the source again if it didn't change by timestamp. That way, silent data corruption won't creep into the backup. Additionally, I keep a backlog of 5 years of file history. Even if a corrupted file creeps into the backup, there is enough time to get a good copy back. If it's older, it probably doesn't hurt so much anyway. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Fri, 12 May 2017 15:02:20 +0200 schrieb Imran Geriskovan: > On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote: > > FWIW, I'm in the market for SSDs ATM, and remembered this from a > > couple weeks ago so went back to find it. Thanks. =:^) > > > > (I'm currently still on quarter-TB generation ssds, plus spinning > > rust for the larger media partition and backups, and want to be rid > > of the spinning rust, so am looking at half-TB to TB, which seems > > to be the pricing sweet spot these days anyway.) > > Since you are taking ssds to mainstream based on your experience, > I guess your perception of data retension/reliability is better than > that of spinning rust. Right? Can you eloborate? > > Or an other criteria might be physical constraints of spinning rust > on notebooks which dictates that you should handle the device > with care when running. > > What was your primary motivation other than performance? Personally, I don't really trust SSDs so much. They are much more robust when it comes to physical damage because there are no physical parts. That's absolutely not my concern. Regarding this, I trust SSDs better than HDDs. My concern is with fail scenarios of some SSDs which die unexpected and horribly. I found some reports of older Samsung SSDs which failed suddenly and unexpected, and in a way that the drive completely died: No more data access, everything gone. HDDs start with bad sectors and there's a good chance I can recover most of the data except a few sectors. When SSD blocks die, they are probably huge compared to a sector (256kB to 4MB usually because that's erase block sizes). If this happens, the firmware may decide to either allow read-only access or completely deny access. There's another situation where dying storage chips may completely mess up the firmware and there's no longer any access to data. That's why I don't trust any of my data to them. But I still want the benefit of their speed. So I use SSDs mostly as frontend caches to HDDs. This gives me big storage with fast access. Indeed, I'm using bcache successfully for this. A warm cache is almost as fast as native SSD (at least it feels almost that fast, it will be slower if you threw benchmarks at it). -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Tue, 18 Apr 2017 15:02:42 +0200 schrieb Imran Geriskovan: > On 4/17/17, Austin S. Hemmelgarn wrote: > > Regarding BTRFS specifically: > > * Given my recently newfound understanding of what the 'ssd' mount > > option actually does, I'm inclined to recommend that people who are > > using high-end SSD's _NOT_ use it as it will heavily increase > > fragmentation and will likely have near zero impact on actual device > > lifetime (but may _hurt_ performance). It will still probably help > > with mid and low-end SSD's. > > I'm trying to have a proper understanding of what "fragmentation" > really means for an ssd and interrelation with wear-leveling. > > Before continuing lets remember: > Pages cannot be erased individually, only whole blocks can be erased. > The size of a NAND-flash page size can vary, and most drive have pages > of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 > pages, which means that the size of a block can vary between 256 KB > and 4 MB. > codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ > > Lets continue: > Since block sizes are between 256k-4MB, data smaller than this will > "probably" will not be fragmented in a reasonably empty and trimmed > drive. And for a brand new ssd we may speak of contiguous series > of blocks. > > However, as drive is used more and more and as wear leveling kicking > in (ie. blocks are remapped) the meaning of "contiguous blocks" will > erode. So any file bigger than a block size will be written to blocks > physically apart no matter what their block addresses says. But my > guess is that accessing device blocks -contiguous or not- are > constant time operations. So it would not contribute performance > issues. Right? Comments? > > So your the feeling about fragmentation/performance is probably > related with if the file is spread into less or more blocks. If # of > blocks used is higher than necessary (ie. no empty blocks can be > found. Instead lots of partially empty blocks have to be used > increasing the total # of blocks involved) then we will notice > performance loss. > > Additionally if the filesystem will gonna try something to reduce > the fragmentation for the blocks, it should precisely know where > those blocks are located. Then how about ssd block informations? > Are they available and do filesystems use it? > > Anyway if you can provide some more details about your experiences > on this we can probably have better view on the issue. What you really want for SSD is not defragmented files but defragmented free space. That increases life time. So, defragmentation on SSD makes sense if it cares more about free space but not file data itself. But of course, over time, fragmentation of file data (be it meta data or content data) may introduce overhead - and in btrfs it probably really makes a difference if I scan through some of the past posts. I don't think it is important for the file system to know where the SSD FTL located a data block. It's just important to keep everything nicely aligned with erase block sizes, reduce rewrite patterns, and free up complete erase blocks as good as possible. Maybe such a process should be called "compaction" and not "defragmentation". In the end, the more continuous blocks of free space there are, the better the chance for proper wear leveling. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck lowmem mode shows corruptions
Am Fri, 5 May 2017 08:55:10 +0800 schrieb Qu Wenruo <quwen...@cn.fujitsu.com>: > At 05/05/2017 01:29 AM, Kai Krakow wrote: > > Hello! > > > > Since I saw a few kernel freezes lately (due to experimenting with > > ck-sources) including some filesystem-related backtraces, I booted > > my rescue system to check my btrfs filesystem. > > > > Luckily, it showed no problems. It said, everything's fine. But I > > also thought: Okay, let's try lowmem mode. And that showed a > > frightening long list of extent corruptions und unreferenced > > chunks. Should I worry? > > Thanks for trying lowmem mode. > > Would you please provide the version of btrfs-progs? Sorry... I realized it myself the moment I hit the "send" button. Here it is: # btrfs version btrfs-progs v4.10.2 # uname -a Linux jupiter 4.10.13-ck #2 SMP PREEMPT Thu May 4 23:44:09 CEST 2017 x86_64 Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz GenuineIntel GNU/Linux > IIRC "ERROR: data extent[96316809216 2097152] backref lost" bug has > been fixed in recent release. Is there a patch I could apply? > And for reference, would you please provide the tree dump of your > chunk and device tree? > > This can be done by running: > # btrfs-debug-tree -t device > # btrfs-debug-tree -t chunk I'll attach those... I'd like to note that between the OP and these dumps, I scrubbed and rebalanced the hole device. I think that would scramble up some numbers. Also I took the dumps while the fs was online. If you want me to do clean dumps of the offline device without intermediate fs processing, let me know. Thanks, Kai > And this 2 dump only contains the btrfs chunk mapping info, so > nothing sensitive is contained. > > Thanks, > Qu > > > > PS: The freezes seem to be related to bfq, switching to deadline > > solved these. > > > > Full log attached, here's an excerpt: > > > > ---8<--- > > > > checking extents > > ERROR: chunk[256 4324327424) stripe 0 did not find the related dev > > extent ERROR: chunk[256 4324327424) stripe 1 did not find the > > related dev extent ERROR: chunk[256 4324327424) stripe 2 did not > > find the related dev extent ERROR: chunk[256 7545552896) stripe 0 > > did not find the related dev extent ERROR: chunk[256 7545552896) > > stripe 1 did not find the related dev extent ERROR: chunk[256 > > 7545552896) stripe 2 did not find the related dev extent [...] > > ERROR: device extent[1, 1094713344, 1073741824] did not find the > > related chunk ERROR: device extent[1, 2168455168, 1073741824] did > > not find the related chunk ERROR: device extent[1, 3242196992, > > 1073741824] did not find the related chunk [...] > > ERROR: device extent[2, 608854605824, 1073741824] did not find the > > related chunk ERROR: device extent[2, 609928347648, 1073741824] did > > not find the related chunk ERROR: device extent[2, 611002089472, > > 1073741824] did not find the related chunk [...] > > ERROR: device extent[3, 64433946624, 1073741824] did not find the > > related chunk ERROR: device extent[3, 65507688448, 1073741824] did > > not find the related chunk ERROR: device extent[3, 66581430272, > > 1073741824] did not find the related chunk [...] > > ERROR: data extent[96316809216 2097152] backref lost > > ERROR: data extent[96316809216 2097152] backref lost > > ERROR: data extent[96316809216 2097152] backref lost > > ERROR: data extent[686074396672 13737984] backref lost > > ERROR: data extent[686074396672 13737984] backref lost > > ERROR: data extent[686074396672 13737984] backref lost > > [...] > > ERROR: errors found in extent allocation tree or chunk allocation > > checking free space cache > > checking fs roots > > ERROR: errors found in fs roots > > Checking filesystem on /dev/disk/by-label/system > > UUID: bc201ce5-8f2b-4263-995a-6641e89d4c88 > > found 1960075935744 bytes used, error(s) found > > total csum bytes: 1673537040 > > total tree bytes: 4899094528 > > total fs tree bytes: 2793914368 > > total extent tree bytes: 190398464 > > btree space waste bytes: 871743708 > > file data blocks allocated: 6907169177600 > > referenced 1979268648960 > > > > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Regards, Kai Replies to list-only preferred. chunk-tree.txt.gz Description: application/gzip device-tree.txt.gz Description: application/gzip
btrfsck lowmem mode shows corruptions
Hello! Since I saw a few kernel freezes lately (due to experimenting with ck-sources) including some filesystem-related backtraces, I booted my rescue system to check my btrfs filesystem. Luckily, it showed no problems. It said, everything's fine. But I also thought: Okay, let's try lowmem mode. And that showed a frightening long list of extent corruptions und unreferenced chunks. Should I worry? PS: The freezes seem to be related to bfq, switching to deadline solved these. Full log attached, here's an excerpt: ---8<--- checking extents ERROR: chunk[256 4324327424) stripe 0 did not find the related dev extent ERROR: chunk[256 4324327424) stripe 1 did not find the related dev extent ERROR: chunk[256 4324327424) stripe 2 did not find the related dev extent ERROR: chunk[256 7545552896) stripe 0 did not find the related dev extent ERROR: chunk[256 7545552896) stripe 1 did not find the related dev extent ERROR: chunk[256 7545552896) stripe 2 did not find the related dev extent [...] ERROR: device extent[1, 1094713344, 1073741824] did not find the related chunk ERROR: device extent[1, 2168455168, 1073741824] did not find the related chunk ERROR: device extent[1, 3242196992, 1073741824] did not find the related chunk [...] ERROR: device extent[2, 608854605824, 1073741824] did not find the related chunk ERROR: device extent[2, 609928347648, 1073741824] did not find the related chunk ERROR: device extent[2, 611002089472, 1073741824] did not find the related chunk [...] ERROR: device extent[3, 64433946624, 1073741824] did not find the related chunk ERROR: device extent[3, 65507688448, 1073741824] did not find the related chunk ERROR: device extent[3, 66581430272, 1073741824] did not find the related chunk [...] ERROR: data extent[96316809216 2097152] backref lost ERROR: data extent[96316809216 2097152] backref lost ERROR: data extent[96316809216 2097152] backref lost ERROR: data extent[686074396672 13737984] backref lost ERROR: data extent[686074396672 13737984] backref lost ERROR: data extent[686074396672 13737984] backref lost [...] ERROR: errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots ERROR: errors found in fs roots Checking filesystem on /dev/disk/by-label/system UUID: bc201ce5-8f2b-4263-995a-6641e89d4c88 found 1960075935744 bytes used, error(s) found total csum bytes: 1673537040 total tree bytes: 4899094528 total fs tree bytes: 2793914368 total extent tree bytes: 190398464 btree space waste bytes: 871743708 file data blocks allocated: 6907169177600 referenced 1979268648960 -- Regards, Kai Replies to list-only preferred. lowmem.txt.gz Description: application/gzip
Re: Can I see what device was used to mount btrfs?
Am Tue, 2 May 2017 21:50:19 +0200 schrieb Goffredo Baroncelli: > On 2017-05-02 20:49, Adam Borowski wrote: > >> It could be some daemon that waits for btrfs to become complete. > >> Do we have something? > > Such a daemon would also have to read the chunk tree. > > I don't think that a daemon is necessary. As proof of concept, in the > past I developed a mount helper [1] which handled the mount of a > btrfs filesystem: this handler first checks if the filesystem is a > multivolume devices, if so it waits that all the devices are > appeared. Finally mount the filesystem. > > > It's not so simple -- such a btrfs device would have THREE states: > > > > 1. not mountable yet (multi-device with not enough disks present) > > 2. mountable ro / rw-degraded > > 3. healthy > > My mount.btrfs could be "programmed" to wait a timeout, then it > mounts the filesystem as degraded if not all devices are present. > This is a very simple strategy, but this could be expanded. > > I am inclined to think that the current approach doesn't fit well the > btrfs requirements. The roles and responsibilities are spread to too > much layer (udev, systemd, mount)... I hoped that my helper could be > adopted in order to concentrate all the responsibility to only one > binary; this would reduce the interface number with the other > subsystem (eg systemd, udev). > > For example, it would be possible to implement a sane check that > prevent to mount a btrfs filesystem if two devices exposes the same > UUID... Ideally, the btrfs wouldn't even appear in /dev until it was assembled by udev. But apparently that's not the case, and I think this is where the problems come from. I wish, btrfs would not show up as device nodes in /dev that the mount command identified as btrfs. Instead, btrfs would expose (probably through udev) a device node in /dev/btrfs/fs_identifier when it is ready. Apparently, the core problem of how to handle degraded btrfs still remains. Maybe it could be solved by adding more stages of btrfs nodes, like /dev/btrfs-incomplete (for unusable btrfs), /dev/btrfs-degraded (for btrfs still missing devices but at least one stripe of btrfs raid available) and /dev/btrfs as the final stage. That way, a mount process could wait for a while, and if the device doesn't appear, it tries the degraded stage instead. If the fs is opened from the degraded dev node stage, udev (or other processes) that scan for devices should stop assembling the fs if they still do so. bcache has a similar approach by hiding an fs within a protective superblock. Unless bcache is setup, the fs won't show up in /dev, and that fs won't be visible by other means. Btrfs should do something similar and only show a single device node if assembled completely. The component devices would have superblocks ignored by mount, and only the final node would expose a virtual superblock and the compound device after it. Of course, this makes things like compound device resizing more complicated maybe even impossible. If I'm not totally wrong, I think this is also how zfs exposes its pools. You need user space tools to make the fs pools visible in the tree. If zfs is incomplete, there's nothing to mount, and thus no race condition. But I never tried zfs seriously, so I do not know. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Am Mon, 1 May 2017 22:56:06 -0600 schrieb Chris Murphy: > On Mon, May 1, 2017 at 9:23 PM, Marc MERLIN wrote: > > Hi Chris, > > > > Thanks for the reply, much appreciated. > > > > On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote: > >> What about btfs check (no repair), without and then also with > >> --mode=lowmem? > >> > >> In theory I like the idea of a 24 hour rollback; but in normal > >> usage Btrfs will eventually free up space containing stale and no > >> longer necessary metadata. Like the chunk tree, it's always > >> changing, so you get to a point, even with snapshots, that the old > >> state of that tree is just - gone. A snapshot of an fs tree does > >> not make the chunk tree frozen in time. > > > > Right, of course, I was being way over optimistic here. I kind of > > forgot that metadata wasn't COW, my bad. > > Well it is COW. But there's more to the file system than fs trees, and > just because an fs tree gets snapshot doesn't mean all data is > snapshot. So whether snapshot or not, there's metadata that becomes > obsolete as the file system is updated and those areas get freed up > and eventually overwritten. > > > > > >> In any case, it's a big problem in my mind if no existing tools can > >> fix a file system of this size. So before making anymore changes, > >> make sure you have a btrfs-image somewhere, even if it's huge. The > >> offline checker needs to be able to repair it, right now it's all > >> we have for such a case. > > > > The image will be huge, and take maybe 24H to make (last time it > > took some silly amount of time like that), and honestly I'm not > > sure how useful it'll be. > > Outside of the kernel crashing if I do a btrfs balance, and > > hopefully the crash report I gave is good enough, the state I'm in > > is not btrfs' fault. > > > > If I can't roll back to a reasonably working state, with data loss > > of a known quantity that I can recover from backup, I'll have to > > destroy and filesystem and recover from scratch, which will take > > multiple days. Since I can't wait too long before getting back to a > > working state, I think I'm going to try btrfs check --repair after > > a scrub to get a list of all the pathanmes/inodes that are known to > > be damaged, and work from there. > > Sounds reasonable? > > Yes. > > > > > > Also, how is --mode=lowmem being useful? > > Testing. lowmem is a different implementation, so it might find > different things from the regular check. > > > > > > And for re-parenting a sub-subvolume, is that possible? > > (I want to delete /sub1/ but I can't because I have /sub1/sub2 > > that's also a subvolume and I'm not sure how to re-parent sub2 to > > somewhere else so that I can subvolume delete sub1) > > Well you can move sub2 out of sub1 just like a directory and then > delete sub1. If it's read-only it can't be moved, but you can use > btrfs property get/set ro true/false to temporarily make it not > read-only, move it, then make it read-only again, and it's still fine > to use with btrfs send receive. > > > > > > > > > In the meantime, a simple check without repair looks like this. It > > will likely take many hours to complete: > > gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2 > > Checking filesystem on /dev/mapper/dshelf2 > > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653 > > checking extents > > checksum verify failed on 3096461459456 found 0E6B7980 wanted > > FBE5477A checksum verify failed on 3096461459456 found 0E6B7980 > > wanted FBE5477A checksum verify failed on 2899180224512 found > > 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512 > > found 7A6D427F wanted 7E899EE5 checksum verify failed on > > 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed > > on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch, > > want=2899180224512, have=3981076597540270796 checksum verify failed > > on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify > > failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum > > verify failed on 1449544613888 found 895D691B wanted A0C64D2B > > checksum verify failed on 1449544613888 found 895D691B wanted > > A0C64D2B parent transid verify failed on 1671538819072 wanted > > 293964 found 293902 parent transid verify failed on 1671538819072 > > wanted 293964 found 293902 checksum verify failed on 1671603781632 > > found 18BC28D6 wanted 372655A0 checksum verify failed on > > 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed > > on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify > > failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum > > verify failed on 2182657212416 found CD8EFC0C wanted 70847071 > > checksum verify failed on 2182657212416 found CD8EFC0C wanted > > 70847071 checksum verify failed on 2898779357184 found 96395131 > > wanted 433D6E09 checksum verify failed on 2898779357184 found > > 96395131 wanted
Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?
Am Tue, 2 May 2017 05:01:02 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Of course on-list I'm somewhat known for my arguments propounding the > notion that any filesystem that's too big to be practically > maintained (including time necessary to restore from backups, should > that be necessary for whatever reason) is... too big... and should > ideally be broken along logical and functional boundaries into a > number of individual smaller filesystems until such point as each one > is found to be practically maintainable within a reasonably practical > time frame. Don't put all the eggs in one basket, and when the bottom > of one of those baskets inevitably falls out, most of your eggs will > be safe in other baskets. =:^) Hehe... Yes, you're a fan of small filesystems. I'm more from the opposite camp, preferring one big filesystem to not mess around with size constraints of small filesystems fighting for the same volume space. It also gives such filesystems better chances for data locality of data staying in totally different parts across your fs mounts and can reduce head movement. Of course, much of this is not true if you use different devices per filesystem, or use SSDs, or SAN where you have no real control over the physical placement of image stripes anyway. But well... In an ideal world, subvolumes of btrfs would be totally independent of each other, just only share the same volume and dynamically allocating chunks of space from it. If one is broken, it is simply not usable and it should be destroyable. A garbage collector would grab the leftover chunks from the subvolume and free them, and you could recreate this subvolume from backup. In reality, shared extents will cross subvolume borders so it is probably not how things could work anytime in the near of far future. This idea is more like having thinly provisioned LVM volumes which allocate space as the filesystems on top need them, much like doing thinly provisioned images with a VM host system. The problem here is, unlike subvolumes, those chunks of space could never be given back to the host as it doesn't know if it is still in use. Of course, there's implementations available which allow thinning the images by passing through TRIM from the guest to the host (or by other means of communication channels between host and guest), but that is usually not giving good performance, if even supported. I tried once to exploit this in VirtualBox and hoped it would translate guest discards into hole punching requests on the host, and it's even documented to work that way... But (a) it was horrible slow, and (b) it was incredibly unstable to the point of being useless. OTOH, it's not announced as a stable feature and has to be enabled by manually editing the XML config files. But I still like the idea: Is it possible to make btrfs still work if one subvolume gets corrupted? Of course it should have ways of telling the user which other subvolumes are interconnected through shared extents so those would be also discarded upon corruption cleanup - at least if those extents couldn't be made any sense of any longer. Since corruption is an issue mostly of subvolumes being written to, snapshots should be mostly safe. Such a feature would also only make sense if btrfs had an online repair tool. BTW, are there plans for having an online repair tool in the future? Maybe one that only scans and fixes part of the filesystems (for obvious performance reasons, wrt Duncans idea of handling filesystems), i.e. those parts that the kernel discovered having corruptions? If I could then just delete and restore affected files, this would be even better than having independent subvolumes like above. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a collection of snapshot subvolumes
Am Tue, 25 Apr 2017 00:02:13 -0400 schrieb "J. Hart": > I have a remote machine with a filesystem for which I periodically > take incremental snapshots for historical reasons. These snapshots > are stored in an archival filesystem tree on a file server. Older > snapshots are removed and newer ones added on a rotational basis. I > need to be able to backup this archive by syncing it with a set of > backup drives. Due to the size, I need to back it up incrementally > rather than sending the entire content each time. Due to the > snapshot rotation, I need to be able to update the state of the > archive backup filesystem as a whole, in much the same manner that > rsync handles file trees. > > It seems that I cannot use "btrfs send", as the archive directory > contains the snapshots as subvolumes. > > I cannot use rsync as it treats subvolumes as simple directories, and > does not preserve subvolume attributes. Rsync also does not support > reflinks, so the snapshot directory content will no longer be > reflinked to other snapshots on the archive backup. I cannot use > hard links in the incrementals as hard links do not cross subvolume > boundaries. > > Thoughts anyone ? If this is for archival purpose only and storage efficiency and speed is your primary concern, try borgbackup. Borgbackup deploys its own deduplication and doesn't rely on btrfs snapshot capabilities. You can store your archives wherever you want (even on non-btrfs), and it won't store any data blocks twice. Upon restore, you could simply recreate the snapshot/subvolume from a similar tree and then rsync a tree restored from borgbackup back to this snapshot to get back a state with the benefits of btrfs snapshots. Borg also has an adapter for fuse to get a mounted view into the archives, tho, it is slow and probably takes a lot of RAM. But it is good enough for single file lookups and navigation. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
Am Tue, 11 Apr 2017 07:33:41 -0400 schrieb "Austin S. Hemmelgarn": > >> FWIW, it is possible to use a udev rule to change the rotational > >> flag from userspace. The kernel's selection algorithm for > >> determining is is somewhat sub-optimal (essentially, if it's not a > >> local disk that can be proven to be rotational, it assumes it's > >> non-rotational), so re-selecting this ends up being somewhat > >> important in certain cases (virtual machines for example). > > > > Just putting nossd in fstab seems convenient enough. > While that does work, there are other pieces of software that change > behavior based on the value of the rotational flag, and likewise make > misguided assumptions about what it means. Something similar happens when you put btrfs on bcache. It now assumes it is on SSD but in reality it isn't. Thus, I also deployed udev rules to force back nossd behavior. But maybe, in the bcache case using "nossd" instead would make more sense. Any ideas on this? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Am Mon, 10 Apr 2017 15:43:57 -0400 schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > On 2017-04-10 14:18, Kai Krakow wrote: > > Am Mon, 10 Apr 2017 13:13:39 -0400 > > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > > > >> On 2017-04-10 12:54, Kai Krakow wrote: > [...] > [...] > >> [...] > >> [...] > [...] > >> [...] > >> [...] > [...] > [...] > >> The command-line also rejects a number of perfectly legitimate > >> arguments that BTRFS does understand too though, so that's not much > >> of a test. > > > > Which are those? I didn't encounter any... > I'm not sure there are any anymore, but I know that a handful (mostly > really uncommon ones) used to (and BTRFS is not alone in this > respect, some of the more esoteric ext4 options aren't accepted on > the kernel command-line either). I know at a minimum at some point > in the past alloc-start, check_int, and inode_cache did not work from > the kernel command-line. The post from Janos explains why: The difference is with the mount handler, depending on whether you use initrd or not. > >> I've just finished some quick testing though, and it looks > >> like you're right, BTRFS does not support this, which means I now > >> need to figure out what the hell was causing the IOPS counters in > >> collectd to change in rough correlation with remounting > >> (especially since it appears to happen mostly independent of the > >> options being changed). > > > > I think that noatime (which I remember you also used?), lazytime, > > and relatime are mutually exclusive: they all handle the inode > > updates. Maybe that is the effect you see? > They're not exactly exclusive. The lazytime option will prevent > changes to the mtime or atime fields in a file from forcing inode > write-out for up to 24 hours (if the inode would be written out for > some other reason (such as a file-size change or the inode being > evicted from the cache), then the timestamps will be too), but it > does not change the value of the timestamps. So if you have lazytime > enabled and use touch to update the mtime on anotherwise idle file, > the mtime will still be correct as far as userspace is concerned, as > long as you don't crash before the update hits the disk (but > userspace will only see the discrepancy _after_ the crash). Yes, I know all this. But I don't see why you still want noatime or relatime if you use lazytime, except for super-optimizing. Lazytime gives you POSIX conformity for a problem that the other options only tried to solve. > > Well, relatime is mostly the same thus not perfectly resembling the > > POSIX standard. I think the only software that relies on atime is > > mutt... > This very much depends on what you're doing. If you have a WORM > workload, then yeah, it's pretty much the same. If however you have > something like a database workload where a specific set of files get > internally rewritten regularly, then it actually has a measurable > impact. I think "impact" is a whole different story. I'm on your side here. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Am Tue, 11 Apr 2017 01:45:32 +0200 schrieb "Janos Toth F.": > >> The command-line also rejects a number of perfectly legitimate > >> arguments that BTRFS does understand too though, so that's not much > >> of a test. > > > > Which are those? I didn't encounter any... > > I think this bug still stands unresolved (for 3+ years, probably > because most people use init-rd/fs without ever considering to omit it > in case they don't really need it at all): > Bug 61601 - rootflags=noatime causes kernel panic when booting > without initrd. The last time I tried it applied to Btrfs as well: > https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18 Ah okay, so the difference is with the mount handler. I can only use initrd here because I have multi-device btrfs ontop of bcache as rootfs. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Am Mon, 10 Apr 2017 13:13:39 -0400 schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > On 2017-04-10 12:54, Kai Krakow wrote: > > Am Mon, 10 Apr 2017 18:44:44 +0200 > > schrieb Kai Krakow <hurikha...@gmail.com>: > > > >> Am Mon, 10 Apr 2017 08:51:38 -0400 > >> schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > >> > [...] > [...] > >> [...] > [...] > [...] > >> > >> Did you put it in /etc/fstab only for the rootfs? If yes, it > >> probably has no effect. You would need to give it as rootflags on > >> the kernel cmdline. > > > > I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4 > > and f2fs know the flag. Kernel 4.10. > > > > So probably you're seeing a placebo effect. If you put lazytime for > > rootfs just only into fstab, it won't have an effect because on > > initial mount this file cannot be opened (for obvious reasons), and > > on remount, btrfs seems to happily accept lazytime but it has no > > effect. It won't show up in /proc/mounts. Try using it in rootflags > > kernel cmdline and you should see that the kernel won't accept the > > flag lazytime. > The command-line also rejects a number of perfectly legitimate > arguments that BTRFS does understand too though, so that's not much > of a test. Which are those? I didn't encounter any... > I've just finished some quick testing though, and it looks > like you're right, BTRFS does not support this, which means I now > need to figure out what the hell was causing the IOPS counters in > collectd to change in rough correlation with remounting (especially > since it appears to happen mostly independent of the options being > changed). I think that noatime (which I remember you also used?), lazytime, and relatime are mutually exclusive: they all handle the inode updates. Maybe that is the effect you see? > This is somewhat disappointing though, as supporting this would > probably help with the write-amplification issues inherent in COW > filesystems. -- Well, relatime is mostly the same thus not perfectly resembling the POSIX standard. I think the only software that relies on atime is mutt... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Am Mon, 10 Apr 2017 18:44:44 +0200 schrieb Kai Krakow <hurikha...@gmail.com>: > Am Mon, 10 Apr 2017 08:51:38 -0400 > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > > > On 2017-04-10 08:45, Kai Krakow wrote: > > > Am Mon, 10 Apr 2017 08:39:23 -0400 > > > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > > > > [...] > > > > > > Does btrfs really support lazytime now? > > > > > It appears to, I do see fewer writes with it than without it. At > > the very least, if it doesn't, then nothing complains about it. > > Did you put it in /etc/fstab only for the rootfs? If yes, it probably > has no effect. You would need to give it as rootflags on the kernel > cmdline. I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4 and f2fs know the flag. Kernel 4.10. So probably you're seeing a placebo effect. If you put lazytime for rootfs just only into fstab, it won't have an effect because on initial mount this file cannot be opened (for obvious reasons), and on remount, btrfs seems to happily accept lazytime but it has no effect. It won't show up in /proc/mounts. Try using it in rootflags kernel cmdline and you should see that the kernel won't accept the flag lazytime. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Am Mon, 10 Apr 2017 08:51:38 -0400 schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > On 2017-04-10 08:45, Kai Krakow wrote: > > Am Mon, 10 Apr 2017 08:39:23 -0400 > > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > > > >> They've been running BTRFS > >> with LZO compression, the SSD allocator, atime disabled, and mtime > >> updates deferred (lazytime mount option) the whole time, so it may > >> be a slightly different use case than the OP from this thread. > > > > Does btrfs really support lazytime now? > > > It appears to, I do see fewer writes with it than without it. At the > very least, if it doesn't, then nothing complains about it. Did you put it in /etc/fstab only for the rootfs? If yes, it probably has no effect. You would need to give it as rootflags on the kernel cmdline. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Am Mon, 10 Apr 2017 08:39:23 -0400 schrieb "Austin S. Hemmelgarn": > They've been running BTRFS > with LZO compression, the SSD allocator, atime disabled, and mtime > updates deferred (lazytime mount option) the whole time, so it may be > a slightly different use case than the OP from this thread. Does btrfs really support lazytime now? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
Am Sun, 9 Apr 2017 02:21:19 +0200 schrieb Hans van Kranenburg: > On 04/08/2017 11:55 PM, Peter Grandi wrote: > >> [ ... ] This post is way too long [ ... ] > > > > Many thanks for your report, it is really useful, especially the > > details. > > Thanks! > > >> [ ... ] using rsync with --link-dest to btrfs while still > >> using rsync, but with btrfs subvolumes and snapshots [1]. [ > >> ... ] Currently there's ~35TiB of data present on the example > >> filesystem, with a total of just a bit more than 9 > >> subvolumes, in groups of 32 snapshots per remote host (daily > >> for 14 days, weekly for 3 months, montly for a year), so > >> that's about 2800 'groups' of them. Inside are millions and > >> millions and millions of files. And the best part is... it > >> just works. [ ... ] > > > > That kind of arrangement, with a single large pool and very many > > many files and many subdirectories is a worst case scanario for > > any filesystem type, so it is amazing-ish that it works well so > > far, especially with 90,000 subvolumes. > > Yes, this is one of the reasons for this post. Instead of only hearing > about problems all day on the mailing list and IRC, we need some more > reports of success. > > The fundamental functionality of doing the cow snapshots, moo, and the > related subvolume removal on filesystem trees is so awesome. I have no > idea how we would have been able to continue this type of backup > system when btrfs was not available. Hardlinks and rm -rf was a total > dead end road. I'm absolutely no expert with arrays of sizes that you use but I also stopped using the hardlink-and-remove approach: It was slow to manage (rsync works slow for it, rm works slow for it) and it was error-probe (due to the nature of hardlinks). I used btrfs with snapshots and rsync for a while in my personal testbed, and experienced great slowness over time: rsync started to become slower and slower, full backup took 4 hours with huge %IO usage, maintaining the backup history was also slow (removing backups took a while), rebalancing was needed due to huge wasted space. I used rsync with --inplace and --no-whole-file to waste as few space as possible. What I first found was an adaptive rebalancer script which I still use for the main filesystem: https://www.spinics.net/lists/linux-btrfs/msg52076.html (thanks to Lionel) It works pretty well and has no such big IO overhead due to the adaptive multi-pass approach. But it still did not help the slowness. I now tested borgbackup for a while, and it's fast: It does the same job in 30 minutes or less instead of 4 hours, and it has much better backup density and comes with easy history maintenance, too. I can now store much more backup history in the same space. Full restore time is about the same as copying back with rsync. For a professional deployment I'm planning to use XFS as the storage backend and borgbackup as the backup frontend, because my findings showed that XFS allocation groups are spanning diagonally across the disk array, that is if you'd use simple JBOD of your iSCSI LUNs, XFS will spread writes across all the LUNs without you needing to do normal RAID striping, which should eliminate the need to migrate when adding more LUNs, and the underlaying storage layer on the NetApp side will probably already do RAID for redundancy anyways. Just feed more space to XFS using LVM. Borgbackup can do everything that btrfs can do for you but is targetting the job of doing backups only: It can compress, deduplicate, encrypt and do history thinning. The only downside I found is that only one backup job at a time can access the backup repository. So you'd have to use one backup repo per source machine. That way you cannot benefit from deduplication across multiple sources. But I'm sure NetApp can do that. OTOH, maybe backup duration drops to a point that you could serialize the backup of some machines. > OTOH, what we do with btrfs (taking a bulldozer and drive across all > the boundaries of sanity according to all recommendations and > warnings) on this scale of individual remotes is something that the > NetApp people should totally be jealous of. Backups management > (manual create, restore etc on top of the nightlies) is self service > functionality for our customers, and being able to implement the > magic behind the APIs with just a few commands like a btrfs sub snap > and some rsync gives the right amount of freedom and flexibility we > need. This is something I'm planning here, too: Self-service backups, do a btrfs snap, but then use borgbackup for archiving purposes. BTW: I think the 2M size comes from the assumption that SSDs manage their storage in groups of erase block sizes. The optimization here would be that btrfs deallocates (and maybe trims) only whole erase blocks which typically are 2M. This has a performance benefit. But if your underlying storage layer is RAID anyways, this no longer maps
Re: Shrinking a device - performance?
Am Mon, 27 Mar 2017 20:06:46 +0500 schrieb Roman Mamedov: > On Mon, 27 Mar 2017 16:49:47 +0200 > Christian Theune wrote: > > > Also: the idea of migrating on btrfs also has its downside - the > > performance of “mkdir” and “fsync” is abysmal at the moment. I’m > > waiting for the current shrinking job to finish but this is likely > > limited to the “find free space” algorithm. We’re talking about a > > few megabytes converted per second. Sigh. > > Btw since this is all on LVM already, you could set up lvmcache with > a small SSD-based cache volume. Even some old 60GB SSD would work > wonders for performance, and with the cache policy of "writethrough" > you don't have to worry about its reliability (much). That's maybe the best recommendation to speed things up. I'm using bcache here for the same reasons (speeding up random workloads) and it works wonders. Tho, for such big storage I'd maybe recommend a bigger SSD and a new one. Bigger SSDs tend to last much longer. Just don't use the whole of it to allow for better wear leveling and you'll get a final setup that can serve the system much longer than for the period of migration. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a file server with many subvolumes
Am Mon, 27 Mar 2017 08:57:17 +0300 schrieb Marat Khalili: > Just some consideration, since I've faced similar but no exactly same > problem: use rsync, but create snapshots on target machine. Blind > rsync will destroy deduplication of your snapshots and take huge > amount of storage, so it's not a solution. But you can rsync --inline > your snapshots in chronological order to some folder and re-take > snapshots of that folder, thus recreating your snapshots structure on > target. Obviously, it can/should be automated. I think it's --inplace and --no-whole-file... Apparently, rsync cannot detect moved files which was a big deal for me regarding deduplication, so I found another solution which is even faster. See my other reply. > On 26/03/17 06:00, J. Hart wrote: > > I have a Btrfs filesystem on a backup server. This filesystem has > > a directory to hold backups for filesystems from remote machines. > > In this directory is a subdirectory for each machine. Under each > > machine subdirectory is one directory for each filesystem > > (ex /boot, /home, etc) on that machine. In each filesystem > > subdirectory are incremental snapshot subvolumes for that > > filesystem. The scheme is something like this: > > > > /backup/// > > > > I'd like to try to back up (duplicate) the file server filesystem > > containing these snapshot subvolumes for each remote machine. The > > problem is that I don't think I can use send/receive to do this. > > "Btrfs send" requires "read-only" snapshots, and snapshots are not > > recursive as yet. I think there are too many subvolumes which > > change too often to make doing this without recursion practical. > > > > Any thoughts would be most appreciated. > > > > J. Hart > > > > -- > > To unsubscribe from this list: send the line "unsubscribe > > linux-btrfs" in the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a file server with many subvolumes
Am Mon, 27 Mar 2017 07:53:17 -0400 schrieb "Austin S. Hemmelgarn": > > I'd like to try to back up (duplicate) the file server filesystem > > containing these snapshot subvolumes for each remote machine. The > > problem is that I don't think I can use send/receive to do this. > > "Btrfs send" requires "read-only" snapshots, and snapshots are not > > recursive as yet. I think there are too many subvolumes which > > change too often to make doing this without recursion practical. > > > > Any thoughts would be most appreciated. > In general, I would tend to agree with everyone else so far if you > have to keep your current setup. Use rsync with the --inplace option > to send data to a staging location, then snapshot that staging > location to do the actual backup. > > Now, that said, I could probably give some more specific advice if I > had a bit more info on how you're actually storing the backups. > There are three general ways you can do this with BTRFS and > subvolumes: 1. Send/receive of snapshots from the system being backed > up. 2. Use some other software to transfer the data into a staging > location on the backup server, then snapshot that. > 3. Use some other software to transfer the data, and have it handle > snapshots instead of using BTRFS, possibly having it create > subvolumes instead of directories at the top level for each system. If you decide for (3), I can recommend borgbackup. It allows variable block size deduplication across all backup sources, tho to fully get that potential, your backups can only be done serially not in parallel. Borgbackup cannot access the same repository with two processes in parallel, and deduplication is only per repository. Another recommendation for backups is the 3-2-1 rule: * have at least 3 different copies of your data (that means, your original data, the backup copy, and another backup copy, separated in a way they cannot fail for the same reason) * use at least 2 different media (that also means: don't backup btrfs to btrfs, and/or use 2 different backup techniques) * keep at least 1 external copy (maybe rsync to a remote location) The 3 copy rule can be deployed by using different physical locations, different device types, different media, and/or different backup programs. So it's kind of entangled with the 2 and 1 rule. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Confusion about snapshots containers
Am Wed, 29 Mar 2017 16:27:30 -0500 schrieb Tim Cuthbertson: > I have recently switched from multiple partitions with multiple > btrfs's to a flat layout. I will try to keep my question concise. > > I am confused as to whether a snapshots container should be a normal > directory or a mountable subvolume. I do not understand how it can be > a normal directory while being at the same level as, for example, a > rootfs subvolume. This is with the understanding that the rootfs is > NOT at the btrfs top level. > > Which should it be, a normal directory or a mountable subvolume > directly under btrfs top level? If either way can work, what are the > pros and cons of each? I think there is no exact standard you could follow. Many distributions seems to go for the standard to prepend subvolumes with "@" if they are meant to be mounted. However, I'm not doing so. Generally speaking, subvolumes organize your volume into logical containers which make sense to be snapshotted on their own. Snapshots won't propagate to sub-subvolumes, it is important to keep that in mind while designing your idea of a structure. I'm using it like this: In subvol=0 I have the following subvolumes: /* - contains distribution specific file systems /home - contains home directories /snapshots - contains snapshots I want to keep /other - misc stuff, i.e. a dump of the subvol structure in a txt - a copy of my restore script - some other supporting docs for restore - this subvolume is kept in sync with my backup volume This means: If I mount one of the rootfs, my home will not be part of this mount automatically because that subvolume is out of scope of the rootfs. Now I have the following subvolumes below these: /gentoo/rootfs - rootfs of my main distribution Note 1: Everything below (except subvolumes) should be maintained by the package manager. Note 2: currently I installed no other distributions Note 3: I could have called it main-system-rootfs /gentoo/usr - actually not a subvolume but a directory for volumes shareable with other distribution instances /gentoo/usr/portage - portage, shareable by other gentoo instances /gentoo/usr/src - the gentoo linux kernel sources, shareable The following are put below /gentoo/rootfs so they not need to be mounted separately: /gentoo/rootfs/var/log - log volume because I don't want it to be snapshotted /gentoo/rootfs/var/tmp - tmp volume because it makes no sense to be snapshotted /gentoo/rootfs/var/lib/machines - subvolume for keeping nspawn containers /gentoo/rootfs/var/lib/machines/* - different machines cloned from each other /gentoo/rootfs/usr/local - non-package manager stuff /home/myuser - my user home /home/myuser/.VirtualBox - VirtualBox machines because I want them snapshotted separately /etc/fstab now only mounts subvolumes outside of the scope of /gentoo/rootfs: LABEL=system /homebtrfs compress=lzo,subvol=home,noatime LABEL=system /usr/portage btrfs noauto,compress=lzo,subvol=gentoo/usr/portage,noatime,x-systemd.automount LABEL=system /usr/src btrfs noauto,compress=lzo,subvol=gentoo/usr/src,noatime,x-systemd.automount Additionally, I mount the subvol=0 for two special purposes: LABEL=system /mnt/btrfs-pool btrfs noauto,compress=lzo,subvolid=0,x-systemd.automount,noatime First: For managing all the subvolumes and have an untampered view (without tmpfs or special purpose mounts) to the volumes. Second: To take a clean backup of the whole system. Now, I can give the bootloader subvol=gentoo/rootfs to select which system to boot (or make it the default subvolume). Maybe you get the idea and find that idea helpful. PS: It can make sense to have var/lib/machines outside of the rootfs scope if you want to share it with other distributions. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Am Wed, 15 Mar 2017 23:26:32 +0100 schrieb Kai Krakow <hurikha...@gmail.com>: > Well, bugs can hit you with every filesystem. Nothing as complex as a Meh... I fooled myself. Find the mistake... ;-) SPOILER: "Nothing" should be "something". -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Am Wed, 15 Mar 2017 23:41:41 +0100 schrieb Kai Krakow <hurikha...@gmail.com>: > Am Wed, 15 Mar 2017 23:26:32 +0100 > schrieb Kai Krakow <hurikha...@gmail.com>: > > > Well, bugs can hit you with every filesystem. Nothing as complex as > > a > > Meh... I fooled myself. Find the mistake... ;-) > > SPOILER: > > "Nothing" should be "something". *doublefacepalm* Please forget what I wrote. The original sentence is correct. I should get some coffee or go to bed. :-\ -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Am Wed, 15 Mar 2017 07:55:51 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Hérikz Nawarro posted on Mon, 13 Mar 2017 08:29:32 -0300 as excerpted: > > > Today is safe to use btrfs for home storage? No raid, just secure > > storage for some files and create snapshots from it. > > > I'll echo the others... but with emphasis on a few caveats the others > mentioned but didn't give the emphasis I thought they deserved: > > 1) Btrfs is, as I repeatedly put it in post after post, "stabilizing, > but not yet fully stable and mature." In general, that means it's > likely to work quite or even very well for you (as it has done for > us) if you don't try the too unusual or get too cocky, but get too > close to the edge and you just might find yourself over that edge. > Don't worry too much, tho, those edges are clearly marked if you're > paying attention, and just by asking here, you're already paying way > more attention than too many we see here... /after/ they've found > themselves over the edge. That's a _very_ good sign. =:^) Well, bugs can hit you with every filesystem. Nothing as complex as a file system can ever be proven bug free (except FAT maybe). But as a general-purpose-no-fancy-features-needed FS, btrfs should be on par with other FS these days. > 2) "Stabilizing, not fully stable and mature", means even more than > ever, if you value your data more than the time, hassle and resources > necessary to have backups, you HAVE them, tested and available for > practical use should it be necessary. This is totally not dependent on "stabilizing, not fully stable and mature". If you data matters to you, do backups. It's that simple. If you don't do backups, your data isn't important - by definition. > Of course any sysadmin (and that's what you are for at least your own > systems if you're making this choice) worth the name will tell you > the value of the data is really defined by the number of backups it > has, not by any arbitrary claims to value absent those backups. No > backups, you simply didn't value the data enough to have them, > whatever claims of value you might otherwise try to make. Backups, > you /did/ value the data. Yes. :-) > And of course the corollary to that first sysadmin's rule of backups > is that an untested as restorable backup isn't yet a backup, only a > potential backup, because the job isn't finished and it can't be > properly called a backup until you know you can restore from it if > necessary. Even more true. :-) > And lest anyone get the wrong idea, a snapshot is /not/ a backup for > purposes of the above rules. It's on the same filesystem and > hardware media and if that goes down... you've lost it just the > same. And since that filesystem is still stabilizing, you really > must be even more prepared for it to go down, even if the chances are > still quite good it won't. A good backup should follow the 3-2-1 rule: Have 3 different backup copies, 2 different media, and store at least 1 copy external/off-site. For customers, we usually deploy a strategy like this for Windows machines: Do one local backup using Windows Image Backup to a local NAS to backup from inside the VM, use a different software to do image backups from outside of the VM to the local NAS, mirror the "outside image" to a remote location (cloud storage). And keep some backup history. Overwriting the one existing backup with a new one won't help you anything. All involved software should be able to do efficient delta backups, otherwise mirroring offsite may be no fun. In linux, I'm using borgbackup and rsync to have something similar. Using borgbackup to a local storage, and syncing it offsite with rsync gives me the 2-1 rule part. You can get the third rule by using rsync to also mirror the local FS off the machine. But that's usually overkill for personal backups. Instead, I only have a third copy of most valuable data like photos, dev stuff, documents, etc. BTW: For me, different media also means different FS types. So a bug in one FS wouldn't easily hit the other. [snip] > 4) Keep the number of snapshots per subvolume under tight control as > already suggested. A few hundred, NOT a few thousand. Easy enough > if you do those snapshots manually, but easy enough to get thousands > if you're not paying attention to thin out the old ones and using an > automated tool such as snapper. Borgbackup is so fast and storage efficient that you could run it easily multiple times per day. That in turn means I don't need to rely on regular snapshots to undo mistakes. I only use snapshots before doing some knowingly risky stuff to have fast recovery. But that's all, nothing else should snapshots before (except you are doing more advanced stuff like container cloning, VM instance spawning, ...). > 5) Stay away from quotas. Either you need the feature and thus need > a more mature filesystem where it's actually stable and does what it > says on the label, or you don't, in which
Re: Hard crash on 4.9.5
Am Sat, 28 Jan 2017 15:50:38 -0500 schrieb Matt McKinnon: > This same file system (which crashed again with the same errors) is > also giving this output during a metadata or data balance: This looks somewhat familiar to the err=-17 that I am experiencing when using VirtualBox image on btrfs in CoW mode (compress=lzo). During IO intensive workloads, it results in "object already exists, err -17" (or similar, someone else also experienced it through another workload). The resulting btrfs check show the same errors, giving inodes without csum. Trying to continue using this file system in successive boots usually results in boot freezes or complete unmountable filesystem, broken beyond repair. I'm feeling that using the bfq elevator usually enables me to trigger this bug also without using VirtualBox, i.e. during normal system usage, and mostly during boot when IO load is very high. So I also stopped using bfq although it was giving me a much superior interactivity. Marking vbox images nocow and using standard elevators (cfq, deadline) exposes no such problems so far - even during excessive IO loads. EOM > Jan 27 19:42:47 my_machine kernel: [ 335.018123] BTRFS info (device > sda1): no csum found for inode 28472371 start 2191360 > Jan 27 19:42:47 my_machine kernel: [ 335.018128] BTRFS info (device > sda1): no csum found for inode 28472371 start 2195456 > Jan 27 19:42:47 my_machine kernel: [ 335.018491] BTRFS info (device > sda1): no csum found for inode 28472371 start 4018176 > Jan 27 19:42:47 my_machine kernel: [ 335.018496] BTRFS info (device > sda1): no csum found for inode 28472371 start 4022272 > Jan 27 19:42:47 my_machine kernel: [ 335.018499] BTRFS info (device > sda1): no csum found for inode 28472371 start 4026368 > Jan 27 19:42:47 my_machine kernel: [ 335.018502] BTRFS info (device > sda1): no csum found for inode 28472371 start 4030464 > Jan 27 19:42:47 my_machine kernel: [ 335.019443] BTRFS info (device > sda1): no csum found for inode 28472371 start 6156288 > Jan 27 19:42:47 my_machine kernel: [ 335.019688] BTRFS info (device > sda1): no csum found for inode 28472371 start 7933952 > Jan 27 19:42:47 my_machine kernel: [ 335.019693] BTRFS info (device > sda1): no csum found for inode 28472371 start 7938048 > Jan 27 19:42:47 my_machine kernel: [ 335.019754] BTRFS info (device > sda1): no csum found for inode 28472371 start 8077312 > Jan 27 19:42:47 my_machine kernel: [ 335.025485] BTRFS warning > (device sda1): csum failed ino 28472371 off 2191360 csum 4031061501 > expected csum 0 Jan 27 19:42:47 my_machine kernel: [ 335.025490] > BTRFS warning (device sda1): csum failed ino 28472371 off 2195456 > csum 2371784003 expected csum 0 Jan 27 19:42:47 my_machine kernel: > [ 335.025526] BTRFS warning (device sda1): csum failed ino 28472371 > off 4018176 csum 3812080098 expected csum 0 Jan 27 19:42:47 > my_machine kernel: [ 335.025531] BTRFS warning (device sda1): csum > failed ino 28472371 off 4022272 csum 2776681411 expected csum 0 Jan > 27 19:42:47 my_machine kernel: [ 335.025534] BTRFS warning (device > sda1): csum failed ino 28472371 off 4026368 csum 1179241675 expected > csum 0 Jan 27 19:42:47 my_machine kernel: [ 335.025540] BTRFS > warning (device sda1): csum failed ino 28472371 off 4030464 csum > 1256914217 expected csum 0 Jan 27 19:42:47 my_machine kernel: > [ 335.026142] BTRFS warning (device sda1): csum failed ino 28472371 > off 7933952 csum 2695958066 expected csum 0 Jan 27 19:42:47 > my_machine kernel: [ 335.026147] BTRFS warning (device sda1): csum > failed ino 28472371 off 7938048 csum 3260800596 expected csum 0 Jan > 27 19:42:47 my_machine kernel: [ 335.026934] BTRFS warning (device > sda1): csum failed ino 28472371 off 6156288 csum 4293116449 expected > csum 0 Jan 27 19:42:47 my_machine kernel: [ 335.033249] BTRFS > warning (device sda1): csum failed ino 28472371 off 8077312 csum > 4031878292 expected csum 0 > > Can these be ignored? > > > On 01/25/2017 04:06 PM, Liu Bo wrote: > > On Mon, Jan 23, 2017 at 03:03:55PM -0500, Matt McKinnon wrote: > >> Wondering what to do about this error which says 'reboot needed'. > >> Has happened a three times in the past week: > >> > > > > Well, I don't think btrfs's logic here is wrong, the following stack > > shows that a nfs client has sent a second unlink against the same > > inode while somehow the inode was not fully deleted by the first > > unlink. > > > > So it'd be good that you could add some debugging information to > > get us further. > > > > Thanks, > > > > -liubo > > > >> Jan 23 14:16:17 my_machine kernel: [ 2568.595648] BTRFS error > >> (device sda1): err add delayed dir index item(index: 23810) into > >> the deletion tree of the delayed node(root id: 257, inode id: > >> 2661433, errno: -17) Jan 23 14:16:17 my_machine kernel: > >> [ 2568.611010] [ cut here ] > >> Jan 23 14:16:17 my_machine kernel: [ 2568.615628] kernel BUG at > >>
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Am Fri, 3 Mar 2017 07:19:06 -0500 schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>: > On 2017-03-03 00:56, Kai Krakow wrote: > > Am Thu, 2 Mar 2017 11:37:53 +0100 > > schrieb Adam Borowski <kilob...@angband.pl>: > > > >> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: > [...] > >> > >> Well, there's Qu's patch at: > >> https://www.spinics.net/lists/linux-btrfs/msg47283.html > >> but it doesn't apply cleanly nor is easy to rebase to current > >> kernels. > [...] > >> > >> Well, yeah. The current check is naive and wrong. It does have a > >> purpose, just fails in this, very common, case. > > > > I guess the reasoning behind this is: Creating any more chunks on > > this drive will make raid1 chunks with only one copy. Adding > > another drive later will not replay the copies without user > > interaction. Is that true? > > > > If yes, this may leave you with a mixed case of having a raid1 drive > > with some chunks not mirrored and some mirrored. When the other > > drives goes missing later, you are loosing data or even the whole > > filesystem although you were left with the (wrong) imagination of > > having a mirrored drive setup... > > > > Is this how it works? > > > > If yes, a real patch would also need to replay the missing copies > > after adding a new drive. > > > The problem is that that would use some serious disk bandwidth > without user intervention. The way from userspace to fix this is to > scrub the FS. It would essentially be the same from kernel space, > which means that if you had a multi-TB FS and this happened, you'd be > running at below capacity in terms of bandwidth for quite some time. > If this were to be implemented, it would have to be keyed off of the > per-chunk degraded check (so that _only_ the chunks that need it get > touched), and there would need to be a switch to disable it. Well, I'd expect that a replaced drive would involve reduced bandwidth for a while. Every traditional RAID does this. The key solution there is that you can limit bandwidth and/or define priorities (BG rebuild rate). Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are affected. The kernel can already do IO priorities and some sort of bandwidth limiting should also be possible. I think IO throttling is already implemented in the kernel somewhere (at least with 4.10) and also in btrfs. So the basics are there. In a RAID setup, performance should never have priority over redundancy by default. If performance is an important factor, I suggest working with SSD writeback caches. This is already possible with different kernel techniques like mdcache or bcache. Proper hardware controllers also support this in hardware. It's cheap to have a mirrored SSD writeback cache of 1TB or so if your setup already contains a multiple terabytes array. Such a setup has huge performance benefits in setups we deploy (tho, not btrfs related). Also, adding/replacing a drive is usually not a totally unplanned event. Except for hot spares, a missing drive will be replaced at the time you arrive on-site. If performance is a factor, this can be done the same time as manually starting the process. So why not should it be done automatically? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Am Thu, 2 Mar 2017 11:37:53 +0100 schrieb Adam Borowski: > On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote: > > [1717713.408675] BTRFS warning (device dm-8): missing devices (1) > > exceeds the limit (0), writeable mount is not allowed > > [1717713.446453] BTRFS error (device dm-8): open_ctree failed > > > > [chris@f25s ~]$ uname -r > > 4.9.8-200.fc25.x86_64 > > > > I thought this was fixed. I'm still getting a one time degraded rw > > mount, after that it's no longer allowed, which really doesn't make > > any sense because those single chunks are on the drive I'm trying to > > mount. > > Well, there's Qu's patch at: > https://www.spinics.net/lists/linux-btrfs/msg47283.html > but it doesn't apply cleanly nor is easy to rebase to current kernels. > > > I don't understand what problem this proscription is trying to > > avoid. If it's OK to mount rw,degraded once, then it's OK to allow > > it twice. If it's not OK twice, it's not OK once. > > Well, yeah. The current check is naive and wrong. It does have a > purpose, just fails in this, very common, case. I guess the reasoning behind this is: Creating any more chunks on this drive will make raid1 chunks with only one copy. Adding another drive later will not replay the copies without user interaction. Is that true? If yes, this may leave you with a mixed case of having a raid1 drive with some chunks not mirrored and some mirrored. When the other drives goes missing later, you are loosing data or even the whole filesystem although you were left with the (wrong) imagination of having a mirrored drive setup... Is this how it works? If yes, a real patch would also need to replay the missing copies after adding a new drive. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please add more info to dmesg output on I/O error
Am Wed, 1 Mar 2017 19:04:26 +0300 schrieb Timofey Titovets: > Hi, today i try move my FS from old HDD to new SSD > While processing i catch I/O error and device remove operation was > canceled > > Dmesg: > [ 1015.010241] blk_update_request: I/O error, dev sda, sector 81353664 > [ 1015.010246] BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0, > rd 23, flush 0, corrupt 0, gen 0 > [ 1015.010282] ata5: EH complete > [ 1017.016721] ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 > action 0x0 [ 1017.016730] ata5.00: irq_stat 0x4008 > [ 1017.016737] ata5.00: failed command: READ FPDMA QUEUED > [ 1017.016748] ata5.00: cmd 60/08:80:c0:5b:d9/00:00:04:00:00/40 tag 16 > ncq dma 4096 in >res 41/40:00:c0:5b:d9/00:00:04:00:00/40 Emask > 0x409 (media error) > [ 1017.016754] ata5.00: status: { DRDY ERR } > [ 1017.016757] ata5.00: error: { UNC } > [ 1017.029479] ata5.00: configured for UDMA/133 > [ 1017.029506] sd 4:0:0:0: [sda] tag#16 UNKNOWN(0x2003) Result: > hostbyte=0x00 driverbyte=0x08 > [ 1017.029511] sd 4:0:0:0: [sda] tag#16 Sense Key : 0x3 [current] > [ 1017.029516] sd 4:0:0:0: [sda] tag#16 ASC=0x11 ASCQ=0x4 > [ 1017.029520] sd 4:0:0:0: [sda] tag#16 CDB: opcode=0x28 28 00 04 d9 > 5b c0 00 00 08 00 > > At now, i fixed this problem by doing scrub FS and delete damaged > files, but scrub are slow, and if btrfs show me a more info on I/O > error, it's will be more helpful > i.e. something like i getting by scrub: > [ 1260.559180] BTRFS warning (device sdb1): i/o error at logical > 40569896960 on dev /dev/sda1, sector 81351616, root 309, inode 55135, > offset 71278592, length 4096, links 1 (path: > nefelim4ag/.config/skypeforlinux/Cache/data_3) > > Thanks. You should turn off SCT ERC with smartctl or set it to lower values, or if that doesn't work with your HDD firmware, increase the timeout of the scsi driver above 120s. This setup as it is, is not going to work correctly with btrfs in case of errors. # smartctl -l scterc,70,70 /dev/sdb should do the trick if supported. It applies an error correction timeout of 7 seconds for reading and writing, which is below the kernel scsi layer timeout of 30 seconds. Otherwise, your drive will fail to respond for two minutes until the kernel resets the drive. According to dmesg, this is what happened. NAS-ready drives usually support this setting, while desktop drives don't or at least default to standard desktop timeouts. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists
Am Tue, 14 Feb 2017 13:52:48 +0100 schrieb Marc Joliet: > Hi again, > > so, it seems that I've solved the problem: After having to > umount/mount the FS several times to get btrfs-cleaner to finish, I > thought of the "failed to load free space [...], rebuilding it now" > type errors and decided to try the clear_cache mount option. Since > then, my home server has been running its backups regularly again. > Furthermore, I was able back up my desktop again via send/recv (the > rsync based backup is still running, but I expect it to succeed). > The kernel log has also stayed clean. > > Kai, I'd be curious whether clear_cache will help in your case, too, > if you haven't tried it already. The FS in question has been rebuilt from scratch. AFAIR it wasn't even mountable back then, or at it froze the system short after mounting. I needed the system back in usable state, so I had to recreate and restore the system. Next time, if it happens again (finger crossed it does not), I'll give it a try. -- Regards, Kai Replies to list-only preferred. pgp8pHpMpE5QT.pgp Description: Digitale Signatur von OpenPGP
Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists
Am Fri, 10 Feb 2017 23:15:03 +0100 schrieb Marc Joliet: > # btrfs filesystem df /media/MARCEC_BACKUP/ > Data, single: total=851.00GiB, used=831.36GiB > System, DUP: total=64.00MiB, used=120.00KiB > Metadata, DUP: total=13.00GiB, used=10.38GiB > Metadata, single: total=1.12GiB, used=0.00B > GlobalReserve, single: total=512.00MiB, used=0.00B > > Hmm, I take it that the single metadata is a leftover from running > --repair? It's more probably a remnant of an incomplete balance operation or an older mkfs version. I'd simply rebalance metadata to fix this. I don't think that btrfs-repair would migrate missing metadata duplicates back to single profile, it would more likely trigger recreating the missing duplicates. But I'm not sure. If it is a result of the repair operation, that could be an interesting clue. Could it explain "error -17" from your logs? But that would mean the duplicates were already missing before the repair operation and triggered that problem. So the question is, why are those duplicates missing in the first place as a result of normal operation? From your logs: ---8<--- snip Feb 02 22:49:14 thetick kernel: BTRFS: device label MARCEC_BACKUP devid 1 transid 283903 /dev/sdd2 Feb 02 22:49:19 thetick kernel: EXT4-fs (sdd1): mounted filesystem with ordered data mode. Opts: (null) Feb 03 00:18:52 thetick kernel: BTRFS info (device sdd2): use zlib compression Feb 03 00:18:52 thetick kernel: BTRFS info (device sdd2): disk space caching is enabled Feb 03 00:18:52 thetick kernel: BTRFS info (device sdd2): has skinny extents Feb 03 00:20:09 thetick kernel: BTRFS info (device sdd2): The free space cache file (3967375376384) is invalid. skip it Feb 03 01:05:58 thetick kernel: [ cut here ] Feb 03 01:05:58 thetick kernel: WARNING: CPU: 1 PID: 26544 at fs/btrfs/extent- tree.c:2967 btrfs_run_delayed_refs+0x26c/0x290 Feb 03 01:05:58 thetick kernel: BTRFS: Transaction aborted (error -17) --->8--- snap "error -17" being "object already exists". My only theory would be this has a direct connection to you finding the single metadata profile. Like in "the kernel thinks the objects already exists when it really didn't, and as a result the object is there only once now" aka "single metadata". But I'm no dev and no expert on the internals. -- Regards, Kai Replies to list-only preferred. pgp64OIVK8aRx.pgp Description: Digitale Signatur von OpenPGP
Re: BTRFS and cyrus mail server
Am Wed, 08 Feb 2017 19:38:06 +0100 schrieb Libor Klepáč: > Hello, > inspired by recent discussion on BTRFS vs. databases i wanted to ask > on suitability of BTRFS for hosting a Cyrus imap server spool. I > haven't found any recent article on this topic. > > I'm preparing migration of our mailserver to Debian Stretch, ie. > kernel 4.9 for now. We are using XFS for storage now. I will migrate > using imapsync to new server. Both are virtual machines running on > vmware on Dell hardware. Disks are on battery backed hw raid > controllers over vmfs. > > I'm considering using BTRFS, but I'm little concerned because of > reading this mailing list ;) > > I'm interested in using: > - compression (emails should compress well - right?) Not really... The small part that's compressible (headers and a few lines of text) are already small, so a sector (maybe 4k) is still a sector. Compression gains you no benefit here. That big parts of mails is already compressed (images, attachments). Mail spools only compress well if you're compressing mails to a solid archive (like 7zip or tgz). If you're compressing each mail individually, there's almost no gain because of file system slack. > - maybe deduplication (cyrus does it by hardlinking of same content > messages now) later It won't work that way. I'd stick to hardlinking. Only offline/nearline deduplication will help you. And it will have a hard time finding the duplicates. This would only properly work if Cyrus separates mail headers and bodies (I don't know if it does, dovecot doesn't which is what I use) because delivering to the spool usually adds some headers like "Delivered-To". This changes the byte offsets between similar mails so that deduplication will no longer work. > - snapshots for history Don't do snapshots too deep. I had similar plans but instead decided it would be better to use the following setup as a continuous backup strategy: Deliver mails to two spools, one being the user accessible spool, and one being the backup spool. Once per day you rename the backup spool and let it be recreated. Then store away the old backup store in whatever way you want (snapshots, traditional backup with retention, ...). > - send/receive for offisite backup It's not that stable that I'd use it in production... > - what about data inlining, should it be turned off? How much data can be inlined? I'm not sure, I never thought about that. > Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 > mailboxes. Similar numbers here, just more mailboxes and less space because we take care that customers remove their mails from our servers and store it in their own systems and backups. With a few exceptions, and those have really big mailboxes. > We have message size limit of ~25MB, so emails are not bigger than > that. 50 MB raw size here... (after 3-in-4 decoding this makes around 37 MB worth of attachments) > There are however bigger files, these are per mailbox > caches/index files of cyrus (some of them are around 300MB) - and > these are also files which are most modified. > Rest of files (messages) are usualy just writen once. I'm still struggling if I should try btrfs or stay with xfs. Xfs has a huge benefit of scaling very very well to parallel workloads and accross multiple devices. Btrfs does exactly that not very well yet (because of write-serialization etc). > > --- > I started using btrfs on backup server as a storage for 4 backuppc > run in containers (backups are then send away with btrbk), year ago. > After switching off data inlining i'm satisfied, everything works > (send/ receive is sometime slow, but i guess it's because of sata > disks on receive side). I've started to love borgbackup. It's very fast, efficient, and reliable. Not sure how good it works for VM images, but for delta backups in general it's very efficient and fast. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: better document btrfs receive security
Am Fri, 3 Feb 2017 08:48:58 -0500 schrieb "Austin S. Hemmelgarn": > +user who is running receive, and then move then into the final > destination Typo? s/then/them/? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dup vs raid1 in single disk
Am Thu, 19 Jan 2017 15:02:14 -0500 schrieb "Austin S. Hemmelgarn": > On 2017-01-19 13:23, Roman Mamedov wrote: > > On Thu, 19 Jan 2017 17:39:37 +0100 > > "Alejandro R. Mosteo" wrote: > > > >> I was wondering, from a point of view of data safety, if there is > >> any difference between using dup or making a raid1 from two > >> partitions in the same disk. This is thinking on having some > >> protection against the typical aging HDD that starts to have bad > >> sectors. > > > > RAID1 will write slower compared to DUP, as any optimization to > > make RAID1 devices work in parallel will cause a total performance > > disaster for you as you will start trying to write to both > > partitions at the same time, turning all linear writes into random > > ones, which are about two orders of magnitude slower than linear on > > spinning hard drives. DUP shouldn't have this issue, but still it > > will be twice slower than single, since you are writing everything > > twice. > As of right now, there will actually be near zero impact on write > performance (or at least, it's way less than the theoretical 50%) > because there really isn't any optimization to speak of in the > multi-device code. That will hopefully change over time, but it's > not likely to do so any time in the future since nobody appears to be > working on multi-device write performance. I think that's only true if you don't account the seek overhead. In single device RAID1 mode you will always seek half of the device while writing data, and even when reading between odd and even PIDs. In contrast, DUP mode doesn't guarantee your seeks to be shorter but from a statistical point of view, on the average it should be shorter. So it should yield better performance (tho I wouldn't expect it to be observable, depending on your workload). So, on devices having no seek overhead (aka SSD), it is probably true (minus bus bandwidth considerations). For HDD I'd prefer DUP. >From data safety point of view: It's more likely that adjacent and nearby sectors are bad. So DUP imposes a higher risk of written data being written to only bad sectors - which means data loss or even file system loss (if metadata hits this problem). To be realistic: I wouldn't trade space usage for duplicate data on an already failing disk, no matter if it's DUP or RAID1. HDD disk space is cheap, and using such a scenario is just waste of performance AND space - no matter what. I don't understand the purpose of this. It just results in fake safety. Better get two separate devices half the size. There's a better chance of getting a better cost/space ratio anyways, plus better performance and safety. > There's also the fact that you're writing more metadata than data > most of the time unless you're dealing with really big files, and > metadata is already DUP mode (unless you are using an SSD), so the > performance hit isn't 50%, it's actually a bit more than half the > ratio of data writes to metadata writes. > > > >> On a related note, I see this caveat about dup in the manpage: > >> > >> "For example, a SSD drive can remap the blocks internally to a > >> single copy thus deduplicating them. This negates the purpose of > >> increased redunancy (sic) and just wastes space" > > > > That ability is vastly overestimated in the man page. There is no > > miracle content-addressable storage system working at 500 MB/sec > > speeds all within a little cheap controller on SSDs. Likely most of > > what it can do, is just compress simple stuff, such as runs of > > zeroes or other repeating byte sequences. > Most of those that do in-line compression don't implement it in > firmware, they implement it in hardware, and even DEFLATE can get 500 > MB/second speeds if properly implemented in hardware. The firmware > may control how the hardware works, but it's usually hardware doing > heavy lifting in that case, and getting a good ASIC made that can hit > the required performance point for a reasonable compression algorithm > like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI > work. I still thinks it's a myth... The overhead of managing inline deduplication is just way too high to implement it without jumping through expensive hoops. Most workloads have almost zero deduplication potential. And even when, their temporal occurrence is spaced so far that an inline deduplicator won't catch it. If it would be all so easy, btrfs would already have it working in mainline. I don't even remember that those patches is still being worked on. With this in mind, I think dup metadata is still a good think to have even on SSD and I would always force to enable it. Potential for deduplication is only when using snapshots (which already are deduplicated when taken) or when handling user data on a file server in a multi-user environment. Users tend to copy their files all over the place - multiple directories of multiple gigabytes.
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 22:25:29 +0100 schrieb Lionel Bouton <lionel-subscript...@bouton.name>: > Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit : > > On 2017-02-07 15:36, Kai Krakow wrote: > >> Am Tue, 7 Feb 2017 09:13:25 -0500 > >> schrieb Peter Zaitsev <p...@percona.com>: > >> > [...] > >> > >> Out of curiosity, I see one problem here: > >> > >> If you're doing snapshots of the live database, each snapshot > >> leaves the database files like killing the database in-flight. > >> Like shutting the system down in the middle of writing data. > >> > >> This is because I think there's no API for user space to subscribe > >> to events like a snapshot - unlike e.g. the VSS API (volume > >> snapshot service) in Windows. You should put the database into > >> frozen state to prepare it for a hotcopy before creating the > >> snapshot, then ensure all data is flushed before continuing. > > Correct. > >> > >> I think I've read that btrfs snapshots do not guarantee single > >> point in time snapshots - the snapshot may be smeared across a > >> longer period of time while the kernel is still writing data. So > >> parts of your writes may still end up in the snapshot after > >> issuing the snapshot command, instead of in the working copy as > >> expected. > > Also correct AFAICT, and this needs to be better documented (for > > most people, the term snapshot implies atomicity of the > > operation). > > Atomicity can be a relative term. If the snapshot atomicity is > relative to barriers but not relative to individual writes between > barriers then AFAICT it's fine because the filesystem doesn't make > any promise it won't keep even in the context of its snapshots. > Consider a power loss : the filesystems atomicity guarantees can't go > beyond what the hardware guarantees which means not all current in fly > write will reach the disk and partial writes can happen. Modern > filesystems will remain consistent though and if an application using > them makes uses of f*sync it can provide its own guarantees too. The > same should apply to snapshots : all the writes in fly can complete or > not on disk before the snapshot what matters is that both the snapshot > and these writes will be completed after the next barrier (and any > robust application will ignore all the in fly writes it finds in the > snapshot if they were part of a batch that should be atomically > commited). > > This is why AFAIK PostgreSQL or MySQL with their default ACID > compliant configuration will recover from a BTRFS snapshot in the > same way they recover from a power loss. This is what I meant in my other reply. But this is also why it should be documented. Wrongly implying that snapshots are single point in time snapshots is a wrong assumption with possibly horrible side effects one wouldn't expect. Taking a snapshot is like a power loss - even tho there is no power loss. So the database has to be properly configured. It is simply short sighted if you don't think about this fact. The documentation should really point that fact out. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 10:43:11 -0500 schrieb "Austin S. Hemmelgarn": > > I mean that: > > You have a 128MB extent, you rewrite random 4k sectors, btrfs will > > not split 128MB extent, and not free up data, (i don't know > > internal algo, so i can't predict when this will hapen), and after > > some time, btrfs will rebuild extents, and split 128 MB exten to > > several more smaller. But when you use compression, allocator > > rebuilding extents much early (i think, it's because btrfs also > > operates with that like 128kb extent, even if it's a continuos > > 128MB chunk of data). > The allocator has absolutely nothing to do with this, it's a function > of the COW operation. Unless you're using nodatacow, that 128MB > extent will get split the moment the data hits the storage device > (either on the next commit cycle (at most 30 seconds with the default > commit cycle), or when fdatasync is called, whichever is sooner). In > the case of compression, it's still one extent (although on disk it > will be less than 128MB) and will be split at _exactly_ the same time > under _exactly_ the same circumstances as an uncompressed extent. > IOW, it has absolutely nothing to do with the extent handling either. I don't think that btrfs splits extents which are part of the snapshot. The extent in a snapshot will stay intact when writing to this extent in another snapshot. Of course, in the just written snapshot, the extent will be represented as a split extent mapping to the original extents data blocks plus the new data in the middle (thus resulting in three extents). This is also why small random writes without autodefrag result in a vast amount of small extents bringing the fs performance to a crawl. Do that multiple times on multiple snapshots, delete some of the original snapshots, and you're left with slack space, data blocks being inaccessible and won't be reclaimed into free space (because they are still part of the original extent), and which can only be reclaimed by a defrag operation - which would of course unshares data. Thus, if any of the above mentioned small extents is still shared with an extent originally much bigger, then it will still occupy its original space on the filesystem - even when its associated snapshot/subvolume no longer exists. Only when the last remaining tiny block of such an extent gets rewritten and the reference counter decreases to zero, the extent is given up and freed. To work around this, you can currently only unshare and recombine by doing defrag and dedupe on all snapshots. This will reclaim space sitting in parts of the original extents no longer referenced by a snapshot visible from the VFS layer. This is for performance reasons because btrfs is extent based. As far as I know, ZFS on the other side, works different. It uses block based storage for the snapshot feature and can easily throw away unused blocks. Only a second layer on top maps this back into extents. The underlying infrastructure, however, is block based storage, which also enables the volume pool to create block devices on the fly out of ZFS storage space. PS: All above given the fact I understood it right. ;-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 15:27:34 -0500 schrieb "Austin S. Hemmelgarn": > >> I'm not sure about this one. I would assume based on the fact that > >> many other things don't work with nodatacow and that regular defrag > >> doesn't work on files which are currently mapped as executable code > >> that it does not, but I could be completely wrong about this too. > > > > Technically, there's nothing that prevents autodefrag to work for > > nodatacow files. The question is: is it really necessary? Standard > > file systems also have no autodefrag, it's not an issue there > > because they are essentially nodatacow. Simply defrag the database > > file once and you're done. Transactional MySQL uses huge data > > files, probably preallocated. It should simply work with > > nodatacow. > The thing is, I don't have enough knowledge of how defrag is > implemented in BTRFS to say for certain that ti doesn't use COW > semantics somewhere (and I would actually expect it to do so, since > that in theory makes many things _much_ easier to handle), and if it > uses COW somewhere, then it by definition doesn't work on NOCOW files. A dev would be needed on this. But from a non-dev point of view, the defrag operation itself is CoW: Blocks are rewritten to another location in contiguous order. Only metadata CoW should be needed for this operation. It should be nothing else than writing to a nodatacow snapshot... Just that the snapshot is more or less implicit and temporary. Hmm? *curious* -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 09:13:25 -0500 schrieb Peter Zaitsev: > Hi Hugo, > > For the use case I'm looking for I'm interested in having snapshot(s) > open at all time. Imagine for example snapshot being created every > hour and several of these snapshots kept at all time providing quick > recovery points to the state of 1,2,3 hours ago. In such case (as I > think you also describe) nodatacow does not provide any advantage. Out of curiosity, I see one problem here: If you're doing snapshots of the live database, each snapshot leaves the database files like killing the database in-flight. Like shutting the system down in the middle of writing data. This is because I think there's no API for user space to subscribe to events like a snapshot - unlike e.g. the VSS API (volume snapshot service) in Windows. You should put the database into frozen state to prepare it for a hotcopy before creating the snapshot, then ensure all data is flushed before continuing. I think I've read that btrfs snapshots do not guarantee single point in time snapshots - the snapshot may be smeared across a longer period of time while the kernel is still writing data. So parts of your writes may still end up in the snapshot after issuing the snapshot command, instead of in the working copy as expected. How is this going to be addressed? Is there some snapshot aware API to let user space subscribe to such events and do proper preparation? Is this planned? LVM could be a user of such an API, too. I think this could have nice enterprise-grade value for Linux. XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But still, also this needs to be integrated with MySQL to properly work. I once (years ago) researched on this but gave up on my plans when I planned database backups for our web server infrastructure. We moved to creating SQL dumps instead, although there're binlogs which can be used to recover to a clean and stable transactional state after taking snapshots. But I simply didn't want to fiddle around with properly cleaning up binlogs which accumulate horribly much space usage over time. The cleanup process requires to create a cold copy or dump of the complete database from time to time, only then it's safe to remove all binlogs up to that point in time. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 14:50:04 -0500 schrieb "Austin S. Hemmelgarn": > > Also does autodefrag works with nodatacow (ie with snapshot) or > > are these exclusive ? > I'm not sure about this one. I would assume based on the fact that > many other things don't work with nodatacow and that regular defrag > doesn't work on files which are currently mapped as executable code > that it does not, but I could be completely wrong about this too. Technically, there's nothing that prevents autodefrag to work for nodatacow files. The question is: is it really necessary? Standard file systems also have no autodefrag, it's not an issue there because they are essentially nodatacow. Simply defrag the database file once and you're done. Transactional MySQL uses huge data files, probably preallocated. It should simply work with nodatacow. On the other hand: Using snapshots clearly introduces fragmentation over time. If autodefrag kicks in (given, it is supported for nodatacow), it will slowly unshare all data over time. This somehow defeats the purpose of having snapshots in the first place for this scenario. In conclusion, I'd recommend to run some maintenance scripts from time to time, one to re-share identical blocks, and one to defragment the current workspace. The bees daemon comes into mind here... I haven't tried it but it sounds like it could fill a gap here: https://github.com/Zygo/bees Another option comes into mind: XFS now supports shared-extents copies. You could simply do a cold copy of the database with this feature resulting in the same effect as a snapshot, without seeing the other performance problems of btrfs. Tho, the fragmentation issue would remain, and I think there's no dedupe application for XFS yet. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
Am Tue, 7 Feb 2017 10:06:34 -0500 schrieb "Austin S. Hemmelgarn": > 4. Try using in-line compression. This can actually significantly > improve performance, especially if you have slow storage devices and > a really nice CPU. Just a side note: With nodatacow there'll be no compression, I think. At least for files with "chattr +C" there'll be no compression. I thus think "nodatacow" has the same effect. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
Am Mon, 6 Feb 2017 08:19:37 -0500 schrieb "Austin S. Hemmelgarn": > > MDRAID uses stripe selection based on latency and other measurements > > (like head position). It would be nice if btrfs implemented similar > > functionality. This would also be helpful for selecting a disk if > > there're more disks than stripesets (for example, I have 3 disks in > > my btrfs array). This could write new blocks to the most idle disk > > always. I think this wasn't covered by the above mentioned patch. > > Currently, selection is based only on the disk with most free > > space. > You're confusing read selection and write selection. MDADM and > DM-RAID both use a load-balancing read selection algorithm that takes > latency and other factors into account. However, they use a > round-robin write selection algorithm that only cares about the > position of the block in the virtual device modulo the number of > physical devices. Thanks for clearing that point. > As an example, say you have a 3 disk RAID10 array set up using MDADM > (this is functionally the same as a 3-disk raid1 mode BTRFS > filesystem). Every third block starting from block 0 will be on disks > 1 and 2, every third block starting from block 1 will be on disks 3 > and 1, and every third block starting from block 2 will be on disks 2 > and 3. No latency measurements are taken, literally nothing is > factored in except the block's position in the virtual device. I didn't know MDADM can use RAID10 on odd amounts of disks... Nice. I'll keep that in mind. :-) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
Am Mon, 6 Feb 2017 07:30:31 -0500 schrieb "Austin S. Hemmelgarn": > > How about mounting the receiver below a directory only traversable > > by root (chmod og-rwx)? Backups shouldn't be directly accessible by > > ordinary users anyway. > There are perfectly legitimate reasons to have backups directly > accessible by users. If you're running automated backups for _their_ > systems _they_ absolutely should have at least read access to the > backups _once they're stable_. This is not what I tried to explain. The OPs question mixes the creation process with later access. The creation process, however, should always be isolated. See below, you're even agreeing. ;-) > This is not a common case, but it is > a perfectly legitimate one. In the same way, if you're storing > backups for your users (in this case you absolutely should not be4 > using send/receive for other reasons), then the use case dictates > that they have some way to access them. I don't deny that. But the OP should understand to properly isolate both operations from each other. This is best practice, this is how it should be done. > > If you want a backup becoming accessible, you can later snapshot it > > to an accessible location after send/receive finished without > > errors. > > > > An in-transit backup, however, should always be protected from > > possible disruptive access. This is an issue with any backup > > software. So place the receive within an inaccessible directory. > > This is not something the backup process should directly bother > > with for simplicity. > I agree on this point however. Doing a backup directly into the > final persistent storage location is never a good idea. It makes > error handling more complicated, it makes handling of multi-tier > storage more complicated (and usually less efficient), and it makes > security more difficult. Agreed. It complicates a lot of things. In conclusion: If done right, the original request isn't a problem, neither is anything wrong by design. It's a question of isolation of operations. This is simply one of the most basic principles of a safe and secure backup. Personally, if I use rsync for backups, I always rsync to a scratch location only accessible by the backup process. This scratch area may even be incomplete, inconsistent or broken in other ways. Only when rsync successfully finished, the scratch area will be snapshot to its final destination - which is accessible by its users/owners. This also has the benefit of the snapshots being self-contained deltas which can be removed without rewriting or reorganizing partial or complete backup history. That's a plus-point for backup safety, performance, and retention policies. Currently, I'm using borgbackup for my personal backups. It has a similar approach by using checkpoints for resuming a partial backup. Only a successful backup process creates the final checkpoint. The intermediate checkpoints can be thrown away at any time later. It currently stores a backup history of 95.8 TB (multiple months) on a 3 TB hard disk. Unfortunately, I don't yet sync this to an offsite location. My most important data (photos, mental work like programming) is stored in a different location through other means (Git, cloud sync). For customers, I prefer to use a local cache where the backup is stored, then it will be synced offsite using deduplication algorithms to reduce transfer overhead. A second, different backup software stores another local copy for fast recovery in case of disaster. There's only need to sync back from offsite storage in case of total local data loss. And there's a backup for the backup. If one doesn't work, there's always the other. In all cases, the intermediate storage is protected from tampering by the user, or even completely blocked to be accessed by the user. Only final and clean snapshots are made available. Also, error handling and cleanup is easy because errors don't leak or propagate into the final storage. You simply can clean caches, intermediate checkpoints, or scratch/staging areas. You can even loose it for whatever reason (hardware problems, storage errors etc). The only downside would be that the next backups takes longer to complete. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it possible to have metadata-only device with no data?
Am Mon, 06 Feb 2017 00:42:01 +0300 schrieb Alexander Tomokhov: > Is it possible, having two drives to do raid1 for metadata but keep > data on a single drive only? -- No, but you could take a look into bcache which should get you something similar if used in write-around mode. Random access will become cached in bcache, which should most of the time be metadata, plus of course randomly accessed data from HDD. If you reduce the sequential cutoff trigger in bcache, it should cache mostly metadata only. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
Am Thu, 2 Feb 2017 10:52:26 + schrieb Graham Cobb: > On 02/02/17 00:02, Duncan wrote: > > If it's a workaround, then many of the Linux procedures we as > > admins and users use every day are equally workarounds. Setting > > 007 perms on a dir that doesn't have anything immediately security > > vulnerable in it, simply to keep other users from even potentially > > seeing or being able to write to something N layers down the subdir > > tree, is standard practice. > > No. There is no need to normally place a read-only snapshot below a > no-execute directory just to prevent write access to it. That is not > part of the admin's expectation. Wrong. If you tend to not be in control of the permissions below a mountpoint, you prevent access to it by restricting permissions on a parent directory of the mountpoint. It's that easy and it always has been. That is standard practice. While your backup is running, you have no control of it - thus use this standard practice! If you want to make selective or general access to the backups, snapshot them to a different location later. You should really learn how to work with scratch locations for a backup. The receiver is such a scratch location and should be handled like it. Scratch directories for backup locations are not a bug and not a workaround. It's just how you should handle it and how it works. By definition, the target directory cannot be read-only while the backup is running - so by definition you need other means to protect it. That in turn defines all your current locations for snapshot as scratch locations. Put your final snapshots somewhere else if you want your users access these after successful finishing a backup. You never want people to access in-transit or partial backups. In-transit and partial always goes to the scratch directory. Only then, after success of the backup, a backup should be made available in a final directory. Maybe your design of your backup is wrong, and not how btrfs handles that? Using scratch locations is a design you should really really always use for backups. Every sophisticated admin would agree that this should be best practice. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive leaves new subvolume modifiable during operation
Am Wed, 1 Feb 2017 17:43:32 + schrieb Graham Cobb: > On 01/02/17 12:28, Austin S. Hemmelgarn wrote: > > On 2017-02-01 00:09, Duncan wrote: > >> Christian Lupien posted on Tue, 31 Jan 2017 18:32:58 -0500 as > >> excerpted: > [...] > >> > >> I'm just a btrfs-using list regular not a dev, but AFAIK, the > >> behavior is likely to be by design and difficult to change, > >> because the send stream is simply a stream of userspace-context > >> commands for receive to act upon, and any other suitably > >> privileged userspace program could run the same commands. (If > >> your btrfs-progs is new enough receive even has a dump option, > >> that prints the metadata operations in human readable form, one > >> operation per line.) > >> > >> So making the receive snapshot read-only during the transfer would > >> prevent receive itself working. > > That's correct. Fixing this completely would require implementing > > receive on the kernel side, which is not a practical option for > > multiple reasons. > > I am with Christian on this. Both the effects he discovered go against > my expectation of how send/receive would or should work. > > This first bug is more serious because it appears to allow a > non-privileged user to disrupt the correct operation of receive, > creating a form of denial-of-service of a send/receive based backup > process. If I decided that I didn't want my pron collection (or my > incriminating emails) appearing in the backups I could just make sure > that I removed them from the receive snapshots while they were still > writeable. > > You may be right that fixing this would require receive in the kernel, > and that is undesirable, although it seems to me that it should be > possible to do something like allow receive to create the snapshot > with a special flag that would cause the kernel to treat it as > read-only to any requests not delivered through the same file > descriptor, or something like that (or, if that can't be done, at > least require root access to make any changes). In any case, I > believe it should be treated as a bug, even if low priority, with an > explicit warning about the possible corruption of receive-based > backups in the btrfs-receive man page. How about mounting the receiver below a directory only traversable by root (chmod og-rwx)? Backups shouldn't be directly accessible by ordinary users anyway. If you want a backup becoming accessible, you can later snapshot it to an accessible location after send/receive finished without errors. An in-transit backup, however, should always be protected from possible disruptive access. This is an issue with any backup software. So place the receive within an inaccessible directory. This is not something the backup process should directly bother with for simplicity. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very slow balance / btrfs-transaction
Am Sat, 04 Feb 2017 20:50:03 + schrieb "Jorg Bornschein": > February 4, 2017 1:07 AM, "Goldwyn Rodrigues" > wrote: > > > Yes, please check if disabling quotas makes a difference in > > execution time of btrfs balance. > > Just FYI: With quotas disabled it took ~20h to finish the balance > instead of the projected >30 days. Therefore, in my case, there was a > speedup of factor ~35. > > > and thanks for the quick reply! (and for btrfs general!) > > > BTW: I'm wondering how much sense it makes to activate the underlying > bcache for my raid1 fs again. I guess btrfs chooses randomly (or > based predicted of disk latency?) which copy of a given extend to > load? As far as I know, it uses PID modulo only currently, no round-robin, no random value. There are no performance optimizations going into btrfs yet because there're still a lot of ongoing feature implementations. I think there were patches to include a rotator value in the stripe selection. They don't apply to the current kernel. I tried it once and didn't see any subjective difference for normal desktop workloads. But that's probably because I use RAID1 for metadata only. MDRAID uses stripe selection based on latency and other measurements (like head position). It would be nice if btrfs implemented similar functionality. This would also be helpful for selecting a disk if there're more disks than stripesets (for example, I have 3 disks in my btrfs array). This could write new blocks to the most idle disk always. I think this wasn't covered by the above mentioned patch. Currently, selection is based only on the disk with most free space. > I guess that would mean the effective cache size would only be > half of the actual cache-set size (+-additional overhead)? Or does > btrfs try a deterministically determined copy of each extend first? I'm currently using 500GB bcache, it helps a lot during system start - and probably also while using using the system. I think that bcache mostly caches metadata access which should improve a lot of btrfs performance issues. The downside of RAID1 profile is, that probably every second access is a cache-miss unless it has already been cached. Thus, it's only half-effective as it could be. I'm using write-back bcache caching, and RAID0 for data (I do daily backups with borgbackup, I can easily recover broken files). So writing with bcache is not such an issue for me. The cache is big enough that double metadata writes are no problem. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists
Am Thu, 02 Feb 2017 13:01:03 +0100 schrieb Marc Joliet <mar...@gmx.de>: > On Sunday 28 August 2016 15:29:08 Kai Krakow wrote: > > Hello list! > > Hi list > > > It happened again. While using VirtualBox the following crash > > happened, btrfs check found a lot of errors which it couldn't > > repair. Earlier that day my system crashed which may already > > introduced errors into my filesystem. Apparently, I couldn't create > > an image (not enough space available), I only can give this trace > > from dmesg: > > > > [44819.903435] [ cut here ] > > [44819.903443] WARNING: CPU: 3 PID: 2787 at > > fs/btrfs/extent-tree.c:2963 btrfs_run_delayed_refs+0x26c/0x290 > > [44819.903444] BTRFS: Transaction aborted (error -17) > > [44819.903445] Modules linked in: nls_iso8859_15 nls_cp437 vfat fat > > fuse rfcomm veth af_packet ipt_MASQUERADE nf_nat_masquerade_ipv4 > > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > > nf_conntrack bridge stp llc w83627ehf bnep hwmon_vid cachefiles > > btusb btintel bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek > > snd_hda_codec_generic snd_hda_intel snd_hda_codec rfkill snd_hwdep > > snd_hda_core snd_pcm snd_timer coretemp hwmon snd r8169 mii > > kvm_intel kvm iTCO_wdt iTCO_vendor_support rtc_cmos irqbypass > > soundcore ip_tables uas usb_storage nvidia_drm(PO) vboxpci(O) > > vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO) > > nvidia(PO) efivarfs unix ipv6 [44819.903484] CPU: 3 PID: 2787 Comm: > > BrowserBlocking Tainted: P O4.7.2-gentoo #2 > > [44819.903485] Hardware name: To Be Filled By O.E.M. To Be Filled > > By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [44819.903487] > > 8130af2d 8800b7d03d20 > > [44819.903489] 810865fa 880409374428 8800b7d03d70 > > 8803bf299760 [44819.903491] ffef > > 8803f677f000 8108666a [44819.903493] Call Trace: > > [44819.903496] [] ? dump_stack+0x46/0x59 > > [44819.903500] [] ? __warn+0xba/0xe0 > > [44819.903502] [] ? warn_slowpath_fmt+0x4a/0x50 > > [44819.903504] [] ? > > btrfs_run_delayed_refs+0x26c/0x290 [44819.903507] > > [] ? btrfs_release_path+0xe/0x80 [44819.903509] > > [] ? btrfs_start_dirty_block_groups+0x2da/0x420 > > [44819.903511] [] ? > > btrfs_commit_transaction+0x143/0x990 [44819.903514] > > [] ? kmem_cache_free+0x165/0x180 [44819.903516] > > [] ? btrfs_wait_ordered_range+0x7c/0x110 > > [44819.903518] [] ? btrfs_sync_file+0x286/0x360 > > [44819.903522] [] ? do_fsync+0x33/0x60 > > [44819.903524] [] ? SyS_fdatasync+0xa/0x10 > > [44819.903528] [] ? > > entry_SYSCALL_64_fastpath+0x13/0x8f [44819.903529] ---[ end trace > > 6944811e170a0e57 ]--- [44819.903531] BTRFS: error (device bcache2) > > in btrfs_run_delayed_refs:2963: errno=-17 Object already exists > > [44819.903533] BTRFS info (device bcache2): forced readonly > > I got the same error myself, with this stack trace: > > -- Logs begin at Fr 2016-04-01 17:07:28 CEST, end at Mi 2017-02-01 > 22:03:57 CET. -- > Feb 01 01:46:26 diefledermaus kernel: [ cut here > ] Feb 01 01:46:26 diefledermaus kernel: WARNING: CPU: 1 > PID: 16727 at fs/btrfs/extent-tree.c:2967 > btrfs_run_delayed_refs+0x278/0x2b0 Feb 01 01:46:26 diefledermaus > kernel: BTRFS: Transaction aborted (error -17) Feb 01 01:46:26 > diefledermaus kernel: BTRFS: error (device sdb2) in > btrfs_run_delayed_refs:2967: errno=-17 Object already exists Feb 01 > 01:46:27 diefledermaus kernel: BTRFS info (device sdb2): forced > readonly Feb 01 01:46:27 diefledermaus kernel: Modules linked in: msr > ctr ccm tun arc4 snd_hda_codec_idt applesmc snd_hda_codec_generic > input_polldev hwmon snd_hda_intel ath5k snd_hda_codec mac80211 > snd_hda_core ath snd_pcm cfg80211 snd_timer video acpi_cpufreq snd > backlight sky2 rfkill processor button soundcore sg usb_storage > sr_mod cdrom ata_generic pata_acpi uhci_hcd ahci libahci ata_piix > libata ehci_pci ehci_hcd Feb 01 01:46:27 diefledermaus kernel: CPU: 1 > PID: 16727 Comm: kworker/u4:0 Not tainted 4.9.6-gentoo #1 > Feb 01 01:46:27 diefledermaus kernel: Hardware name: Apple Inc. > Macmini2,1/Mac-F4208EAA, BIOS MM21.88Z.009A.B00.0706281359 > 06/28/07 Feb 01 01:46:27 diefledermaus kernel: Workqueue: > btrfs-extent-refs btrfs_extent_refs_helper > Feb 01 01:46:27 diefledermaus kernel: > 812cf739 c9000285fd60 > Feb 01 01:46:27 diefledermaus kernel: 8104908a > 8800428df1e0 c9000285fdb0 0020 > Feb 0
Re: mount option nodatacow for VMs on SSD?
Am Mon, 28 Nov 2016 01:38:29 +0100 schrieb Ulli Horlacher <frams...@rus.uni-stuttgart.de>: > On Sat 2016-11-26 (11:27), Kai Krakow wrote: > > > > I have vmware and virtualbox VMs on btrfs SSD. > > > As a side note: I don't think you can use "nodatacow" just for one > > subvolume while the other subvolumes of the same btrfs are mounted > > different. The wiki is just wrong here. > > > > The list of possible mount options in the wiki explicitly lists > > "nodatacow" as not working per subvolume - just globally for the > > whole fs. > > Thanks for pointing this out! > I have misunderstood this, first. You can, however, use chattr to make the subvolume root directory (that one where it is mounted) nodatacow (chattr +C) _before_ placing any files or directories in there. That way, newly created files and directories will inherit the flag. Take note that this flag can only applied to directories and empty (zero-sized) files. That way, you get the intended benefit and your next question applies a little less because: > Ok, then next question :-) > > What is better (for a single user workstation): using mount option > "autodefrag" or call "btrfs filesystem defragment -r" (-t ?) via > nightly cronjob? > > So far, I use neither. When using the above method to make your VM images nodatacow, the only fragmentation issue you need to handle is when doing snapshots. Snapshots are subject to copy-on-write. If you do heavy snapshotting, you'll be getting heavy fragmentation based on the write-patterns. I don't know if autodefrag will handle nodatacow files. You may want to use a dedupe utility after defragmentation, like duperemove (running it manually) or bees (a background daemon also trying to keep fragmentation low). If you are doing no or infrequent snapshots, I won't bother with manual defragging at all for your VM images since you're on SSD. If you aren't going to use snapshots at all, you even won't have to think about autodefrag, tho I still recommend to enable it (see post from Duncan). Manual defrag is a highly write-intensive operation, rewriting multiple gigabytes of data. I strongly recommend against using it on a daily basis for SSD. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount option nodatacow for VMs on SSD?
Am Fri, 25 Nov 2016 09:28:40 +0100 schrieb Ulli Horlacher: > I have vmware and virtualbox VMs on btrfs SSD. > > I read in > https://btrfs.wiki.kernel.org/index.php/SysadminGuide#When_To_Make_Subvolumes > > certain types of data (databases, VM images and similar > typically big files that are randomly written internally) may require > CoW to be disabled for them. So for example such areas could be > placed in a subvolume, that is always mounted with the option > "nodatacow". > > Does this apply to SSDs, too? As a side note: I don't think you can use "nodatacow" just for one subvolume while the other subvolumes of the same btrfs are mounted different. The wiki is just wrong here. The list of possible mount options in the wiki explicitly lists "nodatacow" as not working per subvolume - just globally for the whole fs. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupt leaf, slot offset bad
Am Tue, 11 Oct 2016 07:09:49 -0700 schrieb Liu Bo: > On Tue, Oct 11, 2016 at 02:48:09PM +0200, David Sterba wrote: > > Hi, > > > > looks like a lot of random bitflips. > > > > On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote: > > > item 109 has a few strange chars in its name (and it's > > > truncated): 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a > > > > > > item 105 key (261 DIR_ITEM 54556048) itemoff 11723 > > > itemsize 72 location key (606286 INODE_ITEM 0) type FILE > > > namelen 42 datalen 0 name: > > > python2-gobject-3.20.1-1-x86_64.pkg.tar.xz item 106 key (261 > > > DIR_ITEM 56363628) itemoff 11660 itemsize 63 location key (894298 > > > INODE_ITEM 0) type FILE namelen 33 datalen 0 name: > > > unrar-1:5.4.5-1-x86_64.pkg.tar.xz item 107 key (261 DIR_ITEM > > > 66963651) itemoff 11600 itemsize 60 location key (1178 INODE_ITEM > > > 0) type FILE namelen 30 datalen 0 name: > > > glibc-2.23-5-x86_64.pkg.tar.xz item 108 key (261 DIR_ITEM > > > 68561395) itemoff 11532 itemsize 68 location key (660578 > > > INODE_ITEM 0) type FILE namelen 38 datalen 0 name: > > > squashfs-tools-4.3-4-x86_64.pkg.tar.xz item 109 key (261 DIR_ITEM > > > 76859450) itemoff 11483 itemsize 65 location key (2397184 > > > UNKNOWN.0 7091317839824617472) type 45 namelen 13102 datalen > > > 13358 name: 1-x86_64.pkg.tar.xzb > > > > namelen must be smaller than 255, but the number itself does not > > look like a bitflip (0x332e), the name looks like a fragment of. > > > > The location key is random garbage, likely an overwritten memory, > > 7091317839824617472 == 0x62696c010023 contains ascii 'bil', the > > key type is unknown but should be INODE_ITEM. > > > > > data > > > item 110 key (261 DIR_ITEM 9799832789237604651) itemoff > > > 11405 itemsize 62 > > > location key (388547 INODE_ITEM 0) type FILE > > > namelen 32 datalen 0 name: > > > intltool-0.51.0-1-any.pkg.tar.xz item 111 key (261 DIR_ITEM > > > 81211850) itemoff 11344 itemsize 131133 > > > > itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61, > > corresponds to the expected item size. > > > > There's possibly other random bitflips in the keys or other > > structures. It's hard to estimate the damage and thus the scope of > > restorable data. > > It makes sense since this's a ssd we may have only one copy for > metadata. > > Thanks, > > -liubo >From this point of view it doesn't make sense to store only one copy of meta data on SSD... The bit flip probably happened in RAM when taking the other garbage into account, so dup meta data could have helped here. If the SSD firmware would collapse duplicate meta data into single blobs, that's perfectly fine. If the dup meta data arrives with bits flipped, it won't be deduplicated. So this is fine, too. BTW: I cannot believe that SSD firmwares really do the quite expensive job of deduplication other than maybe internal compression. Maybe there are some drives out there but most won't deduplicate. It's just too little gain for too much complexity. So I personally would always switch on duplicate meta data even for SSD. It shouldn't add to wear leveling too much if you do the usual SSD optimization anyways (like noatime). PS: I suggest doing an extensive memtest86 before trying any repairs on this system... Are you probably mixing different model DIMMs in dual channel slots? Most of the times I've seen bitflips, this was the culprit... -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html