Re: System unable to mount partition after a power loss
On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby wrote: > > All, > > I'm coming to you to see if there is a way to fix or at least recover > most of the data I have from a btrfs filesystem. The system went down > after both a breaker and the battery backup failed. I cannot currently > mount the system, with the following error from dmesg: > > Note: The vda1 is just the entire disk being passed from the VM host > to the VM it's not an actual true virtual block device This is qemu-kvm? What's the cache mode being used? It's possible the usual write guarantees are thwarted by VM caching. > btrfs check --recover also ends in a segmentation fault I'm not familiar with --recover option, the --repair option is flagged with a warning in the man page. Warning Do not use --repair unless you are advised to do so by a developer or an experienced user, > btrfs --version: > btrfs-progs v4.7.3 Old version of progs, I suggest upgrading to 4.17.1 and run btrfs insp dump-s -f /device/ btrfs insp rescue super -v /device/ btrfs check --mode=lowmem /device/ These are all read only commands. Please post output to the list, hopefully a developer will get around to looking at it. It is safe to try: mount -o ro,norecovery,usebackuproot /device/ /mnt/ If that works, I suggest updating your backup while it's still possible in the meantime. -- Chris Murphy
Re: Need help with potential ~45TB dataloss
On Tue, Dec 4, 2018 at 3:09 AM Patrick Dijkgraaf wrote: > > Hi Chris, > > See the output below. Any suggestions based on it? If they're SATA drives, they may not support SCT ERC; and if they're SAS, depending on what controller they're behind, smartctl might need a hint to properly ask the drive for SCT ERC status. Simplest way to know is do 'smartctl -x' on one drive, assuming they're all the same basic make/model other than size. -- Chris Murphy
Re: experiences running btrfs on external USB disks?
On Mon, Dec 3, 2018 at 10:44 PM Tomasz Chmielewski wrote: > > I'm trying to use btrfs on an external USB drive, without much success. > > When the drive is connected for 2-3+ days, the filesystem gets remounted > readonly, with BTRFS saying "IO failure": > > [77760.444607] BTRFS error (device sdb1): bad tree block start, want > 378372096 have 0 > [77760.550933] BTRFS error (device sdb1): bad tree block start, want > 378372096 have 0 > [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: > errno=-5 IO failure > [77760.550979] BTRFS info (device sdb1): forced readonly > [77760.551003] BTRFS: error (device sdb1) in > btrfs_run_delayed_refs:2935: errno=-5 IO failure > [77760.553223] BTRFS error (device sdb1): pending csums is 4096 > > > Note that there are no other kernel messages (i.e. that would indicate a > problem with disk, cable disconnection etc.). > > The load on the drive itself can be quite heavy at times (i.e. 100% IO > for 1-2 h and more) - can it contribute to the problem (i.e. btrfs > thinks there is some timeout somewhere)? > > Running 4.19.6 right now, but was experiencing the issue also with 4.18 > kernels. > > > > # btrfs device stats /data > [/dev/sda1].write_io_errs0 > [/dev/sda1].read_io_errs 0 > [/dev/sda1].flush_io_errs0 > [/dev/sda1].corruption_errs 0 > [/dev/sda1].generation_errs 0 Hard to say without a complete dmesg; but errno=-5 IO failure is pretty much some kind of hardware problem in my experience. I haven't seen it be a bug. -- Chris Murphy
Re: Ran into "invalid block group size" bug, unclear how to proceed.
On Mon, Dec 3, 2018 at 8:32 PM Mike Javorski wrote: > > Need a bit of advice here ladies / gents. I am running into an issue > which Qu Wenruo seems to have posted a patch for several weeks ago > (see https://patchwork.kernel.org/patch/10694997/). > > Here is the relevant dmesg output which led me to Qu's patch. > > [ 10.032475] BTRFS critical (device sdb): corrupt leaf: root=2 > block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104, > invalid block group size, have 10804527104 expect (0, 10737418240] > [ 10.032493] BTRFS error (device sdb): failed to read block groups: -5 > [ 10.053365] BTRFS error (device sdb): open_ctree failed > > > This server has a 16 disk btrfs filesystem (RAID6) which I boot > periodically to btrfs-send snapshots to. This machine is running > ArchLinux and I had just updated to their latest 4.19.4 kernel > package (from 4.18.10 which was working fine). I've tried updating to > the 4.19.6 kernel that is in testing, but that doesn't seem to resolve > the issue. From what I can see on kernel.org, the patch above is not > pushed to stable or to Linus' tree. > > At this point the question is what to do. Is my FS toast? Could I > revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19 > boot process may have flipped some bits which would make reverting > problematic. That patch is not yet merged in linux-next so to use it, you'd need to apply yourself and compile a kernel. I can't tell for sure if it'd help. But, the less you change the file system, the better chance of saving it. I have no idea why there'd be a corrupt leaf just due to a kernel version change, though. Needless to say, raid56 just seems fragile once it runs into any kind of trouble. I personally wouldn't boot off it at all. I would only mount it from another system, ideally an installed system but a live system with the kernel versions you need would also work. That way you can get more information without changes, and booting will almost immediately mount rw, if mount succeeds at all, and will write a bunch of changes to the file system. Whether it's a case of 4.18.10 not detecting corruption that 4.19 sees, or if 4.19 already caused it, the best chance is to not mount it rw, and not run check --repair, until you get some feedback from a developer. The thing I'd like to see is # btrfs rescue super -v /anydevice/ # btrfs insp dump-s -f /anydevice/ First command will tell us if all the supers are the same and valid across all devices. And the second one, hopefully it's pointed to a device with valid super, will tell us if there's a log root value other than 0. Both of those are read only commands. -- Chris Murphy
Re: Need help with potential ~45TB dataloss
Also useful information for autopsy, perhaps not for fixing, is to know whether the SCT ERC value for every drive is less than the kernel's SCSI driver block device command timeout value. It's super important that the drive reports an explicit read failure before the read command is considered failed by the kernel. If the drive is still trying to do a read, and the kernel command timer times out, it'll just do a reset of the whole link and we lose the outcome for the hanging command. Upon explicit read error only, can Btrfs, or md RAID, know what device and physical sector has a problem, and therefore how to reconstruct the block, and fix the bad sector with a write of known good data. smartctl -l scterc /device/ and cat /sys/block/sda/device/timeout Only if SCT ERC is enabled with a value below 30, or if the kernel command timer is change to be well above 30 (like 180, which is absolutely crazy but a separate conversation) can we be sure that there haven't just been resets going on for a while, preventing bad sectors from being fixed up all along, and can contribute to the problem. This comes up on the linux-raid (mainly md driver) list all the time, and it contributes to lost RAID all the time. And arguably it leads to unnecessary data loss in even the single device desktop/laptop use case as well. Chris Murphy
Re: BTRFS Mount Delay Time Graph
On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton wrote: > > Le 03/12/2018 à 20:56, Lionel Bouton a écrit : > > [...] > > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > > tuning of the io queue (switching between classic io-schedulers and > > blk-mq ones in the virtual machines) and BTRFS mount options > > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > > in mount time (I managed to reduce the mount of IO requests > > Sent to quickly : I meant to write "managed to reduce by half the number > of IO write requests for the same amount of data writen" > > > by half on > > one server in production though although more tests are needed to > > isolate the cause). Interesting. I wonder if it's ssd_spread or space_cache=v2 that reduces the writes by half, or by how much for each? That's a major reduction in writes, and suggests it might be possible for further optimization, to help mitigate the wandering trees impact. -- Chris Murphy
Re:
On Thu, Nov 22, 2018 at 11:41 PM Andy Leadbetter wrote: > > I have a failing 2TB disk that is part of a 4 disk RAID 6 system. I > have added a new 2TB disk to the computer, and started a BTRFS replace > for the old and new disk. The process starts correctly however some > hours into the job, there is an error and kernel oops. relevant log > below. The relevant log is the entire dmesg, not a snippet. It's decently likely there's more than one thing going on here. We also need full output of 'smartctl -x' for all four drives, and also 'smartctl -l scterc' for all four drives, and also 'cat /sys/block/sda/device/timeout' for all four drives. And which bcache mode you're using. The call trace provided is from kernel 4.15 which is sufficiently long ago I think any dev working on raid56 might want to see where it's getting tripped up on something a lot newer, and this is why: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v4.19.3=v4.15.1 That's a lot of changes in just the raid56 code between 4.15 and 4.19. And then in you call trace, btrfs_dev_replace_start is found in dev-replace.c which likewise has a lot of changes. But then also, I think 4.15 might still be in the era where it was not recommended to use 'btrfs dev replace' for raid56, only non-raid56. I'm not sure if the problems with device replace were fixed, and if they were fixed kernel or progs side. Anyway, the latest I recall, it was recommended on raid56 to 'btrfs dev add' then 'btrfs dev remove'. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/dev-replace.c?id=v4.19.3=v4.15.1 And that's only a few hundred changes for each. Check out inode.c - there are over 2000 changes. > The disks are configured on top of bcache, in 5 arrays with a small > 128GB SSD cache shared. The system in this configuration has worked > perfectly for 3 years, until 2 weeks ago csum errors started > appearing. I have a crashplan backup of all files on the disk, so I > am not concerned about data loss, but I would like to avoid rebuild > the system. btrfs-progs 4.17 still considers raid56 experimental, not for production use. And three years ago, the current upstream kernel released was 4.3 so I'm gonna guess the kernel history of this file system goes back older than that, very close to raid56 code birth. And then adding bcache to this mix just makes it all the more complicated. > > btrfs dev stats shows > [/dev/bcache0].write_io_errs0 > [/dev/bcache0].read_io_errs 0 > [/dev/bcache0].flush_io_errs0 > [/dev/bcache0].corruption_errs 0 > [/dev/bcache0].generation_errs 0 > [/dev/bcache1].write_io_errs0 > [/dev/bcache1].read_io_errs 20 > [/dev/bcache1].flush_io_errs0 > [/dev/bcache1].corruption_errs 0 > [/dev/bcache1].generation_errs 14 > [/dev/bcache3].write_io_errs0 > [/dev/bcache3].read_io_errs 0 > [/dev/bcache3].flush_io_errs0 > [/dev/bcache3].corruption_errs 0 > [/dev/bcache3].generation_errs 19 > [/dev/bcache2].write_io_errs0 > [/dev/bcache2].read_io_errs 0 > [/dev/bcache2].flush_io_errs0 > [/dev/bcache2].corruption_errs 0 > [/dev/bcache2].generation_errs 2 3 of 4 drives have at least one generation error. While there are no corruptions reported, generation errors can be really tricky to recover from at all. If only one device had only read errors, this would be a lot less difficult. > I've tried the latest kernel, and the latest tools, but nothing will > allow me to replace, or delete the failed disk. If the file system is mounted, I would try to make a local backup ASAP before you lose the whole volume. Whether it's LVM pool of two drives (linear/concat) with XFS, or if you go with Btrfs -dsingle -mraid1 (also basically a concat) doesn't really matter, but I'd get whatever you can off the drive. I expect avoiding a rebuild in some form or another is very wishful thinking and not very likely. The more changes are made to the file system, repair attempts or otherwise writing to it, decreases the chance of recovery. -- Chris Murphy
Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3
On Thu, Nov 22, 2018 at 6:07 AM Tomasz Chmielewski wrote: > > On 2018-11-22 21:46, Nikolay Borisov wrote: > > >> # echo w > /proc/sysrq-trigger > >> > >> # dmesg -c > >> [ 931.585611] sysrq: SysRq : Show Blocked State > >> [ 931.585715] taskPC stack pid father > >> [ 931.590168] btrfs-cleaner D0 1340 2 0x8000 > >> [ 931.590175] Call Trace: > >> [ 931.590190] __schedule+0x29e/0x840 > >> [ 931.590195] schedule+0x2c/0x80 > >> [ 931.590199] schedule_timeout+0x258/0x360 > >> [ 931.590204] io_schedule_timeout+0x1e/0x50 > >> [ 931.590208] wait_for_completion_io+0xb7/0x140 > >> [ 931.590214] ? wake_up_q+0x80/0x80 > >> [ 931.590219] submit_bio_wait+0x61/0x90 > >> [ 931.590225] blkdev_issue_discard+0x7a/0xd0 > >> [ 931.590266] btrfs_issue_discard+0x123/0x160 [btrfs] > >> [ 931.590299] btrfs_discard_extent+0xd8/0x160 [btrfs] > >> [ 931.590335] btrfs_finish_extent_commit+0xe2/0x240 [btrfs] > >> [ 931.590382] btrfs_commit_transaction+0x573/0x840 [btrfs] > >> [ 931.590415] ? btrfs_block_rsv_check+0x25/0x70 [btrfs] > >> [ 931.590456] __btrfs_end_transaction+0x2be/0x2d0 [btrfs] > >> [ 931.590493] btrfs_end_transaction_throttle+0x13/0x20 [btrfs] > >> [ 931.590530] btrfs_drop_snapshot+0x489/0x800 [btrfs] > >> [ 931.590567] btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs] > >> [ 931.590607] cleaner_kthread+0x136/0x160 [btrfs] > >> [ 931.590612] kthread+0x120/0x140 > >> [ 931.590646] ? btree_submit_bio_start+0x20/0x20 [btrfs] > >> [ 931.590658] ? kthread_bind+0x40/0x40 > >> [ 931.590661] ret_from_fork+0x22/0x40 > >> > > > > It seems your filesystem is mounted with the DSICARD option meaning > > every delete will result in discard this is highly suboptimal for > > ssd's. > > Try remounting the fs without the discard option see if it helps. > > Generally for discard you want to submit it in big batches (what fstrim > > does) so that the ftl on the ssd could apply any optimisations it might > > have up its sleeve. > > Spot on! > > Removed "discard" from fstab and added "ssd", rebooted - no more > btrfs-cleaner running. > > Do you know if the issue you described ("discard this is highly > suboptimal for ssd") affects other filesystems as well to a similar > extent? I.e. if using ext4 on ssd? Quite a lot of activity on ext4 and XFS are overwrites, so discard isn't needed. And it might be discard is subject to delays. On Btrfs, it's almost immediate, to the degree that on a couple SSDs I've tested, stale trees referenced exclusively by the most recent backup tree entires in the superblock are already zeros. That functionally means no automatic recoveries at mount time if there's a problem with any of the current trees. I was using it for about a year to no ill effect, BUT not a lot of file deletions either. I wouldn't recommend it, and instead suggest enabling the fstrim.timer which by default runs fstrim.service once a week (which in turn issues fstrim, I think on all mounted volumes.) I am a bit more concerned about the read errors you had that were being corrected automatically? The corruption suggests a firmware bug related to trim. I'd check the affected SSD firmware revision and consider updating it (only after a backup, it's plausible the firmware update is not guaranteed to be data safe). Does the volume use DUP or raid1 metadata? I'm not sure how it's correcting for these problems otherwise. -- Chris Murphy
Re: BTRFS on production: NVR 16+ IP Cameras
On Thu, Nov 15, 2018 at 10:39 AM Juan Alberto Cirez wrote: > > Is BTRFS mature enough to be deployed on a production system to underpin > the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)? > > Based on our limited experience with BTRFS (1+ year) under the above > scenario the answer seems to be no; but I wanted you ask the community > at large for their experience before making a final decision to hold off > on deploying BTRFS on production systems. > > Let us be clear: We think BTRFS has great potential, and as it matures > we will continue to watch its progress, so that at some future point we > can return to using it. > > The issue has been the myriad of problems we have encountered when > deploying BTRFS as the storage fs for the NVR/VMS in cases were the > camera count exceeds 10: Corrupted file systems, sudden read-only file > system, re-balance kernel panics, broken partitions, etc. Performance problems are separate from reliability problems. No matter what, there shouldn't be corruptions or failures when your process is writing through the Btrfs kernel driver. Period. So you've either got significant hardware/firmware problems as the root cause, or your use case is exposing Btrfs bugs. But you're burdened with providing sufficient details about the hardware and storage stack configuration including kernel, btrfs-progs versions and mkfs options and mount options being used. Without a way for a developer to reproduce your problem it's unlikely the source of the problem can be discovered and fixed. > So, again, the question is: is BTRFS mature enough to be used in such > use case and if so, what approach can be used to mitigate such issues. What format are the cameras writing out in? It matters if this is a continuous appending format, or if it's writing them out as individual JPEG files, one per frame, or whatever. What rate, what size, and any other concurrent operations, etc. -- Chris Murphy
Re: Where is my disk space ?
On Thu, Nov 8, 2018 at 2:27 AM, Barbet Alain wrote: > Hi ! > Just to give you end of the story: > I move my /var/lib/docker to my home (other partition), and my space > come back ... I'm not sure why that would matter. Both btrfs du and regular du showed only ~350M used in /var which is about what I'd expect. And also the 'btrfs sub list' output doesn't show any subvolumes/snapshots for Docker. The upstream Docker behavior on Btrfs is that it uses subvolumes and snapshots for everything, quickly you'll see a lot of them. However many distributions override the default Docker behavior, e.g. with Docker storage setup, and will cause it to always favor a particular driver. For example the Docker overlay2 driver, which leverages kernel overlayfs, which will work on any file system including Btrfs. And I'm not exactly sure where the upper dirs are stored, but I'd be surprised if they're not in /var. Anyway, if you're using Docker, moving stuff around will almost certainly break it. And as I'm an extreme expert in messing up Docker storage, I can vouch for the strategy of stopping the docker daemon, recursively deleting everything in /var/lib/docker/ and then starting Docker. Now you get to go fetch all your images again. And anyway, you shouldn't be storing any data in the containers, they should be throwaway things, important data should be stored elsewhere including any state information for the container. :-D Avoid container misery by having a workflow that expects containers to be transient disposable objects. -- Chris Murphy
Re: BTRFS did it's job nicely (thanks!)
On Mon, Nov 5, 2018 at 6:27 AM, Austin S. Hemmelgarn wrote: > On 11/4/2018 11:44 AM, waxhead wrote: >> >> Sterling Windmill wrote: >>> >>> Out of curiosity, what led to you choosing RAID1 for data but RAID10 >>> for metadata? >>> >>> I've flip flipped between these two modes myself after finding out >>> that BTRFS RAID10 doesn't work how I would've expected. >>> >>> Wondering what made you choose your configuration. >>> >>> Thanks! >>> Sure, >> >> >> The "RAID"1 profile for data was chosen to maximize disk space utilization >> since I got a lot of mixed size devices. >> >> The "RAID"10 profile for metadata was chosen simply because it *feels* a >> bit faster for some of my (previous) workload which was reading a lot of >> small files (which I guess was embedded in the metadata). While I never >> remembered that I got any measurable performance increase the system simply >> felt smoother (which is strange since "RAID"10 should hog more disks at >> once). >> >> I would love to try "RAID"10 for both data and metadata, but I have to >> delete some files first (or add yet another drive). >> >> Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 >> does not work as you expected? >> >> As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 >> replica) is striped over as many disks it can (as long as there is free >> space). >> >> So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe >> over (20/2) x 2 and if you run out of space on 10 of the devices it will >> continue to stripe over (5/2) x 2. So your stripe width vary with the >> available space essentially... I may be terribly wrong about this (until >> someones corrects me that is...) > > He's probably referring to the fact that instead of there being a roughly > 50% chance of it surviving the failure of at least 2 devices like classical > RAID10 is technically able to do, it's currently functionally 100% certain > it won't survive more than one device failing. Right. Classic RAID10 is *two block device* copies, where you have mirror1 drives and mirror2 drives, and each mirror pair becomes a single virtual block device that are then striped across. If you lose a single mirror1 drive, its mirror2 data is available and statistically unlikely to also go away. Whereas with Btrfs raid10, it's *two block group* copies. And it is the block group that's striped. That means block group copy 1 is striped across 1/2 the available drives (at the time the bg is allocated), and block group copy 2 is striped across the other drives. When a drive dies, there is no single remaining drive that contains all the missing copies, they're distributed. Which means you've got a very good chance in a 2 drive failure of losing two copies of either metadata or data or both. While I'm not certain it's 100% not survivable, the real gotcha is it's possible maybe even likely that it'll mount and seem to work fine but as soon as it runs into two missing bg's, it'll face plant. -- Chris Murphy
Re: Where is my disk space ?
Also, since you don't have any snapshots, you could also find this conventionally: # du -sh /* Chris Murphy
Re: Where is my disk space ?
On Tue, Oct 30, 2018 at 4:44 PM, Barbet Alain wrote: > Thanks for answer ! > alian@alian:~> sudo btrfs sub list -ta / > [sudo] Mot de passe de root : > ID gen top level path > -- --- - > 257 79379 5 /@ > 258 79386 257 @/var > 259 79000 257 @/usr/local > 260 79376 257 @/tmp > 261 79001 257 @/srv > 262 79062 257 @/root > 263 79001 257 @/opt > 264 78898 257 @/boot/grub2/x86_64-efi > 265 78933 257 @/boot/grub2/i386-pc > > Yes it's opensuse, but I don't see any snapper config enable. > For memory, I use docker that full my disk, I remove subvolume, but > it's look like something is missing somewhere. Try mount -o subvolid=5 /mnt cd /mnt btrfs fi du -s * Maybe that will help reveal where it's hiding. It's possible btrfs fi du does not cross bind mounts. I know the Total column does include amounts in nested subvolumes. -- Chris Murphy
Re: Salvage files from broken btrfs
On Tue, Oct 30, 2018 at 4:11 PM, Mirko Klingmann wrote: > Hi all, > > my btrfs root file system on a SD card broke down and did not mount anymore. It might mount with -o ro,nologreplay Typically an SD card will break in a way that it can't write, and mount will just hang (with mmcblk errors). Mounting with both ro and nologreplay will ensure no writes are needed, allowing the mount to succeed. of course any changes that are in the log tree will be missing so recent transactions may be unrecoverable but so far I've had good luck recovering from broken SD cards this way. -- Chris Murphy
Re: Where is my disk space ?
On Tue, Oct 30, 2018 at 9:17 AM, Barbet Alain wrote: > Hi, > I experienced disk out of space issue: > alian:~ # df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs7.8G 0 7.8G 0% /dev > tmpfs 7.8G 47M 7.8G 1% /dev/shm > tmpfs 7.8G 18M 7.8G 1% /run > tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup > /dev/sda641G 35G 5.1G 88% / > /dev/sda641G 35G 5.1G 88% /var > /dev/sda641G 35G 5.1G 88% /root > /dev/sda641G 35G 5.1G 88% /srv > /dev/sda641G 35G 5.1G 88% /opt > /dev/sda641G 35G 5.1G 88% /boot/grub2/i386-pc > /dev/sda641G 35G 5.1G 88% /usr/local > /dev/sda641G 35G 5.1G 88% /tmp > /dev/sda641G 35G 5.1G 88% /boot/grub2/x86_64-efi > /dev/sda7 424G 200G 225G 48% /home > > > It say I use 35Go / 41. But I have only 5,8Go of data: > alian:~ # btrfs fi du -s / > Total Exclusive Set shared Filename >5.84GiB 5.84GiB 0.00B / > alian:/ # du -h --exclude ./home --max-depth=0 > 6.2G. I suspect there are snapshots taking up space that are no located in the search path starting at / What do you get for: $ sudo btrfs sub list -ta / Is this an openSUSE system? If snapper is enabled, you'll need to ask it to delete some of the snapshots to free up space rather than doing it with btrfs user space tools. > alian:/ # btrfs fi df / > Data, single: total=35.00GiB, used=34.18GiB > System, DUP: total=32.00MiB, used=16.00KiB > Metadata, DUP: total=384.00MiB, used=216.75MiB > GlobalReserve, single: total=22.05MiB, used=0.00B > > I try to run btrfs balance multiple time with various parameters but > it doesn't change anything nor trying btrf check in single user mode. > > Where is my 30 Go missing ? -- Chris Murphy
Re: Understanding "btrfs filesystem usage"
On Tue, Oct 30, 2018 at 10:10 AM, Ulli Horlacher wrote: > > On Mon 2018-10-29 (17:57), Remi Gauvin wrote: >> On 2018-10-29 02:11 PM, Ulli Horlacher wrote: >> >> > I want to know how many free space is left and have problems in >> > interpreting the output of: >> > >> > btrfs filesystem usage >> > btrfs filesystem df >> > btrfs filesystem show >> >> In my not so humble opinion, the filesystem usage command has the >> easiest to understand output. It' lays out all the pertinent information. >> >> You can clearly see 825GiB is allocated, with 494GiB used, therefore, >> filesystem show is actually using the "Allocated" value as "Used". >> Allocated can be thought of "Reserved For". > > And what is "Device unallocated"? Not reserved? That's a reasonable interpretation. Unallocated space is space that's not used for anything: no data, no metadata, and isn't reference by any block group. It's not a relevant number day to day, I'd say it's advanced leaning toward esoteric knowledge of Btrfs internals. At this point I'd like to see a simper output by default, and have a verbose option for advanced users, and an export option that spits out a superset of all available information in a format parsable for scripts. But I know there are other project that depend on btrfs user space output, rather than having something specifically invented for them that's easily parsed, and can be kept consistent and extendible, separate from human user consumption. Oh well! >> The disparity between 498GiB used and 823Gib is pretty high. This is >> probably the result of using an SSD with an older kernel. If your >> kernel is not very recent, (sorry, I forget where this was fixed, >> somewhere around 4.14 or 4.15), then consider mounting with the nossd >> option. > > I am running kernel 4.4 (it is a Ubuntu 16.04 system) > But /local is on a SSD. Should I really use nossd mount option?! Yes. But it's not a file system integrity suggestion, it's an optimization. > > > >> You can improve this by running a balance. >> >> Something like: >> btrfs balance start -dusage=55 > > I run balance via cron weekly (adapted > https://software.opensuse.org/package/btrfsmaintenance) With a newer kernel you can probably reduce this further depending on your workload and use case. And optionally follow it up with executing fstrim, or just enable fstrim.timer (we don't recommend using discard mount option for most use cases as it too aggressively discards very recently stale Btrfs metadata and can make recovery from crashes harder). There is a trim bug that causes FITRIM to only get applied to unallocated space on older file systems, that have been balanced such that block group logical addresses are outside the physical address space of the device which prevents the free space inside of such block groups to be passed over for FITRIM. Looks like this will be fixed in kernel 4.20/5.0 -- Chris Murphy
Re: Kernel crash related to LZO compression
On Thu, Oct 25, 2018 at 9:56 AM, Dmitry Katsubo wrote: > Dear btrfs community, > > My excuses for the dumps for rather old kernel (4.9.25), nevertheless I > wonder > about your opinion about the below reported kernel crashes. > > As I could understand the situation (correct me if I am wrong), it happened > that some data block became corrupted which resulted the following kernel > trace > during the boot: > > kernel BUG at /build/linux-fB36Cv/linux-4.9.25/fs/btrfs/extent_io.c:2318! > invalid opcode: [#1] SMP > Call Trace: > [] ? end_bio_extent_readpage+0x4e9/0x680 [btrfs] > [] ? end_compressed_bio_read+0x3b/0x2d0 [btrfs] > [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs] > [] ? process_one_work+0x141/0x380 > [] ? worker_thread+0x41/0x460 > [] ? kthread+0xb4/0xd0 > [] ? process_one_work+0x380/0x380 > [] ? kthread_park+0x50/0x50 > [] ? ret_from_fork+0x1b/0x28 > > The problematic file turned out to be the one used by systemd-journald > /var/log/journal/c496cea41ebc4700a0dfaabf64a21be4/system.journal > which was trying to read it (or append to it) during the boot and that was > causing the system crash (see attached bootN_dmesg.txt). > > I've rebooted in safe mode and tried to copy the data from this partition to > another location using btrfs-restore, however kernel was crashing as well > with > a bit different symphom (see attached copyN_dmesg.txt): > > Call Trace: > [] ? lzo_decompress_biovec+0x1b0/0x2b0 [btrfs] > [] ? vmalloc+0x38/0x40 > [] ? end_compressed_bio_read+0x265/0x2d0 [btrfs] > [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs] > [] ? process_one_work+0x141/0x380 > [] ? worker_thread+0x41/0x460 > [] ? kthread+0xb4/0xd0 > [] ? ret_from_fork+0x1b/0x28 > > Just to keep away from the problem, I've removed this file and also removed > "compress=lzo" mount option. > > Are there any updates / fixes done in that area? Is lzo option safe to use? It should be safe even with that kernel. I'm not sure this is compression related. There is a corruption bug related to inline extents and corruption that had been fairly elusive but I think it's fixed now. I haven't run into it though. I would say the first step no matter what if you're using an older kernel, is to boot a current Fedora or Arch live or install media, mount the Btrfs and try to read the problem files and see if the problem still happens. I can't even being to estimate the tens of thousands of line changes since kernel 4.9. What profile are you using for this Btrfs? Is this a raid56? What do you get for 'btrfs fi us ' ? -- Chris Murphy
Re: Failover for unattached USB device
it can detect such problems in both metadata and data and should be able to avoid them in the first place due to always on COW (as long as you haven't disabled it). But there is some evidence that old Btrfs bugs could induce corruption in metadata, and not turn into a problem for a long time later. The scrubs only check if metadata and its checksum match up (corruption detection elsewhere in the storage stack) so the scrub most often can't find bugs that cause corruption. You best bet for side stepping such problems is backups, and using the most recent kernel you can. If you encounter some problem that might be a bug, inevitably you'll need to test with a newer kernel version anyway to see if it's still a bug. Each merge cycle involves thousands of lines of changes just for Btrfs and there's more to the storage stack in the kernel than just Btrfs. In your use case with mostly reads, and probably you also don't care about write performance, you could consider mounting with notreelog. This will drop the use of the treelog which is used to improve performance on operations that use fsync. With this option, transactions calling fsync() fall back to sync() so it's safer but slower. -- Chris Murphy
Re: Failover for unattached USB device
On Wed, Oct 24, 2018 at 9:03 AM, Dmitry Katsubo wrote: > On 2018-10-17 00:14, Dmitry Katsubo wrote: >> >> As a workaround I can monitor dmesg output but: >> >> 1. It would be nice if I could tell btrfs that I would like to mount >> read-only >> after a certain error rate per minute is reached. >> 2. It would be nice if btrfs could detect that both drives are not >> available and >> unmount (as mount read-only won't help much) the filesystem. >> >> Kernel log for Linux v4.14.2 is attached. > > > I wonder if somebody could further advise the workaround. I understand that > running > btrfs volume over USB devices is not good, but I think btrfs could play some > role > as well. I think about the best we can expect in the short term is that Btrfs goes read-only before the file system becomes corrupted in a way it can't recover with a normal mount. And I'm not certain it is in this state of development right now for all cases. And I say the same thing for other file systems as well. Running Btrfs on USB devices is fine, so long as they're well behaved. I have such a setup with USB 3.0 devices. Perhaps I got a bit lucky, because there are a lot of known bugs with USB controllers, USB bridge chipsets, and USB hubs. Having user definable switches for when to go read-only is, I think misleading to the user, and very likely will mislead the file system. The file system needs to go read-only when it gets confused, period. It doesn't matter what the error rate is. The work around is really to do the hard work making the devices stable. Not asking Btrfs to paper over known unstable hardware. In my case, I started out with rare disconnects and resets with directly attached drives. This was a couple years ago. It was a Btrfs raid1 setup, and the drives would not go missing at the same time, but both would just drop off from time to time. Btrfs would complain of dropped writes, I vaguely remember it going read only. But normal mounts worked, sometimes with scary errors but always finding a good copy on the other drive, and doing passive fixups. Scrub would always fix up the rest. I'm still using those same file systems on those devices, but now they go through a dyconn USB 3.0 hub with a decently good power supply. I originally thought the drop offs were power related, so I explicitly looked for a USB hub that could supply at least 2A, and this one is 12VDC @ 2500mA. A laptop drive will draw nearly 1A on spin up, but at that point P=AV. Laptop drives during read/write using 1.5 W to 2.5 W @ 5VDC. 1.5-2.5 W = A * 5 V Therefore A = 0.3-0.5A And for 4 drives at possibly 0.5 A (although my drives are all at the 1.6 W read/write), that's 2 A @ 5 V, which is easily maintained for the hub power supply (which by my calculation could do 6 A @ 5 V, not accounting for any resistance). Anyway, as it turns out I don't think it was power related, as the Intel NUC in question probably had just enough amps per port. And what it really was, was incompatibility between the Intel controller and the bridgechipset in the USB-SATA cases, and the USB hub is similar to an ethernet hub, it actually reads the USB stream and rewrites it out. So hubs are actually pretty complicated little things, and having a good one matters. > > In particular I wonder if btrfs could detect that all devices in RAID1 > volume became > inaccessible and instead of reporting increasing "write error" counter to > kernel log simply > render the volume as read-only. "inaccessible" could be that if the same > block cannot be > written back to minimum number of devices in RAID volume, so btrfs gives up. There are pending patches for something similar that you can find in the archives. I think the reason they haven't been merged yet is there haven't been enough comments and feedback (?). I think Anand Jain is the author of those patches so you might dig around in the archives. In a way you have an ideal setup for testing them out. Just make sure you have backups... > > Maybe someone can advise some sophisticated way of quick checking that > filesystems is > healthy? 'btrfs check' without the --repair flag is safe and read only but takes a long time because it'll read all metadata. The fastest safe way is to mount it ro and read a directory recently being written to and see if there are any kernel errors. You could recursively copy files from a directory to /dev/null and then check kernel messages for any errors. So long as metadata is DUP, there is a good chance a bad copy of metadata can be automatically fixed up with a good copy. If there's only single copy of metadata, or both copies get corrupt, then it's difficult. Usually recovery of data is possible, but depending on what's damaged, repair might not be possible. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
On Tue, Oct 16, 2018 at 10:08 PM, Anand Jain wrote: > > So a possible solution for the reproducible builds: >usual mkfs.btrfs dev >Write the data >unmount; create btrfs-image with uuid/fsid/time sanitized; mark it as a > seed (RO). >check/verify the hash of the image. Gotcha. Generation/transid needs to be included in that list. Imagine a fast system vs a slow system. The slow system certainly will end up with with higher transid's for the latest completed transactions. But also, I don't know how the kernel code chooses block numbers, either physical (chunk allocation) or logical (extent allocation) and if that could be made deterministic. Same for inode assignment. Another question that comes up later when creating the sprout by removing the seed device, is how a script can know when all block groups have successfully copied from seed to sprout, and that the sprout can be unmounted. -- Chris Murphy
Re: CRC mismatch
On Tue, Oct 16, 2018 at 9:42 AM, Austin S. Hemmelgarn wrote: > On 2018-10-16 11:30, Anton Shepelev wrote: >> >> Hello, all >> >> What may be the reason of a CRC mismatch on a BTRFS file in >> a virutal machine: >> >> csum failed ino 175524 off 1876295680 csum 451760558 >> expected csum 1446289185 >> >> Shall I seek the culprit in the host machine on in the guest >> one? Supposing the host machine healty, what operations on >> the gueest might have caused a CRC mismatch? >> > Possible causes include: > > * On the guest side: > - Unclean shutdown of the guest system (not likely even if this did > happen). > - A kernel bug on in the guest. > - Something directly modifying the block device (also not very likely). > > * On the host side: > - Unclean shutdown of the host system without properly flushing data from > the guest. Not likely unless you're using an actively unsafe caching mode > for the guest's storage back-end. > - At-rest data corruption in the storage back-end. > - A bug in the host-side storage stack. > - A transient error in the host-side storage stack. > - A bug in the hypervisor. > - Something directly modifying the back-end storage. > > Of these, the statistically most likely location for the issue is probably > the storage stack on the host. Is there still that O_DIRECT related "bug" (or more of a limitation) if the guest is using cache=none on the block device? Anton what virtual machine tech are you using? qemu/kvm managed with virt-manager? The configuration affects host behavior; but the negative effect manifests inside the guest as corruption. If I remember correctly. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
On Tue, Oct 16, 2018 at 2:13 AM, Anand Jain wrote: > > > On 10/14/2018 06:28 AM, Chris Murphy wrote: >> >> Is it practical and desirable to make Btrfs based OS installation >> images reproducible? Or is Btrfs simply too complex and >> non-deterministic? [1] >> >> The main three problems with Btrfs right now for reproducibility are: >> a. many objects have uuids other than the volume uuid; and mkfs only >> lets us set the volume uuid >> b. atime, ctime, mtime, otime; and no way to make them all the same >> c. non-deterministic allocation of file extents, compression, inode >> assignment, logical and physical address allocation >> >> I'm imagining reproducible image creation would be a mkfs feature that >> builds on Btrfs seed and --rootdir concepts to constrain Btrfs >> features to maybe make reproducible Btrfs volumes possible: >> >> - No raid >> - Either all objects needing uuids can have those uuids specified by >> switch, or possibly a defined set of uuids expressly for this use >> case, or possibly all of them can just be zeros (eek? not sure) >> - A flag to set all times the same >> - Possibly require that target block device is zero filled before >> creation of the Btrfs >> - Possibly disallow subvolumes and snapshots >> - Require the resulting image is seed/ro and maybe also a new >> compat_ro flag to enforce that such Btrfs file systems cannot be >> modified after the fact. >> - Enforce a consistent means of allocation and compression >> >> The end result is creating two Btrfs volumes would yield image files >> with matching hashes. > > >> If I had to guess, the biggest challenge would be allocation. But it's >> also possible that such an image may have problems with "sprouts". A >> non-removable sprout seems fairly straightforward and safe; but if a >> "reproducible build" type of seed is removed, it seems like removal >> needs to be smart enough to refresh *all* uuids found in the sprout: a >> hard break from the seed. > > > Right. The seed fsid will be gone in a detached sprout. I think already we get a new devid, volume uuid, and device uuid. Open question is whether any other uuid's need to be refreshed, such as chunk uuid since that appears in every node and leaf. >> Any thoughts? Useful? Difficult to implement? > > Recently Nikolay sent a patch to change fsid on a mounted btrfs. However for > a reproducible builds it also needs neutralized uuids, time, bytenr(s) > further more though the ondisk layout won't change without notice but > block-bytenr might. Seems like the mkfs population method of such a seed, could be made very deterministic as to what the start logical address and physical address are. The vast majority of non-deterministic behavior comes from the nature of kernel code having to handle so many complex inputs and outputs, and negotiate them. > One question why not reproducible builds get the file data extents from the > image and stitch the hashes together to verify the hash. And there could be > a vfs ioctl to import and export filesystem images for a better > support-ability of the use-case similar to the reproducible builds. Perhaps. I don't know the reproducible build requirements very well, if all they really care about is the hash of the data extents, and really how important fs metadata is. That is important when it comes to fuzzing file systems that have no metadata checksumming like squashfs; of course you'd have to checksum the whole file system image. Another feature the mkfs variety of seed image would need, deduplication. As far as I know, deduplication is kernel code only. You'd want to be able to deduplicate, as well as compress, to have the smallest distributed seed possible. And mksquashfs does deduplication by default. -- Chris Murphy
Re: Spurious mount point
On Mon, Oct 15, 2018 at 3:26 PM, Anton Shepelev wrote: > Chris Murphy to Anton Shepelev: > >> > How can I track down the origin of this mount point: >> > >> > /dev/sda2 on /home/hana type btrfs >> > (rw,relatime,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot/hana) >> > >> > if it is not present in /etc/fstab? I shouldn't like to >> > find/grep thoughout the whole filesystem. >> >> Sounds like some service taking snapshots regularly is >> managing this. Maybe this is Mint or Ubuntu and you're >> using Timeshift? > > It is SUSE Linux and (probably) its tool called `snapper', > but I have not found a clue in its documentation. I wasn't aware that SUSE was now using the @ location for snapshots, or that it was using Btrfs for /home. For a while it's been XFS with a Btrfs sysroot. -- Chris Murphy
Re: Spurious mount point
On Mon, Oct 15, 2018 at 9:05 AM, Anton Shepelev wrote: > Hello, all > > How can I track down the origin of this mount point: > >/dev/sda2 on /home/hana type btrfs > (rw,relatime,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot/hana) > > if it is not present in /etc/fstab? I shouldn't like to > find/grep thoughout the whole filesystem. > > -- > () ascii ribbon campaign - against html e-mail > /\ http://preview.tinyurl.com/qcy6mjc [archived] Sounds like some service taking snapshots regularly is managing this. Maybe this is Mint or Ubuntu and you're using Timeshift? Maybe it'll show up in the journal if you add boot parameter 'systemd.log_level=debug' and reboot; then use 'journalctl -b | grep mount' and it should show all instances logged instances of mount events: systemd, udisks2, maybe others? -- Chris Murphy
Re: reproducible builds with btrfs seed feature
On Mon, Oct 15, 2018 at 6:29 AM, Austin S. Hemmelgarn wrote: > On 2018-10-13 18:28, Chris Murphy wrote: >> The end result is creating two Btrfs volumes would yield image files >> with matching hashes. > > So in other words, you care about matching the block layout _exactly_. Only because that's the easiest way to verify reproducibility without any ambiguity. If someone's compromised a build system such that everyone is getting the malicious payload, but they can hide it behind a subvolume or reflink that's not used by default, could someone plausibly cause selective use of their malicious payload? I dunno I leave that for more crafty people. But even if it's a tiny bit of ambiguity, it's non-zero. Hashing a file that contains the entire file system is unambiguous. I think populating the image with --rootdir at mkfs time should be pretty deterministic. One stream in and out. No generations, no snapshot, no delayed allocation. It'd be quite similar to mksquashfs. I guess I'd have to try it a few times, and see if really the only differences are uuids and times, and not allocation related things. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
On Sun, Oct 14, 2018 at 1:09 PM, Cerem Cem ASLAN wrote: > Thanks for the explanation, I got it now. I still think this is > related with my needs, so I'll keep an eye on this. > > What is the possible use case? I can think of only one scenario: You > have a rootfs that contains a distro installer and you want to > generate distro.img files which uses Btrfs under the hood in different > locations and still have the same hash, so you can publish your > verified image hash by a single source (https://your-distro.org). The first step is accepting reproducible builds as a worthy goal in and of itself independent of Btrfs. Specifically "Why does it matter?" found here https://reproducible-builds.org/ Btrfs does bring valuable features for installation images: always on checksumming; seed feature permits a straightforward way to setup a volatile overlay on zram device; ability to convert it to a non-volatile overlay, and boot either the seed or overlay; and even installation by adding the install target and removing both overlay and seed. And yet it remains compatible with a conventional copy to another file system if it's not desirable to use Btrfs as root. Win win. By subsetting Btrfs features we don't care about in the installation seed context, can we achieve reproducibility as a consequence, while retaining some of the more interesting features? Of course once sprouted, those limitations wouldn't apply. Basically it's a "btrfs seed device 2.0" idea. But Btrfs is so complicated it's maybe too much work, hence the question. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
On Sun, Oct 14, 2018 at 6:20 AM, Cerem Cem ASLAN wrote: > I'm not sure I could fully understand the desired achievement but it > sounds like (or this would be an example of selective perception) it's > somehow related with "creating reproducible snapshots" > (https://unix.stackexchange.com/q/462451/65781), no? No the idea is to be able to consistently reproduce a distro installer image (like an ISO file) with the same hash. Inside the ISO image, is typically a root.img or squash.img which itself contains a file system like ext4 or squashfs, to act as the system root. And that root.img is the main thing I'm talking about here. There is work to make squashfs deterministic, as well as ext4. And I'm wondering if there are sane ways to constrain Btrfs features to make it likewise deterministic. For example: fallocate -l 5G btrfsroot.img losetup /dev/loop0 btrfsroot.img mkfs.btrfs -m single -d single -rseed --rootdir /tmp/ -T "20181010T1200" --uuidv $X --uuidc $Y --uuidd $Z ... shasum btrfsroot.img And then do it again, and the shasum's should be the same. I realize today it's not that way. And that inode assignment, extent allocation (number, size, locality) are all factors in making Btrfs quickly non-determinstic, and also why I'm assuming this needs to be done in user space. That would be the point of the -rseed flag: set the seed flag, possibly set a compat_ro flag, fix generation/transid to 1, require the use of -T (similar to make_ext4) to set all timestamps to this value, and configurable uuid's for everything that uses uuids, and whatever other constraints are necessary. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
On Sat, Oct 13, 2018 at 4:28 PM, Chris Murphy wrote: > Is it practical and desirable to make Btrfs based OS installation > images reproducible? Or is Btrfs simply too complex and > non-deterministic? [1] > > The main three problems with Btrfs right now for reproducibility are: > a. many objects have uuids other than the volume uuid; and mkfs only > lets us set the volume uuid > b. atime, ctime, mtime, otime; and no way to make them all the same > c. non-deterministic allocation of file extents, compression, inode > assignment, logical and physical address allocation d. generation, just pick a consistent default because the entire image is made with mkfs and then never rw mounted so it's not a problem > - Possibly disallow subvolumes and snapshots There's no actual mechanism to do either of these with mkfs, so it's not a problem. And if a sprout is created, it's fine for newly created subvolumes to follow the usual behavior of having unique UUID and incrementing generation. Thing is, the sprout will inherit the seeds preset chunk uuid, which while it shouldn't cause a problem is a kind of violation of uuid uniqueness; but ultimately I'm not sure how big of a problem it is for such uuids to spread. -- Chris Murphy
reproducible builds with btrfs seed feature
Is it practical and desirable to make Btrfs based OS installation images reproducible? Or is Btrfs simply too complex and non-deterministic? [1] The main three problems with Btrfs right now for reproducibility are: a. many objects have uuids other than the volume uuid; and mkfs only lets us set the volume uuid b. atime, ctime, mtime, otime; and no way to make them all the same c. non-deterministic allocation of file extents, compression, inode assignment, logical and physical address allocation I'm imagining reproducible image creation would be a mkfs feature that builds on Btrfs seed and --rootdir concepts to constrain Btrfs features to maybe make reproducible Btrfs volumes possible: - No raid - Either all objects needing uuids can have those uuids specified by switch, or possibly a defined set of uuids expressly for this use case, or possibly all of them can just be zeros (eek? not sure) - A flag to set all times the same - Possibly require that target block device is zero filled before creation of the Btrfs - Possibly disallow subvolumes and snapshots - Require the resulting image is seed/ro and maybe also a new compat_ro flag to enforce that such Btrfs file systems cannot be modified after the fact. - Enforce a consistent means of allocation and compression The end result is creating two Btrfs volumes would yield image files with matching hashes. If I had to guess, the biggest challenge would be allocation. But it's also possible that such an image may have problems with "sprouts". A non-removable sprout seems fairly straightforward and safe; but if a "reproducible build" type of seed is removed, it seems like removal needs to be smart enough to refresh *all* uuids found in the sprout: a hard break from the seed. Competing file systems, ext4 with make_ext4 fork, and squashfs. At the moment I'm thinking it might be easier to teach squashfs integrity checking than to make Btrfs reproducible. But then I also think restricting Btrfs features, and applying some requirements to constrain Btrfs to make it reproducible, really enhances the Btrfs seed-sprout feature. Any thoughts? Useful? Difficult to implement? Squashfs might be a better fit for this use case *if* it can be taught about integrity checking. It does per file checksums for the purpose of deduplication but those checksums aren't retained for later integrity checking. [1] problems of reproducible system images https://reproducible-builds.org/docs/system-images/ [2] purpose and motivation for reproducible builds https://reproducible-builds.org/ [3] who is involved? https://reproducible-builds.org/who/#Qubes%20OS -- Chris Murphy
Re: errors reported by btrfs-check
What version of btrfs-progs?
fail to read only scrub, WARNING: at fs/btrfs/transaction.c:1847 cleanup_transaction
btrfs-progs 4.17.1 kernel 4.18.12 I've got another Samsung SDHC card that's gone read only, and any writes cause it to hang. So I've use blockdev --setro on all the partitions and block device to make sure nothing can write to it, and then mount with # mount -o ro,nologreplay /dev/mmcblk0p3 /mnt/sd There is a log tree entry in the super. Next I want to do a scrub to see if there are any corruptions detected. I'm curious if any corruptions have happened in transactions near the time the device failed and went read only. # btrfs scrub start -Bdr /mnt/sd And that results in a warning and call trace. Seems like a bug. [97696.976887] mmc0: new ultra high speed SDR104 SDHC card at address 59b4 [97696.980825] mmcblk0: mmc0:59b4 EB2MW 29.8 GiB [97696.996102] mmcblk0: p1 p2 p3 [100363.736000] r8169 :03:00.0: invalid large VPD tag 7f at offset 0 [103726.761878] BTRFS info (device mmcblk0p3): disabling log replay at mount time [103726.764476] BTRFS info (device mmcblk0p3): using free space tree [103726.767036] BTRFS info (device mmcblk0p3): has skinny extents [103726.811136] BTRFS info (device mmcblk0p3): enabling ssd optimizations [103780.058008] BTRFS warning (device mmcblk0p3): Skipping commit of aborted transaction. [103780.065633] [ cut here ] [103780.070470] BTRFS: Transaction aborted (error -28) [103780.070631] WARNING: CPU: 0 PID: 6670 at fs/btrfs/transaction.c:1847 cleanup_transaction+0x8a/0xd0 [btrfs] [103780.075561] Modules linked in: mmc_block veth xt_nat ipt_MASQUERADE xt_addrtype br_netfilter bridge stp llc ccm nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat arc4 intel_rapl iwlmvm intel_powerclamp coretemp mac80211 kvm_intel kvm snd_hda_codec_hdmi irqbypass btusb snd_hda_codec_realtek iwlwifi crct10dif_pclmul snd_hda_codec_generic btrtl btbcm crc32_pclmul ghash_clmulni_intel btintel bluetooth iTCO_wdt iTCO_vendor_support [103780.097996] intel_cstate cfg80211 snd_hda_intel snd_hda_codec ecdh_generic intel_xhci_usb_role_switch roles rfkill ir_rc6_decoder snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm rc_rc6_mce mei_txe mei ite_cir snd_timer snd rc_core pcc_cpufreq soundcore intel_int0002_vgpio i2c_i801 lpc_ich dm_crypt btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq i915 i2c_algo_bit crc32c_intel drm_kms_helper sdhci_pci drm cqhci r8169 sdhci mii mmc_core video pwm_lpss_platform pwm_lpss lz4 lz4_compress [103780.117383] CPU: 0 PID: 6670 Comm: btrfs-transacti Not tainted 4.18.12-300.fc29.x86_64 #1 [103780.123984] Hardware name: /NUC5PPYB, BIOS PYBSWCEL.86A.0074.2018.0709.1332 07/09/2018 [103780.130401] RIP: 0010:cleanup_transaction+0x8a/0xd0 [btrfs] [103780.136079] Code: 83 f8 01 77 63 f0 48 0f ba ab 08 23 00 00 02 72 1b 41 83 fc fb 0f 84 45 30 08 00 44 89 e6 48 c7 c7 70 6f 6c c0 e8 60 1f a8 ec <0f> 0b 44 89 e1 ba 37 07 00 00 4d 8d 65 28 48 89 ef 48 c7 c6 70 b0 [103780.142943] RSP: 0018:b45401483dd0 EFLAGS: 00010286 [103780.150032] RAX: RBX: 90d8f4724000 RCX: 0006 [103780.157155] RDX: 0007 RSI: 0086 RDI: 90d93fc16930 [103780.164218] RBP: 90d928f99750 R08: 0038 R09: 0007 [103780.171238] R10: R11: 0001 R12: ffe4 [103780.178171] R13: 90d8c44ce600 R14: ffe4 R15: 90d8f4725df8 [103780.185134] FS: () GS:90d93fc0() knlGS: [103780.192265] CS: 0010 DS: ES: CR0: 80050033 [103780.199458] CR2: 55764829c720 CR3: 00017420a000 CR4: 001006f0 [103780.206756] Call Trace: [103780.214090] ? finish_wait+0x80/0x80 [103780.221519] btrfs_commit_transaction+0x86d/0x8b0 [btrfs] [103780.228946] ? join_transaction+0x22/0x3e0 [btrfs] [103780.236448] ? start_transaction+0x9c/0x3e0 [btrfs] [103780.243856] transaction_kthread+0x155/0x170 [btrfs] [103780.251166] ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] [103780.258368] kthread+0x112/0x130 [103780.265481] ? kthread_create_worker_on_cpu+0x70/0x70 [103780.272535] ret_from_fork+0x35/0x40 [103780.279436] ---[ end trace 7470f1b607c73b6c ]--- [103780.285841] BTRFS warning (device mmcblk0p3): cleanup_transaction:1847: Aborting unused transaction(No space left). [103780.289891] BTRFS info (device mmcblk0p3): delayed_refs has NO entry -- Chris Murphy
Re: Scrub aborts due to corrupt leaf
On Wed, Oct 10, 2018 at 9:07 PM, Larkin Lowrey wrote: > On 10/10/2018 10:51 PM, Chris Murphy wrote: >> >> On Wed, Oct 10, 2018 at 8:12 PM, Larkin Lowrey >> wrote: >>> >>> On 10/10/2018 7:55 PM, Hans van Kranenburg wrote: >>>> >>>> On 10/10/2018 07:44 PM, Chris Murphy wrote: >>>>> >>>>> >>>>> I'm pretty sure you have to umount, and then clear the space_cache >>>>> with 'btrfs check --clear-space-cache=v1' and then do a one time mount >>>>> with -o space_cache=v2. >>>> >>>> The --clear-space-cache=v1 is optional, but recommended, if you are >>>> someone who do not likes to keep accumulated cruft. >>>> >>>> The v2 mount (rw mount!!!) does not remove the v1 cache. If you just >>>> mount with v2, the v1 data keeps being there, doing nothing any more. >>> >>> >>> Theoretically I have the v2 space_cache enabled. After a clean umount... >>> >>> # mount -onospace_cache /backups >>> [ 391.243175] BTRFS info (device dm-3): disabling free space tree >>> [ 391.249213] BTRFS error (device dm-3): cannot disable free space tree >>> [ 391.255884] BTRFS error (device dm-3): open_ctree failed >> >> "free space tree" is the v2 space cache, and once enabled it cannot be >> disabled with nospace_cache mount option. If you want to run with >> nospace_cache you'll need to clear it. >> >> >>> # mount -ospace_cache=v1 /backups/ >>> mount: /backups: wrong fs type, bad option, bad superblock on >>> /dev/mapper/Cached-Backups, missing codepage or helper program, or other >>> error >>> [ 983.501874] BTRFS info (device dm-3): enabling disk space caching >>> [ 983.508052] BTRFS error (device dm-3): cannot disable free space tree >>> [ 983.514633] BTRFS error (device dm-3): open_ctree failed >> >> You cannot go back and forth between v1 and v2. Once v2 is enabled, >> it's always used regardless of any mount option. You'll need to use >> btrfs check to clear the v2 cache if you want to use v1 cache. >> >> >>> # btrfs check --clear-space-cache v1 /dev/Cached/Backups >>> Opening filesystem to check... >>> couldn't open RDWR because of unsupported option features (3). >>> ERROR: cannot open file system >> >> You're missing the '=' symbol for the clear option, that's why it fails. >> > > # btrfs check --clear-space-cache=v2 /dev/Cached/Backups > Opening filesystem to check... > Checking filesystem on /dev/Cached/Backups > UUID: acff5096-1128-4b24-a15e-4ba04261edc3 > Clear free space cache v2 > Segmentation fault (core dumped) > > [ 109.686188] btrfs[2429]: segfault at 68 ip 555ff6394b1c sp > 7ffcc4733ab0 error 4 in btrfs[555ff637c000+ca000] > [ 109.696732] Code: ff e8 68 ed ff ff 8b 4c 24 58 4d 8b 8f c7 01 00 00 4c > 89 fe 85 c0 0f 44 44 24 40 45 31 c0 89 44 24 40 48 8b 84 24 90 00 00 00 <8b> > 40 68 49 29 87 d0 00 00 00 6a 00 55 48 8b 54 24 18 48 8b 7c 24 > > That's btrfs-progs v4.17.1 on 4.18.12-200.fc28.x86_64. > > I appreciate the help and advice from everyone who has contributed to this > thread. At this point, unless there is something for the project to gain > from tracking down this trouble, I'm just going to nuke the fs and start > over. Is this a 68T file system? Seems excessive. For now you should be able to use the new v2 space tree. I think Qu or some dev will want to know why you're getting a crash trying to clear the v2 space cache. Maybe try clearing the v1 first, then v2? While v1 is default right now, soonish the plan is to go to v2 by default but the inability to clear is a bug worth investigation. I've just tried it on several of my file systems and it clears without error and rebuilds at next mount with v2 option. If it is the 68T file system, I don't expect a btrfs-image is going to be easy to capture or deliver: you've got 95GiB of metadata! Compressed that's still a ~30-45GiB image. -- Chris Murphy
Re: Scrub aborts due to corrupt leaf
On Wed, Oct 10, 2018 at 8:12 PM, Larkin Lowrey wrote: > On 10/10/2018 7:55 PM, Hans van Kranenburg wrote: >> >> On 10/10/2018 07:44 PM, Chris Murphy wrote: >>> >>> >>> I'm pretty sure you have to umount, and then clear the space_cache >>> with 'btrfs check --clear-space-cache=v1' and then do a one time mount >>> with -o space_cache=v2. >> >> The --clear-space-cache=v1 is optional, but recommended, if you are >> someone who do not likes to keep accumulated cruft. >> >> The v2 mount (rw mount!!!) does not remove the v1 cache. If you just >> mount with v2, the v1 data keeps being there, doing nothing any more. > > > Theoretically I have the v2 space_cache enabled. After a clean umount... > > # mount -onospace_cache /backups > [ 391.243175] BTRFS info (device dm-3): disabling free space tree > [ 391.249213] BTRFS error (device dm-3): cannot disable free space tree > [ 391.255884] BTRFS error (device dm-3): open_ctree failed "free space tree" is the v2 space cache, and once enabled it cannot be disabled with nospace_cache mount option. If you want to run with nospace_cache you'll need to clear it. > > # mount -ospace_cache=v1 /backups/ > mount: /backups: wrong fs type, bad option, bad superblock on > /dev/mapper/Cached-Backups, missing codepage or helper program, or other > error > [ 983.501874] BTRFS info (device dm-3): enabling disk space caching > [ 983.508052] BTRFS error (device dm-3): cannot disable free space tree > [ 983.514633] BTRFS error (device dm-3): open_ctree failed You cannot go back and forth between v1 and v2. Once v2 is enabled, it's always used regardless of any mount option. You'll need to use btrfs check to clear the v2 cache if you want to use v1 cache. > > # btrfs check --clear-space-cache v1 /dev/Cached/Backups > Opening filesystem to check... > couldn't open RDWR because of unsupported option features (3). > ERROR: cannot open file system You're missing the '=' symbol for the clear option, that's why it fails. -- Chris Murphy
Re: Recovery options for damaged beginning of the filesystem
On Tue, Oct 9, 2018 at 10:47 PM, Shapranov Vladimir wrote: > Hi, > > I've got a filesystem with first ~50Mb accidentally dd'ed. > > "btrfs check" fails with a following error (regardless of "-s"): > checksum verify failed on 21037056 found FC8A6557 wanted 2F51D090 > checksum verify failed on 21037056 found FC8A6557 wanted 2F51D090 > checksum verify failed on 21037056 found 1EDD5E47 wanted 222F7E7F > checksum verify failed on 21037056 found 1EDD5E47 wanted 222F7E7F > bytenr mismatch, want=21037056, have=13515002166904211737 > ERROR: cannot read chunk root > ERROR: cannot open file system > > "mount -o ro /dev/sdf1 /mnt/tmp" fails, while "mount -o ro,subvol=X /mnt/tmp" > succeeds for "/" and couple subvolumes. What do you get for 'btrfs rescue super -v /dev/sdf1' ? I thought the kernel code will not mount a Btrfs if the first super is not present or valid (checksum match)? -- Chris Murphy
Re: Scrub aborts due to corrupt leaf
On Wed, Oct 10, 2018 at 12:31 PM, Larkin Lowrey wrote: > Interesting, because I do not see any indications of any other errors. The > fs is backed by an mdraid array and the raid checks always pass with no > mismatches, edac-util doesn't report any ECC errors, smartd doesn't report > any SMART errors, and I never see any raid controller errors. I have the > console connected through serial to a logging console server so if there > were errors reported I would have seen them. I think Holger is referring to the multiple reports like this: [ 817.883261] scsi_eh_0 S0 141 2 0x8000 [ 817.66] Call Trace: [ 817.891391] ? __schedule+0x253/0x860 [ 817.895094] ? scsi_try_target_reset+0x90/0x90 [ 817.899631] ? scsi_eh_get_sense+0x220/0x220 [ 817.904045] schedule+0x28/0x80 [ 817.907260] scsi_error_handler+0x1d2/0x5b0 [ 817.911514] ? __schedule+0x25b/0x860 [ 817.915207] ? scsi_eh_get_sense+0x220/0x220 [ 817.919547] kthread+0x112/0x130 [ 817.922818] ? kthread_create_worker_on_cpu+0x70/0x70 [ 817.928015] ret_from_fork+0x22/0x40 That isn't a SCSI controller or drive error itself; it's a capture of a thread that's in the state of handling scsi errors (maybe). I'm finding scsi_try_target_reset here at line 855 https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_error.c And also line 2143 for scsi_error_handler https://github.com/torvalds/linux/blob/master/drivers/scsi/scsi_error.c Is the problem Btrfs on sysroot? Because if the sysroot file system is entirely error free, I'd expect to eventually get a lot more error information from the kernel even without sysrq+t rather than faceplanting. Can you post the entire dmesg? The posted one starts at ~815 seconds, and the problems definitely start before then but as it is we have nothing really to go on. -- Chris Murphy
Re: Scrub aborts due to corrupt leaf
On Wed, Oct 10, 2018 at 10:04 AM, Holger Hoffstätte wrote: > On 10/10/18 17:44, Larkin Lowrey wrote: > (..) >> >> About once a week, or so, I'm running into the above situation where >> FS seems to deadlock. All IO to the FS blocks, there is no IO >> activity at all. I have to hard reboot the system to recover. There >> are no error indications except for the following which occurs well >> before the FS freezes up: >> >> BTRFS warning (device dm-3): block group 78691883286528 has wrong amount >> of free space >> BTRFS warning (device dm-3): failed to load free space cache for block >> group 78691883286528, rebuilding it now >> >> Do I have any options other the nuking the FS and starting over? > > > Unmount cleanly & mount again with -o space_cache=v2. I'm pretty sure you have to umount, and then clear the space_cache with 'btrfs check --clear-space-cache=v1' and then do a one time mount with -o space_cache=v2. But anyway, to me that seems premature because we don't even know what's causing the problem. a. Freezing means there's a kernel bug. Hands down. b. Is it freezing on the rebuild? Or something else? c. I think the devs would like to see the output from btrfs-progs v4.17.1, 'btrfs check --mode=lowmem' and see if it finds anything, in particular something not related to free space cache. Rebuilding either version of space cache requires successfully reading (and parsing) the extent tree. -- Chris Murphy
Re: CoW behavior when writing same content
On Tue, Oct 9, 2018 at 11:25 AM, Andrei Borzenkov wrote: > 09.10.2018 18:52, Chris Murphy пишет: >>> In this case is root/big_file and snapshot/big_file still share the same >>> data? >> >> You'll be left with three files. /big_file and root/big_file will >> share extents, > > How comes they share extents? This requires --reflink, is it default now? Good catch. It's not the default. I meant to write that initially only root/big_file and snapshot/big_file have shared extents And the shared extents are lost when snapshot/big_file is "overwritten" by the copy into snapshot/ >> and snapshot/big_file will have its own extents. You'd >> need to copy with --reflink for snapshot/big_file to have shared >> extents with /big_file - or deduplicate. >> > This still overwrites the whole file in the sense original file content > of "snapshot/big_file" is lost. That new content happens to be identical > and that new content will probably be reflinked does not change the fact > that original file is gone. Agreed. -- Chris Murphy
Re: CoW behavior when writing same content
On Tue, Oct 9, 2018 at 8:48 AM, Gervais, Francois wrote: > Hi, > > If I have a snapshot where I overwrite a big file but which only a > small portion of it is different, will the whole file be rewritten in > the snapshot? Or only the different part of the file? Depends on how the application modifies files. Many applications write out a whole new file with a pseudorandom filename, fsync, then rename. > > Something like: > > $ dd if=/dev/urandom of=/big_file bs=1M count=1024 > $ cp /big_file root/ > $ btrfs sub snap root snapshot > $ cp /big_file snapshot/ > > In this case is root/big_file and snapshot/big_file still share the same data? You'll be left with three files. /big_file and root/big_file will share extents, and snapshot/big_file will have its own extents. You'd need to copy with --reflink for snapshot/big_file to have shared extents with /big_file - or deduplicate. -- Chris Murphy
qgroups not enabled, but perf stats reports btrfs_qgroup_release_data and btrfs_qgroup_free_delayed_ref
[chris@flap ~]$ sudo perf stat -e 'btrfs:*' -a sleep 70 ##And then I loaded a few sites in Firefox early on in those 70 seconds. Performance counter stats for 'system wide': 5 btrfs:btrfs_transaction_commit 29 btrfs:btrfs_inode_new 29 btrfs:btrfs_inode_request 25 btrfs:btrfs_inode_evict 1,602 btrfs:btrfs_get_extent 0 btrfs:btrfs_handle_em_exist 1 btrfs:btrfs_get_extent_show_fi_regular 88 btrfs:btrfs_truncate_show_fi_regular 19 btrfs:btrfs_get_extent_show_fi_inline 2 btrfs:btrfs_truncate_show_fi_inline 189 btrfs:btrfs_ordered_extent_add 189 btrfs:btrfs_ordered_extent_remove 9 btrfs:btrfs_ordered_extent_start 592 btrfs:btrfs_ordered_extent_put 1,207 btrfs:__extent_writepage 1,203 btrfs:btrfs_writepage_end_io_hook 25 btrfs:btrfs_sync_file 0 btrfs:btrfs_sync_fs 0 btrfs:btrfs_add_block_group 1,508 btrfs:add_delayed_tree_ref 1,498 btrfs:run_delayed_tree_ref 379 btrfs:add_delayed_data_ref 336 btrfs:run_delayed_data_ref 1,887 btrfs:add_delayed_ref_head 1,839 btrfs:run_delayed_ref_head 0 btrfs:btrfs_chunk_alloc 0 btrfs:btrfs_chunk_free 794 btrfs:btrfs_cow_block 6,982 btrfs:btrfs_space_reservation 0 btrfs:btrfs_trigger_flush 0 btrfs:btrfs_flush_space 952 btrfs:btrfs_reserved_extent_alloc 0 btrfs:btrfs_reserved_extent_free 1,005 btrfs:find_free_extent 1,005 btrfs:btrfs_reserve_extent 816 btrfs:btrfs_reserve_extent_cluster 1 btrfs:btrfs_find_cluster 0 btrfs:btrfs_failed_cluster_setup 1 btrfs:btrfs_setup_cluster 5,952 btrfs:alloc_extent_state 6,034 btrfs:free_extent_state 374 btrfs:btrfs_work_queued 362 btrfs:btrfs_work_sched 362 btrfs:btrfs_all_work_done 116 btrfs:btrfs_ordered_sched 0 btrfs:btrfs_workqueue_alloc 0 btrfs:btrfs_workqueue_destroy 0 btrfs:btrfs_qgroup_reserve_data 201 btrfs:btrfs_qgroup_release_data 1,839 btrfs:btrfs_qgroup_free_delayed_ref 0 btrfs:btrfs_qgroup_account_extents 0 btrfs:btrfs_qgroup_trace_extent 0 btrfs:btrfs_qgroup_account_extent 0 btrfs:qgroup_update_counters 0 btrfs:qgroup_update_reserve 0 btrfs:qgroup_meta_reserve 0 btrfs:qgroup_meta_convert 0 btrfs:qgroup_meta_free_all_pertrans 0 btrfs:btrfs_prelim_ref_merge 0 btrfs:btrfs_prelim_ref_insert 2,663 btrfs:btrfs_inode_mod_outstanding_extents 0 btrfs:btrfs_remove_block_group 0 btrfs:btrfs_add_unused_block_group 0 btrfs:btrfs_skip_unused_block_group 70.004723586 seconds time elapsed [chris@flap ~]$ Seems like a lot of activity for just a few transactions, but what really caught my eye here is the qgroup reporting for a file system that has never had qgroups enabled. Is it expected? Chris Murphy
Re: btrfs problems
On Thu, Sep 20, 2018 at 3:36 PM Adrian Bastholm wrote: > > Thanks a lot for the detailed explanation. > Aabout "stable hardware/no lying hardware". I'm not running any raid > hardware, was planning on just software raid. Yep. I'm referring to the drives, their firmware, cables, logic board, its firmware, the power supply, power, etc. Btrfs is by nature intolerant of corruption. Other file systems are more tolerant because they don't know about it (although recent versions of XFS and ext4 are now defaulting to checksummed metadata and journals). >three drives glued > together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would > this be a safer bet, or would You recommend running the sausage method > instead, with "-d single" for safety ? I'm guessing that if one of the > drives dies the data is completely lost > Another variant I was considering is running a raid1 mirror on two of > the drives and maybe a subvolume on the third, for less important > stuff RAID does not substantially reduce the chances of data loss. It's not anything like a backup. It's an uptime enhancer. If you have backups, and your primary storage dies, of course you can restore from backup no problem, but it takes time and while the restore is happening, you're not online - uptime is killed. If that's a negative, might want to run RAID so you can keep working during the degraded period, and instead of a restore you're doing a rebuild. But of course there is a chance of failure during the degraded period. So you have to have a backup anyway. At least with Btrfs/ZFS, there is another reason to run with some replication like raid1 or raid5 and that's so that if there's corruption or a bad sector, Btrfs doesn't just detect it, it can fix it up with the good copy. For what it's worth, make sure the drives have lower SCT ERC time than the SCSI command timer. This is the same for Btrfs as it is for md and LVM RAID. The command timer default is 30 seconds, and most drives have SCT ERC disabled with very high recovery times well over 30 seconds. So either set SCT ERC to something like 70 deciseconds. Or increase the command timer to something like 120 or 180 (either one is absurdly high but what you want is for the drive to eventually give up and report a discrete error message which Btrfs can do something about, rather than do a SATA link reset in which case Btrfs can't do anything about it). -- Chris Murphy
Re: btrfs problems
On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm wrote: > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo wrote: > >> >> Then I strongly recommend to use the latest upstream kernel and progs >> for btrfs. (thus using Debian Testing) >> >> And if anything went wrong, please report asap to the mail list. >> >> Especially for fs corruption, that's the ghost I'm always chasing for. >> So if any corruption happens again (although I hope it won't happen), I >> may have a chance to catch it. > > You got it >> > >> >> Anyway, enjoy your stable fs even it's not btrfs > >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess >> > after the dust settled I realize I like the flexibility of BTRFS. >> > > I'm back to btrfs. > >> From the code aspect, the biggest difference is the chunk layout. >> Due to the ext* block group usage, each block group header (except some >> sparse bg) is always used, thus btrfs can't use them. >> >> This leads to highly fragmented chunk layout. > > The only thing I really understood is "highly fragmented" == not good > . I might need to google these "chunk" thingies Chunks are synonyms with block groups. They're like a super extent, or extent of extents. The block group is how Btrfs abstracts the logical address used most everywhere in Btrfs land, and device + physical location of extents. It's how a file is referenced only by on logical address, and doesn't need to know either where the extent is located, or how many copies there are. The block group allocation profile is what determines if there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk and where the copies are located. It's also fundamental to how device add, remove, replace, file system resize, and balance all interrelate. >> If your primary concern is to make the fs as stable as possible, then >> keep snapshots to a minimal amount, avoid any functionality you won't >> use, like qgroup, routinely balance, RAID5/6. > > So, is RAID5 stable enough ? reading the wiki there's a big fat > warning about some parity issues, I read an article about silent > corruption (written a while back), and chris says he can't recommend > raid56 to mere mortals. Depends on how you define stable. In recent kernels it's stable on stable hardware, i.e. no lying hardware (actually flushes when it claims it has), no power failures, and no failed devices. Of course it's designed to help protect against a clear loss of a device, but there's tons of stuff here that's just not finished including ejecting bad devices from the array like md and lvm raids will do. Btrfs will just keep trying, through all the failures. There are some patches to moderate this but I don't think they're merged yet. You'd also want to be really familiar with how to handle degraded operation, if you're going to depend on it, and how to replace a bad device. Last I refreshed my memory on it, it's advised to use "btrfs device add" followed by "btrfs device remove" for raid56; whereas "btrfs replace" is preferred for all other profiles. I'm not sure if the "btrfs replace" issues with parity raid were fixed. Metadata as raid56 shows a lot more problem reports than metadata raid1, so there's something goofy going on in those cases. I'm not sure how well understood they are. But other people don't have problems with it. It's worth looking through the archives about some things. Btrfs raid56 isn't exactly perfectly COW, there is read-modify-write code that means there can be overwrites. I vaguely recall that it's COW in the logical layer, but the physical writes can end up being RMW or not for sure COW. -- Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann wrote: > Am 13.9.2018 14:35, schrieb Nikolay Borisov: >> >> On 13.09.2018 15:30, Jürgen Herrmann wrote: >>> >>> OK, I will install kdump later and perform a dump after the hang. >>> >>> One more noob question beforehand: does this dump contain sensitive >>> information, for example the luks encryption key for the disk etc? A >>> Google search only brings up one relevant search result which can only >>> be viewed with a redhat subscription... >> >> >> >> So a kdump will dump the kernel memory so it's possible that the LUKS >> encryption keys could be extracted from that image. Bummer, it's >> understandable why you wouldn't want to upload it :). In this case you'd >> have to install also the 'crash' utility to open the crashdump and >> extract the calltrace of the btrfs process. The rough process should be : >> >> >> crash 'path to vm linux' 'path to vmcore file', then once inside the >> crash utility : >> >> set , you can acquire the pid by issuing 'ps' >> which will give you a ps-like output of all running processes at the >> time of crash. After the context has been set you can run 'bt' which >> will give you a backtrace of the send process. >> >> >> >>> >>> Best regards, >>> Jürgen >>> >>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov >>> : >>> >>>> On 13.09.2018 14:50, Jürgen Herrmann wrote: >>>>> >>>>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also >>>>> after the hang because I started the loop before the hang. The dmesg >>>>> output should show the hanging tasks from second 346 on or so. Still >>>>> not >>>>> useful? >>>>> >>>> >>>> So from 346 it's evident that transaction commit is waiting for >>>> commit_root_sem to be acquired. So something else is holding it and not >>>> giving the transaction chance to finish committing. Now the only place >>>> where send acquires this lock is in find_extent_clone around the call >>>> to extent_from_logical. The latter basically does an extent tree search >>>> and doesn't loop so can't possibly deadlock. Furthermore I don't see any >>>> userspace processes being hung in kernel space. >>>> >>>> Additionally looking at the userspace processes they indicate that >>>> find_extent_clone has finished and are blocked in send_write_or_clone >>>> which does the write. But I guess this actually happens before the hang. >>>> >>>> >>>> So at this point without looking at the stacktrace of the btrfs send >>>> process after the hung has occurred I don't think much can be done >>> >>> > I know this is probably not the correct list to ask this question but maybe > someone of the devs can point me to the right list? > > I cannot get kdump to work. The crashkernel is loaded and everything is > setup for it afaict. I asked a question on this over at stackexchange but no > answer yet. > https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing > > So i did a little digging and added some debug printk() statements to see > whats going on and it seems that panic() is never called. maybe the second > stack trace is the reason? > Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN > > Could someone please tell me where I can report this problem and get some > help on this topic? Try kexec mailing list. They handle kdump. http://lists.infradead.org/mailman/listinfo/kexec -- Chris Murphy
Re: inline extents
Adding fsdevel@, linux-ext4, and btrfs@ (which has a separate subject on this same issue) On Wed, Sep 19, 2018 at 7:45 PM, Dave Chinner wrote: >On Wed, Sep 19, 2018 at 10:23:38AM -0600, Chris Murphy wrote: >> Fedora 29 has a new feature to test if boot+startup fails, so the >> bootloader can do a fallback at next boot, to a previously working >> entry. Part of this means GRUB (the bootloader code, not the user >> space code) uses "save_env" to overwrite the 1024 data bytes with >> updated environment information. > > That's just broken. Illegal. Completely unsupportable. Doesn't > matter what the filesystem is, nobody is allowed to write directly > to the block device a filesystem owns. Yeah, the word I'm thinking of is abomination. However in their defense, grubenv and the 'save_env' command are old features: line 3638 @node Environment block http://git.savannah.gnu.org/cgit/grub.git/tree/docs/grub.texi "For safety reasons, this storage is only available when installed on a plain disk (no LVM or RAID), using a non-checksumming filesystem (no ZFS), and using BIOS or EFI functions (no ATA, USB or IEEE1275)." I haven't checked how it tests for this. But by now, it should list the supported file systems, rather than what's exempt. That's a shorter list. > ext4 has inline data, too, so there's every chance grub will corrupt > ext4 filesystems with tit's wonderful new feature. I'm not sure if > the ext4 metadata cksums cover the entire inode and inline data, but > if they do it's the same problem as btrfs. I don't see inline used with a default mkfs, but I do see metadata_csum e2fsprogs-1.44.3-1.fc29.x86_64 Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: signed_directory_hash Default mount options: user_xattr acl > >> For XFS, I'm not sure how the inline extent is saved, and whether >> metadata checksumming includes or excludes the inline extent. > > When XFS implements this, it will be like btrfs as the data will be > covered by the metadata CRCs for the inode, and so writing directly > to it would corrupt the inode and render it unreadable by the > filesystem. Good to know. > >> I'm also kinda ignoring the reflink ramifications of this behavior, >> for now. Let's just say even if there's no corruption I'm really >> suspicious of bootloader code writing anything, even what seems to be >> a simple overwrite of two sectors. > > You're not the only one > > Like I said, it doesn't matter what the filesystem is, overwriting > file data by writing directly to the block device is not > supportable. It's essentially a filesystem corruption vector, and > grub needs to have that functionality removed immediately. I'm in agreement with respect to the more complex file systems. We've already realized the folly of the bootloader being unable to do journal replay, ergo it doesn't really for sure have a complete picture of the file system anyway. That's suboptimal when it results in boot failure. But if it were going to use stale file system information, get a wrong idea of the file system, and then use that to do even 1024 bytes of writes? No, no, and no. Meanwhile, also in Fedoraland, it's one of the distros where grubenv and grub.cfg stuff is on the EFI System partition, which is FAT. This overwrite behavior will work there, but even this case is a kind of betrayal that the file is being modified, without its metadata being updated. I think it's an old era hack that by today's standards simply isn't good enough. I'm a little surprised that all UEFI implementations permit arbitrary writes from the pre-boot environment to arbitrary block devices, even with Secure Boot enabled. That seems specious. I know some of the file systems have reserve areas for bootloader stuff. I'm not sure if that's preferred over bootloaders just getting their own partition and controlling it stem to stern however they want. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy wrote: > https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F > > Does anyone know if this is still a problem on Btrfs if grubenv has > xattr +C set? In which case it should be possible to overwrite and > there's no csums that are invalidated. I'm wrong. $ sudo grub2-editenv --verbose grubenv create [sudo] password for chris: [chris@f29h ~]$ ll -rw-r--r--. 1 root root 1024 Sep 18 13:37 grubenv [chris@f29h ~]$ stat -f grubenv File: "grubenv" ID: ac9ba8ecdce5b017 Namelen: 255 Type: btrfs Block size: 4096 Fundamental block size: 4096 Blocks: Total: 46661632 Free: 37479747 Available: 37422535 Inodes: Total: 0 Free: 0 [chris@f29h ~]$ sudo filefrag -v grubenv Filesystem type is: 9123683e File size of grubenv is 1024 (1 block of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..4095: 0.. 4095: 4096: last,not_aligned,inline,eof grubenv: 1 extent found [chris@f29h ~]$ So it's an inline extent, which means nocow doesn't apply. It's metadata so it *must* be COW. And any overwrite would trigger a metadata checksum error. First I'd argue it should refuse to create the file on Btrfs. But if it does create grubenv, instead it should know that on Btrfs it must redirect it to the appropriate btrfs reserved area (no idea how this works) rather than to a file. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Tue, Sep 18, 2018 at 1:11 PM, Goffredo Baroncelli wrote: >> I think it's a problem, and near as I can tell it'll be a problem for >> all kinds of complex storage. I don't see how the bootloader itself >> can do an overwrite onto raid5 or raid6. > > >> That's certainly supported by GRUB for reading > Not yet, I am working on that [1] Sorry! I meant mdadm raid56. It definitely can read that format for some time and even degraded! It's pretty cool. But I see no way that it's sane to have the bootloader write to such a volume. I've run into some issue where grub2-mkconfig and grubby, can change the grub.cfg, and then do a really fast reboot without cleanly unmounting the volume - and what happens? Can't boot. The bootloader can't do log replay so it doesn't see the new grub.cfg at all. If all you do is mount the volume and unmount, log replay happens, the fs metadata is all fixed up just fine, and now the bootloader can see it. This same problem can happen with the kernel and initramfs installations. (Hilariously the reason why this can happen is because of a process exempting itself from being forcibly killed by systemd *against* the documented advice of systemd devs that you should only do this for processes not on rootfs; but as a consequence of this process doing the wrong thing, systemd at reboot time ends up doing an unclean unmount and reboot because it won't kill the kill exempt process.) So *already* we have file systems that are becoming too complicated for the bootloader to reliably read, because they cannot do journal relay, let alone have any chance of modifying (nor would I want them to do this). So yeah I'm, very rapidly becoming opposed to grubenv on anything but super simple volumes like maybe ext4 without a journal (extents are nice); or even perhaps GRUB should just implement its own damn file system and we give it its own partition - similar to BIOS Boot - but probably a little bigger > >> but is the bootloader overwrite of gruvenv going to >> recompute parity and write to multiple devices? Eek! > > Recompute the parity should not be a big deal. Updating all the (b)trees > would be a too complex goal. I think it's just asking for trouble. Sometimes the best answer ends up being no, no and definitely no. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Tue, Sep 18, 2018 at 1:01 PM, Andrei Borzenkov wrote: > 18.09.2018 21:57, Chris Murphy пишет: >> On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov >> wrote: >>> 18.09.2018 08:37, Chris Murphy пишет: >> >>>> The patches aren't upstream yet? Will they be? >>>> >>> >>> I do not know. Personally I think much easier is to make grub location >>> independent of /boot, allowing grub be installed in separate partition. >>> This automatically covers all other cases (like MD, LVM etc). >> >> The only case where I'm aware of this happens is Fedora on UEFI where >> they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure >> upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever >> seen it elsewhere (except Fedora on UEFI). >> >> I'm not sure this is much easier. Yet another volume that would be >> persistently mounted? Where? A nested mount at /boot/grub? I'm not >> liking that at all. Even Windows and macOS have saner and simpler to >> understand booting methods than this. >> >> > That's exactly what Windows ended up with - separate boot volume with > bootloader related files. The OEM installer will absolutely install to a single partition. If you point it to a blank drive on BIOS it will preferentially create a "system" volume that's used for booting. But it's not mandatory. On UEFI, it doesn't create a "system" volume, just "recovery" is ~500M and "reserved" 16M. The reserved partition is blank unless you've done some resizing on the main volume. The recovery volume contains Winre.wim which is used for doing resets. If you blow away that partition, you can still boot, but you can't do resets. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Tue, Sep 18, 2018 at 12:25 PM, Austin S. Hemmelgarn wrote: > It actually is independent of /boot already. I've got it running just fine > on my laptop off of the EFI system partition (which is independent of my > /boot partition), and thus have no issues with handling of the grubenv file. > The problem is that all the big distros assume you want it in /boot, so they > have no option for putting it anywhere else. > > Actually installing it elsewhere is not hard though, you just pass > `--boot-directory=/wherever` to the `grub-install` script and turn off your > distributions automatic reinstall mechanism so it doesn't get screwed up by > the package manager when the GRUB package gets updated. You can also make > `/boot/grub` a symbolic link pointing to the real GRUB directory, so that > you don't have to pass any extra options to tools like grub-reboot or > grub-set-default. This is how Fedora builds their signed grubx64.efi to behave. But you cannot ever run grub-install on a Secure Boot enabled computer, or you now have to learn all about signing your own binaries. I don't even like doing that, let alone saner users. So for those distros that support Secure Boot, in practice you're stuck with the behavior of their prebuilt GRUB binary that goes on the ESP. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov wrote: > 18.09.2018 08:37, Chris Murphy пишет: >> The patches aren't upstream yet? Will they be? >> > > I do not know. Personally I think much easier is to make grub location > independent of /boot, allowing grub be installed in separate partition. > This automatically covers all other cases (like MD, LVM etc). The only case where I'm aware of this happens is Fedora on UEFI where they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever seen it elsewhere (except Fedora on UEFI). I'm not sure this is much easier. Yet another volume that would be persistently mounted? Where? A nested mount at /boot/grub? I'm not liking that at all. Even Windows and macOS have saner and simpler to understand booting methods than this. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Tue, Sep 18, 2018 at 11:15 AM, Goffredo Baroncelli wrote: > On 18/09/2018 06.21, Chris Murphy wrote: >> b. The bootloader code, would have to have sophisticated enough Btrfs >> knowledge to know if the grubenv has been reflinked or snapshot, >> because even if +C, it may not be valid to overwrite, and COW must >> still happen, and there's no way the code in GRUB can do full blow COW >> and update a bunch of metadata. > > And what if GRUB ignore the possibility of COWing and overwrite the data ? Is > it a so big problem that the data is changed in all the snapshots ? > It would be interested if the same problem happens for a swap file. I think it's an abomination :-) It totally perverts the idea of reflinks and snapshots and blurs the line between domains. Is it a user file or not and are these user space commands or not and are they reliable or do they have exceptions? I have a boot subvolume mounted at /boot, and this boot subvolume gets snapshot, and if GRUB can overwrite grubenv, it overwrites the purported GRUB state information in every one of those boots, going back maybe months, even when these are read only subvolumes. I think it's a problem, and near as I can tell it'll be a problem for all kinds of complex storage. I don't see how the bootloader itself can do an overwrite onto raid5 or raid6. That's certainly supported by GRUB for reading, but is the bootloader overwrite of gruvenv going to recompute parity and write to multiple devices? Eek! -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov wrote: > 18.09.2018 07:21, Chris Murphy пишет: >> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy >> wrote: >>> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F >>> >>> Does anyone know if this is still a problem on Btrfs if grubenv has >>> xattr +C set? In which case it should be possible to overwrite and >>> there's no csums that are invalidated. >>> >>> I kinda wonder if in 2018 it's specious for, effectively out of tree >>> code, to be making modifications to the file system, outside of the >>> file system. >> >> a. The bootloader code (pre-boot, not user space setup stuff) would >> have to know how to read xattr and refuse to overwrite a grubenv >> lacking xattr +C. >> b. The bootloader code, would have to have sophisticated enough Btrfs >> knowledge to know if the grubenv has been reflinked or snapshot, >> because even if +C, it may not be valid to overwrite, and COW must >> still happen, and there's no way the code in GRUB can do full blow COW >> and update a bunch of metadata. >> >> So answering my own question, this isn't workable. And it seems the >> same problem for dm-thin. >> >> There are a couple of reserve locations in Btrfs at the start and I >> think after the first superblock, for bootloader embedding. Possibly >> one or both of those areas could be used for this so it's outside the >> file system. But other implementations are going to run into this >> problem too. >> > > That's what SUSE grub2 version does - it includes patches to redirect > writes on btrfs to reserved area. I am not sure how it behaves in case > of multi-device btrfs though. The patches aren't upstream yet? Will they be? They redirect writes to grubenv specifically? Or do they use the reserved areas like a hidden and fixed location for what grubenv would contain? I guess the user space grub-editenv could write to grubenv, which even if COW, GRUB can pick up that change. But GRUB itself writes its changes to a reserved area. Hmmm. Complicated. -- Chris Murphy
Re: GRUB writing to grubenv outside of kernel fs code
On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy wrote: > https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F > > Does anyone know if this is still a problem on Btrfs if grubenv has > xattr +C set? In which case it should be possible to overwrite and > there's no csums that are invalidated. > > I kinda wonder if in 2018 it's specious for, effectively out of tree > code, to be making modifications to the file system, outside of the > file system. a. The bootloader code (pre-boot, not user space setup stuff) would have to know how to read xattr and refuse to overwrite a grubenv lacking xattr +C. b. The bootloader code, would have to have sophisticated enough Btrfs knowledge to know if the grubenv has been reflinked or snapshot, because even if +C, it may not be valid to overwrite, and COW must still happen, and there's no way the code in GRUB can do full blow COW and update a bunch of metadata. So answering my own question, this isn't workable. And it seems the same problem for dm-thin. There are a couple of reserve locations in Btrfs at the start and I think after the first superblock, for bootloader embedding. Possibly one or both of those areas could be used for this so it's outside the file system. But other implementations are going to run into this problem too. I'm not sure how else to describe state. If NVRAM is sufficiently wear resilient enough to have writes to it possibly every day, for every boot, to indicate boot success/fail. -- Chris Murphy
GRUB writing to grubenv outside of kernel fs code
https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F Does anyone know if this is still a problem on Btrfs if grubenv has xattr +C set? In which case it should be possible to overwrite and there's no csums that are invalidated. I kinda wonder if in 2018 it's specious for, effectively out of tree code, to be making modifications to the file system, outside of the file system. -- Chris Murphy
Re: btrfs problems
re's a known highly tested backport, because they want "The Behavior" to be predictable, both good and bad. That is not a model well suited for a file system that's in Btrfs really active development state. It's better now than it was even a couple years ago, where I'd say: just don't use RHEL or Debian or anything with old kernels except for experimenting; it's not worth the hassle; you're inevitably gonna have to use a newer kernel because all the Btrfs devs are busy making metric shittonnes of fixes in the mainline version. Today, it's not as bad as that. But still 4.9 is old in Btrfs terms. Should it be stable? For *your* problem for sure because that's just damn strange and something very goofy is going on. But is it possible there's a whole series of bugs happening in sequence that results in this kind of corruption? No idea. Maybe. And that's the main reason why quite a lot of users on this list use Fedora, Arch, Gentoo - so they're using the newest stable or even mainline rc kernels. And so if you want to run any file system, including Btrfs, in production with older kernels, you pick a distro that's doing that work. And right now it's openSUSE and SUSE that have the most Btrfs developers supporting 4.9 and 4.14 kernels and Btrfs. Most of those users are getting distro support, I don't often see SUSE users on here. OpenZFS is a different strategy because they're using out of tree code. So you can run older kernels, and compile the current openzfs code base against your older kernel. In effect you're using an older distro kernel, but with new file system code base supported by that upstream. -- Chris Murphy
Re: Move data and mount point to subvolume
On Sun, Sep 16, 2018 at 12:40 PM, Rory Campbell-Lange wrote: > Thanks very much for spotting my error, Chris. > > # mount | grep bkp > /dev/mapper/cdisk2 on /bkp type btrfs > (rw,noatime,compress=lzo,space_cache,subvolid=5,subvol=/) > > # btrfs subvol list /bkp > ID 258 gen 313636 top level 5 path backup > > I'm a bit confused about the difference between / and backup, which is > at /bkp/backup. top level, subvolid=5, subvolid=0, subvol=/, FS_TREE are all the same thing. This is the subvolume that's created at mkfs time, it has no name, it can't be deleted, and at mkfs time if you do # btrfs sub get-default ID 5 (FS_TREE) So long as you haven't changed the default subvolume, the top level subvolume is what gets mounted, unless you use "-o subvol=" or "-o subvolid=" mount option. If you do # btrfs sub list -ta /bkp It might become a bit more clear what the layout is on disk. And for an even more verbose output you can do: # btrfs insp dump-t -t fs_tree /dev/### for this you need to specify device not mountpoint, you don't need to umount, it's a read only command Anything that in the "top level" or the "file system root" you will see listed. The first number is the inode, you'll see 256 is a special inode for subvolumes. You can do 'ls -li' and compare. Any subvolume you create is not FS_TREE, it is a "file tree". And note that each subvolume has it's own pile of inode numbers, meaning files/directories only have unique inode numbers *in a given subvolume*. Those inode numbers start over in a new subvolume. Subvolumes share extent, chunk, csum, uuid and other trees, so a subvolume is not a completely isolated "file system". > > Anyhow I've verified I can snapshot /bkp/backup to another subvolume. > This means I don't need to move anything, simply remount /bkp at > /bkp/backup. Uhh, that's the reverse of what you said in the first message. I'm not sure what you want to do. It sounds like you want to mount the subvolume "backup" at /bkp/ so that all the other files/dirs on this Btrfs volume are not visible through the /bkp/ mount path? Anyway if you want to explicitly mount the subvolume "backup" somewhere, you use -o subvol=backup to specify "the subvolume named backup, not the top level subvolume". > > Presumably I can therefore remount /bkp at subvolume /backup? > > # btrfs subvolume show /bkp/backup | egrep -i 'name|uuid|subvol' > Name: backup > UUID: d17cf2ca-a6db-ca43-8054-1fd76533e84b > Parent UUID:- > Received UUID: - > Subvolume ID: 258 > > My fstab is presently > > UUID=da90602a-b98e-4f0b-959a-ce431ac0cdfa /bkp btrfs > noauto,noatime,compress=lzo 0 2 > > I guess it would now be > > UUID=d17cf2ca-a6db-ca43-8054-1fd76533e84b /bkp btrfs > noauto,noatime,compress=lzo 0 2 No you can't mount by subvolume UUID. You continue to specify the volume UUID, but then add a mount option noauto,noatime,compress=lzo,subvol=backup or noauto,noatime,compress=lzo,subvolid=258 The advantage of subvolid is that it doesn't change when you rename the subvolume. > >> If you snapshot a subvolume, which itself contains subvolumes, the >> nested subvolumes are not snapshot. In the snapshot, the nested >> subvolumes are empty directories. >> >> > >> > # btrfs fi du -s /bkp/backup-subvol/backup >> > Total Exclusive Set shared Filename >> > ERROR: cannot check space of '/bkp/backup-subvol/backup': Inappropriate >> > ioctl for device >> >> That's a bug in older btrfs-progs. It's been fixed, but I'm not sure >> what version, maybe by 4.14? > > Sounds about right -- my version is 4.7.3. It's not dangerous to use it (maybe --repair is more dangerous but don't use it without advice first, no matter version). You just don't get new features and bug fixes. It's also not dangerous to use something much newer, again if the user space tools are very new and the kernel is old, you just don't get certain features. -- Chris Murphy
Re: btrfs problems
On Sun, Sep 16, 2018 at 7:58 AM, Adrian Bastholm wrote: > Hello all > Actually I'm not trying to get any help any more, I gave up BTRFS on > the desktop, but I'd like to share my efforts of trying to fix my > problems, in hope I can help some poor noob like me. There's almost no useful information provided for someone to even try to reproduce your results, isolate cause and figure out the bugs. No kernel version. No btrfs-progs version. No description of the hardware and how it's laid out, and what mkfs and mount options are being used. No one really has the time to speculate. >BTRFS check --repair is not recommended Right. So why did you run it anyway? man btrfs check: Warning Do not use --repair unless you are advised to do so by a developer or an experienced user It is always a legitimate complaint, despite this warning, if btrfs check --repair makes things worse, because --repair shouldn't ever make things worse. But Btrfs repairs are complicated, and that's why the warning is there. I suppose the devs could have made the flag --riskyrepair but I doubt this would really slow users down that much. A big part of --repair fixes weren't known to make things worse at the time, and edge cases where it made things worse kept popping up, so only in hindsight does it make sense --repair maybe could have been called something different to catch the user's attention. But anyway, I see this same sort of thing on the linux-raid list all the time. People run into trouble, and they press full forward making all kinds of changes, each change increases the chance of data loss. And then they come on the list with WTF messages. And it's always a lesson in patience for the list regulars and developers... if only you'd come to us with questions sooner. > Please have a look at the console logs. These aren't logs. It's a record of shell commands. Logs would include kernel messages, ideally all of them. Why is device 3 missing? We have no idea. Most of Btrfs code is in the kernel, problems are reported by the kernel. So we need kernel messages, user space messages aren't enough. Anyway, good luck with openzfs, cool project. -- Chris Murphy
Re: Move data and mount point to subvolume
> So I did this: > > btrfs subvol snapshot /bkp /bkp/backup-subvol > > strangely while /bkp/backup has lots of files in it, > /bkp/backup-subvol/backup has none. > > # btrfs subvol list /bkp > ID 258 gen 313585 top level 5 path backup > ID 4782 gen 313590 top level 5 path backup-subvol OK so previously you said "/bkp which is a top level subvolume. There are no other subvolumes." But in fact backup is already a subvolume. So now it's confusing what you were asking for in the first place, maybe you didn't realize backup is not a dir but it is a subvolume. If you snapshot a subvolume, which itself contains subvolumes, the nested subvolumes are not snapshot. In the snapshot, the nested subvolumes are empty directories. > > # btrfs fi du -s /bkp/backup-subvol/backup > Total Exclusive Set shared Filename > ERROR: cannot check space of '/bkp/backup-subvol/backup': Inappropriate > ioctl for device That's a bug in older btrfs-progs. It's been fixed, but I'm not sure what version, maybe by 4.14? > > Any ideas about what could be going on? > > In the mean time I'm trying: > > btrfs subvol create /bkp/backup-subvol > cp -prv --reflink=always /bkp/backup/* /bkp/backup-subvol/ Yeah that will take a lot of writes that are not necessary, now that you see backup is a subvolume already. If you want a copy of it, just snapshot it. -- Chris Murphy
Re: Move data and mount point to subvolume
On Sun, Sep 16, 2018 at 5:14 AM, Rory Campbell-Lange wrote: > Hi > > We have a backup machine that has been happily running its backup > partitions on btrfs (on top of a luks encrypted disks) for a few years. > > Our backup partition is on /bkp which is a top level subvolume. > Data, RAID1: total=2.52TiB, used=1.36TiB > There are no other subvolumes. and > /dev/mapper/cdisk2 on /bkp type btrfs > (rw,noatime,compress=lzo,space_cache,subvolid=5,subvol=/) I like Hans' 2nd email advice to snapshot the top level subvolume. I would start out with: btrfs sub snap -r /bkp /bkp/toplevel.ro And that way I shouldn't be able to F this up irreversibly if I make a mistake. :-D And then do another snapshot that's rw: btrfs sub snap /bkp /bkp/bkpsnap cd /bkp/bkpsnap Now remove everything except "backupdir". Then move everything out of backupdir including any hidden files. Then rmdir backupdir. Then you can rename the snapshot/subvolume cd .. mv bkpsnap backup That's less metadata writes than creating a new subvolume, and reflink copying the backup dir, e.g. cp -a --reflink /bkp/backupdir /bkp/backupsubvol That could take a long time because all the metadata is fully read, modified (new inodes) and written out. But either way it should work. -- Chris Murphy
Re: state of btrfs snapshot limitations?
On Fri, Sep 14, 2018 at 3:05 PM, James A. Robinson wrote: > https://btrfs.wiki.kernel.org/index.php/Incremental_Backup > > talks about the basic snapshot capabilities of btrfs and led > me to look up what, if any, limits might apply. I find some > threads from a few years ago that talk about limiting the > number of snapshots for a volume to 100. It does seem variable and I'm not certain what the pattern is that triggers pathological behavior. There's a container thread about a year ago with someone using docker on Btrfs with more than 100K containers, per day, but I don't know the turn over rate. That person does say it's deletion that's expensive but not intolerably so. My advice is you come up with as many strategies as you can implement. Because if one strategy starts to implode with terrible performance, you can just bail on it (or try fixing it, or submitting bug reports to make Btrfs better down the road, etc.), and yet you still have one or more other strategies that are still viable. By strategy, you might want to implement both your ideal and conservative approaches, and also something in the middle. Also, it's reasonable to mirror those strategies on a different storage stack, e.g. LVM thin volumes and XFS. LVM thin volumes are semi-cheap to create, and semi-cheap to delete; where Btrfs snapshots are almost free to create, and expensive to delete (varies depending on changes in it or the subvolume it's created from). But if the LVM thin pool's metadata pool runs out of space, it's big trouble. I expect to lose all the LV's if that ever happens. Also, this strategy doesn't have send/receive, so ordinary use of rsync is expensive since it reads and compares both source and destination. The first answer for this question contains a possible work around depending on hard links. https://serverfault.com/questions/489289/handling-renamed-files-or-directories-in-rsync With Btrfs big issues for scalability are the extent tree, which is shared among all snapshots and subvolumes. Therefore, the bigger the file system gets, in effect the more fragile the extent tree becomes. The other thing is btrfs check is super slow with large volumes, some people have dozen or more TiB file systems that take days to check. I also agree with the noatime suggestion from Hans. Note this is a per subvolume mount time option, so if you're using the subvol= or subvolid= mount options, you need to noatime every time, once per file system isn't enough. -- Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
(resend to all) On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov wrote: > > > On 13.09.2018 18:30, Chris Murphy wrote: >> This is the 2nd or 3rd thread containing hanging btrfs send, with >> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" >> and the other I can't find at the moment. In that case though the hang >> is reproducible in 4.14.x and weirdly it only happens when a snapshot >> contains (perhaps many) reflinks. Scrub and check lowmem find nothing >> wrong. >> >> I have snapshots with a few reflinks (cp --reflink and also >> deduplication), and I see maybe 15-30 second hangs where nothing is >> apparently happening (in top or iotop), but I'm also not seeing any >> blocked tasks or high CPU usage. Perhaps in my case it's just >> recovering quickly. >> >> Are there any kernel config options in "# Debug Lockups and Hangs" >> that might hint at what's going on? Some of these are enabled in >> Fedora debug kernels, which are built practically daily, e.g. right >> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which >> translates to git 54eda9df17f3. > > If it's a lock-related problem then you need Lock Debugging => Lock > debugging: prove locking correctness OK looks like that's under a different section as CONFIG_PROVE_LOCKING which is enabled on Fedora debug kernels. # Debug Lockups and Hangs CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 # Lock Debugging (spinlocks, mutexes, etc...) CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y # CONFIG_DEBUG_LOCKDEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set CONFIG_LOCK_TORTURE_TEST=m -- Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
This is the 2nd or 3rd thread containing hanging btrfs send, with kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" and the other I can't find at the moment. In that case though the hang is reproducible in 4.14.x and weirdly it only happens when a snapshot contains (perhaps many) reflinks. Scrub and check lowmem find nothing wrong. I have snapshots with a few reflinks (cp --reflink and also deduplication), and I see maybe 15-30 second hangs where nothing is apparently happening (in top or iotop), but I'm also not seeing any blocked tasks or high CPU usage. Perhaps in my case it's just recovering quickly. Are there any kernel config options in "# Debug Lockups and Hangs" that might hint at what's going on? Some of these are enabled in Fedora debug kernels, which are built practically daily, e.g. right now the latest in the build system is 4.19.0-0.rc3.git2.1 - which translates to git 54eda9df17f3. Chris Murphy
Re: btrfs send hung in pipe_wait
On Sun, Sep 9, 2018 at 2:16 PM, Stefan Loewen wrote: > I'm not sure about the exact definition of "blocked" here, but I was > also surprised that there were no blocked tasks listed since I'm > definitely unable to kill (SIGKILL) that process. > On the other hand it wakes up hourly to transfer a few bytes. > The problem is definitely not, that I issued the sysrq too early. I > think it was after about 45min of no IO. Another one the devs have asked for in cases where things get slow or hang, but without explicit blocked task messages, is sysrq + t. But I'm throwing spaghetti at a wall at this point, none of it will fix the problem, and I haven't learned how to read these outputs. > So there is some problem with this "original" subvol. Maybe I should > describe how that came into existence. > Initially I had my data on a NTFS formatted drive. I then created a > btrfs partition on my second drive and rsynced all my stuff over into > the root subvol. > Then I noticed that having all my data in the root subvol was a bad > idea and created a "data" subvol and reflinked everything into it. > I deleted the data from the root subvol, made a snapshot of the "data" > subvol, tried sending that and ran into the problem we're discussing > here. That is interesting and useful information. I see nothing invalid about it at all. However, just for future reference it is possible to snapshot the top level (default) subvolume. By default, the top level subvolume (sometimes referred to as subvolid=5 or subvolid=0) is what is mounted if you haven't used 'btrfs sub set-default' to change it. You can snapshot that subvolume by snapshotting the mount point. e.g. mount /dev/sda1 /mnt btrfs sub snap /mnt/subvolume1 So now you have a readwrite subvolume called "subvolume1" which contains everything that was in the top level, which you can now delete if you're trying to keep things tidy and just have subvolumes and snapshots in the top level. Anyway, what you did is possibly relevant to the problem. But if it turns out it's the cause of the problem, it's definitely a bug. > > btrfs check in lowmem mode did not find any errors either: > > $ sudo btrfs check --mode=lowmem --progress /dev/sdb1 > Opening filesystem to check... > Checking filesystem on /dev/sdb1 > UUID: cd786597-3816-40e7-bf6c-d585265ad372 > [1/7] checking root items (0:00:30 elapsed, > 1047408 items checked) > [2/7] checking extents (0:03:55 elapsed, > 309170 items checked) > cache and super generation don't match, space cache will be invalidated > [3/7] checking free space cache(0:00:00 elapsed) > [4/7] checking fs roots(0:04:07 elapsed, 85373 > items checked) > [5/7] checking csums (without verifying data) (0:00:00 elapsed, > 253106 items checked) > [6/7] checking root refs done with fs roots in lowmem mode, skipping > [7/7] checking quota groups skipped (not enabled on this FS) > found 708354711552 bytes used, no error found > total csum bytes: 689206904 > total tree bytes: 2423865344 > total fs tree bytes: 1542914048 > total extent tree bytes: 129843200 > btree space waste bytes: 299191292 > file data blocks allocated: 31709967417344 > referenced 928531877888 OK good to know. -- Chris Murphy
Re: btrfs send hung in pipe_wait
I don't see any blocked tasks. I wonder if you were too fast with sysrq w? Maybe it takes a little bit for the block task to actually develop? I suggest also 'btrfs check --mode=lowmem' because that is a separate implementation of btrfs check that tends to catch different things than the original. It is slow, however. -- Chris Murphy
Re: btrfs send hung in pipe_wait
On Fri, Sep 7, 2018 at 11:07 AM, Stefan Loewen wrote: > List of steps: > - 3.8G iso lays in read-only subvol A > - I create subvol B and reflink-copy the iso into it. > - I create a read-only snapshot C of B > - I "btrfs send --no-data C > /somefile" > So you got that right, yes. OK I can't reproduce it. Sending A and C complete instantly with --no-data, and complete in the same time with a full send/receive. In my case I used a 4.9G ISO. I can't think of what local difference accounts for what you're seeing. There is really nothing special about --reflinks. The extent and csum data are identical to the original file, and that's the bulk of the metadata for a given file. What I can tell you is usually the developers want to see sysrq+w whenever there are blocked tasks. https://fedoraproject.org/wiki/QA/Sysrq You'll want to enable all sysrq functions. And next you'll want three ssh shells: 1. sudo journalctl -fk 2. sudo -i to become root, and then echo w > /proc/sysrq-trigger but do not hit return yet 3. sudo btrfs send... to reproduce the problem. Basically the thing is gonna hang soon after you reproduce the problem, so you want to get to shell #2 and just hit return rather than dealing with long delays typing that echo command out. And then the journal command is so your local terminal captures the sysrq output because you're gonna kill the VM instead of waiting it out. I have no idea how to read these things but someone might pick up this thread and have some idea why these tasks are hanging. > > Unfortunately I don't have any way to connect the drive to a SATA port > directly but I tried to switch out as much of the used setup as > possible (all changes active at the same time): > - I got the original (not the clone) HDD out of the enclosure and used > this adapter to connect it: > https://www.amazon.de/DIGITUS-Adapterkabel-40pol-480Mbps-schwarz/dp/B007X86VZK > - I used a different Notebook > - I ran the test natively on that notebook (instead of from > VirtualBox. I used VirtualBox for most of the tests as I have to > force-poweroff the PC everytime the btrfs-send hangs as it is not > killable) This problem only happens in VirtualBox? Or it happens on baremetal also? And we've established it happens with two different source (send) devices, which means two different Btrfs volumes. All I can say is you need to keep changing things up, process of elimination. Rather tedious. Maybe you could try downloading a Fedora 28 ISO, make a boot stick out of it, and try to reproduce with the same drives. At least that's an easy way to isolate the OS from the equation. -- Chris Murphy
Re: compiling btrfs-progs 4.17.1 gives error "reiserfs/misc.h: No such file or directory"
On Fri, Sep 7, 2018 at 3:56 AM, Jürgen Herrmann wrote: > Hello! > > I'm having a problem with btrfs send which stops after several seconds. > The process hangs with 100% cpu time on one cpu. The system is still > responsive to input but no io is happening anymore so the system > basically becomes unuseable. What kernel? Latest stable is 4.18.6. but I want to make sure that's what you're using, someone else has reported btrfs send problems in another thread with 4.18.5 that sound similar. -- Chris Murphy
Re: btrfs send hung in pipe_wait
On Fri, Sep 7, 2018 at 6:47 AM, Stefan Loewen wrote: > Well... It seems it's not the hardware. > I ran a long SMART check which ran through without errors and > reallocation count is still 0. That only checks the drive, it's an internal test. It doesn't check anything else, including connections. Also you do have a log with a read error and a sector LBA reported. So there is a hardware issue, it could just be transient. > So I used clonezilla (partclone.btrfs) to mirror the drive to another > drive (same model). > Everything copied over just fine. No I/O error im dmesg. > > The new disk shows the same behavior. So now I'm suspicious of USB behavior. Like I said earlier, when I've got USB enclosed drives connect to my NUC, regardless of file system, I routinely get hangs and USB resets. I have to connect all of my USB enclosed drives to a good USB hub, or I have problems. > So I created another subvolume, reflinked stuff over and found that it > is enough to reflink one file, create a read-only snapshot and try to > btrfs-send that. It's not happening with every file, but there are > definitely multiple different files. The one I tested with is a 3.8GB > ISO file. > Even better: > 'btrfs send --no-data snap-one > /dev/null' > (snap-one containing just one iso file) hangs as well. Do you have a list of steps to make this clear? It sounds like first you copy a 3.8G ISO file to one subvolume, then reflink copy it into another subvolume, then snapshot that 2nd subvolume, and try to send the snapshot? But I want to be clear. I've got piles of reflinked files in snapshots and they send OK, although like I said I do get sometimes a 15-30 second hang during sends. > Still dmesg shows no IO errors, only "INFO: task btrfs-transacti:541 > blocked for more than 120 seconds." with associated call trace. > btrfs-send reads some MB in the beginning, writes a few bytes and then > hangs without further IO. > > copying the same file without --reflink, snapshotting and sending > works without problems. > > I guess that pretty much eleminates bad sectors and points towards > some problem with reflinks / btrfs metadata. That's pretty weird. I'll keep trying and see if I hit this. What happens if you downgrade to an older kernel? Either 4.14 or 4.17 or both. The send code is mainly in the kernel, where the receive code is mainly in user space tools, for this testing you don't need to downgrade user space tools. If there's a bug here, I expect it's kernel. -- Chris Murphy
Re: btrfs send hung in pipe_wait
On Thu, Sep 6, 2018 at 2:16 PM, Stefan Loewen wrote: > Data,single: Size:695.01GiB, Used:653.69GiB > /dev/sdb1 695.01GiB > Metadata,DUP: Size:4.00GiB, Used:2.25GiB > /dev/sdb1 8.00GiB > System,DUP: Size:40.00MiB, Used:96.00KiB > Does that mean Metadata is duplicated? Yes. Single copy for data. Duplicate for metadata+system, and there are no single chunks for metadata/system. > > Ok so to summarize and see if I understood you correctly: > There are bad sectors on disk. Running an extended selftest (smartctl -t > long) could find those and replace them with spare sectors. More likely if it finds a persistently failing sector, it will just record the first failing sector LBA in its log, and then abort. You'll see this info with 'smartctl -a' or with -x. It is possible to resume the test using selective option and picking a 4K aligned 512 byte LBA value after the 4K sector with the defect. Just because only one is reported in dmesg doesn't mean there isn't a bad one. It's unlikely the long test is going to actually fix anything, it'll just give you more ammunition for getting a likely under warranty device replaced because it really shouldn't have any issues at this age. > If it does not I can try calculating the physical (4K) sector number and > write to that to make the drive notice and mark the bad sector. > Is there a way to find out which file I will be writing to beforehand? I'm not sure how to do it easily. >Or is > it easier to just write to the sector and then wait for scrub to tell me > (and the sector is broken anyways)? If it's a persistent read error, then it's lost. So you might as well overwrite it. If it's data, scrub will tell you what file is corrupted (and restore can help you recover the whole file, of course it'll have a 4K hole of zeros in it). If it's metadata, Btrfs will fix up the 4K hole with duplicate metadata. Gotcha is to make certain you've got the right LBA to write to. You can use dd to test this, by reading the suspect bad sector, and if you've got the right one, you'll get an I/O error in user space and dmesg will have a message like before with sector value. Use the dd skip= flag for reading, but make *sure* you use seek= when writing *and* make sure you always use bs=4096 count=1 so that if you make a mistake you limit the damage haha. > > For the drive: Not under warranty anymore. It's an external HDD that I had > lying around for years, mostly unused. Now I wanted to use it as part of my > small DIY NAS. Gotcha. Well you can read up on smartctl and smartd, and set it up for regular extended tests, and keep an eye on rapidly changing values. It might give you a 50/50 chance of an early heads up before it dies. I've got an old Hitachi/Apple laptop drive that years ago developed multiple bad sectors in different zones of the drive. They got remapped and I haven't had a problem with that drive since. *shrug* And in fact I did get a discrete error message from the drive for one of those and Btrfs overwrote that bad sector with a good copy (it's in a raid1 volume), so working as designed I guess. Since you didn't get a fix up message from Btrfs, either the whole thing just got confused with hanging tasks, or it's possible it's a data block. -- Chris Murphy
Re: btrfs send hung in pipe_wait
On Thu, Sep 6, 2018 at 12:36 PM, Stefan Loewen wrote: > Output of the commands is attached. fdisk Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes smart Sector Sizes: 512 bytes logical, 4096 bytes physical So clearly the case is lying about the actual physical sector size of the drive. It's very common. But it means to fix the bad sector by writing to it, must be a 4K write. A 512 byte write to the reported LBA, will fail because it is a RMW, and the read will fail. So if you write to that sector, you'll get a read failure. Kinda confusing. So you can convert the LBA to a 4K value, and use dd to write to that "4K LBA" using bs=4096 and a count of 1 but only when you're ready to lose all 4096 bytes in that sector. If it's data, it's fine. It's the loss of one file, and scrub will find and report path to file so you know what was affected. If it's metadata, it could be a problem. What do you get for 'btrfs fi us ' for this volume? I'm wondering if DUP metadata is being used across the board with no single chunks. If so, then you can zero that sector, and Btrfs will detect the missing metadata in that chunk on scrub, and fix it up from a copy. But if you only have single copy metadata, it just depends what's on that block as to how recoverable or repairable this is. 195 Hardware_ECC_Recovered -O-RCK 100 100 000-0 196 Reallocated_Event_Count -O--CK 252 252 000-0 197 Current_Pending_Sector -O--CK 252 252 000-0 198 Offline_Uncorrectable CK 252 252 000-0 Interesting, no complaints there. Unexpected. 11 Calibration_Retry_Count -O--CK 100 100 000-8 200 Multi_Zone_Error_Rate -O-R-K 100 100 000-31 https://kb.acronis.com/content/9136 This is a low hour device, probably still under warranty? I'd get it swapped out. If you want more ammunition for arguing in favor of a swap out under warranty you could do smartctl -t long /dev/sdb That will take just under 4 hours to run (you can use the drive in the meantime, but it'll take a bit longer); and then after that smartctl -x /dev/sdb And see if it's found a bad sector or updated any of those smart values for the worse in particular the offline values. SCT (Get) Error Recovery Control command failed OK so not configurable, it is whatever it is and we don't know what that is. Probably one of the really long recoveries. > > The broken-sector-theory sounds plausible and is compatible with my new > findings: > I suspected the problem to be in one specific directory, let's call it > "broken_dir". > I created a new subvolume and copied broken_dir over. > - If I copied it with cp --reflink, made a snapshot and tried to btrfs-send > that, it hung > - If I rsynced broken_dir over I could snapshot and btrfs-send without a > problem. Yeah I'm not sure what it is, maybe a data block. > > But shouldn't btrfs scrub or check find such errors? Nope. Btrfs expects the drive to complete the read command, but always second guesses the content of the read by comparing to checksums. So if the drive just supplied corrupt data, Btrfs would detect that and discretely report, and if there's a good copy it would self heal. But it can't do that because the drive or USB bus also seems to hang in such a way that a bunch of tasks are also hung, and none of them are getting a clear pass/fail for the read. It just hangs. Arguably the device or the link should not hang. So I'm still wondering if something else is going on, but this is just the most obvious first problem, and maybe it's being complicated by another problem we haven't figure out yet. Anyway, once this problem is solve, it'll become clear if there are additional problems or not. In my case, I often get usb reset errors when I directly connect USB 3.0 drives to my Intel NUC, but I don't ever get them when plugging the drive into a dyconn hub. So if you don't already have a hub in between the drive and the computer, it might be worth considering. Basically the hub is going to read and completely rewrite the whole stream that goes through it (in both directions). -- Chris Murphy
Re: btrfs send hung in pipe_wait
On Thu, Sep 6, 2018 at 10:03 AM, Stefan Löwen wrote: > I have one subvolume (rw) and 2 snapshots (ro) of it. > > I just tested 'btrfs send > /dev/null' and that also shows no IO > after a while but also no significant CPU usage. > During this I tried 'ls' on the source subvolume and it hangs as well. > dmesg has some interesting messages I think (see attached dmesg.log) > OK you've got a different problem. [ 186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [ 186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0 00 08 00 00 [ 186.898764] print_req_error: I/O error, dev sdb, sector 354853072 [ 187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [ 187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [ 187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [ 187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [ 188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [ 188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd [ 188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [ 188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0 00 08 00 00 [ 188.247048] print_req_error: I/O error, dev sdb, sector 354855120 This is a read error for a specific sector. So your drive has media problems. And I think that's the instigating problem here, from which a bunch of other tasks that depend on one or more reads completing but never do. But weirdly there also isn't any kind of libata reset. At least on SATA, by default we see a link reset after a command has not returned in 30 seconds. That reset would totally clear the drive's command queue, and then things either can recover or barf. But in your case, neither happens and it just sits there with hung tasks. [ 189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 And that's the last we really see from Btrfs. After that, it's all just hung task traces and are rather unsurprising to me. Drives in USB cases add a whole bunch of complicating factors for troubleshooting and repair. Including often masking the actual logical and physical sector size, the min and max IO size, alignment offset, and all kinds of things. They can have all sorts of bugs. And I'm also not totally certain about the relationship between the usb reset messages and the bad sector. As far as I know the only way we can get a sector LBA expressly noted in dmesg along with the failed read(10) command, is if the drive has reported back to libata that discrete error with sense information. So I'm accepting that as a reliable error, rather than it being something like a cable. But the reset messages could possibly be something else in addition to that. Anyway, the central issue is sector 354855120 is having problems. I can't tell from the trace if it's transient or persistent. Maybe if it's transient, that would explain how you sometimes get send to start working again briefly but then it reverts to hanging. What do you get for: fdisk -l /dev/sdb smartctl -x /dev/sdb smartctl -l sct erc /dev/sdb Those are all read only commands, nothing is written or changed. -- Chris Murphy
Re: btrfs send hung in pipe_wait
On Thu, Sep 6, 2018 at 9:04 AM, Stefan Loewen wrote: > Update: > It seems like btrfs-send is not completely hung. It somewhat regularly > wakes up every hour to transfer a few bytes. I noticed this via a > periodic 'ls -l' on the snapshot file. These are the last outputs > (uniq'ed): > > -rw--- 1 root root 1492797759 Sep 6 08:44 intenso_white.snapshot > -rw--- 1 root root 1493087856 Sep 6 09:44 intenso_white.snapshot > -rw--- 1 root root 1773825308 Sep 6 10:44 intenso_white.snapshot > -rw--- 1 root root 1773976853 Sep 6 11:58 intenso_white.snapshot > -rw--- 1 root root 1774122301 Sep 6 12:59 intenso_white.snapshot > -rw--- 1 root root 1774274264 Sep 6 13:58 intenso_white.snapshot > -rw--- 1 root root 1774435235 Sep 6 14:57 intenso_white.snapshot > > I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I > have no idea if this is useful) but there are no changes, even during > the short wakeups. I have a sort of "me too" here. I definitely see btrfs send just hang for no apparent reason, but in my case it's for maybe 15-30 seconds. Not an hour. Looking at top and iotop at the same time as the LED lights on the drives, there's definitely zero activity happening. I can make things happen during this time - like I can read a file or save a file from/to any location including the send source or receive destination. It really just behaves as if the send thread is saying "OK I'm gonna nap now, back in a bit" and then it is. So what I end up with on drives with a minimum read-write of 80M/s, is a send receive that's getting me a net of about 30M/s. I have around 100 snapshots on the source device. How many total snapshots do you have on your source? That does appear to affect performance for some things, including send/receive. -- Chris Murphy
nbdkit as a flexible alternative to loopback mounts
https://rwmj.wordpress.com/2018/09/04/nbdkit-as-a-flexible-alternative-to-loopback-mounts/ This is a pretty cool writeup. I can vouch Btrfs will format mount, write to, scrub, and btrfs check works on an 8EiB (virtual) disk. The one thing I thought might cause a problem is the ndb device has a 1KiB sector size, but Btrfs (on x86_64) still uses 4096 byte "sector" and it all seems to work fine despite that. Anyway, maybe it's useful for some fstests instead of file backed losetup devices? -- Chris Murphy
Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order
On Tue, Sep 4, 2018 at 10:22 AM, Etienne Champetier wrote: > Do you have a procedure to copy all subvolumes & skip error ? (I have > ~200 snapshots) If they're already read-only snapshots, then script an iteration of btrfs send receive to a new volume. Btrfs seed-sprout would be ideal, however in this case I don't think can help because a.) it's temporarily one file system, which could mean the corruption is inherited; and b.) I'm not sure it's multiple device aware, so either the btrfs-tune -S1 might fail on 2+ device Btrfs volumes, or possibly it insists on a two device sprout in order to replicate a two device seed. If they're not already read-only, it's tricky because it sounds like mounting rw is possibly risky, and taking read only snapshots might fail anyway. There is no way to make read only snapshots unless the volume can be written to; and no way to force a rw subvolume to be treated as if it were read only even if the volume is mounted read only. And it takes a read only subvolume for send to work. -- Chris Murphy
Re: Does ssd auto detected work for microSD cards?
On Mon, Sep 3, 2018 at 7:53 PM, GWB wrote: > Curious instance here, but perhaps this is the expected behaviour: > > mount | grep btrfs > /dev/sdb3 on / type btrfs (rw,ssd,subvol=@) > /dev/sdb3 on /home type btrfs (rw,ssd,subvol=@home) > /dev/sde1 on /media/gwb09/btrfs-32G-MicroSDc type btrfs > (rw,nosuid,nodev,uhelper=udisks2) > > This is on an Ubuntu 14 client. > > /dev/sdb is indeed an ssd, a Samsung 850 EVO 500Gig, where Ubuntu runs > on btrfs root. It appears btrfs did indeed auto detected an ssd > drive. However: > > /dev/sde is a micro SD card (32Gig Samsung) sitting in a USB 3 card > reader, inserted into a USB 3 card slot. But ssh is not detected. > > So is that the expected behavior? cat /sys/block/sde/queue/rotational That's what Btrfs uses for detection. I'm willing to bet the SD Card slot is not using the mmc driver, but instead USB and therefore always treated as a rotational device. > If not, does it make a difference? > > Would it be best to mount an sd card with ssd_spread? For the described use case, it probably doesn't make much of a difference. It sounds like these are fairly large contiguous files, ZFS send files. I think for both SDXC and eMMC, F2FS is probably more applicable overall than Btrfs due to its reduced wandering trees problem. But again for your use case it may not matter much. > Yet another side note: both btrfs and zfs are now "housed" at Oracle > (and most of java, correct?). Not really. The ZFS we care about now is OpenZFS, forked from Oracle's ZFS. And a bunch of people not related to Oracle do that work. And Btrfs has a wide assortment of developers: Facebook, SUSE, Fujitsu, Oracle, and more. -- Chris Murphy
Re: IO errors when building RAID1.... ?
On Mon, Sep 3, 2018 at 4:23 AM, Adam Borowski wrote: > On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote: >> For > 10 years drive firmware handles bad sector remapping internally. >> It remaps the sector logical address to a reserve physical sector. >> >> NTFS and ext[234] have a means of accepting a list of bad sectors, and >> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+ >> and I think even FAT, lack this capability. > > > FAT entry FF7 (FAT12)/FFF7 (FAT16)/... > Oh yeah even Linux mkdosfs does have -c option to check for bad sectors and presumably will remove them from use. It doesn't accept a separate list though, like badblocks + mke2fs. -- Chris Murphy
Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order
ead_io_errs0 > [/dev/sdb2].flush_io_errs 0 > [/dev/sdb2].corruption_errs 0 > [/dev/sdb2].generation_errs 0 > > device stats report no errors :( > > # btrfs fi df / > Data, RAID1: total=2.32TiB, used=2.23TiB > System, RAID1: total=96.00MiB, used=368.00KiB > Metadata, RAID1: total=22.00GiB, used=19.12GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > # btrfs scrub status / > scrub status for 4917db5e-fc20-4369-9556-83082a32d4cd > scrub started at Mon Sep 3 05:32:52 2018, interrupted after > 00:27:35, not running > total bytes scrubbed: 514.05GiB with 0 errors > > I've already tried 2 times to run btrfs scrub (after reboot), but it > stops before the end, with the previous dmesg error > > My question is what is the safest way to rebuild this BTRFS RAID1? > I haven't tried "btrfs check --repair" yet > (I can boot on a more up to date Linux live if it helps) Definitely do not run btrfs check --repair, that's the nearly last resort. It's vaguely possible this is a bug that's been fixed in a newer kernel version, so it's worth giving 4.17.x or 4.18.x a shot at it. That is at least safe. But I'm suspicious of "BTRFS: error (device sda2) in btrfs_run_delayed_refs:2930: errno=-5 IO failure" which is usually a hardware error. But I don't see any hardware related message in the dmesg snippet provided so you'd need to go through the whole thing looking for suspicious items why there was an IO failure. It's clear Btrfs did receive all or part of the leaf, determined it's corrupt, and the actual mystery is if that double message is for both drives even though only sda2 is named both times (the first two lines of your dmesg). There are some kinds of memory related corruption that newer versions of btrfs-progs can fix. I'm not sure if 4.4 is new enough, or if the particular corruption you're seeing is something btrfs check can fix, but I still wouldn't use --repair until Qu or another dev says to give it a shot. -- Chris Murphy
Re: IO errors when building RAID1.... ?
On Sat, Sep 1, 2018 at 1:03 AM, Pierre Couderc wrote: > > > On 08/31/2018 08:52 PM, Chris Murphy wrote: >> >> >> Bad sector which is failing write. This is fatal, there isn't anything >> the block layer or Btrfs (or ext4 or XFS) can do about it. Well, >> ext234 do have an option to scan for bad sectors and create a bad >> sector map which then can be used at mkfs time, and ext234 will avoid >> using those sectors. And also the md driver has a bad sector option >> for the same, and does remapping. But XFS and Btrfs don't do that. >> >> If the drive is under warranty, get it swapped out, this is definitely >> a warranty covered problem. >> >> >> >> > Thank you very much. > > Once upon a time...(I am old), there were lists of bad sectors, and the > software did avoid wrting in them. It seems to have disappeared. For which > reason ? Maybe because these errors occur so rarely, that it is not worth > the trouble ? For > 10 years drive firmware handles bad sector remapping internally. It remaps the sector logical address to a reserve physical sector. NTFS and ext[234] have a means of accepting a list of bad sectors, and will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+ and I think even FAT, lack this capability. I'm not aware of any file system that once had bad sector tracking, that has since dropped the capability. -- Chris Murphy
Re: IO errors when building RAID1.... ?
If you want you can post the output from 'sudo smartctl -x /dev/sda' which will contain more information... but this is in some sense superfluous. The problem is very clearly a bad drive, the drive explicitly report to libata a write error, and included the sector LBA affected, and only the drive firmware would know that. It's not likely a cable problem or something like. And that the write error is reported at all means it's persistent, not transient. Chris Murphy
Re: IO errors when building RAID1.... ?
kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] killing request > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 1, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 2, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 3, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 4, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 5, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 6, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 7, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] FAILED Result: > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 00 61 > 9c 00 00 0a 00 00 > Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda, > sector 6396928 > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > > more than 100 identical lines... > > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > Aug 31 17:36:38 server kernel: ata1: EH complete > Aug 31 17:36:38 server kernel: ata1.00: detaching (SCSI 0:0:0:0) > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Synchronize Cache(10) > failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Stopping disk > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] Start/Stop Unit failed: > Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > Aug 31 17:36:38 server kernel: Buffer I/O error on dev sda1, logical block > 488378352, async page read > Aug 31 17:36:38 server kernel: scsi 0:0:0:0: rejecting I/O to dead device > Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda, > sector 6762624 > Aug 31 17:36:38 server kernel: BTRFS: error (device sda1) in > btrfs_commit_transaction:2227: errno=-5 IO failure (Error while writing out > transaction) > Aug 31 17:36:38 server kernel: BTRFS info (device sda1): forced readonly > Aug 31 17:36:38 server kernel: BTRFS warning (device sda1): Skipping commit > of aborted transaction. > Aug 31 17:36:38 server kernel: [ cut here ] > Aug 31 17:36:38 server kernel: WARNING: CPU: 1 PID: 159 at > /build/linux-cRtIym/linux-4.9.30/fs/btrfs/transaction.c:1850 > cleanup_transaction+0x1f0/0x2e0 [btrfs] > Aug 31 17:36:38 server kernel: BTRFS: Transaction aborted (error -5) > Aug 31 17:36:38 server kernel: Modules linked in: intel_rapl > x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass eeepc_wmi > asus_wmi crct10dif_pclmul sparse_keymap crc32_pclmul g > > These are hardware problems and aren't related to Btrfs. >sd 0:0:0:0: [sda] FAILED Result: > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 00 61 > 9c 00 00 0a 00 00 > Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda, > sector 6396928 Bad sector which is failing write. This is fatal, there isn't anything the block layer or Btrfs (or ext4 or XFS) can do about it. Well, ext234 do have an option to scan for bad sectors and create a bad sector map which then can be used at mkfs time, and ext234 will avoid using those sectors. And also the md driver has a bad sector option for the same, and does remapping. But XFS and Btrfs don't do that. If the drive is under warranty, get it swapped out, this is definitely a warranty covered problem. -- Chris Murphy
bug: btrfs-progs scrub -R flag doesn't show per device stats
btrfs-progs v4.17.1 man btrfs-scrub: -R print raw statistics per-device instead of a summary However, on a two device Btrfs volume, -R does not show per device statistics. See screenshot: https://drive.google.com/open?id=1xmt_NHGlNJPc8I0F4_OZxgGe9b3quCnD Additionally, the description of -d and -R doesn't help me distinquish between the two. -R says "instead of a summary" so that suggests -d will summarize but isn't explicitly stated. -- Chris Murphy
Re: How to erase a RAID1 (+++)?
And also, I'll argue this might have been a btrfs-progs bug as well, depending on what version was used and the command. Both mkfs and dev add should not be able to add type code 0x05. At least libblkid correctly shows that it's 1KiB in size, so really Btrfs should not succeed at adding this device, it can't put any of the supers in the correct location. [chris@f28h ~]$ sudo fdisk -l /dev/loop0 Disk /dev/loop0: 1 GiB, 1073741824 bytes, 2097152 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x7e255cce Device Boot StartEnd Sectors Size Id Type /dev/loop0p12048 206847 204800 100M 83 Linux /dev/loop0p2 206848 411647 204800 100M 83 Linux /dev/loop0p3 411648 616447 204800 100M 83 Linux /dev/loop0p4 616448 821247 204800 100M 5 Extended /dev/loop0p5 618496 821247 202752 99M 83 Linux [chris@f28h ~]$ sudo kpartx -a /dev/loop0 [chris@f28h ~]$ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:00 1G 0 loop ├─loop0p1 253:10 100M 0 part ├─loop0p2 253:20 100M 0 part ├─loop0p3 253:30 100M 0 part ├─loop0p4 253:40 1K 0 part └─loop0p5 253:5099M 0 part [chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4 btrfs-progs v4.17.1 See http://btrfs.wiki.kernel.org for more information. probe of /dev/loop0p4 failed, cannot detect existing filesystem. ERROR: use the -f option to force overwrite of /dev/loop0p4 [chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4 -f btrfs-progs v4.17.1 See http://btrfs.wiki.kernel.org for more information. ERROR: mount check: cannot open /dev/loop0p4: No such file or directory ERROR: cannot check mount status of /dev/loop0p4: No such file or directory [chris@f28h ~]$ I guess that's a good sign in this case? Chris Murphy
Re: How to erase a RAID1 (+++)?
On Thu, Aug 30, 2018 at 9:21 AM, Alberto Bursi wrote: > > On 8/30/2018 11:13 AM, Pierre Couderc wrote: >> Trying to install a RAID1 on a debian stretch, I made some mistake and >> got this, after installing on disk1 and trying to add second disk : >> >> >> root@server:~# fdisk -l >> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors >> Units: sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disklabel type: dos >> Disk identifier: 0x2a799300 >> >> Device Boot StartEndSectors Size Id Type >> /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux >> >> >> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors >> Units: sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disklabel type: dos >> Disk identifier: 0x9770f6fa >> >> Device Boot StartEndSectors Size Id Type >> /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended >> >> >> And : >> >> root@server:~# btrfs fi show >> Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed >> Total devices 2 FS bytes used 1.10GiB >> devid1 size 1.82TiB used 4.02GiB path /dev/sda1 >> devid2 size 1.00KiB used 0.00B path /dev/sdb1 >> >> ... >> >> My purpose is a simple RAID1 main fs, with bootable flag on the 2 >> disks in prder to start in degraded mode >> How to get out ofr that...? >> >> Thnaks >> PC > > > sdb1 is an extended partition, you cannot format an extended partition. > > change sdb1 into primary partition or add a logical partition into it. Ahh you're correct. There is special treatment of 0x05, it's a logical container with the start address actually pointing to the address where the EBR is. And that EBR's first record contains the actual real extended partition information. So this represents two bugs in the installer: 1. If there's only one partition on a drive, it should be primary by default, not extended. 2. But if extended, it must point to an EBR, and the EBR must be created at that location. Obviously since there is no /dev/sdb2, this EBR is not present. -- Chris Murphy
Re: How to erase a RAID1 (+++)?
On Thu, Aug 30, 2018 at 3:13 AM, Pierre Couderc wrote: > Trying to install a RAID1 on a debian stretch, I made some mistake and got > this, after installing on disk1 and trying to add second disk : > > > root@server:~# fdisk -l > Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x2a799300 > > Device Boot StartEndSectors Size Id Type > /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux > > > Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > Units: sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disklabel type: dos > Disk identifier: 0x9770f6fa > > Device Boot StartEndSectors Size Id Type > /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended Extended partition type is not a problem if you're using GRUB as the bootloader; other bootloaders may not like this. Strictly speaking the type code 0x05 is incorrect, GRUB ignores type code, as does the kernel. GRUB also ignores the active bit (boot flag). > > > And : > > root@server:~# btrfs fi show > Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed > Total devices 2 FS bytes used 1.10GiB > devid1 size 1.82TiB used 4.02GiB path /dev/sda1 > devid2 size 1.00KiB used 0.00B path /dev/sdb1 That's odd; and I know you've moved on from this problem but I would have liked to see the super for /dev/sdb1 and also the installer log for what commands were used for partitioning, including mkfs and device add commands. For what it's worth, 'btrfs dev add' formats the device being added, it does not need to be formatted in advance, and also it resizes the file system properly. > My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in > prder to start in degraded mode Good luck with this. The Btrfs archives are full of various limitations of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if you persistently ask for degraded mount, you run the risk of other problems if there's merely a delayed discovery of one of the devices. Once a Btrfs volume is degraded, it does not automatically resume normal operation just because the formerly missing device becomes available. So... this is flat out not suitable for use cases where you need unattended raid1 degraded boot. -- Chris Murphy
Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)
On Tue, Aug 28, 2018 at 1:14 PM, Menion wrote: > You are correct, indeed in order to cleanup you need > > 1) someone realize that snapshots have been created > 2) apt-brtfs-snapshot is manually installed on the system > > Assuming also that the snapshots created during do-release-upgrade are > managed for auto cleanup Ha! I should have read all the emails. Anyway, good sleuthing. I think it's a good idea to file a bug report on it, so at the least other people can fix it manually. -- Chris Murphy
Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)
On Tue, Aug 28, 2018 at 8:56 AM, Menion wrote: > [sudo] password for menion: > ID gen top level path > -- --- - > 257 600627 5 /@ > 258 600626 5 /@home > 296 599489 5 > /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55 > 297 599489 5 > /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08 > 298 599489 5 > /@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30 > > So, there are snapshots, right? Yep. So you can use 'sudo btrfs fi du -s ' to get a report on how much exclusive space is being used by each of those snapshots and I'll bet it all adds up to about 10G or whatever you're missing. >The time stamp is when I have launched > do-release-upgrade, but it didn't ask anything about snapshot, neither > I asked for it. Yep, not sure what's creating them or what the cleanup policy is (if there is one). So it's worth asking in an Ubuntu forum what these snapshots are where they came from and what cleans them up so you don't run out of space, or otherwise how to configure it if you want more space just because. I mean, it's a neat idea. But also it needs to clean up after itself if for no other reason than to avoid user confusion :-) > If it is confirmed, how can I remove the unwanted snapshot, keeping > the current "visible" filesystem contents > Sorry, I am still learning BTRFS and I would like to avoid mistakes > Bye You can definitely use Btrfs specific tools to get rid of the snapshots and not piss off Btrfs at all. However, if you delete them behind the back of the thing that created them in the first place, it might get pissed off if they just suddenly go missing. Sometimes those tools want to do the cleanups because it's tracking the snapshots and what their purpose is. So if they just go away, it's like having the rug pulled out from under them. Anyway: 'sudo btrfs sub del ' will delete it. Also, I can't tell you for sure what sort of write amplification Btrfs contributes in your use case on eMMC compared to F2FS. Btrfs has a "wandering trees" problem that F2FS doesn't have as big a problem. It's not a big deal (probably) on other kinds of SSDs like SATA/SAS and NVMe. But on eMMC? If it were SD Card I'd say you can keep using Btrfs, and maybe mitigate the wandering trees with compression to reduce overall writes. But if your eMMC is soldered onto a board, I might consider F2FS instead. And Btrfs for other things. -- Chris Murphy
Re: DRDY errors are not consistent with scrub results
On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN wrote: > What I want to achive is that I want to add the problematic disk as > raid1 and see how/when it fails and how BTRFS recovers these fails. > While the party goes on, the main system shouldn't be interrupted > since this is a production system. For example, I would never expect > to be ended up with such a readonly state while trying to add a disk > with "unknown health" to the system. Was it somewhat expected? I don't know. I also can't tell you how LVM or mdraid behave in the same situation either though. For sure I've come across bug reports where underlying devices go read only and the file system falls over totally and developers shrug and say they can't do anything. This situation is a little different and difficult. You're starting out with a one drive setup so the profile is single/DUP or single/single, and that doesn't change when adding. So the 2nd drive is actually *mandatory* for a brief period of time before you've made it raid1 or higher. It's a developer question what is the design, and if this is a bug: maybe the device being added should be written to with placeholder supers or even just zeros in all the places for 'dev add' metadata, and only if that succeeds, to then write real updated supers to all devices. It's possible the 'dev add' presently writes updated supers to all devices at the same time, and has a brief period where the state is fragile and if it fails, it goes read only to prevent damaging the file system. Anyway, without a call trace, no idea why it ended up read only. So I have to speculate. > > Although we know that disk is about to fail, it still survives. That's very tenuous rationalization, a drive that rejects even a single write is considered failed by the md driver. Btrfs is still very tolerant of this, so if it had successfully added and you were running in production, you should expect to see thousands of write errors dumped to the kernel log because Btrfs never ejects a bad drive still. It keeps trying. And keeps reporting the failures. And all those errors being logged can end up causing more write demand if the logs are on the same volume as the failing device, even more errors to record, and you get an escalating situation with heavy log writing. > Shouldn't we expect in such a scenario that when system tries to read > or write some data from/to that BROKEN_DISK and when it recognizes it > failed, it will try to recover the part of the data from GOOD_DISK and > try to store that recovered data in some other part of the > BROKEN_DISK? Nope. Btrfs can only write supers to fixed locations on the drive, same as any other file system. Btrfs metadata could possibly go elsewhere because it doesn't have fixed locations, but Btrfs doesn't do bad sector tracking. So once it decides metadata goes in location X, if X reports a write error it will not try to write elsewhere and insofar as I'm aware ext4 and XFS and LVM and md don't either; md does have an optional bad block map it will use for tracking bad sectors and remap to known good sectors. Normally the drive firmware should do this and when that fails the drive is considered toast for production purpose >Or did I misunderstood the whole thing? Well in a way this is sorta user sabotage. It's a valid test and I'd say ideally things should fail safely, rather than fall over. But at the same time it's not wrong for developers to say: "look if you add a bad device there's a decent chance we're going face plant and go read only to avoid causing worse problems, so next time you should qualify the drive before putting it into production." I'm willing to bet all the other file system devs would say something like that even if Btrfs devs think something better could happen, it's probably not a super high priority. -- Chris Murphy
Re: DRDY errors are not consistent with scrub results
On Tue, Aug 28, 2018 at 12:50 PM, Cerem Cem ASLAN wrote: > I've successfully moved everything to another disk. (The only hard > part was configuring the kernel parameters, as my root partition was > on LVM which is on LUKS partition. Here are the notes, if anyone > needs: > https://github.com/ceremcem/smith-sync/blob/master/create-bootable-backup.md) > > Now I'm seekin for trouble :) I tried to convert my new system (booted > with new disk) into raid1 coupled with the problematic old disk. To do > so, I issued: > > sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ > /dev/mapper/master-root appears to contain an existing filesystem (btrfs). > ERROR: use the -f option to force overwrite of /dev/mapper/master-root > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ -f > ERROR: error adding device '/dev/mapper/master-root': Input/output error > aea@aea3:/mnt$ sudo btrfs device add /dev/mapper/master-root /mnt/peynir/ > sudo: unable to open /var/lib/sudo/ts/aea: Read-only file system > > Now I ended up with a readonly file system. Isn't it possible to add a > device to a running system? Yes. The problem is the 2nd error message: ERROR: error adding device '/dev/mapper/master-root': Input/output error So you need to look in dmesg to see what Btrfs kernel messages occurred at that time. I'm gonna guess it's a failed write. You have a few of those in the smartctl log output. Any time a write failure happens, the operation is always fatal regardless of the file system. -- Chris Murphy
Re: Scrub aborts due to corrupt leaf
On Tue, Aug 28, 2018 at 7:42 AM, Qu Wenruo wrote: > > > On 2018/8/28 下午9:29, Larkin Lowrey wrote: >> On 8/27/2018 10:12 PM, Larkin Lowrey wrote: >>> On 8/27/2018 12:46 AM, Qu Wenruo wrote: >>>> >>>>> The system uses ECC memory and edac-util has not reported any errors. >>>>> However, I will run a memtest anyway. >>>> So it should not be the memory problem. >>>> >>>> BTW, what's the current generation of the fs? >>>> >>>> # btrfs inspect dump-super | grep generation >>>> >>>> The corrupted leaf has generation 2862, I'm not sure how recent did the >>>> corruption happen. >>> >>> generation 358392 >>> chunk_root_generation 357256 >>> cache_generation358392 >>> uuid_tree_generation358392 >>> dev_item.generation 0 >>> >>> I don't recall the last time I ran a scrub but I doubt it has been >>> more than a year. >>> >>> I am running 'btrfs check --init-csum-tree' now. Hopefully that clears >>> everything up. >> >> No such luck: >> >> Creating a new CRC tree >> Checking filesystem on /dev/Cached/Backups >> UUID: acff5096-1128-4b24-a15e-4ba04261edc3 >> Reinitialize checksum tree >> csum result is 0 for block 2412149436416 >> extent-tree.c:2764: alloc_tree_block: BUG_ON `ret` triggered, value -28 > > It's ENOSPC, meaning btrfs can't find enough space for the new csum tree > blocks. Seems bogus, there's >4TiB unallocated. >Label: none uuid: acff5096-1128-4b24-a15e-4ba04261edc3 >Total devices 1 FS bytes used 66.61TiB >devid1 size 72.77TiB used 68.03TiB path /dev/mapper/Cached-Backups > >Data, single: total=67.80TiB, used=66.52TiB >System, DUP: total=40.00MiB, used=7.41MiB >Metadata, DUP: total=98.50GiB, used=95.21GiB >GlobalReserve, single: total=512.00MiB, used=0.00B Even if all metadata is only csum tree, and ~200GiB needs to be written, there's plenty of free space for it. -- Chris Murphy
Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)
On Tue, Aug 28, 2018 at 3:34 AM, Menion wrote: > Hi all > I have run a distro upgrade on my Ubuntu 16.04 that runs ppa kernel > 4.17.2 with btrfsprogs 4.17.0 > The root filesystem is BTRFS single created by the Ubuntu Xenial > installer (so on kernel 4.4.0) on an internal mmc, located in > /dev/mmcblk0p3 > After the upgrade I have cleaned apt cache and checked the free space, > the results were odd, following some checks (shrinked), followed by > more comments: Do you know if you're using Timeshift? I'm not sure if it's enabled by default on Ubuntu when using Btrfs, but you may have snapshots. 'sudo btrfs sub list -at /' That should show all subvolumes (includes snapshots). > [48479.254106] BTRFS info (device mmcblk0p3): 17 enospc errors during balance Probably soft enospc errors it was able to work around. -- Chris Murphy
Re: Scrub aborts due to corrupt leaf
On Mon, Aug 27, 2018 at 8:12 PM, Larkin Lowrey wrote: > On 8/27/2018 12:46 AM, Qu Wenruo wrote: >> >> >>> The system uses ECC memory and edac-util has not reported any errors. >>> However, I will run a memtest anyway. >> >> So it should not be the memory problem. >> >> BTW, what's the current generation of the fs? >> >> # btrfs inspect dump-super | grep generation >> >> The corrupted leaf has generation 2862, I'm not sure how recent did the >> corruption happen. > > > generation 358392 > chunk_root_generation 357256 > cache_generation358392 > uuid_tree_generation358392 > dev_item.generation 0 > > I don't recall the last time I ran a scrub but I doubt it has been more than > a year. > > I am running 'btrfs check --init-csum-tree' now. Hopefully that clears > everything up. I'd expect --init-csum-tree on recreates the data csum tree, and will not assume metadata leaf is correct and just recompute a csum for it. -- Chris Murphy
Re: DRDY errors are not consistent with scrub results
On Mon, Aug 27, 2018 at 6:49 PM, Cerem Cem ASLAN wrote: > Thanks for your guidance, I'll get the device replaced first thing in > the morning. > > Here is balance results which I think resulted not too bad: > > sudo btrfs balance start /mnt/peynir/ > WARNING: > > Full balance without filters requested. This operation is very > intense and takes potentially very long. It is recommended to > use the balance filters to narrow down the balanced data. > Use 'btrfs balance start --full-balance' option to skip this > warning. The operation will start in 10 seconds. > Use Ctrl-C to stop it. > 10 9 8 7 6 5 4 3 2 1 > Starting balance without any filters. > Done, had to relocate 18 out of 18 chunks > > I suppose this means I've not lost any data, but I'm very prone to due > to previous `smartctl ...` results. OK so nothing fatal anyway. We'd have to see any kernel messages that appeared during the balance to see if there were read or write errors, but presumably any failure means the balance fails so... might get you by for a while actually. -- Chris Murphy
Re: DRDY errors are not consistent with scrub results
On Mon, Aug 27, 2018 at 6:38 PM, Chris Murphy wrote: >> Metadata,single: Size:8.00MiB, Used:0.00B >>/dev/mapper/master-root 8.00MiB >> >> Metadata,DUP: Size:2.00GiB, Used:562.08MiB >>/dev/mapper/master-root 4.00GiB >> >> System,single: Size:4.00MiB, Used:0.00B >>/dev/mapper/master-root 4.00MiB >> >> System,DUP: Size:32.00MiB, Used:16.00KiB >>/dev/mapper/master-root64.00MiB >> >> Unallocated: >>/dev/mapper/master-root 915.24GiB > > > OK this looks like it maybe was created a while ago, it has these > empty single chunk items that was common a while back. There is a low > risk to clean it up, but I still advise backup first: > > 'btrfs balance start -mconvert=dup ' You can skip this advise now, it really doesn't matter. But future Btrfs shouldn't have both single and DUP chunks like this one is showing, if you're using relatively recent btrfs-progs to create the file system. -- Chris Murphy
Re: DRDY errors are not consistent with scrub results
On Mon, Aug 27, 2018 at 6:05 PM, Cerem Cem ASLAN wrote: > Note that I've directly received this reply, not by mail list. I'm not > sure this is intended or not. I intended to do Reply to All but somehow this doesn't always work out between the user and Gmail, I'm just gonna assume gmail is being an asshole again. > Chris Murphy , 28 Ağu 2018 Sal, 02:25 > tarihinde şunu yazdı: >> >> On Mon, Aug 27, 2018 at 4:51 PM, Cerem Cem ASLAN >> wrote: >> > Hi, >> > >> > I'm getting DRDY ERR messages which causes system crash on the server: >> > >> > # tail -n 40 /var/log/kern.log.1 >> > Aug 24 21:04:55 aea3 kernel: [ 939.228059] lxc-bridge: port >> > 5(vethI7JDHN) entered disabled state >> > Aug 24 21:04:55 aea3 kernel: [ 939.300602] eth0: renamed from vethQ5Y2OF >> > Aug 24 21:04:55 aea3 kernel: [ 939.328245] IPv6: ADDRCONF(NETDEV_UP): >> > eth0: link is not ready >> > Aug 24 21:04:55 aea3 kernel: [ 939.328453] IPv6: >> > ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready >> > Aug 24 21:04:55 aea3 kernel: [ 939.328474] IPv6: >> > ADDRCONF(NETDEV_CHANGE): vethI7JDHN: link becomes ready >> > Aug 24 21:04:55 aea3 kernel: [ 939.328491] lxc-bridge: port >> > 5(vethI7JDHN) entered blocking state >> > Aug 24 21:04:55 aea3 kernel: [ 939.328493] lxc-bridge: port >> > 5(vethI7JDHN) entered forwarding state >> > Aug 24 21:04:59 aea3 kernel: [ 943.085647] cgroup: cgroup2: unknown >> > option "nsdelegate" >> > Aug 24 21:16:15 aea3 kernel: [ 1619.400016] perf: interrupt took too >> > long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to >> > 79750 >> > Aug 24 21:17:11 aea3 kernel: [ 1675.515815] perf: interrupt took too >> > long (3137 > 3132), lowering kernel.perf_event_max_sample_rate to >> > 63750 >> > Aug 24 21:17:13 aea3 kernel: [ 1677.080837] cgroup: cgroup2: unknown >> > option "nsdelegate" >> > Aug 25 22:38:31 aea3 kernel: [92955.512098] usb 4-2: USB disconnect, >> > device number 2 >> > Aug 26 02:14:21 aea3 kernel: [105906.035038] lxc-bridge: port >> > 4(vethCTKU4K) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.107521] lxc-bridge: port >> > 4(vethO59BPD) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.109991] device vethO59BPD left >> > promiscuous mode >> > Aug 26 02:15:30 aea3 kernel: [105974.109995] lxc-bridge: port >> > 4(vethO59BPD) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.710490] lxc-bridge: port >> > 4(vethBAYODL) entered blocking state >> > Aug 26 02:15:30 aea3 kernel: [105974.710493] lxc-bridge: port >> > 4(vethBAYODL) entered disabled state >> > Aug 26 02:15:30 aea3 kernel: [105974.710545] device vethBAYODL entered >> > promiscuous mode >> > Aug 26 02:15:30 aea3 kernel: [105974.710598] IPv6: >> > ADDRCONF(NETDEV_UP): vethBAYODL: link is not ready >> > Aug 26 02:15:30 aea3 kernel: [105974.710600] lxc-bridge: port >> > 4(vethBAYODL) entered blocking state >> > Aug 26 02:15:30 aea3 kernel: [105974.710601] lxc-bridge: port >> > 4(vethBAYODL) entered forwarding state >> > Aug 26 02:16:35 aea3 kernel: [106039.674089] BTRFS: device fsid >> > 5b844c7a-0cbd-40a7-a8e3-6bc636aba033 devid 1 transid 984 /dev/dm-3 >> > Aug 26 02:17:21 aea3 kernel: [106085.352453] ata4.00: failed command: READ >> > DMA >> > Aug 26 02:17:21 aea3 kernel: [106085.352901] ata4.00: status: { DRDY ERR } >> > Aug 26 02:18:56 aea3 kernel: [106180.648062] ata4.00: exception Emask >> > 0x0 SAct 0x0 SErr 0x0 action 0x0 >> > Aug 26 02:18:56 aea3 kernel: [106180.648333] ata4.00: BMDMA stat 0x25 >> > Aug 26 02:18:56 aea3 kernel: [106180.648515] ata4.00: failed command: READ >> > DMA >> > Aug 26 02:18:56 aea3 kernel: [106180.648706] ata4.00: cmd >> > c8/00:08:80:9c:bb/00:00:00:00:00/e3 tag 0 dma 4096 in >> > Aug 26 02:18:56 aea3 kernel: [106180.648706] res >> > 51/40:00:80:9c:bb/00:00:00:00:00/03 Emask 0x9 (media error) >> > Aug 26 02:18:56 aea3 kernel: [106180.649380] ata4.00: status: { DRDY ERR } >> > Aug 26 02:18:56 aea3 kernel: [106180.649743] ata4.00: error: { UNC } >> >> Classic case of uncorrectable read error due to sector failure. >> >> >> >> > Aug 26 02:18:56 aea3 kernel: [106180.779311] ata4.00: configured for >> > UDMA/133 >> > Aug 26 02:18:56 aea3 kernel: [106180.779331] sd 3:0:0:0: [sda] tag#0 >> > FAILED Result: hostbyte=DID_OK driverbyte
Re: Device Delete Stalls
And by 4.14 I actually mean 4.14.60 or 4.14.62 (based on the changelog). I don't think the single patch in 4.14.62 applies to your situation.
Re: Device Delete Stalls
On Thu, Aug 23, 2018 at 8:04 AM, Stefan Malte Schumacher wrote: > Hallo, > > I originally had RAID with six 4TB drives, which was more than 80 > percent full. So now I bought > a 10TB drive, added it to the Array and gave the command to remove the > oldest drive in the array. > > btrfs device delete /dev/sda /mnt/btrfs-raid > > I kept a terminal with "watch btrfs fi show" open and It showed that > the size of /dev/sda had been set to zero and that data was being > redistributed to the other drives. All seemed well, but now the > process stalls at 8GB being left on /dev/sda/. It also seems that the > size of the drive has been reset the original value of 3,64TiB. > > Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 > Total devices 7 FS bytes used 8.07TiB > devid1 size 3.64TiB used 8.00GiB path /dev/sda > devid2 size 3.64TiB used 2.73TiB path /dev/sdc > devid3 size 3.64TiB used 2.73TiB path /dev/sdd > devid4 size 3.64TiB used 2.73TiB path /dev/sde > devid5 size 3.64TiB used 2.73TiB path /dev/sdf > devid6 size 3.64TiB used 2.73TiB path /dev/sdg > devid7 size 9.10TiB used 2.50TiB path /dev/sdb > > I see no more btrfs worker processes and no more activity in iotop. > How do I proceed? I am using a current debian stretch which uses > Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. > > How should I proceed? I have a Backup but would prefer an easier and > less time-comsuming way out of this mess. I'd let it keep running as long as you can tolerate it. In the meantime, update your backups, and keep using the file system normally, it should be safe to use. The block group migration can sometimes be slow with "brfs dev del" compared to the replace operation, I can't explain why but it might be related to some combination of file and free space fragmentation as well as number of snapshots, and just general complexity of what is effectively a partial balance operation going on. Next, you could do a sysrq + t, which dumps process state into the kernel message buffer which might not be big enough to contain the output. If you're using systemd, the journal -k will have it, and presumably syslog's messages will have it. I can't parse this output but a developer might find it useful to see what's going on and if it's just plain wrong. Or if it's just slow. Next, once you get sick of waiting, well you can force a reboot with 'reboot -f' or 'sysrq + b' but then what's the plan? Sure you could just try again but I don't know that this should give different results. It's either just slow, or it's a bug. And if it's a bug, maybe it's fixed in something newer, in which case I'd try a much newer kernel 4.14 at the oldest, and ideally 4.18.4, at least to finish off this task. For what it's worth, the bulk of the delete operation is like a filtered balance, it's mainly relocating block groups, and that is supposed to be COW. So it should be safe to do an abrupt reboot. If you're not writing new information there's no information to lose; the worst case is Btrfs has a slightly older superblock than the latest generation for block group relocation and it starts from that point again. I've done quite a lot of jerkface reboot -f and sysrq + b with Btrfs and have never broken a file system so far (power failures, different story) but maybe I'm lucky and I have a bunch of well behaved devices. -- Chris Murphy
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
On Fri, Aug 10, 2018 at 9:29 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted: > >> But whether data is shared or exclusive seems potentially ephemeral, and >> not something a sysadmin should even be able to anticipate let alone >> individual users. > > Define "user(s)". The person who is saving their document on a network share, and they've never heard of Btrfs. > Arguably, in the context of btrfs tool usage, "user" /is/ the admin, I'm not talking about btrfs tools. I'm talking about rational, predictable behavior of a shared folder. If I try to drop a 1GiB file into my share and I'm denied, not enough free space, and behind the scenes it's because of a quota limit, I expect I can delete *any* file(s) amounting to create 1GiB free space and then I'll be able to drop that file successfully without error. But if I'm unwittingly deleting shared files, my quota usage won't go down, and I still can't save my file. So now I somehow need a secret incantation to discover only my exclusive files and delete enough of them in order to save this 1GiB file. It's weird, it's unexpected, I think it's a use case failure. Maybe Btrfs quotas isn't meant to work with samba or NFS shares. *shrug* > > "Regular users" as you use the term, that is the non-admins who just need > to know how close they are to running out of their allotted storage > resources, shouldn't really need to care about btrfs tool usage in the > first place, and btrfs commands in general, including btrfs quota related > commands, really aren't targeted at them, and aren't designed to report > the type of information they are likely to find useful. Other tools will > be more appropriate. I'm not talking about any btrfs commands or even the term quota for regular users. I'm talking about saving a file, being denied, and how does the user figure out how to free up space? Anyway, it's a hypothetical scenario. While I have Samba running on a Btrfs volume with various shares as subvolumes, I don't have quotas enabled. -- Chris Murphy
Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota
duplicated, is out of scope for the user. And we can't have quotas getting busted all of a sudden because the sysadmin decides to do -dconvert -mconvert raid1, without requiring the sysadmin to double everyone's quota before performing the operation. > >> >> >> In short: values representing quotas are user-oriented ("the numbers one >> bought"), not storage-oriented ("the numbers they actually occupy"). > > Well, if something is not possible or brings so big performance impact, > there will be no argument on how it should work in the first place. Yep! What is VFS disk quotas and does Btrfs use that at all? If not, why not? It seems to me there really should be a high level basic per directory quota implementation at the VFS layer, with a single kernel interface as well as a single user space interface, regardless of the file system. Additional file system specific quota features can of course have their own tools, but all of this re-invention of the wheel for basic directory quotas is a mystery to me. -- Chris Murphy
mount shows incorrect subvol when backed by bind mount
I've got another example of bind mounts resulting in confusing (incorrect) information in the mount command with Btrfs. In this case, it's Docker using bind mounts. Full raw version (expires in 7 days) https://paste.fedoraproject.org/paste/r8tr-3nuvoycwxf0bPUrmA/raw Relevant portion: mount shows: /dev/mmcblk0p3 on /var/lib/docker/containers type btrfs (rw,noatime,seclabel,compress-force=zstd,ssd,space_cache=v2,subvolid=265,subvol=/root/var/lib/docker/containers) /dev/mmcblk0p3 on /var/lib/docker/btrfs type btrfs (rw,noatime,seclabel,compress-force=zstd,ssd,space_cache=v2,subvolid=265,subvol=/root/var/lib/docker/btrfs) And from the detail fpaste, you can see there is no such subvolume docker/btrfs or docker/containers - and subvolid=265 is actually for rootfs. Anyway, mortals will be confused by this behavior. -- Chris Murphy
Re: Unmountable root partition
On Tue, Jul 31, 2018 at 12:03 PM, Cerem Cem ASLAN wrote: > 3. mount -t btrfs /dev/mapper/foo--vg-root /mnt/foo > Gives the following error: > > mount: wrong fs type, bad option, bad superblock on ... > > 4. dmesg | tail > Outputs the following: > > > [17755.840916] sd 3:0:0:0: [sda] tag#0 FAILED Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > [17755.840919] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0 02 00 00 > 02 00 > [17755.840921] blk_update_request: I/O error, dev sda, sector 507906 > [17755.840941] EXT4-fs (dm-4): unable to read superblock Are you sure this is the output for the command? Because you're explicitly asking for type btrfs, which fails, and then the kernel reports EXT4 superblock unreadable. What do you get if you omit -t btrfs and just let it autodetect? But yeah, this is an IO error from the device and there's nothing Btrfs can do about that unless there is DUP or raid1+ metadata available. Is it possible this LV was accidentally reformatted ext4? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ssd vs ssd_spread with sdcard or emmc
Hi, I'm not finding any recent advice for sdcard or eMMC media, both of which trigger the ssd mount option automatically. I seem to recall ssd has had some optimizations recently, but haven't heard much about ssd_spread. While sdcard and eMMC are rather different, it seems they have two things in common: they don't have the wear durability of even consumer SATA SSD let alone NVMe, and also they both suffer from dog slow writes. I'm unable to tell that ssd_spread does any better writes wise on a Samsung EVO+ sdcard. So that leaves wear and in particular if wandering trees is at all affected by ssd_spread? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
Related on XFS list. https://www.spinics.net/lists/linux-xfs/msg20722.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn wrote: > On 2018-07-18 13:40, Chris Murphy wrote: >> >> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy >> wrote: >> >>> I don't know for sure, but based on the addresses reported before and >>> after dd for the fallocated tmp file, it looks like Btrfs is not using >>> the originally fallocated addresses for dd. So maybe it is COWing into >>> new blocks, but is just as quickly deallocating the fallocated blocks >>> as it goes, and hence doesn't end up in enospc? >> >> >> Previous thread is "Problem with file system" from August 2017. And >> there's these reproduce steps from Austin which have fallocate coming >> after the dd. >> >> truncate --size=4G ./test-fs >> mkfs.btrfs ./test-fs >> mkdir ./test >> mount -t auto ./test-fs ./test >> dd if=/dev/zero of=./test/test bs=65536 count=32768 >> fallocate -l 2147483650 ./test/test && echo "Success!" >> >> >> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and >> fallocate in half. >> >> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 >> 1000+0 records in >> 1000+0 records out >> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s >> [chris@f28s btrfs]$ sync >> [chris@f28s btrfs]$ df -h >> FilesystemSize Used Avail Use% Mounted on >> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs >> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp >> >> >> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over >> it, this fails, but I kinda expect that because there's only 1.1G free >> space. But maybe that's what you're saying is the bug, it shouldn't >> fail? > > Yes, you're right, I had things backwards (well, kind of, this does work on > ext4 and regular XFS, so it arguably should work here). I guess I'm confused what it even means to fallocate over a file with in-use blocks unless either -d or -p options are used. And from the man page, I don't grok the distinction between -d and -p either. But based on their descriptions I'd expect they both should work without enospc. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html