Re: exclusive subvolume space missing
Tomasz Pala posted on Fri, 15 Dec 2017 09:22:14 +0100 as excerpted: > I wonder how this one db-library behaves: > > $ find . -name \*.sqlite | xargs ls -gGhS | head -n1 > -rw-r--r-- 1 15M 2017-12-08 12:14 > ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite > > $ ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite | > head -n1 > File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite > has 128 extents: > > > At least every $HOME/{.{,c}cache,tmp} should be +C... Many admins will put tmp, and sometimes cache or selected parts of it, on tmpfs anyway... thereby both automatically clearing it on reboot, and allowing enforced size control as necessary. >> And if possible, use nocow for this file. > > Actually, this should be officially advised to use +C for entire /var > tree and every other tree that might be exposed for hostile write > patterns, like /home or /tmp (if held on btrfs). > > I'd say, that from security point of view the nocow should be default, > unless specified for mount or specific file... Currently, if I mount > with nocow, there is no way to whitelist trusted users or secure > location, and until btrfs-specific options could be handled per > subvolume, there is really no alternative. Nocow disables many reasons people run btrfs in the first place, including checksumming and damage-detection, with auto-repair from other copies where available (raid1/10 and dup modes primarily), as well as btrfs transparent compression, for users using that. Additionally, snapshotting, another feature people use btrfs for, turns nocow into cow1 (cow the first time a block is written after a snapshot), since snapshotting locks down the previous extent in ordered to maintain the snapshotted reference. And given that any user can create a snapshot any time they want (even if you lock down the btrfs executable, if they're malevolent users and not locked to only running specifically whitelisted executables, they can always get a copy of the executable elsewhere), and /home or individual user subvols may well be auto-snapshotted already, setting nocow isn't likely to be of much security value at all. So nocow is, as one regular wrote, most useful for "this really should go on something other than btrfs, but I'm too lazy to set it up that way and I'm already on btrfs, so the nocow band-aid is all I got. And yes, I try using my screwdriver as a hammer too, because that's what I have there too!" In that sort of case, just use some other filesystem more appropriate to the use-case, and you won't have to worry about btrfs issues, cow- triggered or otherwise, in the first place. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Tue, Dec 12, 2017 at 08:50:15 +0800, Qu Wenruo wrote: > Even without snapshot, things can easily go crazy. > > This will write 128M file (max btrfs file extent size) and write it to disk. > # xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file > > Then, overwrite the 1~128M range. > # xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file > > Guess your real disk usage, it's 127M + 128M = 255M. > > The point here, if there is any reference of a file extent, the whole > extent won't be freed, even it's only 1M of a 128M extent. OK, /this/ is scary. I guess nocow prevents this behaviour? I have +C chatted the file eating my space and it ceased. > Are you pre-allocating the file before write using tools like dd? I have no idea, this could be checked in source of http://pam-abl.sourceforge.net/ But this is plain Berkeley DB (5.3 in my case)... which scarries me even more: $ rpm -q --what-requires 'libdb-5.2.so()(64bit)' 'libdb-5.3.so()(64bit)' | wc -l 14 # ipoldek desc -B db5.3 Package:db5.3-5.3.28.0-4.x86_64 Required(by): apache1-base, apache1-mod_ssl, apr-util-dbm-db, bogofilter, c-icap, c-icap-srv_url_check, courier-authlib, courier-authlib-authuserdb, courier-imap, courier-imap-common, cyrus-imapd, cyrus-imapd-libs, cyrus-sasl, cyrus-sasl-sasldb, db5.3-devel, db5.3-utils, dnshistory, dsniff, evolution-data-server, evolution-data-server-libs, exim, gda-db, ggz-server, heimdal-libs-common, hotkeys, inn, inn-libs, isync, jabberd, jigdo, jigdo-gtk, jnettop, libetpan, libgda3, libgda3-devel, libhome, libqxt, libsolv, lizardfs-master, maildrop, moc, mutt, netatalk, nss_updatedb, ocaml-dbm, opensips, opensmtpd, pam-pam_abl, pam-pam_ccreds, perl-BDB, perl-BerkeleyDB, perl-BerkeleyDB, perl-DB_File, perl-URPM, perl-cyrus-imapd, php4-dba, php52-dba, php53-dba, php54-dba, php55-dba, php56-dba, php70-dba, php70-dba, php71-dba, php71-dba, php72-dba, php72-dba, postfix, python-bsddb, python-modules, python3-bsddb3, redland, ruby-modules, sendmail, squid-session_acl, squid-time_quota_acl, squidGuard, subversion-libs, swish-e, tomoe-svn, webalizer-base, wwwcount OK, not much of user-applications here, as they mostly use sqlite. I wonder how this one db-library behaves: $ find . -name \*.sqlite | xargs ls -gGhS | head -n1 -rw-r--r-- 1 15M 2017-12-08 12:14 ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite $ ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite | head -n1 File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite has 128 extents: At least every $HOME/{.{,c}cache,tmp} should be +C... > And if possible, use nocow for this file. Actually, this should be officially advised to use +C for entire /var tree and every other tree that might be exposed for hostile write patterns, like /home or /tmp (if held on btrfs). I'd say, that from security point of view the nocow should be default, unless specified for mount or specific file... Currently, if I mount with nocow, there is no way to whitelist trusted users or secure location, and until btrfs-specific options could be handled per subvolume, there is really no alternative. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On 2017年12月11日 19:40, Tomasz Pala wrote: > On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote: > >>> I could debug something before I'll clean this up, is there anything you >>> want to me to check/know about the files? >> >> fiemap result along with btrfs dump-tree -t2 result. > > fiemap attached, but dump-tree requires unmounted fs, doesn't it? It doesn't. You can dump your tree with fs mounted, although it may affect the accuracy. The good news is, in your case, it doesn't really need extent tree, as there is no shared extent here. > >>> - I've lost 3.6 GB during the night with reasonably small >>> amount of writes, I guess it might be possible to trash entire >>> filesystem within 10 minutes if doing this on purpose. >> >> That's a little complex. >> To get into such situation, snapshot must be used and one must know >> which file extent is shared and how it's shared. > > Hostile user might assume that any of his own files old enough were > being snapshotted. Unless snapshots are not used at all... > > The 'obvious' solution would be for quotas to limit the data size including > extents lost due to fragmentation, but this is not the real solution as > users don't care about fragmentation. So we're back to square one. > >> But as I mentioned, XFS supports reflink, which means file extent can be >> shared between several inodes. >> >> From the message I got from XFS guys, they free any unused space of a >> file extent, so it should handle it quite well. > > Forgive my ignorance, as I'm not familiar with details, but isn't the > problem 'solvable' by reusing space freed from the same extent for any > single (i.e. the same) inode? Not that easy. The extent tree design makes it a little tricky to do that. So btrfs use the current extent booking, the laziest way to delete extent. > This would certainly increase > fragmentation of a file, but reduce extent usage significially. > > > Still, I don't comprehend the cause of my situation. If - after doing a > defrag (after snapshotting whatever there were already trashed) btrfs > decides to allocate new extents for the file, why doesn't is use them > efficiently as long as I'm not doing snapshots anymore? Even without snapshot, things can easily go crazy. This will write 128M file (max btrfs file extent size) and write it to disk. # xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file Then, overwrite the 1~128M range. # xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file Guess your real disk usage, it's 127M + 128M = 255M. The point here, if there is any reference of a file extent, the whole extent won't be freed, even it's only 1M of a 128M extent. While defrag will basically read out the whole 128M file, and rewrite it. Basically the same as: # dd if=/mnt/btrfs/file of=/mnt/btrfs/file2 # rm /mnt/btrfs/file In this case, it will cause a new 128M file extent, while old 128M+127M extents lost all their reference so they are freed. As a result, it frees 127M. > I'm attaching the second fiemap, the same file from last snapshot taken. > According to this one-liner: > > for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done > > current file doesn't share any physical locations with the old one. > But still grows, so what does this situation have with snapshots anyway? In your fiemap, all your file extent is exclusive, so not really related to snapshot. But the file is very fragmented. Most of them is 4K sized, several 8K sized. And the final extent is 220K sized. Are you pre-allocating the file before write using tools like dd? If so, just as I explained above, it will at least *DOUBLE* on-disk space usage, and cause tons of fragment. It's recommended to use fallocate to prealloc file instead of things like dd. (preallocated range acts must like nocow, although only for first write) And if possible, use nocow for this file. > > Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB > occupied per extent. How is that possible? Appending small write and frequently fsync or small random DIO. Avoid such pattern or at least use nocow. Also avoid using dd to preallocate file. Another solution is autodefrag, but I doubt the effect. Thanks, Qu > signature.asc Description: OpenPGP digital signature
Re: exclusive subvolume space missing
On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote: >> I could debug something before I'll clean this up, is there anything you >> want to me to check/know about the files? > > fiemap result along with btrfs dump-tree -t2 result. fiemap attached, but dump-tree requires unmounted fs, doesn't it? >> - I've lost 3.6 GB during the night with reasonably small >> amount of writes, I guess it might be possible to trash entire >> filesystem within 10 minutes if doing this on purpose. > > That's a little complex. > To get into such situation, snapshot must be used and one must know > which file extent is shared and how it's shared. Hostile user might assume that any of his own files old enough were being snapshotted. Unless snapshots are not used at all... The 'obvious' solution would be for quotas to limit the data size including extents lost due to fragmentation, but this is not the real solution as users don't care about fragmentation. So we're back to square one. > But as I mentioned, XFS supports reflink, which means file extent can be > shared between several inodes. > > From the message I got from XFS guys, they free any unused space of a > file extent, so it should handle it quite well. Forgive my ignorance, as I'm not familiar with details, but isn't the problem 'solvable' by reusing space freed from the same extent for any single (i.e. the same) inode? This would certainly increase fragmentation of a file, but reduce extent usage significially. Still, I don't comprehend the cause of my situation. If - after doing a defrag (after snapshotting whatever there were already trashed) btrfs decides to allocate new extents for the file, why doesn't is use them efficiently as long as I'm not doing snapshots anymore? I'm attaching the second fiemap, the same file from last snapshot taken. According to this one-liner: for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done current file doesn't share any physical locations with the old one. But still grows, so what does this situation have with snapshots anyway? Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB occupied per extent. How is that possible? -- Tomasz PalaFile log.14 has 933 extents: # Logical Physical Length Flags 0: 00297a001000 1000 1: 1000 00297aa01000 1000 2: 2000 002979ffe000 1000 3: 3000 00297d1fc000 1000 4: 4000 00297e5f7000 1000 5: 5000 00297d1fe000 1000 6: 6000 00297c7f4000 1000 7: 7000 00297dbf9000 1000 8: 8000 00297eff3000 1000 9: 9000 0029821c7000 1000 10: a000 002982bbf000 1000 11: b000 0029803e 1000 12: c000 00297b40 1000 13: d000 002979601000 1000 14: e000 002980dd5000 1000 15: f000 0029821be000 1000 16: 0001 00298715f000 1000 17: 00011000 002985d71000 1000 18: 00012000 00298537f000 1000 19: 00013000 00298676 1000 20: 00014000 00298498d000 1000 21: 00015000 0029821b4000 1000 22: 00016000 0029817c7000 1000 23: 00017000 00298a2fa000 1000 24: 00018000 002988f1f000 1000 25: 00019000 00298d47f000 1000 26: 0001a000 00298c0af000 1000 27: 0001b000 00298a2ee000 1000 28: 0001c000 00298a2eb000 1000 29: 0001d000 0029905f2000 1000 30: 0001e000 00298f22a000 1000 31: 0001f000 00298de66000 1000 32: 0002 00298ace3000 1000 33: 00021000 00298a2e9000 1000 34: 00022000 00298a2e7000 1000 35: 00023000 00298b6c3000 1000 36: 00024000 002990fd5000 1000 37: 00025000 002992d6c000 1000 38: 00026000 0029954db000 1000 39: 00027000 002993747000 1000 40: 00028000 002992d62000 1000 41: 00029000 002992389000
Re: exclusive subvolume space missing
On 2017年12月11日 07:44, Qu Wenruo wrote: > > > On 2017年12月10日 19:27, Tomasz Pala wrote: >> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: >> 1. is there any switch resulting in 'defrag only exclusive data'? >>> >>> IIRC, no. >> >> I have found a directory - pam_abl databases, which occupy 10 MB (yes, >> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after >> defrag. After defragging files were not snapshotted again and I've lost >> 3.6 GB again, so I got this fully reproducible. >> There are 7 files, one of which is 99% of the space (10 MB). None of >> them has nocow set, so they're riding all-btrfs. >> >> I could debug something before I'll clean this up, is there anything you >> want to me to check/know about the files? > > fiemap result along with btrfs dump-tree -t2 result. > > Both output has nothing related to file name/dir name, but only some > "meaningless" bytenr, so it should be completely OK to share them. > >> >> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS >> condition which could be triggered by malicious user during a few hours >> or faster > > You won't want to hear this: > The biggest ratio in theory is, 128M / 4K = 32768. > >> - I've lost 3.6 GB during the night with reasonably small >> amount of writes, I guess it might be possible to trash entire >> filesystem within 10 minutes if doing this on purpose. > > That's a little complex. > To get into such situation, snapshot must be used and one must know > which file extent is shared and how it's shared. > > But yes, it's possible. > > While on the other hand, XFS, which also supports reflink, handles it > quite well, so I'm wondering if it's possible for btrfs to follow its > behavior. > >> 3. I guess there aren't, so how could I accomplish my target, i.e. reclaiming space that was lost due to fragmentation, without breaking spanshoted CoW where it would be not only pointless, but actually harmful? >>> >>> What about using old kernel, like v4.13? >> >> Unfortunately (I guess you had 3.13 on mind), I need the new ones and >> will be pushing towards 4.14. > > No, I really mean v4.13. My fault, it is v3.13. What a stupid error... > > From btrfs(5): > --- >Warning >Defragmenting with Linux kernel versions < 3.9 or ≥ > 3.14-rc2 as >well as with Linux stable kernel versions ≥ 3.10.31, ≥ > 3.12.12 >or ≥ 3.13.4 will break up the ref-links of CoW data (for >example files copied with cp --reflink, snapshots or >de-duplicated data). This may cause considerable increase of >space usage depending on the broken up ref-links. > --- > >> 4. How can I prevent this from happening again? All the files, that are written constantly (stats collector here, PostgreSQL database and logs on other machines), are marked with nocow (+C); maybe some new attribute to mark file as autodefrag? +t? >>> >>> Unfortunately, nocow only works if there is no other subvolume/inode >>> referring to it. >> >> This shouldn't be my case anymore after defrag (==breaking links). >> I guess no easy way to check refcounts of the blocks? > > No easy way unfortunately. > It's either time consuming (used by qgroup) or complex (manually tree > search and do the backref walk by yourself) > >> >>> But in my understanding, btrfs is not suitable for such conflicting >>> situation, where you want to have snapshots of frequent partial updates. >>> >>> IIRC, btrfs is better for use case where either update is less frequent, >>> or update is replacing the whole file, not just part of it. >>> >>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which >>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run. >> >> That is something coherent with my conclusions after 2 years on btrfs, >> however I didn't expect a single file to eat 1000 times more space than it >> should... >> >> >> I wonder how many other filesystems were trashed like this - I'm short >> of ~10 GB on other system, many other users might be affected by that >> (telling the Internet stories about btrfs running out of space). > > Firstly, no other filesystem supports snapshot. > So it's pretty hard to get a baseline. > > But as I mentioned, XFS supports reflink, which means file extent can be > shared between several inodes. > > From the message I got from XFS guys, they free any unused space of a > file extent, so it should handle it quite well. > > But it's quite a hard work to achieve in btrfs, needs years development > at least. > >> >> It is not a problem that I need to defrag a file, the problem is I don't >> know: >> 1. whether I need to defrag, >> 2. *what* should I defrag >> nor have a tool that would defrag smart - only the exclusive data or, in >> general, the block that are worth defragging if space released from >> extents is greater than space lost on
Re: exclusive subvolume space missing
On 2017年12月10日 19:27, Tomasz Pala wrote: > On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: > >>> 1. is there any switch resulting in 'defrag only exclusive data'? >> >> IIRC, no. > > I have found a directory - pam_abl databases, which occupy 10 MB (yes, > TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after > defrag. After defragging files were not snapshotted again and I've lost > 3.6 GB again, so I got this fully reproducible. > There are 7 files, one of which is 99% of the space (10 MB). None of > them has nocow set, so they're riding all-btrfs. > > I could debug something before I'll clean this up, is there anything you > want to me to check/know about the files? fiemap result along with btrfs dump-tree -t2 result. Both output has nothing related to file name/dir name, but only some "meaningless" bytenr, so it should be completely OK to share them. > > The fragmentation impact is HUGE here, 1000-ratio is almost a DoS > condition which could be triggered by malicious user during a few hours > or faster You won't want to hear this: The biggest ratio in theory is, 128M / 4K = 32768. > - I've lost 3.6 GB during the night with reasonably small > amount of writes, I guess it might be possible to trash entire > filesystem within 10 minutes if doing this on purpose. That's a little complex. To get into such situation, snapshot must be used and one must know which file extent is shared and how it's shared. But yes, it's possible. While on the other hand, XFS, which also supports reflink, handles it quite well, so I'm wondering if it's possible for btrfs to follow its behavior. > >>> 3. I guess there aren't, so how could I accomplish my target, i.e. >>>reclaiming space that was lost due to fragmentation, without breaking >>>spanshoted CoW where it would be not only pointless, but actually >>> harmful? >> >> What about using old kernel, like v4.13? > > Unfortunately (I guess you had 3.13 on mind), I need the new ones and > will be pushing towards 4.14. No, I really mean v4.13. From btrfs(5): --- Warning Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥ 3.13.4 will break up the ref-links of CoW data (for example files copied with cp --reflink, snapshots or de-duplicated data). This may cause considerable increase of space usage depending on the broken up ref-links. --- > >>> 4. How can I prevent this from happening again? All the files, that are >>>written constantly (stats collector here, PostgreSQL database and >>>logs on other machines), are marked with nocow (+C); maybe some new >>>attribute to mark file as autodefrag? +t? >> >> Unfortunately, nocow only works if there is no other subvolume/inode >> referring to it. > > This shouldn't be my case anymore after defrag (==breaking links). > I guess no easy way to check refcounts of the blocks? No easy way unfortunately. It's either time consuming (used by qgroup) or complex (manually tree search and do the backref walk by yourself) > >> But in my understanding, btrfs is not suitable for such conflicting >> situation, where you want to have snapshots of frequent partial updates. >> >> IIRC, btrfs is better for use case where either update is less frequent, >> or update is replacing the whole file, not just part of it. >> >> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which >> is pointing to /usr/bin and /usr/lib) , but not for /var or /run. > > That is something coherent with my conclusions after 2 years on btrfs, > however I didn't expect a single file to eat 1000 times more space than it > should... > > > I wonder how many other filesystems were trashed like this - I'm short > of ~10 GB on other system, many other users might be affected by that > (telling the Internet stories about btrfs running out of space). Firstly, no other filesystem supports snapshot. So it's pretty hard to get a baseline. But as I mentioned, XFS supports reflink, which means file extent can be shared between several inodes. From the message I got from XFS guys, they free any unused space of a file extent, so it should handle it quite well. But it's quite a hard work to achieve in btrfs, needs years development at least. > > It is not a problem that I need to defrag a file, the problem is I don't know: > 1. whether I need to defrag, > 2. *what* should I defrag > nor have a tool that would defrag smart - only the exclusive data or, in > general, the block that are worth defragging if space released from > extents is greater than space lost on inter-snapshot duplication. > > I can't just defrag entire filesystem since it breaks links with snapshots. > This change was a real deal-breaker here... IIRC it's better to add a option to make defrag snapshot-aware. (Don't break snapshot sharing but only to
Re: exclusive subvolume space missing
On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote: > I have found a directory - pam_abl databases, which occupy 10 MB (yes, > TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after # df Filesystem Size Used Avail Use% Mounted on /dev/sda264G 61G 2.8G 96% / # btrfs fi du . Total Exclusive Set shared Filename 0.00B 0.00B - ./1/__db.register 10.00MiB10.00MiB - ./1/log.01 16.00KiB 0.00B - ./1/hosts.db 16.00KiB 0.00B - ./1/users.db 168.00KiB 0.00B - ./1/__db.001 40.00KiB 0.00B - ./1/__db.002 44.00KiB 0.00B - ./1/__db.003 10.28MiB10.00MiB - ./1 0.00B 0.00B - ./__db.register 16.00KiB16.00KiB - ./hosts.db 16.00KiB16.00KiB - ./users.db 10.00MiB10.00MiB - ./log.13 0.00B 0.00B - ./__db.001 0.00B 0.00B - ./__db.002 0.00B 0.00B - ./__db.003 20.31MiB20.03MiB 284.00KiB . # btrfs fi defragment log.13 # df /dev/sda264G 54G 9.4G 86% / 6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: >> 1. is there any switch resulting in 'defrag only exclusive data'? > > IIRC, no. I have found a directory - pam_abl databases, which occupy 10 MB (yes, TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after defrag. After defragging files were not snapshotted again and I've lost 3.6 GB again, so I got this fully reproducible. There are 7 files, one of which is 99% of the space (10 MB). None of them has nocow set, so they're riding all-btrfs. I could debug something before I'll clean this up, is there anything you want to me to check/know about the files? The fragmentation impact is HUGE here, 1000-ratio is almost a DoS condition which could be triggered by malicious user during a few hours or faster - I've lost 3.6 GB during the night with reasonably small amount of writes, I guess it might be possible to trash entire filesystem within 10 minutes if doing this on purpose. >> 3. I guess there aren't, so how could I accomplish my target, i.e. >>reclaiming space that was lost due to fragmentation, without breaking >>spanshoted CoW where it would be not only pointless, but actually harmful? > > What about using old kernel, like v4.13? Unfortunately (I guess you had 3.13 on mind), I need the new ones and will be pushing towards 4.14. >> 4. How can I prevent this from happening again? All the files, that are >>written constantly (stats collector here, PostgreSQL database and >>logs on other machines), are marked with nocow (+C); maybe some new >>attribute to mark file as autodefrag? +t? > > Unfortunately, nocow only works if there is no other subvolume/inode > referring to it. This shouldn't be my case anymore after defrag (==breaking links). I guess no easy way to check refcounts of the blocks? > But in my understanding, btrfs is not suitable for such conflicting > situation, where you want to have snapshots of frequent partial updates. > > IIRC, btrfs is better for use case where either update is less frequent, > or update is replacing the whole file, not just part of it. > > So btrfs is good for root filesystem like /etc /usr (and /bin /lib which > is pointing to /usr/bin and /usr/lib) , but not for /var or /run. That is something coherent with my conclusions after 2 years on btrfs, however I didn't expect a single file to eat 1000 times more space than it should... I wonder how many other filesystems were trashed like this - I'm short of ~10 GB on other system, many other users might be affected by that (telling the Internet stories about btrfs running out of space). It is not a problem that I need to defrag a file, the problem is I don't know: 1. whether I need to defrag, 2. *what* should I defrag nor have a tool that would defrag smart - only the exclusive data or, in general, the block that are worth defragging if space released from extents is greater than space lost on inter-snapshot duplication. I can't just defrag entire filesystem since it breaks links with snapshots. This change was a real deal-breaker here... Any way to fed the deduplication code with snapshots maybe? There are directories and files in the same layout, this could be fast-tracked to check and deduplicate. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sun, Dec 03, 2017 at 01:45:45 +, Duncan wrote: > OTOH, it's also quite possible that people chose btrfs at least partly > for other reasons, say the "storage pool" qualities, and would rather Well, to name some: 1. filesystem-level backups via snapshot/send/receive - much cleaner and faster than rsyncs or other old-fashioned methods. This obviously requires the CoW-once feature; - caveat: for btrfs-killing usage patterns all the snapshots but the last one need to be removed; 2. block-level checksums with RAID1-awareness - in contrary to mdadm RAIDx, which chooses random data copy from underlying devices, this is much less susceptible to bit rot; - caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID instead of real balancing), no N-way mirroring nor write-mostly flag. 3. compression - there is no real alternative, however: - caveat: requires CoW enabled, which makes it not suitable for ...systemd journals, which compress with great ratio (c.a. 1:10), nor for various databases, as they will be nocowed sooner or later; 4. storage pools you've mentioned - they are actually not much superior to LVM-based approach; until one could create subvolume with different profile (e.g. 'disable RAID1 for /var/log/journal') it is still better to create separate filesystems, meaning one have to use LVM or (the hard way) paritioning. Some of the drawbacks above are immanent to CoW and so shouldn't be expected to be fixed internally, as the needs are conflicting, but their impact might be nullified by some housekeeping. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing)
02.12.2017 03:27, Qu Wenruo пишет: > > That's the difference between how sub show and quota works. > > For quota, it's per-root owner check. > Means even a file extent is shared between different inodes, if all > inodes are inside the same subvolume, it's counted as exclusive. > And if any of the file extent belongs to other subvolume, then it's > counted as shared. > Could you also explain how parent qgroup computes exclusive space? I.e. 10:~ # mkfs -t btrfs -f /dev/sdb1 btrfs-progs v4.13.3 See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM /dev/sdb1 (1023.00MiB) ... Label: (null) UUID: b9b0643f-a248-4667-9e69-acf5baaef05b Node size: 16384 Sector size:4096 Filesystem size:1023.00MiB Block group profiles: Data: single8.00MiB Metadata: DUP 51.12MiB System: DUP 8.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: IDSIZE PATH 1 1023.00MiB /dev/sdb1 10:~ # mount -t btrfs /dev/sdb1 /mnt 10:~ # cd /mnt 10:/mnt # btrfs quota enable . 10:/mnt # btrfs su cre sub1 Create subvolume './sub1' 10:/mnt # dd if=/dev/urandom of=sub1/file1 bs=1K count=1024 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00833739 s, 126 MB/s 10:/mnt # dd if=/dev/urandom of=sub1/file2 bs=1K count=1024 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0179272 s, 58.5 MB/s 10:/mnt # btrfs subvolume snapshot sub1 sub2 Create a snapshot of 'sub1' in './sub2' 10:/mnt # dd if=/dev/urandom of=sub2/file2 bs=1K count=1024 conv=notrunc 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0348762 s, 30.1 MB/s 10:/mnt # btrfs qgroup show --sync -p . qgroupid rfer excl parent -- 0/5 16.00KiB 16.00KiB --- 0/256 2.02MiB 1.02MiB --- 0/257 2.02MiB 1.02MiB --- So far so good. This is expected, each subvolume has 1MiB shared and 1MiB exclusive. 10:/mnt # btrfs qgroup create 22/7 /mnt 10:/mnt # btrfs qgroup assign --rescan 0/256 22/7 /mnt Quota data changed, rescan scheduled 10:/mnt # btrfs quota rescan -s /mnt no rescan operation in progress 10:/mnt # btrfs qgroup assign --rescan 0/257 22/7 /mnt Quota data changed, rescan scheduled 10:/mnt # btrfs quota rescan -s /mnt no rescan operation in progress 10:/mnt # btrfs qgroup show --sync -p . qgroupid rfer excl parent -- 0/5 16.00KiB 16.00KiB --- 0/256 2.02MiB 1.02MiB 22/7 0/257 2.02MiB 1.02MiB 22/7 22/7 3.03MiB 3.03MiB --- 10:/mnt # Oops. Total for 22/7 is correct (1MiB shared + 2 * 1MiB exclusive) but why all data is treated as exclusive here? It does not match your explanation ... signature.asc Description: OpenPGP digital signature
Re: exclusive subvolume space missing
On Sun, Dec 3, 2017 at 3:47 AM, Adam Borowskiwrote: > I'd say that the only good use for nocow is "I wish I have placed this file > on a non-btrfs, but it'd be too much hassle to repartition". > > If you snapshot nocow at all, you get the worst of both worlds. I think it's better to have the option than not have it, but for regular Joe user I think it's a problem. And that's why I'm not such a big fan of systemd-journald using chattr +C on journals when on Btrfs, by default. I wouldn't mind it if systemd also made /var/log/journal/ a subvolume, just like it automatically creates /var/lib/machines as as subvolume. That way by default /var/log/journal would be immune to snapshots. Or alternatively a rework of how journals are written to be more COW friendly. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Fri, Dec 1, 2017 at 5:53 PM, Tomasz Palawrote: > # btrfs fi usage / > Overall: > Device size: 128.00GiB > Device allocated:117.19GiB > Device unallocated: 10.81GiB > Device missing: 0.00B > Used:103.56GiB > Free (estimated): 11.19GiB (min: 11.14GiB) > Data ratio: 1.98 > Metadata ratio: 2.00 > Global reserve: 146.08MiB (used: 0.00B) > > Data,single: Size:1.19GiB, Used:1.18GiB >/dev/sda2 1.07GiB >/dev/sdb2 132.00MiB This is asking for trouble. Two devices have single copy data chunks, if those drives die, you lose that data. But the metadata referring to those files will survive and Btrfs will keep complaining about them at every scrub until they're all deleted - there is no command that makes this easy. You'd have to scrape scrub output, which includes paths to the missing files, and script something to delete them all. You should convert this with something like 'btrfs balance start -dconvert=raid1,soft ' -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On 2017年12月02日 17:33, Tomasz Pala wrote: > OK, I seriously need to address that, as during the night I lost > 3 GB again: > > On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote: > >>> # btrfs fi sh / >>> Label: none uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a >>> Total devices 2 FS bytes used 44.10GiB >Total devices 2 FS bytes used 47.28GiB > >>> # btrfs fi usage / >>> Overall: >>> Used: 88.19GiB >Used: 94.58GiB >>> Free (estimated): 18.75GiB (min: 18.75GiB) >Free (estimated): 15.56GiB (min: 15.56GiB) >>> >>> # btrfs dev usage / > - output not changed > >>> # btrfs fi df / >>> Data, RAID1: total=51.97GiB, used=43.22GiB >Data, RAID1: total=51.97GiB, used=46.42GiB >>> System, RAID1: total=32.00MiB, used=16.00KiB >>> Metadata, RAID1: total=2.00GiB, used=895.69MiB >>> GlobalReserve, single: total=131.14MiB, used=0.00B >GlobalReserve, single: total=135.50MiB, used=0.00B >>> >>> # df >>> /dev/sda264G 45G 19G 71% / >/dev/sda264G 48G 16G 76% / >>> However the difference is on active root fs: >>> >>> -0/29124.29GiB 9.77GiB >>> +0/29115.99GiB 76.00MiB > 0/29119.19GiB 3.28GiB >> >> Since you have already showed the size of the snapshots, which hardly >> goes beyond 1G, it may be possible that extent booking is the cause. >> >> And considering it's all exclusive, defrag may help in this case. > > I'm going to try defrag here, but have a bunch of questions before; > as defrag would break CoW, I don't want to defrag files that span > multiple snapshots, unless they have huge overhead: > 1. is there any switch resulting in 'defrag only exclusive data'? IIRC, no. > 2. is there any switch resulting in 'defrag only extents fragmented more than > X' >or 'defrag only fragments that would be possibly freed'? No, either. > 3. I guess there aren't, so how could I accomplish my target, i.e. >reclaiming space that was lost due to fragmentation, without breaking >spanshoted CoW where it would be not only pointless, but actually harmful? What about using old kernel, like v4.13? > 4. How can I prevent this from happening again? All the files, that are >written constantly (stats collector here, PostgreSQL database and >logs on other machines), are marked with nocow (+C); maybe some new >attribute to mark file as autodefrag? +t? Unfortunately, nocow only works if there is no other subvolume/inode referring to it. That's to say, if you're using snapshot, then NOCOW won't help as much as you expected, but still much better than normal data cow. > > For example, the largest file from stats collector: > Total Exclusive Set shared Filename > 432.00KiB 176.00KiB 256.00KiB load/load.rrd > > but most of them has 'Set shared'==0. > > 5. The stats collector is running from the beginning, according to the > quota output was not the issue since something happened. If the problem > was triggered by (guessing) low space condition, and it results in even > more space lost, there is positive feedback that is dangerous, as makes > any filesystem unstable ("once you run out of space, you won't recover"). > Does it mean btrfs is simply not suitable (yet?) for frequent updates usage > pattern, like RRD files? Hard to say the cause. But in my understanding, btrfs is not suitable for such conflicting situation, where you want to have snapshots of frequent partial updates. IIRC, btrfs is better for use case where either update is less frequent, or update is replacing the whole file, not just part of it. So btrfs is good for root filesystem like /etc /usr (and /bin /lib which is pointing to /usr/bin and /usr/lib) , but not for /var or /run. > > 6. Or maybe some extra steps just before taking snapshot should be taken? > I guess 'defrag exclusive' would be perfect here - reclaiming space > before it is being locked inside snapshot. Yes, this sounds perfectly reasonable. Thanks, Qu > Rationale behind this is obvious: since the snapshot-aware defrag was > removed, allow to defrag snapshot exclusive data only. > This would of course result in partial file defragmentation, but that > should be enough for pathological cases like mine. signature.asc Description: OpenPGP digital signature
Re: exclusive subvolume space missing
On Sun, Dec 03, 2017 at 01:45:45AM +, Duncan wrote: > Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted: > >> I got ~500 small files (100-500 kB) updated partially in regular > >> intervals: > >> > >> # du -Lc **/*.rrd | tail -n1 > >> 105Mtotal > > FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post) > are (other than that a quick google suggests that it's... > round-robin-database... Basically: preallocate a file, its size doesn't change since then. Every a few minutes, write several bytes into the file, slowly advancing. This is indeed the worst possible case for btrfs, and nocow doesn't help the slightest as the database doesn't wrap around before a typical snapshot interval. > Meanwhile, /because/ nocow has these complexities along with others (nocow > automatically turns off data checksumming and compression for the files > too), and the fact that they nullify some of the big reasons people might > choose btrfs in the first place, I actually don't recommend setting > nocow in the first place -- if usage is such than a file needs nocow, > my thinking is that btrfs isn't a particularly good hosting choice for > that file in the first place, a more traditional rewrite-in-place > filesystem is likely to be a better fit. I'd say that the only good use for nocow is "I wish I have placed this file on a non-btrfs, but it'd be too much hassle to repartition". If you snapshot nocow at all, you get the worst of both worlds. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritic Oath: "Keep trackers off your trail" ⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah ⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...; ⠈⠳⣄ (same for all links) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted: > On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote: > >>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from >> [...] >>> Now make various small changes to the file, say under 16 KiB each. These >>> will each be COWed elsewhere as one might expect. by default 16 KiB at >>> a time I believe (might be 4 KiB, as it was back when the default leaf >> >> I got ~500 small files (100-500 kB) updated partially in regular >> intervals: >> >> # du -Lc **/*.rrd | tail -n1 >> 105Mtotal FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post) are (other than that a quick google suggests that it's... round-robin-database... and the database bit alone sounds bad in this context as database-file rewrites are known to be a worst-case for cow-based filesystems), but it sounds like you suspect that they have this rewrite-most pattern that could explain your problem... >>> But here's the kicker. Even without a snapshot locking that original 100 >>> MiB extent in place, if even one of the original 16 KiB blocks isn't >>> rewritten, that entire 100 MiB extent will remain locked in place, as the >>> original 16 KiB blocks that have been changed and thus COWed elsewhere >>> aren't freed one at a time, the full 100 MiB extent only gets freed, all >>> at once, once no references to it remain, which means once that last >>> block of the extent gets rewritten. > > OTOH - should this happen with nodatacow files? As I mentioned before, > these files are chattred +C (however this was not their initial state > due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ). > Am I wrong thinking, that in such case they should occupy twice their > size maximum? Or maybe there is some tool that could show me the real > space wasted by file, including extents count etc? Nodatacow... isn't as simple as the name might suggest. For one thing, snapshots depend on COW and lock the extents they reference in-place, so while a file might be set nocow and that setting is retained, the first write to a block after a snapshot *MUST* cow that block... because the snapshot has the existing version referenced and it can't change without changing the snapshot as well, and that would of course defeat the purpose of snapshots. Tho the attribute is retained and further writes to the same already cowed block won't cow it again. FWIW, on this list that behavior is often referred to as cow1, cow only the first time that a block is written after a snapshot locks the previous version in place. The effect of cow1 depends on the frequency and extent of block rewrites vs. the frequency of snapshots of the subvolume they're on. As should be obvious if you think about it, once you've done the cow1, further rewrites to the same block before further snapshots won't cow further, so if only a few blocks are repeatedly rewritten multiple times between snapshots, the effect should be relatively small. Similarly if snapshots happen far more frequently than block rewrites, since in that case most of the snapshots won't have anything changed (for that file anyway) since the last one. However, if most of the file gets rewritten between snapshots and the snapshot frequency is often enough to be a major factor, the effect can be practically as bad as if the file weren't nocow in the first place. If I knew a bit more about rrd's rewrite pattern... and your snapshot pattern... Second, as you alluded, for btrfs files must be set nocow before anything is written to them. Quoting the chattr (1) manpage: "If it is set on a file which already has data blocks, it is undefined when the blocks assigned to the file will be fully stable." Not being a dev I don't read the code to know what that means in practice, but it could well be effectively cow1, which would yield the maximum 2X size you assumed. But I think it's best to take "undefined" at its meaning, and assume worst-case "no effect at all", for size calculation purposes, unless you really /did/ set it at file creation, before the file had content. And the easiest way to do /that/, and something that might be worthwhile doing anyway if you think unreclaimed still referenced extents are your problem, is to set the nocow flag on the /directory/, then copy the files into it, taking care to actually create them new, that is, use --reflink=never or copy the files to a different filesystem, perhaps tmpfs, and back, so they /have/ to be created new. Of course with the rewriter (rrdcached, apparently) shut down for the process. Then, once the files are safely back in place and the filesystem synced so the data is actually on disk, you can delete the old copies (which will continue to serve as backups until then), and sync the filesystem again. While snapshots will of course continue to keep extents they reference locked, for unsnapshotted files at least, this process should clear up any still
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote: >> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from > [...] >> Now make various small changes to the file, say under 16 KiB each. These >> will each be COWed elsewhere as one might expect. by default 16 KiB at >> a time I believe (might be 4 KiB, as it was back when the default leaf > > I got ~500 small files (100-500 kB) updated partially in regular > intervals: > > # du -Lc **/*.rrd | tail -n1 > 105Mtotal > >> But here's the kicker. Even without a snapshot locking that original 100 >> MiB extent in place, if even one of the original 16 KiB blocks isn't >> rewritten, that entire 100 MiB extent will remain locked in place, as the >> original 16 KiB blocks that have been changed and thus COWed elsewhere >> aren't freed one at a time, the full 100 MiB extent only gets freed, all >> at once, once no references to it remain, which means once that last >> block of the extent gets rewritten. OTOH - should this happen with nodatacow files? As I mentioned before, these files are chattred +C (however this was not their initial state due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ). Am I wrong thinking, that in such case they should occupy twice their size maximum? Or maybe there is some tool that could show me the real space wasted by file, including extents count etc? -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote: > OK, is this supposed to be raid1 or single data, because the above shows > metadata as all raid1, while some data is single tho most is raid1, and > while old mkfs used to create unused single chunks on raid1 that had to > be removed manually via balance, those single data chunks aren't unused. It is supposed to be RAID1, the single data were leftovers from my previous attempts to gain some space by converting into single profile. Which miserably failed BTW (would it be smarter with "soft" option?), but I've already managed to clear this. > Assuming the intent is raid1, I'd recommend doing... > > btrfs balance start -dconvert=raid1,soft / Yes, this was the way to go. It also reclaimed the 8 GB. I assume the failing -dconvert=single somehow locked that 8 GB, so this issue should be addressed in btrfs-tools to report such locked out region. You've already noted that the single profile data occupied much less itself. So this was the first issue, the second is running overhead, that accumulates over time. Since yesterday, when I had 19 GB free, I've lost 4 GB already. The scenario you've described is very probable: > btrfs balance start -dusage=N / [...] > allocated value toward usage. I too run relatively small btrfs raid1s > and would suggest trying N=5, 20, 40, 70, until the spread between There were no effects above N=10 (both dusage and musage). > consuming your space either, as I'd suspect they might if the problem were > for instance atime updates, so while noatime is certainly recommended and I use noatime by default since years, so not the source of problem here. > The other possibility that comes to mind here has to do with btrfs COW > write patterns... > Suppose you start with a 100 MiB file (I'm adjusting the sizes down from [...] > Now make various small changes to the file, say under 16 KiB each. These > will each be COWed elsewhere as one might expect. by default 16 KiB at > a time I believe (might be 4 KiB, as it was back when the default leaf I got ~500 small files (100-500 kB) updated partially in regular intervals: # du -Lc **/*.rrd | tail -n1 105Mtotal > But here's the kicker. Even without a snapshot locking that original 100 > MiB extent in place, if even one of the original 16 KiB blocks isn't > rewritten, that entire 100 MiB extent will remain locked in place, as the > original 16 KiB blocks that have been changed and thus COWed elsewhere > aren't freed one at a time, the full 100 MiB extent only gets freed, all > at once, once no references to it remain, which means once that last > block of the extent gets rewritten. > > So perhaps you have a pattern where files of several MiB get mostly > rewritten, taking more space for the rewrites due to COW, but one or > more blocks remain as originally written, locking the original extent > in place at its full size, thus taking twice the space of the original > file. > > Of course worst-case is rewrite the file minus a block, then rewrite > that minus a block, then rewrite... in which case the total space > usage will end up being several times the size of the original file! > > Luckily few people have this sort of usage pattern, but if you do... > > It would certainly explain the space eating... Did anyone investigated how is that related to RRD rewrites? I don't use rrdcached, never thought that 100 MB of data might trash entire filesystem... best regards, -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
OK, I seriously need to address that, as during the night I lost 3 GB again: On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote: >> # btrfs fi sh / >> Label: none uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a >> Total devices 2 FS bytes used 44.10GiB Total devices 2 FS bytes used 47.28GiB >> # btrfs fi usage / >> Overall: >> Used: 88.19GiB Used: 94.58GiB >> Free (estimated): 18.75GiB (min: 18.75GiB) Free (estimated): 15.56GiB (min: 15.56GiB) >> >> # btrfs dev usage / - output not changed >> # btrfs fi df / >> Data, RAID1: total=51.97GiB, used=43.22GiB Data, RAID1: total=51.97GiB, used=46.42GiB >> System, RAID1: total=32.00MiB, used=16.00KiB >> Metadata, RAID1: total=2.00GiB, used=895.69MiB >> GlobalReserve, single: total=131.14MiB, used=0.00B GlobalReserve, single: total=135.50MiB, used=0.00B >> >> # df >> /dev/sda264G 45G 19G 71% / /dev/sda264G 48G 16G 76% / >> However the difference is on active root fs: >> >> -0/29124.29GiB 9.77GiB >> +0/29115.99GiB 76.00MiB 0/29119.19GiB 3.28GiB > > Since you have already showed the size of the snapshots, which hardly > goes beyond 1G, it may be possible that extent booking is the cause. > > And considering it's all exclusive, defrag may help in this case. I'm going to try defrag here, but have a bunch of questions before; as defrag would break CoW, I don't want to defrag files that span multiple snapshots, unless they have huge overhead: 1. is there any switch resulting in 'defrag only exclusive data'? 2. is there any switch resulting in 'defrag only extents fragmented more than X' or 'defrag only fragments that would be possibly freed'? 3. I guess there aren't, so how could I accomplish my target, i.e. reclaiming space that was lost due to fragmentation, without breaking spanshoted CoW where it would be not only pointless, but actually harmful? 4. How can I prevent this from happening again? All the files, that are written constantly (stats collector here, PostgreSQL database and logs on other machines), are marked with nocow (+C); maybe some new attribute to mark file as autodefrag? +t? For example, the largest file from stats collector: Total Exclusive Set shared Filename 432.00KiB 176.00KiB 256.00KiB load/load.rrd but most of them has 'Set shared'==0. 5. The stats collector is running from the beginning, according to the quota output was not the issue since something happened. If the problem was triggered by (guessing) low space condition, and it results in even more space lost, there is positive feedback that is dangerous, as makes any filesystem unstable ("once you run out of space, you won't recover"). Does it mean btrfs is simply not suitable (yet?) for frequent updates usage pattern, like RRD files? 6. Or maybe some extra steps just before taking snapshot should be taken? I guess 'defrag exclusive' would be perfect here - reclaiming space before it is being locked inside snapshot. Rationale behind this is obvious: since the snapshot-aware defrag was removed, allow to defrag snapshot exclusive data only. This would of course result in partial file defragmentation, but that should be enough for pathological cases like mine. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
Tomasz Pala posted on Sat, 02 Dec 2017 01:53:39 +0100 as excerpted: > # btrfs fi usage / > Overall: > Device size: 128.00GiB > Device allocated:117.19GiB > Device unallocated: 10.81GiB > Device missing: 0.00B > Used:103.56GiB > Free (estimated): 11.19GiB (min: 11.14GiB) > Data ratio: 1.98 > Metadata ratio: 2.00 > Global reserve: 146.08MiB (used: 0.00B) > > Data,single: Size:1.19GiB, Used:1.18GiB >/dev/sda2 1.07GiB >/dev/sdb2 132.00MiB > > Data,RAID1: Size:55.97GiB, Used:50.30GiB >/dev/sda2 55.97GiB >/dev/sdb2 55.97GiB > > Metadata,RAID1: Size:2.00GiB, Used:908.61MiB >/dev/sda2 2.00GiB >/dev/sdb2 2.00GiB > > System,RAID1: Size:32.00MiB, Used:16.00KiB >/dev/sda2 32.00MiB >/dev/sdb2 32.00MiB > > Unallocated: >/dev/sda2 4.93GiB >/dev/sdb2 5.87GiB OK, is this supposed to be raid1 or single data, because the above shows metadata as all raid1, while some data is single tho most is raid1, and while old mkfs used to create unused single chunks on raid1 that had to be removed manually via balance, those single data chunks aren't unused. Which means if it's supposed to raid1, you don't have redundancy on that single data. Assuming the intent is raid1, I'd recommend doing... btrfs balance start -dconvert=raid1,soft / Probably disable quotas at least temporarily while you do so, tho, as they don't scale well with balance and make it take much longer. That should go reasonably fast as it's only a bit over 1 GiB on the one device, and 132 MiB on the other (from your btrfs device usage), and the soft allows it to skip chunks that don't need conversion. It should kill those single entries and even up usage on both devices, along with making the filesystem much more tolerant of loss of one of the two devices. Other than that, what we can see from the above is that it's a relatively small filesystem, 64 GiB each on a pair of devices, raid1 but for the above. We also see that the allocated chunks vs. chunk usage isn't /too/ bad, with that being a somewhat common problem. However, given the relatively small 64 GiB per device pair-device raid1 filesystem, there is some slack, about 5 GiB worth, in that raid1 data, that you can recover. btrfs balance start -dusage=N / Where N represents a percentage full, so 0-100. Normally, smaller values of N complete much faster, with the most effect if they're enough, because at say 10% usage, 10 90% empty chunks can be rewritten into a single 100% full chunk. The idea is to start with a small N value since it completes fast, and redo with higher values as necessary to shrink the total data chunk allocated value toward usage. I too run relatively small btrfs raid1s and would suggest trying N=5, 20, 40, 70, until the spread between used and total is under 2 gigs, under a gig if you want to go that far (nominal data chunk size is a gig so even a full balance will be unlikely to get you a spread less than that). Over 70 likely won't get you much so isn't worth it. That should return the excess to unallocated, leaving the filesystem able to use the freed space for data or metadata chunks as necessary, tho you're unlikely to see an increase in available space in (non-btrfs) df or similar. If the unallocated value gets down below 1 GiB you may have issues trying to free space since balance will want space to write the chunk it's going to write into to free the others, so you probably want to keep an eye on this and rebalance if it gets under 2-3 gigs free space, assuming of course that there's slack between used and total that /can/ be freed by a rebalance. FWIW the same can be done with metadata using -musage=, with metadata chunks being 256 MiB nominal, but keep in mind that global reserve is allocated from metadata space but doesn't count as used, so you typically can't get the spread down below half a GiB or so. And in most cases it's data chunks that get the big spread, not metadata, so it's much more common to have to do -d for data than -m for metadata. All that said, the numbers don't show a runaway spread between total and used, so while this might help, it's not going to fix the primary space being eaten problem of the thread, as I had hoped it might. Additionally, at 2 GiB total per device, metadata chunks aren't runaway consuming your space either, as I'd suspect they might if the problem were for instance atime updates, so while noatime is certainly recommended and might help some, it doesn't appear to be a primary contributor to the problem either. The other possibility that comes to mind here has to do with btrfs COW write patterns... Suppose you start with a 100 MiB file (I'm adjusting the sizes down from the GiB+ example typically used due to the filesystem size
Re: exclusive subvolume space missing
On 2017年12月02日 10:21, Tomasz Pala wrote: > On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote: > >>> Actually I should rephrase the problem: >>> >>> "snapshot has taken 8 GB of space despite nothing has altered source >>> subvolume" > > Actually, after: > > # btrfs balance start -v -dconvert=raid1 / > ctrl-c on block group 35G/113G > # btrfs balance start -v -dconvert=raid1,soft / > # btrfs balance start -v -dusage=55 / > Done, had to relocate 1 out of 56 chunks > # btrfs balance start -v -musage=55 / > Done, had to relocate 2 out of 55 chunks > > and waiting a few minutes after ...the 8 GB I've lost yesterday is back: > > # btrfs fi sh / > Label: none uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a > Total devices 2 FS bytes used 44.10GiB > devid1 size 64.00GiB used 54.00GiB path /dev/sda2 > devid2 size 64.00GiB used 54.00GiB path /dev/sdb2 > > # btrfs fi usage / > Overall: > Device size: 128.00GiB > Device allocated:108.00GiB > Device unallocated: 20.00GiB > Device missing: 0.00B > Used: 88.19GiB > Free (estimated): 18.75GiB (min: 18.75GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 131.14MiB (used: 0.00B) > > Data,RAID1: Size:51.97GiB, Used:43.22GiB >/dev/sda2 51.97GiB >/dev/sdb2 51.97GiB > > Metadata,RAID1: Size:2.00GiB, Used:895.69MiB >/dev/sda2 2.00GiB >/dev/sdb2 2.00GiB > > System,RAID1: Size:32.00MiB, Used:16.00KiB >/dev/sda2 32.00MiB >/dev/sdb2 32.00MiB > > Unallocated: >/dev/sda2 10.00GiB >/dev/sdb2 10.00GiB > > # btrfs dev usage / > /dev/sda2, ID: 1 >Device size:64.00GiB >Device slack: 0.00B >Data,RAID1: 51.97GiB >Metadata,RAID1: 2.00GiB >System,RAID1: 32.00MiB >Unallocated:10.00GiB > > /dev/sdb2, ID: 2 >Device size:64.00GiB >Device slack: 0.00B >Data,RAID1: 51.97GiB >Metadata,RAID1: 2.00GiB >System,RAID1: 32.00MiB >Unallocated:10.00GiB > > # btrfs fi df / > Data, RAID1: total=51.97GiB, used=43.22GiB > System, RAID1: total=32.00MiB, used=16.00KiB > Metadata, RAID1: total=2.00GiB, used=895.69MiB > GlobalReserve, single: total=131.14MiB, used=0.00B > > # df > /dev/sda264G 45G 19G 71% / > > However the difference is on active root fs: > > -0/29124.29GiB 9.77GiB > +0/29115.99GiB 76.00MiB > > Still, 45G used, while there is (if I counted this correctly) 25G of data... > >> Then please provide correct qgroup numbers. >> >> The correct number should be get by: >> # btrfs quota enable >> # btrfs quota rescan -w >> # btrfs qgroup show -prce --sync > > OK, just added the --sort=excl: > > qgroupid rfer excl max_rfer max_excl parent child > -- - > 0/5 16.00KiB 16.00KiB none none --- --- > 0/36122.57GiB 7.00MiB none none --- --- > 0/35822.54GiB 7.50MiB none none --- --- > 0/34322.36GiB 7.84MiB none none --- --- > 0/34522.49GiB 8.05MiB none none --- --- > 0/35722.50GiB 9.27MiB none none --- --- > 0/36022.57GiB 10.27MiB none none --- --- > 0/34422.48GiB 11.09MiB none none --- --- > 0/35922.55GiB 12.57MiB none none --- --- > 0/36222.59GiB 22.96MiB none none --- --- > 0/30212.87GiB 31.23MiB none none --- --- > 0/42815.96GiB 38.68MiB none none --- --- > 0/29411.09GiB 47.86MiB none none --- --- > 0/33621.80GiB 49.59MiB none none --- --- > 0/30012.56GiB 51.43MiB none none --- --- > 0/34222.31GiB 52.93MiB none none --- --- > 0/33321.71GiB 54.54MiB none none --- --- > 0/36322.63GiB 58.83MiB none none --- --- > 0/37023.27GiB 59.46MiB none none --- --- > 0/30513.01GiB 61.47MiB none none --- --- > 0/33121.61GiB 61.49MiB none none --- --- > 0/33421.78GiB 62.95MiB none none --- --- > 0/30613.04GiB 64.11MiB none none --- --- > 0/30412.96GiB 64.90MiB none none ---
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote: >> Actually I should rephrase the problem: >> >> "snapshot has taken 8 GB of space despite nothing has altered source >> subvolume" Actually, after: # btrfs balance start -v -dconvert=raid1 / ctrl-c on block group 35G/113G # btrfs balance start -v -dconvert=raid1,soft / # btrfs balance start -v -dusage=55 / Done, had to relocate 1 out of 56 chunks # btrfs balance start -v -musage=55 / Done, had to relocate 2 out of 55 chunks and waiting a few minutes after ...the 8 GB I've lost yesterday is back: # btrfs fi sh / Label: none uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a Total devices 2 FS bytes used 44.10GiB devid1 size 64.00GiB used 54.00GiB path /dev/sda2 devid2 size 64.00GiB used 54.00GiB path /dev/sdb2 # btrfs fi usage / Overall: Device size: 128.00GiB Device allocated:108.00GiB Device unallocated: 20.00GiB Device missing: 0.00B Used: 88.19GiB Free (estimated): 18.75GiB (min: 18.75GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 131.14MiB (used: 0.00B) Data,RAID1: Size:51.97GiB, Used:43.22GiB /dev/sda2 51.97GiB /dev/sdb2 51.97GiB Metadata,RAID1: Size:2.00GiB, Used:895.69MiB /dev/sda2 2.00GiB /dev/sdb2 2.00GiB System,RAID1: Size:32.00MiB, Used:16.00KiB /dev/sda2 32.00MiB /dev/sdb2 32.00MiB Unallocated: /dev/sda2 10.00GiB /dev/sdb2 10.00GiB # btrfs dev usage / /dev/sda2, ID: 1 Device size:64.00GiB Device slack: 0.00B Data,RAID1: 51.97GiB Metadata,RAID1: 2.00GiB System,RAID1: 32.00MiB Unallocated:10.00GiB /dev/sdb2, ID: 2 Device size:64.00GiB Device slack: 0.00B Data,RAID1: 51.97GiB Metadata,RAID1: 2.00GiB System,RAID1: 32.00MiB Unallocated:10.00GiB # btrfs fi df / Data, RAID1: total=51.97GiB, used=43.22GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=895.69MiB GlobalReserve, single: total=131.14MiB, used=0.00B # df /dev/sda264G 45G 19G 71% / However the difference is on active root fs: -0/29124.29GiB 9.77GiB +0/29115.99GiB 76.00MiB Still, 45G used, while there is (if I counted this correctly) 25G of data... > Then please provide correct qgroup numbers. > > The correct number should be get by: > # btrfs quota enable > # btrfs quota rescan -w > # btrfs qgroup show -prce --sync OK, just added the --sort=excl: qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 16.00KiB 16.00KiB none none --- --- 0/36122.57GiB 7.00MiB none none --- --- 0/35822.54GiB 7.50MiB none none --- --- 0/34322.36GiB 7.84MiB none none --- --- 0/34522.49GiB 8.05MiB none none --- --- 0/35722.50GiB 9.27MiB none none --- --- 0/36022.57GiB 10.27MiB none none --- --- 0/34422.48GiB 11.09MiB none none --- --- 0/35922.55GiB 12.57MiB none none --- --- 0/36222.59GiB 22.96MiB none none --- --- 0/30212.87GiB 31.23MiB none none --- --- 0/42815.96GiB 38.68MiB none none --- --- 0/29411.09GiB 47.86MiB none none --- --- 0/33621.80GiB 49.59MiB none none --- --- 0/30012.56GiB 51.43MiB none none --- --- 0/34222.31GiB 52.93MiB none none --- --- 0/33321.71GiB 54.54MiB none none --- --- 0/36322.63GiB 58.83MiB none none --- --- 0/37023.27GiB 59.46MiB none none --- --- 0/30513.01GiB 61.47MiB none none --- --- 0/33121.61GiB 61.49MiB none none --- --- 0/33421.78GiB 62.95MiB none none --- --- 0/30613.04GiB 64.11MiB none none --- --- 0/30412.96GiB 64.90MiB none none --- --- 0/30312.94GiB 68.39MiB none none --- --- 0/36723.20GiB 68.52MiB none none --- --- 0/36623.22GiB 69.79MiB none none --- --- 0/36422.63GiB 72.03MiB
Re: exclusive subvolume space missing
On 2017年12月02日 09:43, Tomasz Pala wrote: > On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote: > >>> qgroupid rfer excl >>> >>> 0/26012.25GiB 3.22GiB from 170712 - first snapshot >>> 0/31217.54GiB 4.56GiB from 170811 >>> 0/36625.59GiB 2.44GiB from 171028 >>> 0/37023.27GiB 59.46MiB from 18 - prev snapshot >>> 0/38821.69GiB 7.16GiB from 171125 - last snapshot >>> 0/29124.29GiB 9.77GiB default subvolume >> >> You may need to manually sync the filesystem (trigger a transaction >> commitment) to update qgroup accounting. > > The data I've pasted were just calculated. > >>> # btrfs quota enable / >>> # btrfs qgroup show / >>> WARNING: quota disabled, qgroup data may be out of date >>> [...] >>> # btrfs quota enable / - for the second time! >>> # btrfs qgroup show / >>> WARNING: qgroup data inconsistent, rescan recommended >> >> Please wait the rescan, or any number is not correct. > > Here I was pointing that first "quota enable" resulted in "quota > disabled" warning until I've enabled it once again. > >> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to >> ensure you understand all the limitation. > > I probably won't understand them all, but this is not an issue of my > concern as I don't use it. There is simply no other way I am aware that > could show me per-subvolume stats. Well, straightforward way, as the > hard way I'm using (btrfs send) confirms the problem. Unfortunately, send doesn't count everything. The most common thing is, send doesn't count extent booking space. Try the following command: # fallocate -l 1G # mkfs.btrfs -f # mount # btrfs subv create /subv1 # xfs_io -f -c "pwrite 0 128M" -c "sync" /subv1/file1 # xfs_io -f "fpunch 0 127M" -c "sync" /subv1/file1 # btrfs subv snapshot -r /subv1 /snapshot # btrfs send /snapshot You will only get the 1M data, while it still takes 128M space on-disk. Btrfs extent book will only free the whole extent if and only if there is no inode referring *ANY* part of the extent. Even only 1M of a 128M file extent is used, it will still takes 128M space on-disk. And that's what send can't tell you. And that's also what qgroup can tell you. That's also why I need *CORRECT* qgroup numbers to further investigate the problem. > > You could simply remove all the quota results I've posted and there will > be the underlaying problem, that the 25 GB of data I got occupies 52 GB. If you only want to know the answer why your "25G" data occupies 52G on disk, above is one of the possible explanations. (And I think I should put it into btrfs(5), although I highly doubt if user will really read them) You could try to defrag, but I'm not sure if defrag works well in multi subvolumes case. > At least one recent snapshot, that was taken after some minor (<100 MB) > changes > from the subvolume, that has undergo some minor changes since then, > occupied 8 GB during one night when the entire system was idling. The only possible method to fully isolate all the disturbing factors is to get rid of snapshot. Build the subvolume from scratch (even no cp --reflink from other subvolume), then test what's happening. Only in that case, you could trust vanilla du (if you don't do any reflink). Although you can always trust qgroups number, such subvolume built from scratch will makes exclusive number equals to reference, making debugging a little easier. Thanks, Qu > > This was crosschecked on files metadata (mtimes compared) and 'du' > results. > > > As a last-resort I've rebalanced the disk (once again), this time with > -dconvert=raid1 (to get rid of the single residue). > signature.asc Description: OpenPGP digital signature
Re: exclusive subvolume space missing
On 2017年12月02日 09:23, Tomasz Pala wrote: > On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote: > >> I assume there is program eating up the space. >> Not btrfs itself. > > Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be > find by lsof on 3.4.75 kernel, but the space was returning after killing > Xorg. The system I'm having problem now is very recent, the space > doesn't return after reboot/emergency and doesn't sum up with files. Unlike vanilla df or "fi usage" or "fi df", btrfs quota only counts on-disk extents. That's to say, reserved space won't contribute to qgroup. Unless one is using anonymous file, which is opened but unlinked, so no one can access it except the owner. (Which I doubt that may be your case) Which should make quota the best tool to debug your problem. (As long as you follow the variant limitations of btrfs quota, especially you need to sync or use --sync option to show qgroup numbers) > >>> Now, the weird part for me is exclusive data count: >>> >>> # btrfs sub sh ./snapshot-171125 >>> [...] >>> Subvolume ID: 388 >>> # btrfs fi du -s ./snapshot-171125 >>> Total Exclusive Set shared Filename >>> 21.50GiB63.35MiB20.77GiB snapshot-171125 >> >> That's the difference between how sub show and quota works. >> >> For quota, it's per-root owner check. > > Just to be clear: I've enabled quota _only_ to see subvolume usage on > spot. And exclusive data - the more detailed approach I've described in > e-mail I've send a minute ago. > >> Means even a file extent is shared between different inodes, if all >> inodes are inside the same subvolume, it's counted as exclusive. >> And if any of the file extent belongs to other subvolume, then it's >> counted as shared. > > Good to know, but this is almost UID0-only system. There are system > users (vendor provided) and 2 ssh accounts for su, but nobody uses this > machine for daily work. The quota values were the last tool I could find > to debug. > >> For fi du, it's per-inode owner check. (The exact behavior is a little >> more complex, I'll skip such corner case to make it a little easier to >> understand). >> >> That's to say, if one file extent is shared by different inodes, then >> it's counted as shared, no matter if these inodes belong to different or >> the same subvolume. >> >> That's to say, "fi du" has a looser condition for "shared" calculation, >> and that should explain why you have 20+G shared. > > There shouldn't be many multi-inode extents inside single subvolume, as this > is mostly fresh > system, with no containers, no deduplication, snapshots are taken from > the same running system before or after some more important change is > done. By 'change' I mean altering text config files mostly (plus > etckeeper's git metadata), so the volume of difference is extremelly > low. Actually most of the difs between subvolumes come from updating > distro packages. There were not much reflink copies made on this > partition, only one kernel source compiled (.ccache files removed > today). So this partition is as clean, as it could be after almost > 5 months in use. > > Actually I should rephrase the problem: > > "snapshot has taken 8 GB of space despite nothing has altered source > subvolume" Then please provide correct qgroup numbers. The correct number should be get by: # btrfs quota enable # btrfs quota rescan -w # btrfs qgroup show -prce --sync Rescan and --sync are important to get the correct number. (while rescan can take a long long time to finish) And further more, please ensure that all deleted files are really deleted. Btrfs delay file and subvolume deletion, so you may need to sync several times or use "btrfs subv sync" to ensure deleted files are deleted. (vanilla du won't tell you if such delayed file deletion is really done) Thanks, Qu > signature.asc Description: OpenPGP digital signature
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote: >> qgroupid rfer excl >> >> 0/26012.25GiB 3.22GiB from 170712 - first snapshot >> 0/31217.54GiB 4.56GiB from 170811 >> 0/36625.59GiB 2.44GiB from 171028 >> 0/37023.27GiB 59.46MiB from 18 - prev snapshot >> 0/38821.69GiB 7.16GiB from 171125 - last snapshot >> 0/29124.29GiB 9.77GiB default subvolume > > You may need to manually sync the filesystem (trigger a transaction > commitment) to update qgroup accounting. The data I've pasted were just calculated. >> # btrfs quota enable / >> # btrfs qgroup show / >> WARNING: quota disabled, qgroup data may be out of date >> [...] >> # btrfs quota enable / - for the second time! >> # btrfs qgroup show / >> WARNING: qgroup data inconsistent, rescan recommended > > Please wait the rescan, or any number is not correct. Here I was pointing that first "quota enable" resulted in "quota disabled" warning until I've enabled it once again. > It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to > ensure you understand all the limitation. I probably won't understand them all, but this is not an issue of my concern as I don't use it. There is simply no other way I am aware that could show me per-subvolume stats. Well, straightforward way, as the hard way I'm using (btrfs send) confirms the problem. You could simply remove all the quota results I've posted and there will be the underlaying problem, that the 25 GB of data I got occupies 52 GB. At least one recent snapshot, that was taken after some minor (<100 MB) changes from the subvolume, that has undergo some minor changes since then, occupied 8 GB during one night when the entire system was idling. This was crosschecked on files metadata (mtimes compared) and 'du' results. As a last-resort I've rebalanced the disk (once again), this time with -dconvert=raid1 (to get rid of the single residue). -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote: > I assume there is program eating up the space. > Not btrfs itself. Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be find by lsof on 3.4.75 kernel, but the space was returning after killing Xorg. The system I'm having problem now is very recent, the space doesn't return after reboot/emergency and doesn't sum up with files. >> Now, the weird part for me is exclusive data count: >> >> # btrfs sub sh ./snapshot-171125 >> [...] >> Subvolume ID: 388 >> # btrfs fi du -s ./snapshot-171125 >> Total Exclusive Set shared Filename >> 21.50GiB63.35MiB20.77GiB snapshot-171125 > > That's the difference between how sub show and quota works. > > For quota, it's per-root owner check. Just to be clear: I've enabled quota _only_ to see subvolume usage on spot. And exclusive data - the more detailed approach I've described in e-mail I've send a minute ago. > Means even a file extent is shared between different inodes, if all > inodes are inside the same subvolume, it's counted as exclusive. > And if any of the file extent belongs to other subvolume, then it's > counted as shared. Good to know, but this is almost UID0-only system. There are system users (vendor provided) and 2 ssh accounts for su, but nobody uses this machine for daily work. The quota values were the last tool I could find to debug. > For fi du, it's per-inode owner check. (The exact behavior is a little > more complex, I'll skip such corner case to make it a little easier to > understand). > > That's to say, if one file extent is shared by different inodes, then > it's counted as shared, no matter if these inodes belong to different or > the same subvolume. > > That's to say, "fi du" has a looser condition for "shared" calculation, > and that should explain why you have 20+G shared. There shouldn't be many multi-inode extents inside single subvolume, as this is mostly fresh system, with no containers, no deduplication, snapshots are taken from the same running system before or after some more important change is done. By 'change' I mean altering text config files mostly (plus etckeeper's git metadata), so the volume of difference is extremelly low. Actually most of the difs between subvolumes come from updating distro packages. There were not much reflink copies made on this partition, only one kernel source compiled (.ccache files removed today). So this partition is as clean, as it could be after almost 5 months in use. Actually I should rephrase the problem: "snapshot has taken 8 GB of space despite nothing has altered source subvolume" -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
>>> Now, the weird part for me is exclusive data count: >>> >>> # btrfs sub sh ./snapshot-171125 >>> [...] >>> Subvolume ID: 388 >>> # btrfs fi du -s ./snapshot-171125 >>> Total Exclusive Set shared Filename >>> 21.50GiB63.35MiB20.77GiB snapshot-171125 >>> >>> How is that possible? This doesn't even remotely relate to 7.15 GiB >>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB. >>> And the same happens with other snapshots, much more exclusive data >>> shown in qgroup than actually found in files. So if not files, where >>> is that space wasted? Metadata? >> >>Personally, I'd trust qgroups' output about as far as I could spit >> Belgium(*). > > Well, there is something wrong here, as after removing the .ccache > directories inside all the snapshots the 'excl' values decreased > ...except for the last snapshot (the list below is short by ~40 snapshots > that have 2 GB excl in total): > > qgroupid rfer excl > > 0/26012.25GiB 3.22GiBfrom 170712 - first snapshot > 0/31217.54GiB 4.56GiBfrom 170811 > 0/36625.59GiB 2.44GiBfrom 171028 > 0/37023.27GiB 59.46MiBfrom 18 - prev snapshot > 0/38821.69GiB 7.16GiBfrom 171125 - last snapshot > 0/29124.29GiB 9.77GiBdefault subvolume You may need to manually sync the filesystem (trigger a transaction commitment) to update qgroup accounting. > > > [~/test/snapshot-171125]# du -sh . > 15G . > > > After changing back to ro I tested how much data really has changed > between the previous and last snapshot: > > [~/test]# btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null > At subvol snapshot-171125 > 74.2MiB 0:00:32 [2.28MiB/s] > > This means there can't be 7 GiB of exclusive data in the last snapshot. Mentioned before, sync the fs first before checking the qgroup numbers. Or use --sync option along with qgroup show. > > Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null > 5.68GiB 0:03:23 [28.6MiB/s] > > I've created a new snapshot right now to compare it with 171125: > 75.5MiB 0:00:43 [1.73MiB/s] > > > OK, I could even compare all the snapshots in sequence: > > # for i in snapshot-17*; btrfs prop set $i ro true > # p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv > > /dev/null; p="$i" done > 1.7GiB 0:00:15 [ 114MiB/s] > 1.03GiB 0:00:38 [27.2MiB/s] > 155MiB 0:00:08 [19.1MiB/s] > 1.08GiB 0:00:47 [23.3MiB/s] > 294MiB 0:00:29 [ 9.9MiB/s] > 324MiB 0:00:42 [7.69MiB/s] > 82.8MiB 0:00:06 [12.7MiB/s] > 64.3MiB 0:00:05 [11.6MiB/s] > 137MiB 0:00:07 [19.3MiB/s] > 85.3MiB 0:00:13 [6.18MiB/s] > 62.8MiB 0:00:19 [3.21MiB/s] > 132MiB 0:00:42 [3.15MiB/s] > 102MiB 0:00:42 [2.42MiB/s] > 197MiB 0:00:50 [3.91MiB/s] > 321MiB 0:01:01 [5.21MiB/s] > 229MiB 0:00:18 [12.3MiB/s] > 109MiB 0:00:11 [ 9.7MiB/s] > 139MiB 0:00:14 [9.32MiB/s] > 573MiB 0:00:35 [15.9MiB/s] > 64.1MiB 0:00:30 [2.11MiB/s] > 172MiB 0:00:11 [14.9MiB/s] > 98.9MiB 0:00:07 [14.1MiB/s] > 54MiB 0:00:08 [6.17MiB/s] > 78.6MiB 0:00:02 [32.1MiB/s] > 15.1MiB 0:00:01 [12.5MiB/s] > 20.6MiB 0:00:00 [ 23MiB/s] > 20.3MiB 0:00:00 [ 23MiB/s] > 110MiB 0:00:14 [7.39MiB/s] > 62.6MiB 0:00:11 [5.67MiB/s] > 65.7MiB 0:00:08 [7.58MiB/s] > 731MiB 0:00:42 [ 17MiB/s] > 73.7MiB 0:00:29 [ 2.5MiB/s] > 322MiB 0:00:53 [6.04MiB/s] > 105MiB 0:00:35 [2.95MiB/s] > 95.2MiB 0:00:36 [2.58MiB/s] > 74.2MiB 0:00:30 [2.43MiB/s] > 75.5MiB 0:00:46 [1.61MiB/s] > > This is 9.3 GB of total diffs between all the snapshots I got. > Plus 15 GB of initial snapshot means there is about 25 GB used, > while df reports twice the amount, way too much for overhead: > /dev/sda264G 52G 11G 84% / > > > # btrfs quota enable / > # btrfs qgroup show / > WARNING: quota disabled, qgroup data may be out of date > [...] > # btrfs quota enable /- for the second time! > # btrfs qgroup show / > WARNING: qgroup data inconsistent, rescan recommended Please wait the rescan, or any number is not correct. (Although it will only be less than actual occupied space) It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to ensure you understand all the limitation. > [...] > 0/42815.96GiB 19.23MiBnewly created (now) snapshot > > > > Assuming the qgroups output is bugus and the space isn't physically > occupied (which is coherent with btrfs fi du output and my expectation) > the question remains: why is that bogus-excl removed from available > space as reported by df or btrfs fi df/usage? And how to reclaim it? Already explained the difference in another thread. Thanks, Qu > > > [~/test]# btrfs device usage / > /dev/sda2, ID: 1 >Device size:64.00GiB >Device slack: 0.00B >Data,single: 1.07GiB >Data,RAID1: 55.97GiB >Metadata,RAID1: 2.00GiB >
Re: exclusive subvolume space missing
On Fri, Dec 01, 2017 at 21:36:14 +, Hugo Mills wrote: >The thing I'd first go looking for here is some rogue process > writing lots of data. I've had something like this happen to me > before, a few times. First, I'd look for large files with "du -ms /* | > sort -n", then work down into the tree until you find them. I already did a handful of searches (mounting parent node in separate directory and diving into default working subvolume on order to unhide possible things covered by any other mounts on top of actual root fs). That's how it looks like: [~/test/@]# du -sh . 15G . >If that doesn't show up anything unusually large, then lsof to look > for open but deleted files (orphans) which are still being written to > by some process. No (deleted) files, the only activity on iotop are internals... 174 be/4 root 15.64 K/s3.67 M/s 0.00 % 5.88 % [btrfs-transacti] 1439 be/4 root0.00 B/s 1173.22 K/s 0.00 % 0.00 % [kworker/u8:8] Only the systemd-journald is writing, but the /var/log is mounted to separate ext3 parition (with journald restarted after the mount); this is also confirmed by looking into separate mount. Anyway that can't be opened-deleted files, as the usage doesn't change after booting into emergency. The worst thing is that the 8 GB was lost during the night, when nothing except for stats collector was running. As already said, this is not the classical "Linux eats my HDD" problem. >This is very likely _not_ to be a btrfs problem, but instead some > runaway process writing lots of crap very fast. Log files are probably > the most plausible location, but not the only one. That would be visible in iostat or /proc/diskstats - it isn't. The free space disappears without being physically written, which means it is some allocation problem. I also created a list of files modified between the snapshots with: find test/@ -xdev -newer some_reference_file_inside_snapshot and there is nothing bigger than a few MBs. I've changed the snapshots to rw and removed some data from all the instances: 4.8 GB in two ISO images and 5 GB-limited .ccache directory. After this I got 11 GB freed, so the numbers are fine. # btrfs fi usage / Overall: Device size: 128.00GiB Device allocated:117.19GiB Device unallocated: 10.81GiB Device missing: 0.00B Used:103.56GiB Free (estimated): 11.19GiB (min: 11.14GiB) Data ratio: 1.98 Metadata ratio: 2.00 Global reserve: 146.08MiB (used: 0.00B) Data,single: Size:1.19GiB, Used:1.18GiB /dev/sda2 1.07GiB /dev/sdb2 132.00MiB Data,RAID1: Size:55.97GiB, Used:50.30GiB /dev/sda2 55.97GiB /dev/sdb2 55.97GiB Metadata,RAID1: Size:2.00GiB, Used:908.61MiB /dev/sda2 2.00GiB /dev/sdb2 2.00GiB System,RAID1: Size:32.00MiB, Used:16.00KiB /dev/sda2 32.00MiB /dev/sdb2 32.00MiB Unallocated: /dev/sda2 4.93GiB /dev/sdb2 5.87GiB >> Now, the weird part for me is exclusive data count: >> >> # btrfs sub sh ./snapshot-171125 >> [...] >> Subvolume ID: 388 >> # btrfs fi du -s ./snapshot-171125 >> Total Exclusive Set shared Filename >> 21.50GiB63.35MiB20.77GiB snapshot-171125 >> >> How is that possible? This doesn't even remotely relate to 7.15 GiB >> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB. >> And the same happens with other snapshots, much more exclusive data >> shown in qgroup than actually found in files. So if not files, where >> is that space wasted? Metadata? > >Personally, I'd trust qgroups' output about as far as I could spit > Belgium(*). Well, there is something wrong here, as after removing the .ccache directories inside all the snapshots the 'excl' values decreased ...except for the last snapshot (the list below is short by ~40 snapshots that have 2 GB excl in total): qgroupid rfer excl 0/26012.25GiB 3.22GiB from 170712 - first snapshot 0/31217.54GiB 4.56GiB from 170811 0/36625.59GiB 2.44GiB from 171028 0/37023.27GiB 59.46MiB from 18 - prev snapshot 0/38821.69GiB 7.16GiB from 171125 - last snapshot 0/29124.29GiB 9.77GiB default subvolume [~/test/snapshot-171125]# du -sh . 15G . After changing back to ro I tested how much data really has changed between the previous and last snapshot: [~/test]# btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null At subvol snapshot-171125 74.2MiB 0:00:32 [2.28MiB/s] This means there can't be 7 GiB of exclusive data in the last snapshot. Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null 5.68GiB 0:03:23 [28.6MiB/s] I've created a new snapshot right now to
Re: exclusive subvolume space missing
On 2017年12月02日 00:15, Tomasz Pala wrote: > Hello, > > I got a problem with btrfs running out of space (not THE > Internet-wide, well known issues with interpretation). > > The problem is: something eats the space while not running anything that > justifies this. There were 18 GB free space available, suddenly it > dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB > with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten > right now, just as I'm writing this e-mail: > > /dev/sda264G 63G 452M 100% / > /dev/sda264G 63G 365M 100% / > /dev/sda264G 63G 316M 100% / > /dev/sda264G 63G 287M 100% / > /dev/sda264G 63G 268M 100% / > /dev/sda264G 63G 239M 100% / > /dev/sda264G 63G 230M 100% / > /dev/sda264G 63G 182M 100% / > /dev/sda264G 63G 163M 100% / > /dev/sda264G 64G 153M 100% / > /dev/sda264G 64G 143M 100% / > /dev/sda264G 64G 96M 100% / > /dev/sda264G 64G 88M 100% / > /dev/sda264G 64G 57M 100% / > /dev/sda264G 64G 25M 100% / > > while my rough calculations show, that there should be at least 10 GB of > free space. After enabling quotas it is somehow confirmed: > > # btrfs qgroup sh --sort=excl / > qgroupid rfer excl > > 0/5 16.00KiB 16.00KiB > [30 snapshots with about 100 MiB excl] > 0/33324.53GiB305.79MiB > 0/29813.44GiB312.74MiB > 0/32723.79GiB427.13MiB > 0/33123.93GiB930.51MiB > 0/26012.25GiB 3.22GiB > 0/31219.70GiB 4.56GiB > 0/38828.75GiB 7.15GiB > 0/29130.60GiB 9.01GiB <- this is the running one > > This is about 30 GB total excl (didn't find a switch to sum this up). I > know I can't just add 'excl' to get usage, so tried to pinpoint the > exact files that occupy space in 0/388 exclusively (this is the last > snapshots taken, all of the snapshots are created from the running fs). I assume there is program eating up the space. Not btrfs itself. > > > Now, the weird part for me is exclusive data count: > > # btrfs sub sh ./snapshot-171125 > [...] > Subvolume ID: 388 > # btrfs fi du -s ./snapshot-171125 > Total Exclusive Set shared Filename > 21.50GiB63.35MiB20.77GiB snapshot-171125 That's the difference between how sub show and quota works. For quota, it's per-root owner check. Means even a file extent is shared between different inodes, if all inodes are inside the same subvolume, it's counted as exclusive. And if any of the file extent belongs to other subvolume, then it's counted as shared. For fi du, it's per-inode owner check. (The exact behavior is a little more complex, I'll skip such corner case to make it a little easier to understand). That's to say, if one file extent is shared by different inodes, then it's counted as shared, no matter if these inodes belong to different or the same subvolume. That's to say, "fi du" has a looser condition for "shared" calculation, and that should explain why you have 20+G shared. Thanks, Qu > > > How is that possible? This doesn't even remotely relate to 7.15 GiB > from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB. > And the same happens with other snapshots, much more exclusive data > shown in qgroup than actually found in files. So if not files, where > is that space wasted? Metadata? > > btrfs-progs-4.12 running on Linux 4.9.46. > > best regards, > signature.asc Description: OpenPGP digital signature
Re: exclusive subvolume space missing
On Fri, Dec 01, 2017 at 05:15:55PM +0100, Tomasz Pala wrote: > Hello, > > I got a problem with btrfs running out of space (not THE > Internet-wide, well known issues with interpretation). > > The problem is: something eats the space while not running anything that > justifies this. There were 18 GB free space available, suddenly it > dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB > with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten > right now, just as I'm writing this e-mail: > > /dev/sda264G 63G 452M 100% / > /dev/sda264G 63G 365M 100% / > /dev/sda264G 63G 316M 100% / > /dev/sda264G 63G 287M 100% / > /dev/sda264G 63G 268M 100% / > /dev/sda264G 63G 239M 100% / > /dev/sda264G 63G 230M 100% / > /dev/sda264G 63G 182M 100% / > /dev/sda264G 63G 163M 100% / > /dev/sda264G 64G 153M 100% / > /dev/sda264G 64G 143M 100% / > /dev/sda264G 64G 96M 100% / > /dev/sda264G 64G 88M 100% / > /dev/sda264G 64G 57M 100% / > /dev/sda264G 64G 25M 100% / > > while my rough calculations show, that there should be at least 10 GB of > free space. After enabling quotas it is somehow confirmed: > > # btrfs qgroup sh --sort=excl / > qgroupid rfer excl > > 0/5 16.00KiB 16.00KiB > [30 snapshots with about 100 MiB excl] > 0/33324.53GiB305.79MiB > 0/29813.44GiB312.74MiB > 0/32723.79GiB427.13MiB > 0/33123.93GiB930.51MiB > 0/26012.25GiB 3.22GiB > 0/31219.70GiB 4.56GiB > 0/38828.75GiB 7.15GiB > 0/29130.60GiB 9.01GiB <- this is the running one > > This is about 30 GB total excl (didn't find a switch to sum this up). I > know I can't just add 'excl' to get usage, so tried to pinpoint the > exact files that occupy space in 0/388 exclusively (this is the last > snapshots taken, all of the snapshots are created from the running fs). The thing I'd first go looking for here is some rogue process writing lots of data. I've had something like this happen to me before, a few times. First, I'd look for large files with "du -ms /* | sort -n", then work down into the tree until you find them. If that doesn't show up anything unusually large, then lsof to look for open but deleted files (orphans) which are still being written to by some process. This is very likely _not_ to be a btrfs problem, but instead some runaway process writing lots of crap very fast. Log files are probably the most plausible location, but not the only one. > Now, the weird part for me is exclusive data count: > > # btrfs sub sh ./snapshot-171125 > [...] > Subvolume ID: 388 > # btrfs fi du -s ./snapshot-171125 > Total Exclusive Set shared Filename > 21.50GiB63.35MiB20.77GiB snapshot-171125 > > > How is that possible? This doesn't even remotely relate to 7.15 GiB > from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB. > And the same happens with other snapshots, much more exclusive data > shown in qgroup than actually found in files. So if not files, where > is that space wasted? Metadata? Personally, I'd trust qgroups' output about as far as I could spit Belgium(*). Hugo. (*) No offence indended to Belgium. -- Hugo Mills | I used to live in hope, but I got evicted. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: exclusive subvolume space missing
Tomasz Pala posted on Fri, 01 Dec 2017 17:15:55 +0100 as excerpted: > Hello, > > I got a problem with btrfs running out of space (not THE > Internet-wide, well known issues with interpretation). > > The problem is: something eats the space while not running anything that > justifies this. There were 18 GB free space available, suddenly it > dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB > with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten > right now, just as I'm writing this e-mail: > > /dev/sda264G 63G 452M 100% / > /dev/sda264G 63G 365M 100% / > /dev/sda264G 63G 316M 100% / > /dev/sda264G 63G 287M 100% / > /dev/sda264G 63G 268M 100% / > /dev/sda264G 63G 239M 100% / > /dev/sda264G 63G 230M 100% / > /dev/sda264G 63G 182M 100% / > /dev/sda264G 63G 163M 100% / > /dev/sda264G 64G 153M 100% / > /dev/sda264G 64G 143M 100% / > /dev/sda264G 64G 96M 100% / > /dev/sda264G 64G 88M 100% / > /dev/sda264G 64G 57M 100% / > /dev/sda264G 64G 25M 100% / Scary. > while my rough calculations show, that there should be at least 10 GB of > free space. After enabling quotas it is somehow confirmed: I don't use quotas so won't claim working knowledge or an explanation of that side of things, however... > > btrfs-progs-4.12 running on Linux 4.9.46. Until quite recently btrfs quotas were too buggy to recommend for use. While the known blocker-level bugs are now fixed, scaling and real-world performance are still an issue, and AFAIK, the fixes didn't make 4.9 and may not be backported as the feature was simply known to be broken beyond reliable usability at that point. Based on comments in other threads here, I /think/ the critical quota fixes hit 4.10, but of course not being an LTS, 4.10 is long out of support. I'd suggest either turning off and forgetting about quotas since it doesn't appear you actually need them, or upgrading to at least 4.13 and keeping current, or the LTS 4.14 if you want to stay on the same kernel series for awhile. As for the scaling and performance issues, during normal/generic filesystem use things are generally fine; it's various btrfs maintenance commands such as balance, snapshot deletion, and btrfs check, that have the scaling issues, and they have /some/ scaling issues even without quotas, it's just that quotas makes the problem *much* worse. One workaround for balance and snapshot deletion is to temporarily disable quotas while the job is running, then reenable (and rescan if necessary, as I don't use the feature here I'm not sure whether it is). That can literally turn a job that was looking to take /weeks/ due to the scaling issue, into a job of hours. Unfortunately, the sorts of conditions that would trigger running a btrfs check don't lend themselves to the same sort of workaround, so not having quotas on at all is the only workaround there. As to your space being eaten problem, the output of btrfs filesystem usage (and perhaps btrfs device usage if it's a multi-device btrfs) could be really helpful here, much more so than quota reports if it's a btrfs issue, or to help eliminate btrfs as the problem if it's not. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
exclusive subvolume space missing
Hello, I got a problem with btrfs running out of space (not THE Internet-wide, well known issues with interpretation). The problem is: something eats the space while not running anything that justifies this. There were 18 GB free space available, suddenly it dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten right now, just as I'm writing this e-mail: /dev/sda264G 63G 452M 100% / /dev/sda264G 63G 365M 100% / /dev/sda264G 63G 316M 100% / /dev/sda264G 63G 287M 100% / /dev/sda264G 63G 268M 100% / /dev/sda264G 63G 239M 100% / /dev/sda264G 63G 230M 100% / /dev/sda264G 63G 182M 100% / /dev/sda264G 63G 163M 100% / /dev/sda264G 64G 153M 100% / /dev/sda264G 64G 143M 100% / /dev/sda264G 64G 96M 100% / /dev/sda264G 64G 88M 100% / /dev/sda264G 64G 57M 100% / /dev/sda264G 64G 25M 100% / while my rough calculations show, that there should be at least 10 GB of free space. After enabling quotas it is somehow confirmed: # btrfs qgroup sh --sort=excl / qgroupid rfer excl 0/5 16.00KiB 16.00KiB [30 snapshots with about 100 MiB excl] 0/33324.53GiB305.79MiB 0/29813.44GiB312.74MiB 0/32723.79GiB427.13MiB 0/33123.93GiB930.51MiB 0/26012.25GiB 3.22GiB 0/31219.70GiB 4.56GiB 0/38828.75GiB 7.15GiB 0/29130.60GiB 9.01GiB <- this is the running one This is about 30 GB total excl (didn't find a switch to sum this up). I know I can't just add 'excl' to get usage, so tried to pinpoint the exact files that occupy space in 0/388 exclusively (this is the last snapshots taken, all of the snapshots are created from the running fs). Now, the weird part for me is exclusive data count: # btrfs sub sh ./snapshot-171125 [...] Subvolume ID: 388 # btrfs fi du -s ./snapshot-171125 Total Exclusive Set shared Filename 21.50GiB63.35MiB20.77GiB snapshot-171125 How is that possible? This doesn't even remotely relate to 7.15 GiB from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB. And the same happens with other snapshots, much more exclusive data shown in qgroup than actually found in files. So if not files, where is that space wasted? Metadata? btrfs-progs-4.12 running on Linux 4.9.46. best regards, -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html