Re: exclusive subvolume space missing
Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted: > On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote: > >>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from >> [...] >>> Now make various small changes to the file, say under 16 KiB each. These >>> will each be COWed elsewhere as one might expect. by default 16 KiB at >>> a time I believe (might be 4 KiB, as it was back when the default leaf >> >> I got ~500 small files (100-500 kB) updated partially in regular >> intervals: >> >> # du -Lc **/*.rrd | tail -n1 >> 105Mtotal FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post) are (other than that a quick google suggests that it's... round-robin-database... and the database bit alone sounds bad in this context as database-file rewrites are known to be a worst-case for cow-based filesystems), but it sounds like you suspect that they have this rewrite-most pattern that could explain your problem... >>> But here's the kicker. Even without a snapshot locking that original 100 >>> MiB extent in place, if even one of the original 16 KiB blocks isn't >>> rewritten, that entire 100 MiB extent will remain locked in place, as the >>> original 16 KiB blocks that have been changed and thus COWed elsewhere >>> aren't freed one at a time, the full 100 MiB extent only gets freed, all >>> at once, once no references to it remain, which means once that last >>> block of the extent gets rewritten. > > OTOH - should this happen with nodatacow files? As I mentioned before, > these files are chattred +C (however this was not their initial state > due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ). > Am I wrong thinking, that in such case they should occupy twice their > size maximum? Or maybe there is some tool that could show me the real > space wasted by file, including extents count etc? Nodatacow... isn't as simple as the name might suggest. For one thing, snapshots depend on COW and lock the extents they reference in-place, so while a file might be set nocow and that setting is retained, the first write to a block after a snapshot *MUST* cow that block... because the snapshot has the existing version referenced and it can't change without changing the snapshot as well, and that would of course defeat the purpose of snapshots. Tho the attribute is retained and further writes to the same already cowed block won't cow it again. FWIW, on this list that behavior is often referred to as cow1, cow only the first time that a block is written after a snapshot locks the previous version in place. The effect of cow1 depends on the frequency and extent of block rewrites vs. the frequency of snapshots of the subvolume they're on. As should be obvious if you think about it, once you've done the cow1, further rewrites to the same block before further snapshots won't cow further, so if only a few blocks are repeatedly rewritten multiple times between snapshots, the effect should be relatively small. Similarly if snapshots happen far more frequently than block rewrites, since in that case most of the snapshots won't have anything changed (for that file anyway) since the last one. However, if most of the file gets rewritten between snapshots and the snapshot frequency is often enough to be a major factor, the effect can be practically as bad as if the file weren't nocow in the first place. If I knew a bit more about rrd's rewrite pattern... and your snapshot pattern... Second, as you alluded, for btrfs files must be set nocow before anything is written to them. Quoting the chattr (1) manpage: "If it is set on a file which already has data blocks, it is undefined when the blocks assigned to the file will be fully stable." Not being a dev I don't read the code to know what that means in practice, but it could well be effectively cow1, which would yield the maximum 2X size you assumed. But I think it's best to take "undefined" at its meaning, and assume worst-case "no effect at all", for size calculation purposes, unless you really /did/ set it at file creation, before the file had content. And the easiest way to do /that/, and something that might be worthwhile doing anyway if you think unreclaimed still referenced extents are your problem, is to set the nocow flag on the /directory/, then copy the files into it, taking care to actually create them new, that is, use --reflink=never or copy the files to a different filesystem, perhaps tmpfs, and back, so they /have/ to be created new. Of course with the rewriter (rrdcached, apparently) shut down for the process. Then, once the files are safely back in place and the filesystem synced so the data is actually on disk, you can delete the old copies (which will continue to serve as backups until then), and sync the filesystem again. While snapshots will of course continue to keep extents they reference locked, for unsnapshotted files at least, this process should clear up any still
Re: btrfs-transacti hammering the system
01.12.2017 21:04, Austin S. Hemmelgarn пишет: > On 2017-12-01 12:13, Andrei Borzenkov wrote: >> 01.12.2017 20:06, Hans van Kranenburg пишет: >>> >>> Additional tips (forgot to ask for your /proc/mounts before): >>> * Use the noatime mount option, so that only accessing files does not >>> lead to changes in metadata, >> >> Is not 'lazytime" default today? Sorry, it was relatime that is today's default, I mixed them up. > It gives you correct atime + no extra >> metadata update cause by update of atime only. > Unless things have changed since the last time this came up, BTRFS does > not support the 'lazytime' mount option (but it doesn't complain about > it either). > Actually since v2.27 "lazytime" is interpreted by mount command itself and converted into MS_LAZYTIME flag, so should be available for each FS. bor@10:~> sudo mkfs -t ext4 /dev/sdb1 mke2fs 1.43.7 (16-Oct-2017) ... bor@10:~> sudo mount -t ext4 -o lazytime /dev/sdb1 /mnt bor@10:~> tail /proc/self/mountinfo ... 224 66 8:17 / /mnt rw,relatime shared:152 - ext4 /dev/sdb1 rw,lazytime,data=ordered bor@10:~> sudo umount /dev/sdb1 bor@10:~> sudo mkfs -t btrfs -f /dev/sdb1 btrfs-progs v4.13.3 ... bor@10:~> sudo mount -t btrfs -o lazytime /dev/sdb1 /mnt bor@10:~> tail /proc/self/mountinfo ... 224 66 0:88 / /mnt rw,relatime shared:152 - btrfs /dev/sdb1 rw,lazytime,space_cache,subvolid=5,subvol=/ bor@10:~> > Also, lazytime is independent from noatime, and using both can have > benefits (lazytime will still have to write out the inode for every file > read on the system every 24 hours, but with noatime it only has to write > out the inode for files that have changed). > OK, that's true. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote: >> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from > [...] >> Now make various small changes to the file, say under 16 KiB each. These >> will each be COWed elsewhere as one might expect. by default 16 KiB at >> a time I believe (might be 4 KiB, as it was back when the default leaf > > I got ~500 small files (100-500 kB) updated partially in regular > intervals: > > # du -Lc **/*.rrd | tail -n1 > 105Mtotal > >> But here's the kicker. Even without a snapshot locking that original 100 >> MiB extent in place, if even one of the original 16 KiB blocks isn't >> rewritten, that entire 100 MiB extent will remain locked in place, as the >> original 16 KiB blocks that have been changed and thus COWed elsewhere >> aren't freed one at a time, the full 100 MiB extent only gets freed, all >> at once, once no references to it remain, which means once that last >> block of the extent gets rewritten. OTOH - should this happen with nodatacow files? As I mentioned before, these files are chattred +C (however this was not their initial state due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ). Am I wrong thinking, that in such case they should occupy twice their size maximum? Or maybe there is some tool that could show me the real space wasted by file, including extents count etc? -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote: > OK, is this supposed to be raid1 or single data, because the above shows > metadata as all raid1, while some data is single tho most is raid1, and > while old mkfs used to create unused single chunks on raid1 that had to > be removed manually via balance, those single data chunks aren't unused. It is supposed to be RAID1, the single data were leftovers from my previous attempts to gain some space by converting into single profile. Which miserably failed BTW (would it be smarter with "soft" option?), but I've already managed to clear this. > Assuming the intent is raid1, I'd recommend doing... > > btrfs balance start -dconvert=raid1,soft / Yes, this was the way to go. It also reclaimed the 8 GB. I assume the failing -dconvert=single somehow locked that 8 GB, so this issue should be addressed in btrfs-tools to report such locked out region. You've already noted that the single profile data occupied much less itself. So this was the first issue, the second is running overhead, that accumulates over time. Since yesterday, when I had 19 GB free, I've lost 4 GB already. The scenario you've described is very probable: > btrfs balance start -dusage=N / [...] > allocated value toward usage. I too run relatively small btrfs raid1s > and would suggest trying N=5, 20, 40, 70, until the spread between There were no effects above N=10 (both dusage and musage). > consuming your space either, as I'd suspect they might if the problem were > for instance atime updates, so while noatime is certainly recommended and I use noatime by default since years, so not the source of problem here. > The other possibility that comes to mind here has to do with btrfs COW > write patterns... > Suppose you start with a 100 MiB file (I'm adjusting the sizes down from [...] > Now make various small changes to the file, say under 16 KiB each. These > will each be COWed elsewhere as one might expect. by default 16 KiB at > a time I believe (might be 4 KiB, as it was back when the default leaf I got ~500 small files (100-500 kB) updated partially in regular intervals: # du -Lc **/*.rrd | tail -n1 105Mtotal > But here's the kicker. Even without a snapshot locking that original 100 > MiB extent in place, if even one of the original 16 KiB blocks isn't > rewritten, that entire 100 MiB extent will remain locked in place, as the > original 16 KiB blocks that have been changed and thus COWed elsewhere > aren't freed one at a time, the full 100 MiB extent only gets freed, all > at once, once no references to it remain, which means once that last > block of the extent gets rewritten. > > So perhaps you have a pattern where files of several MiB get mostly > rewritten, taking more space for the rewrites due to COW, but one or > more blocks remain as originally written, locking the original extent > in place at its full size, thus taking twice the space of the original > file. > > Of course worst-case is rewrite the file minus a block, then rewrite > that minus a block, then rewrite... in which case the total space > usage will end up being several times the size of the original file! > > Luckily few people have this sort of usage pattern, but if you do... > > It would certainly explain the space eating... Did anyone investigated how is that related to RRD rewrites? I don't use rrdcached, never thought that 100 MB of data might trash entire filesystem... best regards, -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snaprotate
On Sat 2017-12-02 (13:53), Ulli Horlacher wrote: > Being a Netapp user for a long time, I have always missed btrfs snapshots > the way Netapp creates them. > > I have now written snaprotate: > > http://fex.belwue.de/snaprotate.html Uuuppps.. just found, I have posted this 2017-09-09 already! Sorry for the duplicate! But the documentation now is better :-) -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de Allmandring 30aTel:++49-711-68565868 70569 Stuttgart (Germany) WWW:http://www.tik.uni-stuttgart.de/ REF:<20171202125356.ga1...@rus.uni-stuttgart.de> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
snaprotate
Being a Netapp user for a long time, I have always missed btrfs snapshots the way Netapp creates them. I have now written snaprotate: http://fex.belwue.de/snaprotate.html snaprotate creates and manages btrfs readonly snapshots similar to Netapp. Snapshots have names like hourly, daily, weekly, single and a date_time prefix. Snapshots are stored in a .snapshot/ directory in the subvolume root. Example: /local/home/.snapshot/2017-09-09_1200.hourly You create a snapshot with: snaprotate is your snapshot name is the maximum number of snapshots for this class is the btrfs subvolume directory you want to snapshoot If count is exceeded then the oldest snapshot will be deleted. You can use snaprotate either as root or as normal user (for your own subvolumes). Example usages: snaprotate single 3 /data # create and rotate "single" snapshot snaprotate test 0 /tmp# delete all "test" snapshots snaprotate daily 7 /home /local/share # create and rotate "daily" snapshot snaprotate -l /home # list snapshots -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de Allmandring 30aTel:++49-711-68565868 70569 Stuttgart (Germany) WWW:http://www.tik.uni-stuttgart.de/ REF:<20171202125356.ga1...@rus.uni-stuttgart.de> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
OK, I seriously need to address that, as during the night I lost 3 GB again: On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote: >> # btrfs fi sh / >> Label: none uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a >> Total devices 2 FS bytes used 44.10GiB Total devices 2 FS bytes used 47.28GiB >> # btrfs fi usage / >> Overall: >> Used: 88.19GiB Used: 94.58GiB >> Free (estimated): 18.75GiB (min: 18.75GiB) Free (estimated): 15.56GiB (min: 15.56GiB) >> >> # btrfs dev usage / - output not changed >> # btrfs fi df / >> Data, RAID1: total=51.97GiB, used=43.22GiB Data, RAID1: total=51.97GiB, used=46.42GiB >> System, RAID1: total=32.00MiB, used=16.00KiB >> Metadata, RAID1: total=2.00GiB, used=895.69MiB >> GlobalReserve, single: total=131.14MiB, used=0.00B GlobalReserve, single: total=135.50MiB, used=0.00B >> >> # df >> /dev/sda264G 45G 19G 71% / /dev/sda264G 48G 16G 76% / >> However the difference is on active root fs: >> >> -0/29124.29GiB 9.77GiB >> +0/29115.99GiB 76.00MiB 0/29119.19GiB 3.28GiB > > Since you have already showed the size of the snapshots, which hardly > goes beyond 1G, it may be possible that extent booking is the cause. > > And considering it's all exclusive, defrag may help in this case. I'm going to try defrag here, but have a bunch of questions before; as defrag would break CoW, I don't want to defrag files that span multiple snapshots, unless they have huge overhead: 1. is there any switch resulting in 'defrag only exclusive data'? 2. is there any switch resulting in 'defrag only extents fragmented more than X' or 'defrag only fragments that would be possibly freed'? 3. I guess there aren't, so how could I accomplish my target, i.e. reclaiming space that was lost due to fragmentation, without breaking spanshoted CoW where it would be not only pointless, but actually harmful? 4. How can I prevent this from happening again? All the files, that are written constantly (stats collector here, PostgreSQL database and logs on other machines), are marked with nocow (+C); maybe some new attribute to mark file as autodefrag? +t? For example, the largest file from stats collector: Total Exclusive Set shared Filename 432.00KiB 176.00KiB 256.00KiB load/load.rrd but most of them has 'Set shared'==0. 5. The stats collector is running from the beginning, according to the quota output was not the issue since something happened. If the problem was triggered by (guessing) low space condition, and it results in even more space lost, there is positive feedback that is dangerous, as makes any filesystem unstable ("once you run out of space, you won't recover"). Does it mean btrfs is simply not suitable (yet?) for frequent updates usage pattern, like RRD files? 6. Or maybe some extra steps just before taking snapshot should be taken? I guess 'defrag exclusive' would be perfect here - reclaiming space before it is being locked inside snapshot. Rationale behind this is obvious: since the snapshot-aware defrag was removed, allow to defrag snapshot exclusive data only. This would of course result in partial file defragmentation, but that should be enough for pathological cases like mine. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html