Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Thu, Feb 11, 2021 at 09:19:07AM -0500, Phillip Susi wrote: > > Phillip Susi writes: > > > Wait, what do you mean the inode nr changes? I thought the whole point > > of the block donating thing was that you get a contiguous set of blocks > > in the new file, then transfer those blocks back to the old inode so > > that the inode number and timestamps of the file don't change. > > I just tested this with e4defrag and the inode nr does not change. > Oddly, it refused to improve my archived journals which had 12-15 > fragments. I finally found /var/log/btmp.1 which despite being less > than 8mb had several hundred fragments. e4defrag got it down to 1 > fragment, but for some reason, it is still described by 3 separate > entries in the extent tree. > > Looking at the archived journals though, I wonder why am I seeing so > many unwritten areas? Just the last extent of this file has nearly 4 mb > that were never written to. This system has never had an unexpected > shutdown. Attached is the extent map. > The mid-journal unwritten areas are likely entry arrays. They grow exponentially and get filled in as more entries are appended containing their respective objects. If you're unfamiliar with the format, there's a chain of entry arrays constructed per recurring data object. At the end of the journal, it's currently expected to find some unwritten space due to the 8MiB fallocate. A future version will likely truncate this off while archiving. I added a journal object layout introspection feature to jio [0], which might be interesting for you to correlate the extent list with the application-level object list. You can access the feature by running `jio report layout`, it will produce a .layout file in the cwd for every journal it opened. Here's a sample: ---8<---8<---8<---8< Layout for "user-1000.journal" Legend: ? OBJECT_UNUSED d OBJECT_DATA f OBJECT_FIELD e OBJECT_ENTRY D OBJECT_DATA_HASH_TABLE F OBJECT_FIELD_HASH_TABLE A OBJECT_ENTRY_ARRAY t OBJECT_TAG |N|object spans N page boundaries (page size used=4096) | single page boundary +N N bytes of alignment padding + single byte of alignment padding F|5344 D|448|1834896 d81+7 f50+6 d74+6 f48 d82+6 f55+ d84+4 f57+7 d79+ f50+6 d104 f47+ d73+7 f44+4 d73+7 f44+4 d73+7 f44+4 d72 f45+3 d76+4 f44+4 d75+5 f48 d90+6 f54+2 d80 f54+2 d84+4 f55+ d123+5 f55+ d82+6 f56 d87+ f58+6 d93+3 f53+3 d|94+2 f54+2 d91+5 f59+5 d119+ f62+2 d107+5 f66+6 d105+7 f48 d108+4 f51+5 d82+6 f49+7 e480 A56 d97+7 d107+5 e480 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 d142+2 d70+2 d107+5 e|480 d74+6 d148+4 d107+5 e480 A56 d78+2 d122+6 d72 d107+5 e480 A88 d79+ d73+7 d107+5 e480 A88 A88 A88 A56 A88 A88 A88 A88 A88 A88 A88 A88 A|88 A88 A88 A88 A88 A88 A88 A88 A88 d97+7 d107+5 e480 A88 A56 A56 d107+5 e480 A56 d107+5 e480 A56 A56 d107+5 e480 A88 A56 ---8<---8<---8<---8< Regards, Vito Caputo [0] git://git.pengaru.com/jio (clone recursively w/--recursive) ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Colin Guthrie writes: > I think the defaults are more complex than just "each journal file can > grow to 128M" no? Not as far as I can see. > I mean there is SystemMaxUse= which defaults to 10% of the partition on > which journal files live (this is for all journal files, not just the > SystemMaxFileSize= which refers to just one file). That controls when to delete old journals, not when to rotate a journal. It looks like you can manually request a rotation, and you can set a time based rotation, but it defaults to off, so that leaves rotating once the file reaches the max size ( 128M ). ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Phillip Susi wrote on 11/02/2021 16:29: Colin Guthrie writes: Are those journal files suffixed with a ~. Only ~ suffixed journals represent a dirty journal file (i.e. from an unexpected shutdown). Nope. Journals rotate for other reason too (e.g. user request, overall space requirements etc.) which might explain this wasted space? I've made no requests to rotate and config is default, which afaics means only rotate when the log hits max size of 128MB. Thus I wouldn't expect to really see any holes in the log, especially in the middle. I think the defaults are more complex than just "each journal file can grow to 128M" no? I mean there is SystemMaxUse= which defaults to 10% of the partition on which journal files live (this is for all journal files, not just the SystemMaxFileSize= which refers to just one file). The default semantics are described in man journald.conf(5) Again, could be a red herring, so just my first thought. Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited http://www.tribalogic.net/ Open Source: Mageia Contributor http://www.mageia.org/ PulseAudio Hacker http://www.pulseaudio.org/ Trac Hacker http://trac.edgewall.org/ ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Colin Guthrie writes: > Are those journal files suffixed with a ~. Only ~ suffixed journals > represent a dirty journal file (i.e. from an unexpected shutdown). Nope. > Journals rotate for other reason too (e.g. user request, overall space > requirements etc.) which might explain this wasted space? I've made no requests to rotate and config is default, which afaics means only rotate when the log hits max size of 128MB. Thus I wouldn't expect to really see any holes in the log, especially in the middle. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Phillip Susi wrote on 11/02/2021 14:19: Looking at the archived journals though, I wonder why am I seeing so many unwritten areas? Just the last extent of this file has nearly 4 mb that were never written to. This system has never had an unexpected shutdown. Attached is the extent map. Are those journal files suffixed with a ~. Only ~ suffixed journals represent a dirty journal file (i.e. from an unexpected shutdown). Journals rotate for other reason too (e.g. user request, overall space requirements etc.) which might explain this wasted space? Just a thought. Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited http://www.tribalogic.net/ Open Source: Mageia Contributor http://www.mageia.org/ PulseAudio Hacker http://www.pulseaudio.org/ Trac Hacker http://trac.edgewall.org/ ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Phillip Susi writes: > Wait, what do you mean the inode nr changes? I thought the whole point > of the block donating thing was that you get a contiguous set of blocks > in the new file, then transfer those blocks back to the old inode so > that the inode number and timestamps of the file don't change. I just tested this with e4defrag and the inode nr does not change. Oddly, it refused to improve my archived journals which had 12-15 fragments. I finally found /var/log/btmp.1 which despite being less than 8mb had several hundred fragments. e4defrag got it down to 1 fragment, but for some reason, it is still described by 3 separate entries in the extent tree. Looking at the archived journals though, I wonder why am I seeing so many unwritten areas? Just the last extent of this file has nearly 4 mb that were never written to. This system has never had an unexpected shutdown. Attached is the extent map. Filesystem type is: ef53 File size of system@13a67b4b418d4869b37247eda6ebe494-00151338-0005b9ee46d7d4a9.journal is 117440512 (28672 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 0:1712667.. 1712667: 1: 1:1..2047:1591168.. 1593214: 2047:1712668: 2: 2048..2132:3012608.. 3012692: 85:1593215: 3: 2133..2139:3012693.. 3012699: 7: unwritten 4: 2140..4095:3012700.. 3014655: 1956: 5: 4096..6143:3041280.. 3043327: 2048:3014656: 6: 6144..8191:3010560.. 3012607: 2048:3043328: 7: 8192..9011:3002368.. 3003187:820:3012608: 8: 9012..9013:3003188.. 3003189: 2: unwritten 9: 9014.. 10239:3003190.. 3004415: 1226: 10:10240.. 11255:3024896.. 3025911: 1016:3004416: 11:11256.. 11268:3025912.. 3025924: 13: unwritten 12:11269.. 11348:3025925.. 3026004: 80: 13:11349.. 11352:3026005.. 3026008: 4: unwritten 14:11353.. 11360:3026009.. 3026016: 8: 15:11361.. 11364:3026017.. 3026020: 4: unwritten 16:11365.. 11373:3026021.. 3026029: 9: 17:11374.. 11376:3026030.. 3026032: 3: unwritten 18:11377.. 11642:3026033.. 3026298:266: 19:11643.. 11688:3026299.. 3026344: 46: unwritten 20:11689.. 11961:3026345.. 3026617:273: 21:11962.. 11962:3026618.. 3026618: 1: unwritten 22:11963.. 12287:3026619.. 3026943:325: 23:12288.. 12347:3033088.. 3033147: 60:3026944: 24:12348.. 12381:3033148.. 3033181: 34: unwritten 25:12382.. 12466:3033182.. 3033266: 85: 26:12467.. 12503:3033267.. 3033303: 37: unwritten 27:12504.. 13007:3033304.. 3033807:504: 28:13008.. 13024:3033808.. 3033824: 17: unwritten 29:13025.. 13044:3033825.. 3033844: 20: 30:13045.. 13061:3033845.. 3033861: 17: unwritten 31:13062.. 13081:3033862.. 3033881: 20: 32:13082.. 13098:3033882.. 3033898: 17: unwritten 33:13099.. 13642:3033899.. 3034442:544: 34:13643.. 13648:3034443.. 3034448: 6: unwritten 35:13649.. 13655:3034449.. 3034455: 7: 36:13656.. 13660:3034456.. 3034460: 5: unwritten 37:13661.. 13667:3034461.. 3034467: 7: 38:13668.. 13673:3034468.. 3034473: 6: unwritten 39:13674.. 13680:3034474.. 3034480: 7: 40:13681.. 13685:3034481.. 3034485: 5: unwritten 41:13686.. 13692:3034486.. 3034492: 7: 42:13693.. 13698:3034493.. 3034498: 6: unwritten 43:13699.. 14276:3034499.. 3035076:578: 44:14277.. 14277:3035077.. 3035077: 1: unwritten 45:14278.. 14458:3035078.. 3035258:181: 46:14459.. 14529:3035259.. 3035329: 71: unwritten 47:14530.. 14570:3035330.. 3035370: 41: 48:14571.. 14641:3035371.. 3035441: 71: unwritten 49:14642.. 14928:3035442.. 3035728:287: 50:14929.. 15002:3035729.. 3035802: 74: unwritten 51:15003.. 15837:3035803.. 303
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Lennart Poettering writes: > inode, and then donate the old blocks over. This means the inode nr > changes, which is something I don't like. Semantically it's only > marginally better than just creating a new file from scratch. Wait, what do you mean the inode nr changes? I thought the whole point of the block donating thing was that you get a contiguous set of blocks in the new file, then transfer those blocks back to the old inode so that the inode number and timestamps of the file don't change. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Di, 09.02.21 10:17, Phillip Susi (ph...@thesusis.net) wrote: > > Chris Murphy writes: > > > And I agree 8MB isn't a big deal. Does anyone complain about journal > > fragmentation on ext4 or xfs? If not, then we come full circle to my > > second email in the thread which is don't defragment when nodatacow, > > only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and > > specify 8MB length. That does seem to consistently no op on nodatacow > > journals which have 8MB extents. > > Ok, I agree there. > > > The reason I'm dismissive is because the nodatacow fragment case is > > the same as ext4 and XFS; the datacow fragment case is both > > spectacular and non-deterministic. The workload will matter where > > Your argument seems to be that it's no worse than ext4 and so if we > don't defrag there, why on btrfs? Lennart seems to be arguing that the > only reason systemd doesn't defrag on ext4 is because the ioctl is > harder to use. It's not just harder to use, it's uglier: you have to create a new inode, and then donate the old blocks over. This means the inode nr changes, which is something I don't like. Semantically it's only marginally better than just creating a new file from scratch. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mo, 08.02.21 22:13, Chris Murphy (li...@colorremedies.com) wrote: > On Mon, Feb 8, 2021 at 7:56 AM Phillip Susi wrote: > > > > > > Chris Murphy writes: > > > > >> It sounds like you are arguing that it is better to do the wrong thing > > >> on all SSDs rather than do the right thing on ones that aren't broken. > > > > > > No I'm suggesting there isn't currently a way to isolate > > > defragmentation to just HDDs. > > > > Yes, but it sounded like you were suggesting that we shouldn't even try, > > not just that it isn't 100% accurate. Sure, some SSDs will be stupid > > and report that they are rotational, but most aren't stupid, so it's a > > good idea to disable the defragmentation on drives that report that they > > are non rotational. > > So far I've seen, all USB devices report rotational. All USB flash > drives, and any SSD in an enclosure. > > Maybe some way of estimating rotational based on latency standard > deviation, and stick that in sysfs, instead of trusting device > reporting. But in the meantime, the imperfect rule could be do not > defragment unless it's SCSI/SATA/SAS and it reports it's rotational. btrfs itelf has a knob declaring whether something is ssd or not ssd, configurable via the mount option. Of course, one would bind any higher level logic to that same thing, and thus make it btrfs' own problem, or the admin's. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Chris Murphy writes: > And I agree 8MB isn't a big deal. Does anyone complain about journal > fragmentation on ext4 or xfs? If not, then we come full circle to my > second email in the thread which is don't defragment when nodatacow, > only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and > specify 8MB length. That does seem to consistently no op on nodatacow > journals which have 8MB extents. Ok, I agree there. > The reason I'm dismissive is because the nodatacow fragment case is > the same as ext4 and XFS; the datacow fragment case is both > spectacular and non-deterministic. The workload will matter where Your argument seems to be that it's no worse than ext4 and so if we don't defrag there, why on btrfs? Lennart seems to be arguing that the only reason systemd doesn't defrag on ext4 is because the ioctl is harder to use. Maybe it should defrag as well, so he's asking for actual performance data to evaluate whether the defrag is pointless or whether maybe ext4 should also start doing a defrag. At least I think that's his point. Personally I agree ( and showed the calculations in a previous post ) that 8 MB/fragment is only going to have a negligiable impact on performance and so isn't worth bothering with a defrag, but he has asked for real world data... > And also, only defragmenting on rotation strikes me as leaving > performance on the table, right? If there is concern about fragmented No, because fragmentation only causes additional latency on HDD, not SSD. > But it sounds to me like you want to learn what the performance is of > journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think > it's interesting because you're still better off leaving nodatacow > journals alone, and something still has to be done in the datacow Except that you're not. Your definition of better off appears to be only on SSD and only because it is preferable to have fewer writes than less fragmentation. On HDD defragmenting is a good thing. Lennart seems to want real world performance data to evaluate just *how* good and whether it's worth the bother, at least for HDD. For SSDs, I believe he agreed that it may as well be shut off there since it provides no benefit, but your patch kills it on HDDs as well. > Is there a test mode for journald to just dump a bunch of random stuff > into the journal to age it? I don't want to wait weeks to get a dozen > journal files. The cause of the fragmentation is slowly appending to the file over time, so if you dump a bunch of data in too quickly you would eliminate the fragmentation. You might try: while true ; do logger "This is a test log message to act as filler" ; sleep 1 ; done To speed things up a little bit. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mon, Feb 8, 2021 at 8:20 AM Phillip Susi wrote: > > > Chris Murphy writes: > > > I showed that the archived journals have way more fragmentation than > > active journals. And the fragments in active journals are > > insignificant, and can even be reduced by fully allocating the journal > > Then clearly this is a problem with btrfs: it absolutely should not be > making the files more fragmented when asked to defrag them. I've asked. We'll see.. > > file to final size rather than appending - which has a good chance of > > fragmenting the file on any file system, not just Btrfs. > > And yet, you just said the active journal had minimal fragmentation. Yes, the extents are consistently 8MB in the nodatacow case, old and new file system alike. Same as ext4 and XFS. > That seems to mean that the 8mb fallocates that journald does is working > well. Sure, you could proabbly get fewer fragments by fallocating the > whole 128 mb at once, but there are tradeoffs to that that are not worth > it. One fragment per 8 mb isn't a big deal. Ideally a filesystem will > manage to do better than that ( didn't btrfs have a persistent > reservation system for this purpose? ), but it certainly should not > commonly do worse. I don't think any of the file systems guarantee a contiguous block range upon fallocate, they only guarantee that writes to fallocated space will succeed. i.e. it's a space reservation. But yeah in practice, 8MB is small enough that chances are you'll see one 8MB extent. And I agree 8MB isn't a big deal. Does anyone complain about journal fragmentation on ext4 or xfs? If not, then we come full circle to my second email in the thread which is don't defragment when nodatacow, only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and specify 8MB length. That does seem to consistently no op on nodatacow journals which have 8MB extents. > > Further, even *despite* this worse fragmentation of the archived > > journals, bcc-tools fileslower shows no meaningful latency as a > > result. I wrote this in the previous email. I don't understand what > > you want me to show you. > > *Of course* it showed no meaningful latency because you did the test on > an SSD, which has no meaningful latency penalty due to fragmentation. > The question is how bad is it on HDD. The reason I'm dismissive is because the nodatacow fragment case is the same as ext4 and XFS; the datacow fragment case is both spectacular and non-deterministic. The workload will matter where these random 4KiB journal writes end up on an HDD. I've seen journals with hundreds to thousands of extents. I'm not sure what we learn from me doing a single isolated test on an HDD. And also, only defragmenting on rotation strikes me as leaving performance on the table, right? If there is concern about fragmented archived journals, then isn't there concern about fragmented active journals? But it sounds to me like you want to learn what the performance is of journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think it's interesting because you're still better off leaving nodatacow journals alone, and something still has to be done in the datacow case. It's two extremes. What the performance is doesn't matter, it's not going to tell you anything you can't already infer from the two layouts. > > And since journald offers no ability to disable the defragment on > > Btrfs, I can't really do a longer term A/B comparison can I? > > You proposed a patch to disable it. Test before and after the patch. Is there a test mode for journald to just dump a bunch of random stuff into the journal to age it? I don't want to wait weeks to get a dozen journal files. > > > I did provide data. That you don't like what the data shows: archived > > journals have more fragments than active journals, is not my fault. > > The existing "optimization" is making things worse, in addition to > > adding a pile of unnecessary writes upon journal rotation. > > If it is making things worse, that is definately a bug in btrfs. It > might be nice to avoid the writes on SSD though since there is no > benefit there. Agreed. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mon, Feb 8, 2021 at 7:56 AM Phillip Susi wrote: > > > Chris Murphy writes: > > >> It sounds like you are arguing that it is better to do the wrong thing > >> on all SSDs rather than do the right thing on ones that aren't broken. > > > > No I'm suggesting there isn't currently a way to isolate > > defragmentation to just HDDs. > > Yes, but it sounded like you were suggesting that we shouldn't even try, > not just that it isn't 100% accurate. Sure, some SSDs will be stupid > and report that they are rotational, but most aren't stupid, so it's a > good idea to disable the defragmentation on drives that report that they > are non rotational. So far I've seen, all USB devices report rotational. All USB flash drives, and any SSD in an enclosure. Maybe some way of estimating rotational based on latency standard deviation, and stick that in sysfs, instead of trusting device reporting. But in the meantime, the imperfect rule could be do not defragment unless it's SCSI/SATA/SAS and it reports it's rotational. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mo, 08.02.21 10:09, Phillip Susi (ph...@thesusis.net) wrote: > That's a fair point: if btrfs isn't any worse than other filessytems, > then why is it the only one that gets a defrag? As answered elsewhere: 1. only btrfs has a cow mode, where fragmentation is through the roof for randomly written files 2. only btrfs as a somewhat nice API for this (i.e. a single best effort ioctl with no params). (ext4 has a defrag API, but it's weird, and xfs I never checked, I never used it) 3. noone was annoyed by journal performance on non-btrfs enough to determine if this is worth it. -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Sa, 06.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote: > The original commit description only mentions COW, it doesn't mention > being predicated on nodatacow. In effect commit > f27a386430cc7a27ebd06899d93310fb3bd4cee7 is obviated by commit > 3a92e4ba470611ceec6693640b05eb248d62e32d four months later. I don't > think they were ever intended to be used together, and combining them > seems accidental. Nah, both commits are for a common goal: make access time behaviour OK'ish on btrfs, where it otherwise is terrible (on rotating media particularly). It's optimized for access times, not for minimal iops. I'd be totally open to revisit this all, and take iops more into account, but again, we'd need a bit of profiling that compares access times, iops, and stuff with and without this, on rotating and ssd. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Sa, 06.02.21 19:47, Chris Murphy (li...@colorremedies.com) wrote: 65;6201;1c > On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi wrote: > > > Chris Murphy writes: > > > > > But it gets worse. The way systemd-journald is submitting the journals > > > for defragmentation is making them more fragmented than just leaving > > > them alone. > > > > Wait, doesn't it just create a new file, fallocate the whole thing, copy > > the contents, and delete the original? > > Same inode, so no. As to the logic, I don't know. I'll ask upstream to > document it. > > ?How can that possibly make > > fragmentation *worse*? > > I'm only seeing this pattern with journald journals, and > BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals. > > Meanwhile, active journals exhibit no different pattern from ext4 and > xfs, no worse fragmentation. That's not surprising, these file systems don't have a defrag ioctl with similar generic semantics. > Is there a VFS API for handling these isues? Should there be? I really > don't think any application, including journald, should be having to > micromanage these kinds of things on a case by case basis. General > problems like this need general solutions. We don't micromanage. We call a simple, extremely generic ioctl, that takes exactly zero parameters, asking btrfs to do its best. > > It sounds like you are arguing that it is better to do the wrong thing > > on all SSDs rather than do the right thing on ones that aren't broken. > > No I'm suggesting there isn't currently a way to isolate > defragmentation to just HDDs. We could add one. For example, the $SYSTEMD_JOURNAL_DEFRAG env var I proposed in that other mail could have a special value besides yes/no of "ssd" or so, where we'd use btrfs own understanding if it's backed by ssd or rotating media, as controlled with the ssd/nossd mount option. (though ideally we'd have a better way to query it that to parse out the mount options string) Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fr, 05.02.21 17:44, Chris Murphy (li...@colorremedies.com) wrote: > On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering > wrote: > > > > On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote: > > > > > > You know, we issue the btrfs ioctl, under the assumption that if the > > > > file is already perfectly defragmented it's a NOP. Are you suggesting > > > > it isn't a NOP in that case? > > > > > > So, what is the reason for defragmenting journal is BTRFS is > > > detected? This does not happen at other filesystems. I have read > > > this thread but has not found a clear answer to this question. > > > > btrfs like any file system fragments files with nocow a bit. Without > > nocow (i.e. with cow) it fragments files horribly, given our write > > pattern (wich is: append something to the end, and update a few > > pointers in the beginning). By upstream default we set nocow, some > > downstreams/users undo that however. (this is done via tmpfiles, > > i.e. journald doesn't actually set nocow ever). > > I don't see why it's upstream's problem to solve downstream decisions. > If they want to (re)enable datacow, then they can also setup some kind > of service to defragment /var/log/journal/ on a schedule, or they can > use autodefrag. There are good reasons to enable cow, even if we default to nocow. RAID, checksumming, compression, all that. It's not clear that nocow is perfect, and cow is terrible or vice versa — in reality it's a very blurry line, and hence we should support both modes, even if we pick a default we think is in average the better choice. But because we support both modes and because defragmentation of an unfragmented file should be a NOP we issue the defrag ioctl too. Moreover, if we wouldn't issue the defrag ioctl, there's no way to get it. I mean, to turn this into something constructive: please send a patch that adds an env var $SYSTEMD_JOURNAL_BTRFS_DEFRAG which when set to 0 will turn off the defrag. If you want to disable this locally, then I am happy to merge a patch that makes that configurable. > > When we archive a journal file (i.e stop writing to it) we know it > > will never receive any further writes. It's a good time to undo the > > fragmentation (we make no distinction whether heavily fragmented, > > little fragmented or not at all fragmented on this) and thus for the > > future make access behaviour better, given that we'll still access the > > file regularly (because archiving in journald doesn't mean we stop > > reading it, it just means we stop writing it — journalctl always > > operates on the full data set). defragmentation happens in the bg once > > triggered, it's a simple ioctl you can invoke on a file. if the file > > is not fragmented it shouldn't do anything. > > ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0, > extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0 > > What 'len' value does journald use? We don't call BTRFS_IOC_DEFRAG_RANGE. Instead, we call BTRFS_IOC_DEFRAG with a NULL parameter. > > other file systems simply have no such ioctl, and they never fragment > > as terribly as btrfs can fragment. hence we don't call that ioctl. > > I did explain how to avoid the fragmentation in the first place, to > obviate the need to defragment. > > 1. nodatacow. journald does this already > 2. fallocate the intended final journal file size from the start, > instead of growing them in 8MB increments. Not an option, as mentioned. We maintain a bunch of journal files in parallel, and if we would allocate them 100% in advance, then we'd have really shitty behaviour since we'd allocate a ton of space on disk we don't actually use, but nonetheless already took away from everything else. > 3. Don't reflink copy (including snapshot) the journals. This arguably > is not journald's responsibility but as it creates both the journal/ > directory and $MACHINEID directory, it could make one or both of them > as subvolumes instead to ensure they're not subject to snapshotting > from above. That's nonsense. People do recursive snapshots. nspawn does, machined does, and so do others. Also, even if recursive snapshots didn't exist, I am pretty sure people might be annoyed if we just fuck with their backup strategy, and exclude some files. > > I'd even be fine dropping it entirely, if someone actually can > > show the benefits of having the files unfragmented when archived > > don't outweigh the downside of generating some iops when executing > > the defragmentation. > > I showed that the archived journals have way more fragmentation than > active journals. Can you report this to the btrfs maintainers? Apparently defragmentation is broken on your btrfs then? (I don't see that here btw) > And the fragments in active journals are insignificant, and can even > be reduced by fully allocating the journal file to final size rather > than appending - which has a good chance of fragmenting the file on > any file system, not just Btrfs. Yeah, b
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Chris Murphy writes: > I showed that the archived journals have way more fragmentation than > active journals. And the fragments in active journals are > insignificant, and can even be reduced by fully allocating the journal Then clearly this is a problem with btrfs: it absolutely should not be making the files more fragmented when asked to defrag them. > file to final size rather than appending - which has a good chance of > fragmenting the file on any file system, not just Btrfs. And yet, you just said the active journal had minimal fragmentation. That seems to mean that the 8mb fallocates that journald does is working well. Sure, you could proabbly get fewer fragments by fallocating the whole 128 mb at once, but there are tradeoffs to that that are not worth it. One fragment per 8 mb isn't a big deal. Ideally a filesystem will manage to do better than that ( didn't btrfs have a persistent reservation system for this purpose? ), but it certainly should not commonly do worse. > Further, even *despite* this worse fragmentation of the archived > journals, bcc-tools fileslower shows no meaningful latency as a > result. I wrote this in the previous email. I don't understand what > you want me to show you. *Of course* it showed no meaningful latency because you did the test on an SSD, which has no meaningful latency penalty due to fragmentation. The question is how bad is it on HDD. > And since journald offers no ability to disable the defragment on > Btrfs, I can't really do a longer term A/B comparison can I? You proposed a patch to disable it. Test before and after the patch. > I did provide data. That you don't like what the data shows: archived > journals have more fragments than active journals, is not my fault. > The existing "optimization" is making things worse, in addition to > adding a pile of unnecessary writes upon journal rotation. If it is making things worse, that is definately a bug in btrfs. It might be nice to avoid the writes on SSD though since there is no benefit there. > Conversely, you have not provided data proving that nodatacow > fallocated files on Btrfs are any more fragmented than fallocated > files on ext4 or XFS. That's a fair point: if btrfs isn't any worse than other filessytems, then why is it the only one that gets a defrag? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Chris Murphy writes: >> It sounds like you are arguing that it is better to do the wrong thing >> on all SSDs rather than do the right thing on ones that aren't broken. > > No I'm suggesting there isn't currently a way to isolate > defragmentation to just HDDs. Yes, but it sounded like you were suggesting that we shouldn't even try, not just that it isn't 100% accurate. Sure, some SSDs will be stupid and report that they are rotational, but most aren't stupid, so it's a good idea to disable the defragmentation on drives that report that they are non rotational. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
06.02.2021 00:33, Phillip Susi пишет: > > Lennart Poettering writes: > >> journalctl gives you one long continues log stream, joining everything >> available, archived or not into one big interleaved stream. > > If you ask for everything, yes... but if you run journalctl -b then > shuoldn't it only read back until it finds the start of the current > boot? Ever tried "systemctl status" on HDD with large amount of archived journal data? It can easily take minutes ... ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi wrote: > Chris Murphy writes: > > > But it gets worse. The way systemd-journald is submitting the journals > > for defragmentation is making them more fragmented than just leaving > > them alone. > > Wait, doesn't it just create a new file, fallocate the whole thing, copy > the contents, and delete the original? Same inode, so no. As to the logic, I don't know. I'll ask upstream to document it. ?How can that possibly make > fragmentation *worse*? I'm only seeing this pattern with journald journals, and BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals. Meanwhile, active journals exhibit no different pattern from ext4 and xfs, no worse fragmentation. Consid other storage technologies where COW and snapshots come into play. For example anything based on device-mapper thin provisioning is going to run into these issues. How it allocates physical extents isn't up to the file system. Duplicate a file and delete the original, you might get a more fragmented file as well. The physical layout is entirely decoupled from the file system - where the filesystem could tell you "no fragmentation" and yet it is highly fragmented, or vice versa. These problems are not unique to Btrfs. Is there a VFS API for handling these isues? Should there be? I really don't think any application, including journald, should be having to micromanage these kinds of things on a case by case basis. General problems like this need general solutions. > > All of those archived files have more fragments (post defrag) than > > they had when they were active. And here is the FIEMAP for the 96MB > > file which has 92 fragments. > > How the heck did you end up with nearly 1 frag per mb? I didn't do anything special, it's a default configuration. I'll ask Btrfs developers about it. Maybe it's one of those artifacts of FIEMAP I mentioned previously. Maybe it's not that badly fragmented to a drive that's going to reorder reads anyway, to be more efficient about it. > > If you want an optimization that's actually useful on Btrfs, > > /var/log/journal/ could be a nested subvolume. That would prevent any > > snapshots above from turning the nodatacow journals into datacow > > journals, which does significantly increase fragmentation (it would in > > the exact same case if it were a reflink copy on XFS for that matter). > > Wouldn't that mean that when you take snapshots, they don't include the > logs? That's a snapshot/rollback regime design and policy question. If you snapshot the subvolume that contains the journals, the journals will be in the snapshot. The user space tools do not have an option for recursive snapshots, so snapshotting does end at subvolume boundaries. If you want journals snapshot, then their enclosing subvolume would need to be snapshot. > That seems like an anti feature that violates the principal of > least surprise. If I make a snapshot of my root, I *expect* it to > contain my logs. You can only rollback that which you snapshot. If you snapshot a root without excluding journals, if you rollback, you rollback the journals. That's data loss. (open)suse has a snapshot/rollback regime configured and enabled by default out of the box. Logs are excluded from it, same as the bootloader. (Although I'll also note they default to volatile systemd journals, and use rsyslogd for persistent logs.) Fedora meanwhile does have persistent journald journals in the root subvolume, but there's no snapshot/rollback regime enabled out of the box. I'm inclined to have them excluded, not so much to avoid cow of the nodatacow journals, but avoiding discontinuity in the journals upon rollback. > > > I don't get the iops thing at all. What we care about in this case is > > latency. A least noticeable latency of around 150ms seems reasonable > > as a starting point, that's where users realize a delay between a key > > press and a character appearing. However, if I check for 10ms latency > > (using bcc-tools fileslower) when reading all of the above journals at > > once: > > > > $ sudo journalctl -D > > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager > > > > Not a single report. None. Nothing took even 10ms. And those journals > > are more fragmented than your 20 in a 100MB file. > > > > I don't have any hard drives to test this on. This is what, 10% of the > > market at this point? The best you can do there is the same as on SSD. > > The above sounded like great data, but not if it was done on SSD. Right. But also I can't disable the defragmentation in order to do a proper test on HDD. > > You can't depend on sysfs to conditionally do defragmentation on only > > rotational media, too many fragile media claim to be rotating. > > It sounds like you are arguing that it is better to do the wrong thing > on all SSDs rather than do the right thing on ones that aren't broken. No I'm suggesting there isn't currently a way to isolate defragmentation to just HD
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, Feb 05, 2021 at 05:44:03PM -0700, Chris Murphy wrote: > On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering > wrote: > > > > On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote: > > > > > > You know, we issue the btrfs ioctl, under the assumption that if the > > > > file is already perfectly defragmented it's a NOP. Are you suggesting > > > > it isn't a NOP in that case? > > > > > > So, what is the reason for defragmenting journal is BTRFS is > > > detected? This does not happen at other filesystems. I have read > > > this thread but has not found a clear answer to this question. > > > > btrfs like any file system fragments files with nocow a bit. Without > > nocow (i.e. with cow) it fragments files horribly, given our write > > pattern (wich is: append something to the end, and update a few > > pointers in the beginning). By upstream default we set nocow, some > > downstreams/users undo that however. (this is done via tmpfiles, > > i.e. journald doesn't actually set nocow ever). > > I don't see why it's upstream's problem to solve downstream decisions. > If they want to (re)enable datacow, then they can also setup some kind > of service to defragment /var/log/journal/ on a schedule, or they can > use autodefrag. > It seems cooperative to me that applications advise the filesystem on appropriate optimization opportunities. Taking a step back and looking at what journald is doing, how and when these journal files are accessed, it doesn't strike me as illogical to tell the fs when archiving it's a good time to defragment the file. > > > When we archive a journal file (i.e stop writing to it) we know it > > will never receive any further writes. It's a good time to undo the > > fragmentation (we make no distinction whether heavily fragmented, > > little fragmented or not at all fragmented on this) and thus for the > > future make access behaviour better, given that we'll still access the > > file regularly (because archiving in journald doesn't mean we stop > > reading it, it just means we stop writing it — journalctl always > > operates on the full data set). defragmentation happens in the bg once > > triggered, it's a simple ioctl you can invoke on a file. if the file > > is not fragmented it shouldn't do anything. > > ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0, > extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0 > > What 'len' value does journald use? > journald uses BTRFS_IOC_DEFRAG, there is no range argument; it's the whole file. I'm inclined to agree with Lennart on this looking more like a btrfs issue than journald issue, based on your claims. journald is arguably Doing The Right Thing by advising btrfs of a defrag opportunity. If btrfs can't usefully defragment the file vs. its layout, it should NOOP the ioctl. If it's producing more fragmented files post-defrag, how is that not a btrfs bug? Some things I didn't see being considered in your comparisons is filesystem free space, age, and concurrent use. If your comparisons are on fresh filesystems, fragmentation tends to be much lower as the business of finding contiguous blocks of free space is trivial. Once the filesystem has aged enough to churn through the available space, fragmentation increases substantially. When journald is the only writer on an otherwise idle filesystem, it's less likely to have its allocations interrupted by allocations to other writers. To make meaningful measurements of fragmentation and the necessity of telling the fs "hey, now's a good time to defrag this file I'm no longer going to write to", you need to look at more worst case scenarios, not best case. On a different note, I feel like there's an unnecessarily combative tone to this discussion. Maybe it's just me, but it deterred me from participating up until this point. Regards, Vito Caputo ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
More data points. 1. An ext4 file system with a 112M system.journal, it has 15 extents. >From FIEMAP we can pretty much see it's really made from 14 8MB extents, consistent with multiple appends. And it's the exact same behavior seen on Btrfs with nodatacow journals. https://pastebin.com/6vuufwXt 2. A Btrfs file system with a 24MB system.journal, nodatacow, 4 extents. The fragments are consistent with #1 as a result of nodatacow journals. https://pastebin.com/Y18B2m4h 3. Continuing from #2, 'journalctl --rotate' strace shows this results in: ioctl(31, BTRFS_IOC_DEFRAG) = 0 filefrag shows the result, 17 extents. But this is misleading because 9 of them are in the same position as before, so it seems to be a minimalist defragment. Btrfs did what was requested but with both limited impact and efficacy, at least on nodatacow files having minimal fragmentation to begin with. https://pastebin.com/1ufErVMs 4. Continuing from #3, 'btrfs fi defrag -l 32M' pointed to this same file results in a single extent file. strace shows this uses ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=33554432, flags=0, extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0 and filefrag shows the single extent mapping: https://pastebin.com/429fZmNB While this is a numeric improvement (no fragmentation), again there's no proven advantage of defragmenting nodatacow journals on Btrfs. It's just needlessly contributing to write amplification. -- The original commit description only mentions COW, it doesn't mention being predicated on nodatacow. In effect commit f27a386430cc7a27ebd06899d93310fb3bd4cee7 is obviated by commit 3a92e4ba470611ceec6693640b05eb248d62e32d four months later. I don't think they were ever intended to be used together, and combining them seems accidental. Defragmenting datacow files makes some sense on rotating media. But that's the exception, not the rule. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering wrote: > > On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote: > > > > You know, we issue the btrfs ioctl, under the assumption that if the > > > file is already perfectly defragmented it's a NOP. Are you suggesting > > > it isn't a NOP in that case? > > > > So, what is the reason for defragmenting journal is BTRFS is > > detected? This does not happen at other filesystems. I have read > > this thread but has not found a clear answer to this question. > > btrfs like any file system fragments files with nocow a bit. Without > nocow (i.e. with cow) it fragments files horribly, given our write > pattern (wich is: append something to the end, and update a few > pointers in the beginning). By upstream default we set nocow, some > downstreams/users undo that however. (this is done via tmpfiles, > i.e. journald doesn't actually set nocow ever). I don't see why it's upstream's problem to solve downstream decisions. If they want to (re)enable datacow, then they can also setup some kind of service to defragment /var/log/journal/ on a schedule, or they can use autodefrag. > When we archive a journal file (i.e stop writing to it) we know it > will never receive any further writes. It's a good time to undo the > fragmentation (we make no distinction whether heavily fragmented, > little fragmented or not at all fragmented on this) and thus for the > future make access behaviour better, given that we'll still access the > file regularly (because archiving in journald doesn't mean we stop > reading it, it just means we stop writing it — journalctl always > operates on the full data set). defragmentation happens in the bg once > triggered, it's a simple ioctl you can invoke on a file. if the file > is not fragmented it shouldn't do anything. ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0, extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0 What 'len' value does journald use? > other file systems simply have no such ioctl, and they never fragment > as terribly as btrfs can fragment. hence we don't call that ioctl. I did explain how to avoid the fragmentation in the first place, to obviate the need to defragment. 1. nodatacow. journald does this already 2. fallocate the intended final journal file size from the start, instead of growing them in 8MB increments. 3. Don't reflink copy (including snapshot) the journals. This arguably is not journald's responsibility but as it creates both the journal/ directory and $MACHINEID directory, it could make one or both of them as subvolumes instead to ensure they're not subject to snapshotting from above. > I'd even be fine dropping it > entirely, if someone actually can show the benefits of having the > files unfragmented when archived don't outweigh the downside of > generating some iops when executing the defragmentation. I showed that the archived journals have way more fragmentation than active journals. And the fragments in active journals are insignificant, and can even be reduced by fully allocating the journal file to final size rather than appending - which has a good chance of fragmenting the file on any file system, not just Btrfs. Further, even *despite* this worse fragmentation of the archived journals, bcc-tools fileslower shows no meaningful latency as a result. I wrote this in the previous email. I don't understand what you want me to show you. And since journald offers no ability to disable the defragment on Btrfs, I can't really do a longer term A/B comparison can I? >i.e. someone > does some profiling, on both ssd and rotating media. Apparently noone > who cares about this apparently wants to do such research though, and > hence I remain deeply unimpressed. Let's not try to do such > optimizations without any data that actually shows it betters things. I did provide data. That you don't like what the data shows: archived journals have more fragments than active journals, is not my fault. The existing "optimization" is making things worse, in addition to adding a pile of unnecessary writes upon journal rotation. Conversely, you have not provided data proving that nodatacow fallocated files on Btrfs are any more fragmented than fallocated files on ext4 or XFS. 2-17 fragments on ext4: https://pastebin.com/jiPhrDzG https://pastebin.com/UggEiH2J That behavior is no different for nodatacow fallocated journals on Btrfs. There's no point in defragmenting these no matter the file system. I don't have to profile this on HDD, I know that even in the best case you're not likely to get (certainly not guaranteed) to get fewer fragments than this. Defrag on Btrfs is for the thousands of fragments case, which is what you get with datacow journals. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote: > > You know, we issue the btrfs ioctl, under the assumption that if the > > file is already perfectly defragmented it's a NOP. Are you suggesting > > it isn't a NOP in that case? > > So, what is the reason for defragmenting journal is BTRFS is > detected? This does not happen at other filesystems. I have read > this thread but has not found a clear answer to this question. btrfs like any file system fragments files with nocow a bit. Without nocow (i.e. with cow) it fragments files horribly, given our write pattern (wich is: append something to the end, and update a few pointers in the beginning). By upstream default we set nocow, some downstreams/users undo that however. (this is done via tmpfiles, i.e. journald doesn't actually set nocow ever). When we archive a journal file (i.e stop writing to it) we know it will never receive any further writes. It's a good time to undo the fragmentation (we make no distinction whether heavily fragmented, little fragmented or not at all fragmented on this) and thus for the future make access behaviour better, given that we'll still access the file regularly (because archiving in journald doesn't mean we stop reading it, it just means we stop writing it — journalctl always operates on the full data set). defragmentation happens in the bg once triggered, it's a simple ioctl you can invoke on a file. if the file is not fragmented it shouldn't do anything. other file systems simply have no such ioctl, and they never fragment as terribly as btrfs can fragment. hence we don't call that ioctl. i'd be fine to avoid the ioctl if we knew for sure the file is at worst mildly fragmented, but apparently btrfs is too broken to be able to implement something like that. I'd even be fine dropping it entirely, if someone actually can show the benefits of having the files unfragmented when archived don't outweigh the downside of generating some iops when executing the defragmentation. i.e. someone does some profiling, on both ssd and rotating media. Apparently noone who cares about this apparently wants to do such research though, and hence I remain deeply unimpressed. Let's not try to do such optimizations without any data that actually shows it betters things. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fr, 05.02.21 16:16, Phillip Susi (ph...@thesusis.net) wrote: > > Lennart Poettering writes: > > > Nope. We always interleave stuff. We currently open all journal files > > in parallel. The system one and the per-user ones, the current ones > > and the archived ones. > > Wait... every time you look at the journal at all, it has to read back > through ALL of the archived journals, even if you are only interested in > information since the last boot that just happened 5 minutes ago? no, we do not iterate though them. we just read some metadata off the header. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Lennart Poettering writes: > journalctl gives you one long continues log stream, joining everything > available, archived or not into one big interleaved stream. If you ask for everything, yes... but if you run journalctl -b then shuoldn't it only read back until it finds the start of the current boot? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fr, 05.02.21 20:43, Dave Howorth (syst...@howorth.org.uk) wrote: > 128 MB files, and I might allocate an extra MB or two for overhead, I > don't know. So when it first starts there'll be 128 MB allocated and > 384 MB free. In stable state there'll be 512 MB allocated and nothing > free. One 128 MB allocated and slowly being used. 384 MB full of > archive files. You always have between 384 MB and 512 MB of logs > stored. I don't understand where you're getting your numbers from. As mentioned elswhere: we typically have to remove two "almost 128M" files to get space for "exactly 128M" of guaranteed space. And you know, each user gets their own journal. Hence, once a single user logs a single line aother 128M are gone, and if another user then does it, bam, another 128M is gone. We can't eat space away like that. > If you can't figure out which parts of an archived file are useful and > which aren't then why are you keeping them? Why not just delete them? > And if you can figure it out then why not do so and compact the useful > information into the minimum storage? We archive for multiple reasons: because file was dirty when we started up (in which case there apparently was an abnormal shutdown of the system or journald), or because we rotate and start a new file (or time change or whatnot). In the first ("dirty") case we don't touch the file at all, because it's likely corrupt and we don't want to corrupt further. We just rename it so that it gets "~" at the end. When we archive the "clean" way we mark the file internally as archived, but before sync everything to disk, so that we know for sure it's all in a good state, and then we don't touch it anymore. "journalctl" will process all these files, regardless if "dirty" archived or "clean" archived. It tries hard to make the best of these files, and varirous codepaths to make sure we don't get confused by half-written files, and can use as much as possible of the parts that were written correctly. hence, that's why we don't delete corrupted files: because we use as much of it as we can. Why? because usually the logs shortly before your system died abnormally are the most interesting. > > Because fs metadata, and because we don't always write files in > > full. I mean, we often do not, because we start a new file *before* > > the file would grow beyond the threshold. this typically means that > > it's typically not enough to delete a single file to get the space we > > need for a full new one, we usually need to delete two. > > Why would you start a new file before the old one is full? Various reasons: user asked for rotation or vacuuming. because abnormal shutdown. becase time change (we want individual files to be montonically ordered), … Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Maksim Fomin writes: > I would say it depends on whether defragmentation issues are feature > of btrfs. As Chris mentioned, if root fs is snapshotted, > 'defragmenting' the journal can actually increase fragmentation. This > is an example when the problem is caused by a feature (not a bug) in > btrfs. For example, my 'system.journal' file is currently 16 MB and > according to filefrag it has 1608 extents (consequence of snapshotted > rootfs?). It looks too much, if I am not missing some technical Holy smokes! How did btrfs manage to butcher that poor file that badly? It shouldn't be possible for it to be *that* bad. I mean, that's only an average of 10kb per fragment! ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Dave Howorth writes: > PS I'm subscribed to the list. I don't need a copy. FYI, rather than ask others to go out of their way when replying to you, you should configure your mail client to set the Reply-To: header to point to the mailing list address so that other people's mail clients do what you want automatically. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Lennart Poettering writes: > Nope. We always interleave stuff. We currently open all journal files > in parallel. The system one and the per-user ones, the current ones > and the archived ones. Wait... every time you look at the journal at all, it has to read back through ALL of the archived journals, even if you are only interested in information since the last boot that just happened 5 minutes ago? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
‐‐‐ Original Message ‐‐‐ On Friday, February 5, 2021 3:23 PM, Lennart Poettering wrote: > On Do, 04.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote: > > > On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering > > lenn...@poettering.net wrote: > > > > > You want to optimize write pattersn I understand, i.e. minimize > > > iops. Hence start with profiling iops, i.e. what defrag actually costs > > > and then weight that agains the reduced access time when accessing the > > > files. In particular on rotating media. > > > > A nodatacow journal on Btrfs is no different than a journal on ext4 or > > xfs. So I don't understand why you think you also need to defragment > > the file, only on Btrfs. You cannot do better than you already are > > with a nodatacow file. That file isn't going to get anymore fragmented > > in use than it was at creation. > > You know, we issue the btrfs ioctl, under the assumption that if the > file is already perfectly defragmented it's a NOP. Are you suggesting > it isn't a NOP in that case? So, what is the reason for defragmenting journal is BTRFS is detected? This does not happen at other filesystems. I have read this thread but has not found a clear answer to this question. > > But it gets worse. The way systemd-journald is submitting the journals > > for defragmentation is making them more fragmented than just leaving > > them alone. > > Sounds like a bug in btrfs? systemd is not the place to hack around > btrfs bugs? I would say it depends on whether defragmentation issues are feature of btrfs. As Chris mentioned, if root fs is snapshotted, 'defragmenting' the journal can actually increase fragmentation. This is an example when the problem is caused by a feature (not a bug) in btrfs. For example, my 'system.journal' file is currently 16 MB and according to filefrag it has 1608 extents (consequence of snapshotted rootfs?). It looks too much, if I am not missing some technical details (perhaps filefrag 'extent' is not a real extent in case of this fs?). Even if it is a bug in btrfs, it would make sense to temporarily disable the policy of 'defragmenting only in BTRFS' in systemd. I am interested in this issue because for some time (probably since late 2017 till late 2019) I had strange issues with systemd-journald crashing at boot time because of archiving journal/defragmenting. The setup was follows: btrfs on external hd (not ssd) with full disk encryption. After mistaken disconnection of mounted disk (but not in all such cases) systemd-journald caused very long lock of boot process because of following loop: systemd-journald tries to archive/defragment journal files -> it crashes for some reason -> systemd restarts systemd-journald -> it starts archiving/defragmenting journal files -> it crashes again -> systemd restarts systemd-journald (my understaing of logs after boot). Eventually this loop breaks and the boot process counties. After login I see that journal data is fine - at least there is no evidence of journal data corruption, so I presume it was caused by archiving/defragmentation policy on btrfs. I used this disk with ext4 filesystem from 2014 to 2017 and never had any problem like that. Eventually I decided to buy a better disk and this problem vanished since then, but why systemd defragmets journal only in btrfs remained a mystery to me. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, 5 Feb 2021 17:44:14 +0100 Lennart Poettering wrote: > On Fr, 05.02.21 16:06, Dave Howorth (syst...@howorth.org.uk) wrote: > > > On Fri, 5 Feb 2021 16:23:02 +0100 > > Lennart Poettering wrote: > > > I don't think that makes much sense: we rotate and start new > > > files for a multitude of reasons, such as size overrun, time > > > jumps, abnormal shutdown and so on. If we'd always leave a fully > > > allocated file around people would hate us... > > > > I'm not sure about that. The file is eventually going to grow to > > 128 MB so if there isn't space for it, I might as well know right > > now as later. And it's not like the space will be available for > > anything else, it's left free for exactly this log file. > > let's say you assign 500M space to journald. If you allocate 128M at a > time, this means the effective unused space is anything between 1M and > 255M, leaving just 256M of logs around. it's probably surprising that > you only end up with 255M of logs when you asked for 500M. I'd claim > that's really shitty behaviour. If you assign 500 MB for something that accommodates multiples of 128 MB then you're not very bright :) 512 MB by contrast can accommodate 4 128 MB files, and I might allocate an extra MB or two for overhead, I don't know. So when it first starts there'll be 128 MB allocated and 384 MB free. In stable state there'll be 512 MB allocated and nothing free. One 128 MB allocated and slowly being used. 384 MB full of archive files. You always have between 384 MB and 512 MB of logs stored. I don't understand where you're getting your numbers from. BTW, I expect my linux systems to stay up from when they're booted until I tell them to stop, and that's usually quite a while. > > Or are you talking about left over files after some exceptional > > event that are only part full? If so, then just deallocate the > > unwanted empty space from them after you've recovered from the > > exceptional event. > > Nah, it doesn't work like this: if a journal file isn't marked clean, > i.e. was left in some half-written state we won't touch it, but just > archive it and start a new one. We don't know how much was correctly > written and how much was not, hence we can't sensibly truncate it. The > kernel after all is entirely free to decide in which order it syncs > writte blocks to disk, and hence it quite often happens that stuff at > the end got synced while stuff in the middle didn't. If you can't figure out which parts of an archived file are useful and which aren't then why are you keeping them? Why not just delete them? And if you can figure it out then why not do so and compact the useful information into the minimum storage? > > > Also, we vacuum old journals when allocating and the size > > > constraints are hit. i.e. if we detect that adding 8M to journal > > > file X would mean the space used by all journals together would > > > be above the configure disk usage limits we'll delete the oldest > > > journal files we can, until we can allocate 8M again. And we do > > > this each time. If we'd allocate the full file all the time this > > > means we'll likely remove ~256M of logs whenever we start a new > > > file. And that's just shitty behaviour. > > > > No it's not; it's exactly what happens most of the time, because all > > the old log files are exactly the same size because that's why they > > were rolled over. So freeing just one of those gives exactly the > > right size space for the new log file. I don't understand why you > > would want to free two? > > Because fs metadata, and because we don't always write files in > full. I mean, we often do not, because we start a new file *before* > the file would grow beyond the threshold. this typically means that > it's typically not enough to delete a single file to get the space we > need for a full new one, we usually need to delete two. Why would you start a new file before the old one is full? Modulo truly exceptional events. It's a genuine question - I don't think I've ever seen it. And sure fs metadata - that just means allocate a bit extra beyond the round number. > actually it's even worse: btrfs lies in "df": it only updates counters > with uncontrolled latency, hence we might actually delete more than > necessary. Sorry dunno much about btrfs. I'm planning to get rid of it here soon. > Lennart PS I'm subscribed to the list. I don't need a copy. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fr, 05.02.21 16:06, Dave Howorth (syst...@howorth.org.uk) wrote: > On Fri, 5 Feb 2021 16:23:02 +0100 > Lennart Poettering wrote: > > I don't think that makes much sense: we rotate and start new files for > > a multitude of reasons, such as size overrun, time jumps, abnormal > > shutdown and so on. If we'd always leave a fully allocated file around > > people would hate us... > > I'm not sure about that. The file is eventually going to grow to 128 MB > so if there isn't space for it, I might as well know right now as > later. And it's not like the space will be available for anything else, > it's left free for exactly this log file. let's say you assign 500M space to journald. If you allocate 128M at a time, this means the effective unused space is anything between 1M and 255M, leaving just 256M of logs around. it's probably surprising that you only end up with 255M of logs when you asked for 500M. I'd claim that's really shitty behaviour. > Or are you talking about left over files after some exceptional event > that are only part full? If so, then just deallocate the unwanted empty > space from them after you've recovered from the exceptional event. Nah, it doesn't work like this: if a journal file isn't marked clean, i.e. was left in some half-written state we won't touch it, but just archive it and start a new one. We don't know how much was correctly written and how much was not, hence we can't sensibly truncate it. The kernel after all is entirely free to decide in which order it syncs writte blocks to disk, and hence it quite often happens that stuff at the end got synced while stuff in the middle didn't. > > Also, we vacuum old journals when allocating and the size constraints > > are hit. i.e. if we detect that adding 8M to journal file X would mean > > the space used by all journals together would be above the configure > > disk usage limits we'll delete the oldest journal files we can, until > > we can allocate 8M again. And we do this each time. If we'd allocate > > the full file all the time this means we'll likely remove ~256M of > > logs whenever we start a new file. And that's just shitty behaviour. > > No it's not; it's exactly what happens most of the time, because all > the old log files are exactly the same size because that's why they > were rolled over. So freeing just one of those gives exactly the right > size space for the new log file. I don't understand why you would want > to free two? Because fs metadata, and because we don't always write files in full. I mean, we often do not, because we start a new file *before* the file would grow beyond the threshold. this typically means that it's typically not enough to delete a single file to get the space we need for a full new one, we usually need to delete two. actually it's even worse: btrfs lies in "df": it only updates counters with uncontrolled latency, hence we might actually delete more than necessary. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fri, 5 Feb 2021 16:23:02 +0100 Lennart Poettering wrote: > I don't think that makes much sense: we rotate and start new files for > a multitude of reasons, such as size overrun, time jumps, abnormal > shutdown and so on. If we'd always leave a fully allocated file around > people would hate us... I'm not sure about that. The file is eventually going to grow to 128 MB so if there isn't space for it, I might as well know right now as later. And it's not like the space will be available for anything else, it's left free for exactly this log file. Or are you talking about left over files after some exceptional event that are only part full? If so, then just deallocate the unwanted empty space from them after you've recovered from the exceptional event. > Also, we vacuum old journals when allocating and the size constraints > are hit. i.e. if we detect that adding 8M to journal file X would mean > the space used by all journals together would be above the configure > disk usage limits we'll delete the oldest journal files we can, until > we can allocate 8M again. And we do this each time. If we'd allocate > the full file all the time this means we'll likely remove ~256M of > logs whenever we start a new file. And that's just shitty behaviour. No it's not; it's exactly what happens most of the time, because all the old log files are exactly the same size because that's why they were rolled over. So freeing just one of those gives exactly the right size space for the new log file. I don't understand why you would want to free two? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Fr, 05.02.21 10:24, Phillip Susi (ph...@thesusis.net) wrote: > > Lennart Poettering writes: > > > You are focussing only on the one-time iops generated during archival, > > and are ignoring the extra latency during access that fragmented files > > cost. Show me that the iops reduction during the one-time operation > > matters and the extra latency during access doesn't matter and we can > > look into making changes. But without anything resembling any form of > > profiling we are just blind people in the fog... > > I'm curious why you seem to think that latency accessing old logs is so > important. I would think that old logs tend to be accessed very > rarely. On such a rare occasion, a few extra mS doesn't seem very > important to me. Even if it's on a 5400 rpm drive, typical latency is > what? 8 mS? Even with a fragment every 8 MB, that's only going to add > up to an extra 128 mS to read and parse a 128 MB log file. Even with no > fragments it's going to take over 1 second to read that file, so we're > only talking about a ~11% slow down here, on an operation that is rare > and you're going to be spending far more time actually looking at the > log than it took to read off the disk. journalctl gives you one long continues log stream, joining everything available, archived or not into one big interleaved stream. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Lennart Poettering writes: > You are focussing only on the one-time iops generated during archival, > and are ignoring the extra latency during access that fragmented files > cost. Show me that the iops reduction during the one-time operation > matters and the extra latency during access doesn't matter and we can > look into making changes. But without anything resembling any form of > profiling we are just blind people in the fog... I'm curious why you seem to think that latency accessing old logs is so important. I would think that old logs tend to be accessed very rarely. On such a rare occasion, a few extra mS doesn't seem very important to me. Even if it's on a 5400 rpm drive, typical latency is what? 8 mS? Even with a fragment every 8 MB, that's only going to add up to an extra 128 mS to read and parse a 128 MB log file. Even with no fragments it's going to take over 1 second to read that file, so we're only talking about a ~11% slow down here, on an operation that is rare and you're going to be spending far more time actually looking at the log than it took to read off the disk. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Chris Murphy writes: > But it gets worse. The way systemd-journald is submitting the journals > for defragmentation is making them more fragmented than just leaving > them alone. Wait, doesn't it just create a new file, fallocate the whole thing, copy the contents, and delete the original? How can that possibly make fragmentation *worse*? > All of those archived files have more fragments (post defrag) than > they had when they were active. And here is the FIEMAP for the 96MB > file which has 92 fragments. How the heck did you end up with nearly 1 frag per mb? > If you want an optimization that's actually useful on Btrfs, > /var/log/journal/ could be a nested subvolume. That would prevent any > snapshots above from turning the nodatacow journals into datacow > journals, which does significantly increase fragmentation (it would in > the exact same case if it were a reflink copy on XFS for that matter). Wouldn't that mean that when you take snapshots, they don't include the logs? That seems like an anti feature that violates the principal of least surprise. If I make a snapshot of my root, I *expect* it to contain my logs. > I don't get the iops thing at all. What we care about in this case is > latency. A least noticeable latency of around 150ms seems reasonable > as a starting point, that's where users realize a delay between a key > press and a character appearing. However, if I check for 10ms latency > (using bcc-tools fileslower) when reading all of the above journals at > once: > > $ sudo journalctl -D > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager > > Not a single report. None. Nothing took even 10ms. And those journals > are more fragmented than your 20 in a 100MB file. > > I don't have any hard drives to test this on. This is what, 10% of the > market at this point? The best you can do there is the same as on SSD. The above sounded like great data, but not if it was done on SSD. Of course it doesn't cause latency on an SSD. I don't know about market trends, but I stopped trusting my data to SSDs a few years ago when my ext4 fs kept being corrupted and it appeared that the FTL of the drive was randomly swapping the contents of different sectors around when I found things like the contents of a text file in a block of the inode table or a directory. > You can't depend on sysfs to conditionally do defragmentation on only > rotational media, too many fragile media claim to be rotating. It sounds like you are arguing that it is better to do the wrong thing on all SSDs rather than do the right thing on ones that aren't broken. > Looking at the two original commits, I think they were always in > conflict with each other, happening within months of each other. They > are independent ways of dealing with the same problem, where only one > of them is needed. And the best of the two is fallocate+nodatacow > which makes the journals behave the same as on ext4 where you also > don't do defragmentation. This makes sense. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Do, 04.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote: > On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering > wrote: > > > You want to optimize write pattersn I understand, i.e. minimize > > iops. Hence start with profiling iops, i.e. what defrag actually costs > > and then weight that agains the reduced access time when accessing the > > files. In particular on rotating media. > > A nodatacow journal on Btrfs is no different than a journal on ext4 or > xfs. So I don't understand why you think you *also* need to defragment > the file, only on Btrfs. You cannot do better than you already are > with a nodatacow file. That file isn't going to get anymore fragmented > in use than it was at creation. You know, we issue the btrfs ioctl, under the assumption that if the file is already perfectly defragmented it's a NOP. Are you suggesting it isn't a NOP in that case? > If you want to do better, maybe stop appending in 8MB increments? > Every time you append it's another extent. Since apparently the > journal files can max out at 128MB before they are rotated, why aren't > they created 128MB from the very start? That would have a decent > chance of getting you a file that's 1-4 extents, and it's not going to > have more extents than that. You know, there are certainly "perfect" ways to adjust our writing scheme to match some specific file system on some specific storage matching some specific user pattern. THing is though, what might be ideal for some fs and some user might be terrible for another fs or another user. We try to find some compromise in the middle, that might not result in "perfect" behaviour everywhere, but at least reasonable behaviour. > Presumably the currently active journal not being fragmented is more > important than archived journals, because searches will happen on > recent events more than old events. Right? Nope. We always interleave stuff. We currently open all journal files in parallel. The system one and the per-user ones, the current ones and the archived ones. > So if you're going to say > fragmentation matters at all, maybe stop intentionally fragmenting the > active journal? We are not *intentionally* fragmenting. Please don't argue on that level. Not helpful, man. > Just fallocate the max size it's going to be right off > the bat? Doesn't matter what file system it is. Once that 128MB > journal is full, leave it alone, and rotate to a new 128M file. The > append is what's making them fragmented. I don't think that makes much sense: we rotate and start new files for a multitude of reasons, such as size overrun, time jumps, abnormal shutdown and so on. If we'd always leave a fully allocated file around people would hate us... The 8M increase is a middle ground: we don#t allocate space for each log message, and we don't allocate space for everything at once. We allocate medium sized chunks at a time. Also, we vacuum old journals when allocating and the size constraints are hit. i.e. if we detect that adding 8M to journal file X would mean the space used by all journals together would be above the configure disk usage limits we'll delete the oldest journal files we can, until we can allocate 8M again. And we do this each time. If we'd allocate the full file all the time this means we'll likely remove ~256M of logs whenever we start a new file. And that's just shitty behaviour. > But it gets worse. The way systemd-journald is submitting the journals > for defragmentation is making them more fragmented than just leaving > them alone. Sounds like a bug in btrfs? systemd is not the place to hack around btrfs bugs? > If you want an optimization that's actually useful on Btrfs, > /var/log/journal/ could be a nested subvolume. That would prevent any > snapshots above from turning the nodatacow journals into datacow > journals, which does significantly increase fragmentation (it would in > the exact same case if it were a reflink copy on XFS for that > matter). Not sure what the point of that would be... at least when systemd does snapshots (i.e. systemd-nspawn --template= and so on) they are of course recursive, so what'd be the point of doing a subvolume there? > > Somehow I think you are missing what I am asking for: some data that > > actually shows your optimization is worth it: i.e. that leaving the > > files fragment doesn't hurt access to the journal badly, and that the > > number of iops is substantially lowered at the same time. > > I don't get the iops thing at all. What we care about in this case is > latency. A least noticeable latency of around 150ms seems reasonable > as a starting point, that's where users realize a delay between a key > press and a character appearing. However, if I check for 10ms latency > (using bcc-tools fileslower) when reading all of the above journals at > once: > > $ sudo journalctl -D > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager > > Not a single report. None. Nothing took even 10ms. And those jour
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering wrote: > You want to optimize write pattersn I understand, i.e. minimize > iops. Hence start with profiling iops, i.e. what defrag actually costs > and then weight that agains the reduced access time when accessing the > files. In particular on rotating media. A nodatacow journal on Btrfs is no different than a journal on ext4 or xfs. So I don't understand why you think you *also* need to defragment the file, only on Btrfs. You cannot do better than you already are with a nodatacow file. That file isn't going to get anymore fragmented in use than it was at creation. If you want to do better, maybe stop appending in 8MB increments? Every time you append it's another extent. Since apparently the journal files can max out at 128MB before they are rotated, why aren't they created 128MB from the very start? That would have a decent chance of getting you a file that's 1-4 extents, and it's not going to have more extents than that. Presumably the currently active journal not being fragmented is more important than archived journals, because searches will happen on recent events more than old events. Right? So if you're going to say fragmentation matters at all, maybe stop intentionally fragmenting the active journal? Just fallocate the max size it's going to be right off the bat? Doesn't matter what file system it is. Once that 128MB journal is full, leave it alone, and rotate to a new 128M file. The append is what's making them fragmented. But it gets worse. The way systemd-journald is submitting the journals for defragmentation is making them more fragmented than just leaving them alone. https://drive.google.com/file/d/1FhffN4WZZT9gZTnG5VWongWJgPG_nlPF/view?usp=sharing All of those archived files have more fragments (post defrag) than they had when they were active. And here is the FIEMAP for the 96MB file which has 92 fragments. https://drive.google.com/file/d/1Owsd5DykNEkwucIPbKel0qqYyS134-tB/view?usp=sharing I don't know if it's a bug with the submitted target size by sd-journald, or if it's a bug in Btrfs. But it doesn't really matter. There is no benefit to defragmenting nodatacow journals that were fallocated upon creation. If you want an optimization that's actually useful on Btrfs, /var/log/journal/ could be a nested subvolume. That would prevent any snapshots above from turning the nodatacow journals into datacow journals, which does significantly increase fragmentation (it would in the exact same case if it were a reflink copy on XFS for that matter). > No, but doing this once in a big linear stream when the journal is > archived might not be so bad if then later on things are much faster > to access for all future because the files aren't fragmented. Ok well in practice is worse than doing nothing so I'm suggesting doing nothing. > Somehow I think you are missing what I am asking for: some data that > actually shows your optimization is worth it: i.e. that leaving the > files fragment doesn't hurt access to the journal badly, and that the > number of iops is substantially lowered at the same time. I don't get the iops thing at all. What we care about in this case is latency. A least noticeable latency of around 150ms seems reasonable as a starting point, that's where users realize a delay between a key press and a character appearing. However, if I check for 10ms latency (using bcc-tools fileslower) when reading all of the above journals at once: $ sudo journalctl -D /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager Not a single report. None. Nothing took even 10ms. And those journals are more fragmented than your 20 in a 100MB file. I don't have any hard drives to test this on. This is what, 10% of the market at this point? The best you can do there is the same as on SSD. You can't depend on sysfs to conditionally do defragmentation on only rotational media, too many fragile media claim to be rotating. And by the way, I use Brfs on SD Card on a Raspberry Pi Zero of all things. The cards last longer than other file systems due to net lower write amplification due to native compression. I wouldn't be surprised if the cards fail sooner if I weren't using compression. But who knows, maybe Btrfs write amplification compared to ext4 and xfs constant journaling ends up being a wash. There are a number of embedded use cases for Btrfs as well. Is compressed F2FS better? Probably. They have a solution for the wandering trees problem, but also no snapshots or data checksumming. But I also don't think any of that is super relevant to the overall topic, I just provide this as a contra-argument that Btrfs isn't appropriate for small cheap storage devices. > The thing is that we tend to have few active files and many archived > files, and since we interleave stuff our access patterns are pretty > bad already, so we don't want to spend even more time on paying for > extra bad access patterns becuase the archived files are fragm
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Lennart Poettering writes: > Well, at least on my system here there are still like 20 fragments per > file. That's not nothin? In a 100 mb file? It could be better, but I very much doubt you're going to notice a difference after defragmenting that. I may be the nut that rescued the old ext2 defrag utility from the dustbin of history, but even I have to admit that it isn't really important to use and there is a reasson why the linux community abandoned it. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mi, 03.02.21 23:11, Chris Murphy (li...@colorremedies.com) wrote: > On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering > wrote: > > > > Performance is terrible if cow is used on journal files while we write > > them. > > I've done it for a year on NVMe. The latency is so low, it doesn't > matter. Maybe do it on rotating media... > > It would be great if we could turn datacow back on once the files are > > archived, and then take benefit of compression/checksumming and > > stuff. not sure if there's any sane API for that in btrfs besides > > rewriting the whole file, though. Anyone knows? > > A compressed file results in a completely different encoding and > extent size, so it's a complete rewrite of the whole file, regardless > of the cow/nocow status. > > Without compression it'd be a rewrite because in effect it's a > different extent type that comes with checksums. i.e. a reflink copy > of a nodatacow file can only be a nodatacow file; a reflink copy of a > datacow file can only be a datacow file. The conversion between them > is basically 'cp --reflink=never' and you get a complete rewrite. > > But you get a complete rewrite of extents by submitting for > defragmentation too, depending on the target extent size. > > It is possible to do what you want by no longer setting nodatacow on > the enclosing dir. Create a 0 length journal file, set nodatacow on > that file, then fallocate it. That gets you a nodatacow active > journal. And then you can just duplicate it in place with a new name, > and the result will be datacow and automatically compressed if > compression is enabled. > > But the write hit has already happened by writing journal data into > this journal file during its lifetime. Just rename it on rotate. > That's the least IO impact possible at this point. Defragmenting it > means even more writes, and not much of a gain if any, unless it's > datacow which isn't the journald default. You are focussing only on the one-time iops generated during archival, and are ignoring the extra latency during access that fragmented files cost. Show me that the iops reduction during the one-time operation matters and the extra latency during access doesn't matter and we can look into making changes. But without anything resembling any form of profiling we are just blind people in the fog... Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mi, 03.02.21 22:51, Chris Murphy (li...@colorremedies.com) wrote: > > > Since systemd-journald sets nodatacow on /var/log/journal the journals > > > don't really fragment much. I typically see 2-4 extents for the life > > > of the journal, depending on how many times it's grown, in what looks > > > like 8MiB increments. The defragment isn't really going to make any > > > improvement on that, at least not worth submitting it for additional > > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > > > small amount of extra writes with no meaningful impact to wear, it > > > probably does have an impact on much more low end flash like USB > > > sticks, eMMC, and SD Cards. So I figure, let's just drop the > > > defragmentation step entirely. > > > > Quite frankly, given how iops-expensive btrfs is, one probably > > shouldn't choose btrfs for such small devices anyway. It's really not > > where btrfs shines, last time I looked. > > Btrfs aggressively delays metadata and data allocation, so I don't > agree that it's expensive. It's not a matter of agreeing or not. Last time people showed me benchmarks (which admittedly was 2 or 3 years ago), the number of iops for typical workloads is typically twice as much as on ext4. Which I don't really want to criticize, it's just the way that it is. I mean, maybe they managed to lower the iops since then, but it's not a matter of "agreeing", it's a matter of showing benchmarks that indicate this is not a problem anymore. > But in any case, reading a journal file and rewriting it out, which is > what defragment does, doesn't really have any benefit given the file > doesn't fragment much anyway due to (a) nodatacow and (b) fallocate, > which is what systemd-journald does on Btrfs. Well, at least on my system here there are still like 20 fragments per file. That's not nothin? > > Did you actually check the iops this generates? > > I don't understand the relevance. You want to optimize write pattersn I understand, i.e. minimize iops. Hence start with profiling iops, i.e. what defrag actually costs and then weight that agains the reduced access time when accessing the files. In particular on rotating media. > > Not sure it's worth doing these kind of optimizations without any hard > > data how expensive this really is. It would be premature. > > Submitting the journal for defragment in effect duplicates the > journal. Read all extents, and rewrite those blocks to a new location. > It's doubling the writes for that journal file. It's not like the > defragment is free. No, but doing this once in a big linear stream when the journal is archived might not be so bad if then later on things are much faster to access for all future because the files aren't fragmented. > Somehow I think you're missing what I've asking for, which is to stop > the unnecessary defragment step because it's not an optimization. It > doesn't meaningfully reduce fragmentation at all, it just adds write > amplification. Somehow I think you are missing what I am asking for: some data that actually shows your optimization is worth it: i.e. that leaving the files fragment doesn't hurt access to the journal badly, and that the number of iops is substantially lowered at the same time. The thing is that we tend to have few active files and many archived files, and since we interleave stuff our access patterns are pretty bad already, so we don't want to spend even more time on paying for extra bad access patterns becuase the archived files are fragment. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering wrote: > > Performance is terrible if cow is used on journal files while we write > them. I've done it for a year on NVMe. The latency is so low, it doesn't matter. > It would be great if we could turn datacow back on once the files are > archived, and then take benefit of compression/checksumming and > stuff. not sure if there's any sane API for that in btrfs besides > rewriting the whole file, though. Anyone knows? A compressed file results in a completely different encoding and extent size, so it's a complete rewrite of the whole file, regardless of the cow/nocow status. Without compression it'd be a rewrite because in effect it's a different extent type that comes with checksums. i.e. a reflink copy of a nodatacow file can only be a nodatacow file; a reflink copy of a datacow file can only be a datacow file. The conversion between them is basically 'cp --reflink=never' and you get a complete rewrite. But you get a complete rewrite of extents by submitting for defragmentation too, depending on the target extent size. It is possible to do what you want by no longer setting nodatacow on the enclosing dir. Create a 0 length journal file, set nodatacow on that file, then fallocate it. That gets you a nodatacow active journal. And then you can just duplicate it in place with a new name, and the result will be datacow and automatically compressed if compression is enabled. But the write hit has already happened by writing journal data into this journal file during its lifetime. Just rename it on rotate. That's the least IO impact possible at this point. Defragmenting it means even more writes, and not much of a gain if any, unless it's datacow which isn't the journald default. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Wed, Feb 3, 2021 at 9:41 AM Lennart Poettering wrote: > > On Di, 05.01.21 10:04, Chris Murphy (li...@colorremedies.com) wrote: > > > f27a386430cc7a27ebd06899d93310fb3bd4cee7 > > journald: whenever we rotate a file, btrfs defrag it > > > > Since systemd-journald sets nodatacow on /var/log/journal the journals > > don't really fragment much. I typically see 2-4 extents for the life > > of the journal, depending on how many times it's grown, in what looks > > like 8MiB increments. The defragment isn't really going to make any > > improvement on that, at least not worth submitting it for additional > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > > small amount of extra writes with no meaningful impact to wear, it > > probably does have an impact on much more low end flash like USB > > sticks, eMMC, and SD Cards. So I figure, let's just drop the > > defragmentation step entirely. > > Quite frankly, given how iops-expensive btrfs is, one probably > shouldn't choose btrfs for such small devices anyway. It's really not > where btrfs shines, last time I looked. Btrfs aggressively delays metadata and data allocation, so I don't agree that it's expensive. There is a wandering trees problem that can result in write amplification, that's a different problem. But via native compression overall writes are proven to significantly reduce overall writes. But in any case, reading a journal file and rewriting it out, which is what defragment does, doesn't really have any benefit given the file doesn't fragment much anyway due to (a) nodatacow and (b) fallocate, which is what systemd-journald does on Btrfs. It'd make more sense to defragment only if the file is datacow. At least then it also gets compressed, which isn't the case when it's nodatacow. > > > Further, since they are nodatacow, they can't be submitted for > > compression. There was a quasi-bug in Btrfs, now fixed, where > > nodatacow files submitted for decompression were compressed. So we no > > longer get that unintended benefit. This strengthens the case to just > > drop the defragment step upon rotation, no other changes. > > > > What do you think? > > Did you actually check the iops this generates? I don't understand the relevance. > > Not sure it's worth doing these kind of optimizations without any hard > data how expensive this really is. It would be premature. Submitting the journal for defragment in effect duplicates the journal. Read all extents, and rewrite those blocks to a new location. It's doubling the writes for that journal file. It's not like the defragment is free. > That said, if there's actual reason to optimize the iops here then we > could make this smart: there's actually an API for querying > fragmentation: we could defrag only if we notice the fragmentation is > really too high. FIEMAP isn't going to work in the case the files are being fragmented. The Btrfs extent size becomes 128KiB in that case, and it looks like massive fragmentation. So that needs to be made smarter first. I don't have a problem submitting the journal for a one time defragment upon rotation if it's datacow, if empty journal-nocow.conf exists. But by default, the combination of fallocate and nodatacow already avoids all meaningful fragmentation, so long as the journals aren't being snapshot. If they are, well, that too is a different problem. If the user does that and we're still defragmenting the files, it'll explode their space consumption because defragment is not snapshot aware, it results in all shared extents becoming unshared. > But quite frankly, this sounds polishing things after the horse > already left the stable: if you want to optimize iops, then don't use > btrfs. If you bought into btrfs, then apparently you are OK with the > extra iops it generates, hence also the defrag costs. Somehow I think you're missing what I've asking for, which is to stop the unnecessary defragment step because it's not an optimization. It doesn't meaningfully reduce fragmentation at all, it just adds write amplification. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Di, 26.01.21 21:00, Chris Murphy (li...@colorremedies.com) wrote: > On Tue, Jan 5, 2021 at 10:04 AM Chris Murphy wrote: > > > > f27a386430cc7a27ebd06899d93310fb3bd4cee7 > > journald: whenever we rotate a file, btrfs defrag it > > > > Since systemd-journald sets nodatacow on /var/log/journal the journals > > don't really fragment much. I typically see 2-4 extents for the life > > of the journal, depending on how many times it's grown, in what looks > > like 8MiB increments. The defragment isn't really going to make any > > improvement on that, at least not worth submitting it for additional > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > > small amount of extra writes with no meaningful impact to wear, it > > probably does have an impact on much more low end flash like USB > > sticks, eMMC, and SD Cards. So I figure, let's just drop the > > defragmentation step entirely. > > > > Further, since they are nodatacow, they can't be submitted for > > compression. There was a quasi-bug in Btrfs, now fixed, where > > nodatacow files submitted for decompression were compressed. So we no > > longer get that unintended benefit. This strengthens the case to just > > drop the defragment step upon rotation, no other changes. > > > > What do you think? > > A better idea. > > Default behavior: journals are nodatacow and are not defragmented. > > If '/etc/tmpfiles.d/journal-nocow.conf ` exists, do the reverse. > Journals are datacow, and files are defragmented (and compressed, if > it's enabled). Performance is terrible if cow is used on journal files while we write them. It would be great if we could turn datacow back on once the files are archived, and then take benefit of compression/checksumming and stuff. not sure if there's any sane API for that in btrfs besides rewriting the whole file, though. Anyone knows? Just dropping FS_NOCOW_FL on the existing file doesn#t work iirc, it can only be changed while a file is empty last time i looked iirc. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Di, 05.01.21 10:04, Chris Murphy (li...@colorremedies.com) wrote: > f27a386430cc7a27ebd06899d93310fb3bd4cee7 > journald: whenever we rotate a file, btrfs defrag it > > Since systemd-journald sets nodatacow on /var/log/journal the journals > don't really fragment much. I typically see 2-4 extents for the life > of the journal, depending on how many times it's grown, in what looks > like 8MiB increments. The defragment isn't really going to make any > improvement on that, at least not worth submitting it for additional > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > small amount of extra writes with no meaningful impact to wear, it > probably does have an impact on much more low end flash like USB > sticks, eMMC, and SD Cards. So I figure, let's just drop the > defragmentation step entirely. Quite frankly, given how iops-expensive btrfs is, one probably shouldn't choose btrfs for such small devices anyway. It's really not where btrfs shines, last time I looked. > Further, since they are nodatacow, they can't be submitted for > compression. There was a quasi-bug in Btrfs, now fixed, where > nodatacow files submitted for decompression were compressed. So we no > longer get that unintended benefit. This strengthens the case to just > drop the defragment step upon rotation, no other changes. > > What do you think? Did you actually check the iops this generates? Not sure it's worth doing these kind of optimizations without any hard data how expensive this really is. It would be premature. That said, if there's actual reason to optimize the iops here then we could make this smart: there's actually an API for querying fragmentation: we could defrag only if we notice the fragmentation is really too high. But quite frankly, this sounds polishing things after the horse already left the stable: if you want to optimize iops, then don't use btrfs. If you bought into btrfs, then apparently you are OK with the extra iops it generates, hence also the defrag costs. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Tue, Jan 5, 2021 at 10:04 AM Chris Murphy wrote: > > f27a386430cc7a27ebd06899d93310fb3bd4cee7 > journald: whenever we rotate a file, btrfs defrag it > > Since systemd-journald sets nodatacow on /var/log/journal the journals > don't really fragment much. I typically see 2-4 extents for the life > of the journal, depending on how many times it's grown, in what looks > like 8MiB increments. The defragment isn't really going to make any > improvement on that, at least not worth submitting it for additional > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > small amount of extra writes with no meaningful impact to wear, it > probably does have an impact on much more low end flash like USB > sticks, eMMC, and SD Cards. So I figure, let's just drop the > defragmentation step entirely. > > Further, since they are nodatacow, they can't be submitted for > compression. There was a quasi-bug in Btrfs, now fixed, where > nodatacow files submitted for decompression were compressed. So we no > longer get that unintended benefit. This strengthens the case to just > drop the defragment step upon rotation, no other changes. > > What do you think? A better idea. Default behavior: journals are nodatacow and are not defragmented. If '/etc/tmpfiles.d/journal-nocow.conf ` exists, do the reverse. Journals are datacow, and files are defragmented (and compressed, if it's enabled). -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel