Re: btrfs, journald logs, fragmentation, and fallocate
>> [ ... ] these extents are all over the place, they're not >> contiguous at all. 4K here, 4K there, 4K over there, back to >> 4K here next to this one, 4K over there...12K over there, 500K >> unwritten, 4K over there. This seems not so consequential on >> SSD, [ ... ] > Indeed there were recent reports that the 'ssd' mount option > causes that, IIRC by Hans van Kranenburg [ ... ] The report included news that "sometimes" the 'ssd' option is automatically switched on at mount even on hard disks. I had promised to put a summary of the issue on the Btrfs wiki, but I regret that I haven't yet done that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] Instead, you can use raw files (preferably sparse unless > there's both nocow and no snapshots). Btrfs does natively everything > you'd gain from qcow2, and does it better: you can delete the master > of a cloned image, deduplicate them, deduplicate two unrelated images; > you can turn on compression, etc. Uhm, I understand this argument in the general case (not specifically as to QCOW2 images), and it has some merit, but it is "controversial", as there are two counterarguments: * Application specifici file formats can match better application specific requirements. * Putting advanced functionality into the filesystem code makes it more complex and less robust, and Btrfs is a bit of a major example of the consequences. I put compression and deduplication as things that I reckon make a filesystem too complex. As to snapshots, I make a difference between filetree snapshots and file snapshots: the first clones a tree as of the snapshot moment, and it is a system management feature, the second provides per-file update rollback. One sort of implies the other, but using the per-file rollback *systematically*, that is a a feature an application can rely one seems a bit dangerous to me. > Once you pay the btrfs performance penalty, Uhmmm, Btrfs has a small or negative performance penalty as a general purpose filesystem, and many (more or less well conceived) tests show it performs up there with the best. The only two real costs I have to it are the huge CPU cost of doing checksumming all the time, but that's unavoidable if one wants checksumming, and that checksumming usually requires metadata duplication, that is at least 'dup' profile for metadata, and that is indeed a bit expensive. > you may as well actually use its features, The features that I think Btrfs gives that are worth using are checksumming, metadata duplication, and filetree snapshots. > which make qcow2 redundant and harmful. My impression is that in almost all cases QCOW2 is harmful, because it trades more IOPS and complexity for less disk space, and disk space is cheap and IOPS and complexity are expensive, but of course a lot of people know better :-). My preferred VM setup is a small essentially read-only non-QCOW2 image for '/' and everything else mounted via NFSv4, from the VM host itself or a NAS server, but again lots of people know better and use multi-terabyte-sized QCOW2 images :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
Goffredo Baroncelli posted on Fri, 28 Apr 2017 19:05:21 +0200 as excerpted: > After some thinking I adopted a different strategies: I used journald as > collector, then I forward all the log to rsyslogd, which used a "log > append" format. Journald never write on the root filesystem, only in > tmp. Great minds think alike. =:^) Only here it's syslog-ng that does the permanent writes. I just couldn't see journald's crazy (for btrfs) write pattern going to permanent storage. And AFAIK, journald has no pre-write filtering mechanism at all, only post-write display-time filtering, so even "log-spam" that I don't want/ need logged gets written to it, while if I see something spamming continuously (I run git kernels and kde, and do get such spammers occasionally) I setup a syslog-ng spam filter to kill it, so it never actually gets written to permanent storage at all. But the tmpfs journals and btrfs traditional logs gives me the best of both worlds, per-boot journals with all the extra metadata, the last ten journal entries for it when I do systemctl status on a unit, etc, and a nice filtered and ordered multi-boot log that I can use traditional text- based log-administration tools on. The only part of it I'm not happy with is that journald apparently can't keep separate user and system journals when set to temporary only -- everything goes to the system journal. Which eventually means that much of the stdout/stderr debugging spew that kde-based apps like to spew out ends up in the system journal and (would be in the) log. But that's a journald "documented bug-feature", and I can and do syslog-ng filter it before it actually hits the written system log (or console log display). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: btrfs, journald logs, fragmentation, and fallocate
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Goffredo Baroncelli > Sent: Saturday, 29 April 2017 3:05 AM > To: Chris Murphy <li...@colorremedies.com> > Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org> > Subject: Re: btrfs, journald logs, fragmentation, and fallocate > > > In the past I faced the same problems; I collected some data here > http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. > Unfortunately the journald files are very bad, because first the data is > written (appended), then the index fields are updated. Unfortunately these > indexes are near after the last write . So fragmentation is unavoidable. Perhaps a better idea for COW filesystems is to store the index in a separate file, and/or rewrite the last 1 MB block (or part thereof) of the data file every time data is appended? That way the data file will use 1MB extents and hopefully avoid ridiculous amounts of metadata. Paul.
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] these extents are all over the place, they're not > contiguous at all. 4K here, 4K there, 4K over there, back to > 4K here next to this one, 4K over there...12K over there, 500K > unwritten, 4K over there. This seems not so consequential on > SSD, [ ... ] Indeed there were recent reports that the 'ssd' mount option causes that, IIRC by Hans van Kranenburg (around 2017-04-17), which also noticed issues with the wandering trees in certain situations (around 2017-04-08). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:41:00AM -0600, Chris Murphy wrote: > The same behavior happens with NTFS in qcow2 files. They quickly end > up with 100,000+ extents unless set nocow. It's like the worst case > scenario. You should never use qcow2 on btrfs, especially if snapshots are involved. They both do roughly the same thing, and layering fragmentation upon fragmentation ɪꜱ ɴᴏᴛ ᴘʀᴇᴛᴛʏ. Layering syncs is bad, too. Instead, you can use raw files (preferably sparse unless there's both nocow and no snapshots). Btrfs does natively everything you'd gain from qcow2, and does it better: you can delete the master of a cloned image, deduplicate them, deduplicate two unrelated images; you can turn on compression, etc. Once you pay the btrfs performance penalty, you may as well actually use its features, which make qcow2 redundant and harmful. Meow! -- Don't be racist. White, amber or black, all beers should be judged based solely on their merits. Heck, even if occasionally a cider applies for a beer's job, why not? On the other hand, corpo lager is not a race. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 1:39 PM, Peter Grandiwrote: > In a particularly demented setup I had to decastrophize with > great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on > RAID6) containining an ever growing number Maildir email archive > ended up with over a million widely scattered microextents: > > http://www.sabi.co.uk/blog/1101Jan.html?110116#110116 Related Btrfs thread "File system corruption, btrfsck abort" involves 5 concurrent use VM's with guests using ext4, NTFS, HFS+, Btrfs, LVM, pointing to qcow2 files on Btrfs for backing. And it's resulting in problems... -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:53 AM, Peter Grandiwrote: > Well, depends, but probably the single file: it is more likely > that the 20,000 fragments will actually be contiguous, and that > there will be less metadata IO than for 40,000 separate journal > files. You can see from the examples I posted that these extents are all over the place, they're not contiguous at all. 4K here, 4K there, 4K over there, back to 4K here next to this one, 4K over there...12K over there, 500K unwritten, 4K over there. This seems not so consequential on SSD, at least if it impacts performance it's not so bad I care. On a hard drive, it's totally noticeable. And that's why journald went with chattr +C by default a few versions ago when on Btrfs. And it does help *if* the partent is never snapshot, which on a snapshotting file system can't really be guaranteed. Inadvertent snapshotting could be inhibited by putting the journals in their own subvolume though. Anyway, it's difficult to consider Btrfs a general purpose file system if other general purpose workloads like journal files, are causing a problem like wandering tree. Hence the subject of what to do about it, and that may mean short term and long term. I can't speak for systemd developers but if there's a different way to write to the journals that'd be better for Btrfs and no worse for ext4 and XFS, it might be considered. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:46 AM, Peter Grandiwrote: > So there are three layers of silliness here: > > * Writing large files slowly to a COW filesystem and > snapshotting it frequently. > * A filesystem that does delayed allocation instead of > allocate-ahead, and does not have psychic code. > * Working around that by using no-COW and preallocation > with a fixed size regardless of snapshot frequency. > > The primary problem here is that there is no way to have slow > small writes and frequent snapshots without generating small > extents: if a file is written at a rate of 1MiB/hour and gets > snapshot every hour the extent size will not be larger than 1MiB > *obviously*. Sure. But in my example, no snapshotting, and +C is inhibited (i.e. I set /etc/tmpfiles.d/journal-nocow.conf which stops systemd from the new behavior of setting +C on journals). That's resulting in a 19000+ fragment journal file. In fact snapshotting does not make it worse though. If it's nocow, then yes snapshotting makes it worse than nocow, but no worse than cow. What I'm trying to get at is default Btrfs behavior and (previous) default journald behavior, have a misalignment resulting in a lot of fragmentation, is there a better way around this than merely setting journals to nocow *and* making sure they stay nocow by preventing snapshotting. If there's nothing better to be done, then I'll just re-recommend to systemd folks that the directory containing journals should be made a subvolume to isolate it from inadvertent snapshotting. If people want to snapshot it anyway there's nothing we can do about that. > Filesystem-level snapshots are not designed to snapshot slowly > growing files, but to snapshots changing collections of > files. There are harsh tradeoffs involved. Application-level > shapshots (also known as log rotations :->) are needed for > special cases and finer grained policies. > > The secondary problem is that a fixed preallocate of 8MiB is > good only if in betweeen snapshots the file grows by a little > less than 8MiB or by substantially more. Just to be clear, none of my own examples involve journals being snapshot. There are no shared extents for any of those files. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
>> The gotcha though is there's a pile of data in the journal >> that would never make it to rsyslogd. If you use journalctl >> -o verbose you can see some of this. > You can send *all the info* to rsyslogd via imjournal > http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html > In my setup all the data are stored in json format in the > /var/log/cee.log file: > $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00 > venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": > "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ] Ahh the horror the horror, I will never be able to unsee that. The UNIX way of doing things is truly dead. >> The same behavior happens with NTFS in qcow2 files. They >> quickly end up with 100,000+ extents unless set nocow. >> It's like the worst case scenario. In a particularly demented setup I had to decastrophize with great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on RAID6) containining an ever growing number Maildir email archive ended up with over a million widely scattered microextents: http://www.sabi.co.uk/blog/1101Jan.html?110116#110116 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On 2017-04-28 19:41, Chris Murphy wrote: > On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli >wrote: > >> In the past I faced the same problems; I collected some data here >> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. >> Unfortunately the journald files are very bad, because first the data is >> written (appended), then the index fields are updated. Unfortunately these >> indexes are near after the last write . So fragmentation is unavoidable. >> >> After some thinking I adopted a different strategies: I used journald as >> collector, then I forward all the log to rsyslogd, which used a "log append" >> format. Journald never write on the root filesystem, only in tmp. > > The gotcha though is there's a pile of data in the journal that would > never make it to rsyslogd. If you use journalctl -o verbose you can > see some of this. You can send *all the info* to rsyslogd via imjournal http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html In my setup all the data are stored in json format in the /var/log/cee.log file: $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": "e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", "_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": "3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", "SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", "_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": "\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": "\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", "_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", "_SOURCE_REALTIME_TIMESTAMP": "1493397701931255", "msg": "[origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" } 2017-04-28T18:41:42.058549+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": "e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", "_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": "3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", "SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", "_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": "\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": "\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", "_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", "_SOURCE_REALTIME_TIMESTAMP": "1493397702058441", "msg": "[origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" } [] All the info are stored with the same keys/values as journald does. I developed an utility (called clp), which allow to query the log by key, filtering by boot nr, by date For example to show all the log related to rsyslog $ clp log -t full-details _SYSTEMD_CGROUP=/system.slice/rsyslog.service 2017-04-21 19:12:29.579748 MESSAGE= [origin software="rsyslogd" swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com;] rsyslogd was HUPed PRIORITY=6 SYSLOG_FACILITY=23 SYSLOG_IDENTIFIER=liblogging-stdlog _BOOT_ID=d77198380c9344248e01166fbd8d60df _CAP_EFFECTIVE=3f _CMDLINE=/usr/sbin/rsyslogd -n _COMM=rsyslogd _EXE=/usr/sbin/rsyslogd _GID=0 _HOSTNAME=venice.bhome _LOGFILEINITLINE=2017-04-21T19:12:29.579768+02:00 venice liblogging-stdlog: _LOGFILELINENUMBER=1 _LOGFILENAME=/var/log/cee.log.7.gz _LOGFILETIMESTAMP=1492794749579768 _MACHINE_ID=e84907d099904117b355a99c98378dca _PID=804 _SOURCE_REALTIME_TIMESTAMP=1492794749579748 _SYSTEMD_CGROUP=/system.slice/rsyslog.service _SYSTEMD_INVOCATION_ID=8f9cb6c871be4158a3ccb374f4323027 _SYSTEMD_SLICE=system.slice _SYSTEMD_UNIT=rsyslog.service _TRANSPORT=syslog _UID=0 msg=[origin software="rsyslogd" swVersion="8.24.0" x-pid="804"
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] And that makes me wonder whether metadata > fragmentation is happening as a result. But in any case, > there's a lot of metadata being written for each journal > update compared to what's being added to the journal file. [ > ... ] That's the "wandering trees" problem in COW filesystems, and manifestations of it in Btrfs have also been reported before. If there is a workload that triggers a lot of "wandering trees" updates, then a filesystem that has "wandering trees" perhaps should not be used :-). > [ ... ] worse, a single file with 2 fragments; or 4 > separate journal files? *shrug* [ ... ] Well, depends, but probably the single file: it is more likely that the 20,000 fragments will actually be contiguous, and that there will be less metadata IO than for 40,000 separate journal files. The deeper "strategic" issue is that storage systems and filesystems in particular have very anisotropic performance envelopes, and mismatches between the envelopes of application and filesystem can be very expensive: http://www.sabi.co.uk/blog/15-two.html?151023#151023 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> Old news is that systemd-journald journals end up pretty > heavily fragmented on Btrfs due to COW. This has been discussed before in detail indeeed here, but also here: http://www.sabi.co.uk/blog/15-one.html?150203#150203 > While journald uses chattr +C on journal files now, COW still > happens if the subvolume the journal is in gets snapshot. e.g. > a week old system.journal has 19000+ extents. [ ... ] It > appears to me (see below URLs pointing to example journals) > that journald fallocated in 8MiB increments but then ends up > doing 4KiB writes; [ ... ] So there are three layers of silliness here: * Writing large files slowly to a COW filesystem and snapshotting it frequently. * A filesystem that does delayed allocation instead of allocate-ahead, and does not have psychic code. * Working around that by using no-COW and preallocation with a fixed size regardless of snapshot frequency. The primary problem here is that there is no way to have slow small writes and frequent snapshots without generating small extents: if a file is written at a rate of 1MiB/hour and gets snapshot every hour the extent size will not be larger than 1MiB *obviously*. Filesystem-level snapshots are not designed to snapshot slowly growing files, but to snapshots changing collections of files. There are harsh tradeoffs involved. Application-level shapshots (also known as log rotations :->) are needed for special cases and finer grained policies. The secondary problem is that a fixed preallocate of 8MiB is good only if in betweeen snapshots the file grows by a little less than 8MiB or by substantially more. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelliwrote: > In the past I faced the same problems; I collected some data here > http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. > Unfortunately the journald files are very bad, because first the data is > written (appended), then the index fields are updated. Unfortunately these > indexes are near after the last write . So fragmentation is unavoidable. > > After some thinking I adopted a different strategies: I used journald as > collector, then I forward all the log to rsyslogd, which used a "log append" > format. Journald never write on the root filesystem, only in tmp. The gotcha though is there's a pile of data in the journal that would never make it to rsyslogd. If you use journalctl -o verbose you can see some of this. There's a bunch of extra metadata in the journal. And then also filtering based on that metadata is useful rather than being limited to grep on a syslog file. Which, you know, it's fine for many use cases. I guess I'm just interested in whether there's an enhancement that can be done to make journals more compatible with Btrfs or vice versa. It's not a huge problem anyway. > > The think became interesting when I discovered that the searching in a > rsyslog file is faster than journalctl (on a rotational media). Unfortunately > I don't have any data to support this. Yes on drives all of these scattered extents cause a lot of head seeking. And I also suspect it's a lot of metadata spread out everywhere too, to account for all of these extents. That's why they moved to chattr +C to make them nocow. An idea I had on systemd list was to automatically make the journal directory a Btrfs subvolume, similar to how systemd already creates a /var/lib/machines subvolume for nspawn containers. This prevents the journals from being caught up in a snapshot of the parent subvolume that typically contains the journals (root fs). There's no practical use I can think of for snapshotting logs. You'd really want the logs to always be linear, contiguous, and never get rolled back. Even if something in the system does get rolled back, you'd want the logs to show that and continue on, rather than being rolled back themselves. So the super simple option would be continue with +C on journals, and then a separate subvolume to prevent COW from ever happening inadvertently. The same behavior happens with NTFS in qcow2 files. They quickly end up with 100,000+ extents unless set nocow. It's like the worst case scenario. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On 2017-04-28 18:16, Chris Murphy wrote: > Old news is that systemd-journald journals end up pretty heavily > fragmented on Btrfs due to COW. While journald uses chattr +C on > journal files now, COW still happens if the subvolume the journal is > in gets snapshot. e.g. a week old system.journal has 19000+ extents. > > The news is I started a systemd thread. > > This is the start: > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html > > Where it gets interesting, two messages by Andrei Borzenkov: He > evaluates existing code and does some tests on ext4 and XFS. > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html > > And then the question. > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html > > Given what journald is doing, is what Btrfs is doing expected? Is > there something it could do better to be more like ext4 and XFS in the > same situation? Or is it out of scope for Btrfs? In the past I faced the same problems; I collected some data here http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. Unfortunately the journald files are very bad, because first the data is written (appended), then the index fields are updated. Unfortunately these indexes are near after the last write . So fragmentation is unavoidable. After some thinking I adopted a different strategies: I used journald as collector, then I forward all the log to rsyslogd, which used a "log append" format. Journald never write on the root filesystem, only in tmp. The think became interesting when I discovered that the searching in a rsyslog file is faster than journalctl (on a rotational media). Unfortunately I don't have any data to support this. However if someone is interested I can share more details. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs, journald logs, fragmentation, and fallocate
Old news is that systemd-journald journals end up pretty heavily fragmented on Btrfs due to COW. While journald uses chattr +C on journal files now, COW still happens if the subvolume the journal is in gets snapshot. e.g. a week old system.journal has 19000+ extents. The news is I started a systemd thread. This is the start: https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html Where it gets interesting, two messages by Andrei Borzenkov: He evaluates existing code and does some tests on ext4 and XFS. https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html And then the question. https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html Given what journald is doing, is what Btrfs is doing expected? Is there something it could do better to be more like ext4 and XFS in the same situation? Or is it out of scope for Btrfs? It appears to me (see below URLs pointing to example journals) that journald fallocated in 8MiB increments but then ends up doing 4KiB writes; there's a lot of these unused (unwritten) 8MiB extents that appear in both filefrag and btrfs-debug -f outputs. The +C idea just rearranges the deck chairs, it's not solving the underlying problem except in the case where the containing subvolume is never snapshot. And in the COW case, I'm seeing about 30 metadata nodes being written out for what amounts to less than a 4KiB journal append. Each time. And that makes me wonder whether metadata fragmentation is happening as a result. But in any case, there's a lot of metadata being written for each journal update compared to what's being added to the journal file. And then that makes me wonder if a better optimization on Btrfs would be having each write be a separate file. The small updates would have data inline. Which is worse, a single file with 2 fragments; or 4 separate journal files? *shrug* At least those individual files would be subject to compression with +c; whereas right now the open endedness of the active journal has not a single compressed extent. Only once rotated do they get compressed (via defragmentation which journald does only on Btrfs). Journals contain highly compressible data. Anyway, two example journals. The parent directory has chattr +c, both journals inherited it. The first URL is filefrag -v, the 2nd is btrfs-debug -f; for each journal. This is a rotated journal. Upon rotation on Btrfs, journald defragments the file which ends up compressing it when chattr +c. https://da.gd/4NKyq https://da.gd/zEeYW This is an active system.journal. No compressed extents (the writes I think are too small). https://da.gd/cBjX https://da.gd/YXuI Extra credit if you've followed this far... The rotated log has piles of unwritten items in it that are making it fairly inefficient even with compression. Just using cat to write its contents to a new file, compression goes from a 1.27 ratio, to 5.70. Here are the results after catting that file: https://da.gd/rE8KT https://da.gd/PD5qI -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html