Re: [systemd-devel] journal fragmentation on Btrfs
On Sat, 22.04.17 15:29, Andrei Borzenkov (arvidj...@gmail.com) wrote: > 18.04.2017 07:27, Chris Murphy пишет: > > On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov > > wrote: > >> 18.04.2017 06:50, Chris Murphy пишет: > > > What exactly "changes" mean? Write() syscall? > >>> > >>> filefrag reported entries increase, it's using FIEMAP. > >>> > >> > >> So far it sounds like btrfs allocates new extent on every write to > >> journal file. Each journal record itself is relatively small indeed. > > > > Hence why it would be better if there's no fsync so that it can > > accumulate these and do its own commit (30s default for Btrfs) and let > > them accumulate. > > > > It is not related to fsync. I made some tests. Journald does not appear > to preallocate file nor mmap the whole file (at least as far as I can > see from the source); when it appends new record it basically does > > fallocate (fd, end_of_file, new_size) > mmap (fd, end_of_file, new_size) > write to new size > > This results in large number of extents as each fallocate() ends up in > new extent. > > I can easily reproduce it with small program that is using similar > pattern; actually mmap is also red herring. Just fallocat'ing file in > small increments gives file consisting of overly large number of > extents. How exactly those extents get distributed across device > probably depends on overall filesystem activity. > > This is different from simply writing to file at the end, which still > results in several extent, but significantly larger. > > BTW you get the same pattern from direct IO. Writing 100M file in 4K > blocks using cached writes gives me here 7 extents of size between 25M > and 500K. Writing the same with direct IO results in 25600 extents (same > as growing file in 4K steps with fallocate). BTW, we are not really married to any particular fancy semantics of fallocate(). We call it mostly so that our later writes to the file blocks using mmap() will not result in SIGBUS. There was also the hope that letting the fs know in advance that we are about to append the specified amount of bytes to the end of the file through mmap() would be a good thing not a bad thing... Or in other words, we really don't need fallocate() to actually go to disk or anything and actually write anything. All we want is *reserve* some space for us... Maybe this is something to report to the btrfs folks? It appears to me their implementation of fallocate() does more than it has to according to the docs. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
22.04.2017 15:29, Andrei Borzenkov пишет: > 18.04.2017 07:27, Chris Murphy пишет: >> On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov >> wrote: >>> 18.04.2017 06:50, Chris Murphy пишет: >> > What exactly "changes" mean? Write() syscall? filefrag reported entries increase, it's using FIEMAP. >>> >>> So far it sounds like btrfs allocates new extent on every write to >>> journal file. Each journal record itself is relatively small indeed. >> >> Hence why it would be better if there's no fsync so that it can >> accumulate these and do its own commit (30s default for Btrfs) and let >> them accumulate. >> > > It is not related to fsync. I made some tests. Journald does not appear > to preallocate file nor mmap the whole file (at least as far as I can > see from the source); when it appends new record it basically does > > fallocate (fd, end_of_file, new_size) > mmap (fd, end_of_file, new_size) > write to new size > > This results in large number of extents as each fallocate() ends up in > new extent. > > I can easily reproduce it with small program that is using similar > pattern; actually mmap is also red herring. Just fallocat'ing file in > small increments gives file consisting of overly large number of > extents. How exactly those extents get distributed across device > probably depends on overall filesystem activity. > > This is different from simply writing to file at the end, which still > results in several extent, but significantly larger. > > BTW you get the same pattern from direct IO. Writing 100M file in 4K > blocks using cached writes gives me here 7 extents of size between 25M > and 500K. Writing the same with direct IO results in 25600 extents (same > as growing file in 4K steps with fallocate). > For comparison - on ext4 both direct IO and fallocate ends in 2-3 extents. On xfs fallocate gives 2 extents (the first being very small) and direct IO - 1 extent. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
18.04.2017 07:27, Chris Murphy пишет: > On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov > wrote: >> 18.04.2017 06:50, Chris Murphy пишет: > What exactly "changes" mean? Write() syscall? >>> >>> filefrag reported entries increase, it's using FIEMAP. >>> >> >> So far it sounds like btrfs allocates new extent on every write to >> journal file. Each journal record itself is relatively small indeed. > > Hence why it would be better if there's no fsync so that it can > accumulate these and do its own commit (30s default for Btrfs) and let > them accumulate. > It is not related to fsync. I made some tests. Journald does not appear to preallocate file nor mmap the whole file (at least as far as I can see from the source); when it appends new record it basically does fallocate (fd, end_of_file, new_size) mmap (fd, end_of_file, new_size) write to new size This results in large number of extents as each fallocate() ends up in new extent. I can easily reproduce it with small program that is using similar pattern; actually mmap is also red herring. Just fallocat'ing file in small increments gives file consisting of overly large number of extents. How exactly those extents get distributed across device probably depends on overall filesystem activity. This is different from simply writing to file at the end, which still results in several extent, but significantly larger. BTW you get the same pattern from direct IO. Writing 100M file in 4K blocks using cached writes gives me here 7 extents of size between 25M and 500K. Writing the same with direct IO results in 25600 extents (same as growing file in 4K steps with fallocate). ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, 17.04.17 21:50, Chris Murphy (li...@colorremedies.com) wrote: > >> Why do I see so many changes to the journal file, once ever 2-5 > >> seconds? This adds 4096 byte blocks to the file each time, and when > >> cow, that'd explain why there are so many fragments. > >> > > > > > > What exactly "changes" mean? Write() syscall? > > filefrag reported entries increase, it's using FIEMAP. As mentioned we write to the file via mmap() as we recv the log messages, and then issue ftruncate()'s to propagate mtime inotify events which other clients can watch for for live log views. And then, 5min after a write we issue sync(), but at most every 5min once. > Also with stat I see the times (all three) change on the file. If I go > to GNOME Terminal and just sudo some command, that itself causes the > current system.journal file to get all three times modified. It > happens immediately, there's no delay. So if I'm doing something like > drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus > the journal, it's just constantly writing stuff to the journal. This > is without anything running journalctl -f or reading the journal. The "sudo" command logs each invocation, hence, yes, of course, the log files will get updated. > > > >> #Storage=auto > >> #Compress=yes > >> #Seal=yes > >> #SplitMode=uid > >> #SyncIntervalSec=5m > > > > This controls how often systemd calls fsync() on currently active > > journal file. Do you see fsync() every 3 seconds? > > I have no idea if it's fsync or what. How can I tell? You can do "strace -p `pidof systemd-journald` -e sync"... > Also, I don't think these journal files are being compressed. > > Using the btrfs-progs/btrfs-debugfs script on a few user journal > files, I'm seeing massive compression ratios. Maybe I'll try > Compress=No and see if there's a change. As documented Compress= will only compress large objects stored in the journal, but not the general journal structure. This means journal files are usually highly compressible, still. Random access and compression don't easily mix, and we valued random access more. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, 17.04.17 13:49, Chris Murphy (li...@colorremedies.com) wrote: > On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov > wrote: > > 17.04.2017 19:25, Chris Murphy пишет: > >> This explains one system's fragmented journals; but the other system > >> isn't snapshotting journals and I haven't figured out why they're so > >> fragmented. No snapshots, and they are all +C at create time > >> (systemd-journald default on Btrfs). Is it possible to prevent > >> journald from setting +C on /var/log/journal and > >> /var/log/journal/? If I remove them, at next boot they get > >> reset, so any new journals created inherit that. > >> > > > > Yes, should be possible by creating empty > > /etc/tmpfiles.d/journal-nocow.conf. > > OK super. > > How about inhibiting the defragmentation on rotate? I'm suspicious one > of the things I'm seeing is due to ssd optimization mount options, but > I need to see the predefrag state of the files. You can't turn off the defrag-on-archive logic. But you can configure journald to use mutch larger journal files, so that archival never happens... > Why do I see so many changes to the journal file, once ever 2-5 > seconds? This adds 4096 byte blocks to the file each time, and when > cow, that'd explain why there are so many fragments. We write to the journal files through mmap. If you see writes every 2-5 seconds then this indicates that there's something logging every 2-5s... > > #Storage=auto > #Compress=yes > #Seal=yes > #SplitMode=uid > #SyncIntervalSec=5m > #RateLimitIntervalSec=30s > #RateLimitBurst=1000 > > A change every 5m is not what I'm seeing with stat. I have no crit, > emerg, or alert messages happening. Just a bunch of drm debug messages > which are constant. But if the flush should only happen every 5 > minutes, I'm confused. SyncIntervalSec= configures the max time after each write that journald will sync(). Or in other words, it means that sync() is called once every 5min if you have a constant stream of log messages, but if you have a long phase of no messages we'll not call it at al either. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov wrote: > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. https://paste.fedoraproject.org/paste/oVT-tsU2sBOdTJaZxGua-15M1UNdIGYhyRLivL9gydE= That's just a partial, but the complete output captured for a couple minutes doesn't contain an fsync. Then I did this for 8 minutes, strace -c -f -p $(pgrep systemd-journal) https://paste.fedoraproject.org/paste/Uzc2KhkkaqLOU8USLd38B15M1UNdIGYhyRLivL9gydE= So 6 fsyncs in 8 minutes; more than 1 per 5 minutes, but not nearly as many as I thought. So maybe as you say it's just memory mapped activity I'm seeing with state and filefrag. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov wrote: >> I have no idea if it's fsync or what. How can I tell? >> > > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. Also found this. https://lwn.net/Articles/306046/ Not sure how to enable and use it though. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov wrote: > 18.04.2017 06:50, Chris Murphy пишет: >>> What exactly "changes" mean? Write() syscall? >> >> filefrag reported entries increase, it's using FIEMAP. >> > > So far it sounds like btrfs allocates new extent on every write to > journal file. Each journal record itself is relatively small indeed. Hence why it would be better if there's no fsync so that it can accumulate these and do its own commit (30s default for Btrfs) and let them accumulate. It is likely that the ssd allocation option on these ssd's is a factor in fragmentation because it's trying to allocation to a unique 2MB section based on expected erase block size. There's a lot of discussion going on right now on the Btrfs list whether these assumptions are still true, and in what cases maybe we should be using nossd on higher end SSD's and NVMe. What's for sure though is that with any of these allocators, nocow is not good for lower end SSDs like SD cards; all that does it ask to write to the same LBA over and over and over again, for a journal. And it just increases write amplification unnecessarily. So I'm beginning to think that on SSDs, it's better if journald did +c rather than +C on journals. But there's still some researching to do. I definitely think /var/log/journal/ should be a subvolume to avoid its contents being snapshot. That does make the fragmentation problem worse. And also I think defragmentation feature should be disabled at least on SSD; or should include zlib compression. The write amplification on SSD is worse than just leaving the file fragmented. > >> Also with stat I see the times (all three) change on the file. If I go >> to GNOME Terminal and just sudo some command, that itself causes the >> current system.journal file to get all three times modified. It >> happens immediately, there's no delay. So if I'm doing something like >> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus >> the journal, it's just constantly writing stuff to the journal. This >> is without anything running journalctl -f or reading the journal. >> >>> #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m >>> >>> This controls how often systemd calls fsync() on currently active >>> journal file. Do you see fsync() every 3 seconds? >> >> I have no idea if it's fsync or what. How can I tell? >> > > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. > > Is it possible that btrfs behavior you observe is specific to memory > mapped files handling? Maybe. But even after a reboot I see the same extent entries in the file. Granted a good deal of these 1 block entries have addresses that are one after the other so they often make up larger continuous extents, but they still have separate entries. > >> Also, I don't think these journal files are being compressed. >> >> Using the btrfs-progs/btrfs-debugfs script on a few user journal >> files, I'm seeing massive compression ratios. Maybe I'll try >> Compress=No and see if there's a change. >> > > Only actual message payload above some threshold (I think 256 or 512 > bytes, not sure) is compressed; everything else is not. For average > syslog-type messages payload is far too small. This is really only > interesting when you store core dump or similar. Interesting I see. Thanks. I'll try strace and see what's going on. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
18.04.2017 06:50, Chris Murphy пишет: > On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkov wrote: >> 17.04.2017 22:49, Chris Murphy пишет: >>> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov >>> wrote: 17.04.2017 19:25, Chris Murphy пишет: > This explains one system's fragmented journals; but the other system > isn't snapshotting journals and I haven't figured out why they're so > fragmented. No snapshots, and they are all +C at create time > (systemd-journald default on Btrfs). Is it possible to prevent > journald from setting +C on /var/log/journal and > /var/log/journal/? If I remove them, at next boot they get > reset, so any new journals created inherit that. > Yes, should be possible by creating empty /etc/tmpfiles.d/journal-nocow.conf. >>> >>> OK super. >>> >>> How about inhibiting the defragmentation on rotate? I'm suspicious one >>> of the things I'm seeing is due to ssd optimization mount options, but >>> I need to see the predefrag state of the files. >>> >>> Why do I see so many changes to the journal file, once ever 2-5 >>> seconds? This adds 4096 byte blocks to the file each time, and when >>> cow, that'd explain why there are so many fragments. >>> >> >> >> What exactly "changes" mean? Write() syscall? > > filefrag reported entries increase, it's using FIEMAP. > So far it sounds like btrfs allocates new extent on every write to journal file. Each journal record itself is relatively small indeed. > Also with stat I see the times (all three) change on the file. If I go > to GNOME Terminal and just sudo some command, that itself causes the > current system.journal file to get all three times modified. It > happens immediately, there's no delay. So if I'm doing something like > drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus > the journal, it's just constantly writing stuff to the journal. This > is without anything running journalctl -f or reading the journal. > >> >>> #Storage=auto >>> #Compress=yes >>> #Seal=yes >>> #SplitMode=uid >>> #SyncIntervalSec=5m >> >> This controls how often systemd calls fsync() on currently active >> journal file. Do you see fsync() every 3 seconds? > > I have no idea if it's fsync or what. How can I tell? > strace -p $(pgrep systemd-journal) You will not see actual writes as file is memory mapped, but it definitely does not do any fsync() every so often. Is it possible that btrfs behavior you observe is specific to memory mapped files handling? > Also, I don't think these journal files are being compressed. > > Using the btrfs-progs/btrfs-debugfs script on a few user journal > files, I'm seeing massive compression ratios. Maybe I'll try > Compress=No and see if there's a change. > Only actual message payload above some threshold (I think 256 or 512 bytes, not sure) is compressed; everything else is not. For average syslog-type messages payload is far too small. This is really only interesting when you store core dump or similar. > file: > user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal > extents 64 disk size 294912 logical size 8388608 ratio 28.44 > file: > user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal > extents 64 disk size 278528 logical size 8388608 ratio 30.12 > file: > user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal > extents 320 disk size 5206016 logical size 41943040 ratio 8.06 > ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkov wrote: > 17.04.2017 22:49, Chris Murphy пишет: >> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov >> wrote: >>> 17.04.2017 19:25, Chris Murphy пишет: This explains one system's fragmented journals; but the other system isn't snapshotting journals and I haven't figured out why they're so fragmented. No snapshots, and they are all +C at create time (systemd-journald default on Btrfs). Is it possible to prevent journald from setting +C on /var/log/journal and /var/log/journal/? If I remove them, at next boot they get reset, so any new journals created inherit that. >>> >>> Yes, should be possible by creating empty >>> /etc/tmpfiles.d/journal-nocow.conf. >> >> OK super. >> >> How about inhibiting the defragmentation on rotate? I'm suspicious one >> of the things I'm seeing is due to ssd optimization mount options, but >> I need to see the predefrag state of the files. >> >> Why do I see so many changes to the journal file, once ever 2-5 >> seconds? This adds 4096 byte blocks to the file each time, and when >> cow, that'd explain why there are so many fragments. >> > > > What exactly "changes" mean? Write() syscall? filefrag reported entries increase, it's using FIEMAP. Also with stat I see the times (all three) change on the file. If I go to GNOME Terminal and just sudo some command, that itself causes the current system.journal file to get all three times modified. It happens immediately, there's no delay. So if I'm doing something like drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus the journal, it's just constantly writing stuff to the journal. This is without anything running journalctl -f or reading the journal. > >> #Storage=auto >> #Compress=yes >> #Seal=yes >> #SplitMode=uid >> #SyncIntervalSec=5m > > This controls how often systemd calls fsync() on currently active > journal file. Do you see fsync() every 3 seconds? I have no idea if it's fsync or what. How can I tell? Also, I don't think these journal files are being compressed. Using the btrfs-progs/btrfs-debugfs script on a few user journal files, I'm seeing massive compression ratios. Maybe I'll try Compress=No and see if there's a change. file: user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal extents 64 disk size 294912 logical size 8388608 ratio 28.44 file: user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal extents 64 disk size 278528 logical size 8388608 ratio 30.12 file: user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal extents 320 disk size 5206016 logical size 41943040 ratio 8.06 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
17.04.2017 22:49, Chris Murphy пишет: > On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov > wrote: >> 17.04.2017 19:25, Chris Murphy пишет: >>> This explains one system's fragmented journals; but the other system >>> isn't snapshotting journals and I haven't figured out why they're so >>> fragmented. No snapshots, and they are all +C at create time >>> (systemd-journald default on Btrfs). Is it possible to prevent >>> journald from setting +C on /var/log/journal and >>> /var/log/journal/? If I remove them, at next boot they get >>> reset, so any new journals created inherit that. >>> >> >> Yes, should be possible by creating empty >> /etc/tmpfiles.d/journal-nocow.conf. > > OK super. > > How about inhibiting the defragmentation on rotate? I'm suspicious one > of the things I'm seeing is due to ssd optimization mount options, but > I need to see the predefrag state of the files. > > Why do I see so many changes to the journal file, once ever 2-5 > seconds? This adds 4096 byte blocks to the file each time, and when > cow, that'd explain why there are so many fragments. > What exactly "changes" mean? Write() syscall? > #Storage=auto > #Compress=yes > #Seal=yes > #SplitMode=uid > #SyncIntervalSec=5m This controls how often systemd calls fsync() on currently active journal file. Do you see fsync() every 3 seconds? > #RateLimitIntervalSec=30s > #RateLimitBurst=1000 > > A change every 5m is not what I'm seeing with stat. I have no crit, > emerg, or alert messages happening. Just a bunch of drm debug messages > which are constant. But if the flush should only happen every 5 > minutes, I'm confused. > > ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov wrote: > 17.04.2017 19:25, Chris Murphy пишет: >> This explains one system's fragmented journals; but the other system >> isn't snapshotting journals and I haven't figured out why they're so >> fragmented. No snapshots, and they are all +C at create time >> (systemd-journald default on Btrfs). Is it possible to prevent >> journald from setting +C on /var/log/journal and >> /var/log/journal/? If I remove them, at next boot they get >> reset, so any new journals created inherit that. >> > > Yes, should be possible by creating empty > /etc/tmpfiles.d/journal-nocow.conf. OK super. How about inhibiting the defragmentation on rotate? I'm suspicious one of the things I'm seeing is due to ssd optimization mount options, but I need to see the predefrag state of the files. Why do I see so many changes to the journal file, once ever 2-5 seconds? This adds 4096 byte blocks to the file each time, and when cow, that'd explain why there are so many fragments. #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m #RateLimitIntervalSec=30s #RateLimitBurst=1000 A change every 5m is not what I'm seeing with stat. I have no crit, emerg, or alert messages happening. Just a bunch of drm debug messages which are constant. But if the flush should only happen every 5 minutes, I'm confused. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
Here's an example rotated log (Btrfs, NVMe, no compression, default ssd mount option). As you can see it takes up more space on disk than it contains data, so there's a lot of slack space for some reason, despite /etc/systemd/journald.conf being unmodified and thus Compress=Yes. file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 41 disk size 143511552 logical size 100663296 ratio 0.70 $ sudo btrfs fi defrag -c system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal And now: file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 That's nearly 1/7th smaller. The existing defrag without compression is probably just increasing write amplification on SSDs. If it's badly fragmented just leave it alone. This also works on nocow journals with +C set, although I'm not sure whether this is intended behavior (I thought nocow implies no compression); so I've asked about that on the Btrfs list. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
17.04.2017 19:25, Chris Murphy пишет: > This explains one system's fragmented journals; but the other system > isn't snapshotting journals and I haven't figured out why they're so > fragmented. No snapshots, and they are all +C at create time > (systemd-journald default on Btrfs). Is it possible to prevent > journald from setting +C on /var/log/journal and > /var/log/journal/? If I remove them, at next boot they get > reset, so any new journals created inherit that. > Yes, should be possible by creating empty /etc/tmpfiles.d/journal-nocow.conf. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 3:57 AM, Lennart Poettering wrote: >> I do manual snapshots before software updates, which means new writes >> to these files are subject to COW, but additional writes to the same >> extents are overwrites and are not COW because of chattr +C. I've used >> this same strategy for a long time, since systemd-journald defaults to >> +C for journal files; but I've not seen them get this fragmented this >> quickly. >> > > IIRC NOCOW only has an effect if set right after the file is created > before the first write to it is done. Or in other words, you cannot > retroactively make a file NOCOW. This means that if you in one way or > another make a COW copy of a file (through reflinking — implicit or > not, note that "cp" reflinks by default — or through snapshotting or > something else) the file is COW and you'll get fragmentation. Correct. There are three states for files on Btrfs: cow (normal), nocow (+C), and a snapshot of a nocow (+C) file which is "cowandthennocow" or whatever you want to call it. But yes a snapshot of a nocow file does fragment a ton, but then becomes nocow and won't fragment more. This explains one system's fragmented journals; but the other system isn't snapshotting journals and I haven't figured out why they're so fragmented. No snapshots, and they are all +C at create time (systemd-journald default on Btrfs). Is it possible to prevent journald from setting +C on /var/log/journal and /var/log/journal/? If I remove them, at next boot they get reset, so any new journals created inherit that. Anyway, snapshots of journals on Btrfs should be avoided for other reasons. The autocleaning features (SystemMaxUse=, SystemKeepFree=) as well as --vacuum-size=. ) don't work correctly when there are snapshots of journals. Even when journald deletes journals, their extents are pinned by snapshots, so they still take up the same space. Basically journald could get into a situation where it deletes all journals it sees, but no space is freed up because those journals are stuck in a snapshot. > I am not entirely sure what to recommend you. Ultimately whether btrfs > fragments or not, is probably something you have to discuss with the > btrfs folks. We do try to make the best of btrfs, by managing the COW > flag, but this only helps you to a limited degree as > snapshots/reflinks will fuck things up anyway... Definitely. An easy solution would be for journald to create /var/log/journal/ as a subvolume instead of a directory. This will make journals immune to snapshots of the containing subvolume (typically root fs). Of course systemd already makes subvolumes behind the scenes for other sane reasons like /var/lib/machines. Snapshotting logs strikes me as an invalid use case anyway. Anyone would want logs immune to rollback, that'd defeat troubleshooting and auditing. Logs should be linear and continuous, not rolled back. The snapshotting is arguably a mistake, due to lack of user understanding of the consequences. It is admittedly esoteric. > We also ask btrfs to defrag the file as soon as we mark it as > archived... I'd even be willing to extend on that, and defrag the file > on other events too, for example if it ends up being too heavily > fragmented. But last time I looked btrfs didn't have any nice API for > that, that would have a clear focus on a single file only... The biggest issue with them is they take up a lot of space and very inconsistently defragment. Depending on kernel version they can become magnificently larger. Speaking of which, even with Compress=Yes (default), the journal files are highly compressible. By copying some to a Btrfs volume with compress mount option (this does not force compression it gives up easily on already compressed data), I'm finding 4-6x smaller files. So the journals are highly compressible. This is the last line for a couple journals, from btrfs-progs/btrfs-debugfs: file: system@01b44589014542e3b48df31f152c0916-ca2b-00054546539416e8.journal extents 384 disk size 9691136 logical size 50331648 ratio 5.19 file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 If there is a way to optimize this compression when rotating logs, read-compress-write, this means defragmentation isn't needed on Btrfs, and all file systems gain the benefit of much smaller logs. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
Am Mon, 17 Apr 2017 16:01:48 +0200 schrieb Kai Krakow : > > We also ask btrfs to defrag the file as soon as we mark it as > > archived... > > This makes sense. And I've learned that journal on btrfs works much > better if you use many small files vs. a few big files. I've currently > set the journal size limit to 8 MB for that reason which gives me very > good performance. Hmm well, just looked, I eventually stopped doing that, probably when you introduced defragging the archived journals. But I see no journal file being bigger than 128M which seems to work well. -- Regards, Kai Replies to list-only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
Am Mon, 17 Apr 2017 11:57:21 +0200 schrieb Lennart Poettering : > On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote: > > > Hi, > > > > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) > > that's maybe a couple weeks old and was clean installed. Drive is > > NVMe. > > > > > > # filefrag * > > system.journal: 9283 extents found > > user-1000.journal: 3437 extents found > > # lsattr > > C-- ./system.journal > > C-- ./user-1000.journal > > > > I do manual snapshots before software updates, which means new > > writes to these files are subject to COW, but additional writes to > > the same extents are overwrites and are not COW because of chattr > > +C. I've used this same strategy for a long time, since > > systemd-journald defaults to +C for journal files; but I've not > > seen them get this fragmented this quickly. > > > > IIRC NOCOW only has an effect if set right after the file is created > before the first write to it is done. Or in other words, you cannot > retroactively make a file NOCOW. This means that if you in one way or > another make a COW copy of a file (through reflinking — implicit or > not, note that "cp" reflinks by default — or through snapshotting or > something else) the file is COW and you'll get fragmentation. To mark a file nocow, it has to exist with zero bytes and never been written to. The nocow attribute (chattr +C) will be inherited from the directory upon creation of a file. So the best way to go is setting +C on the directory and all future files of the journal would be nocow. You can still do snapshots, nocow doesn't prohibit that and doesn't make journals cow again. What happens is that btrfs simply unshares extents as soon as you write to the snapshot. The newly created extent itself will behave like nocow again. If the extents are big enough, this shouldn't introduce any serious fragmentation, just waste space. Btrfs won't split extents upon unsharing them during a write. It may, however, "replace" only part of the unshared extent thus making three new: two sharing the old copy, one having the new data. But since journals are append only, that should be no problem. It's just that the data is written so slowly that writes almost never become combined into one single writes, resulting in many extents. > I am not entirely sure what to recommend you. Ultimately whether btrfs > fragments or not, is probably something you have to discuss with the > btrfs folks. We do try to make the best of btrfs, by managing the COW > flag, but this only helps you to a limited degree as > snapshots/reflinks will fuck things up anyway... Well, usually you shouldn't have to manage the cow flag at all: Just set it once for the newly created journal directory and everything is fine. And even then, people may not want this so they could easily unset the flag on the directory and rotate the journal. > We also ask btrfs to defrag the file as soon as we mark it as > archived... This makes sense. And I've learned that journal on btrfs works much better if you use many small files vs. a few big files. I've currently set the journal size limit to 8 MB for that reason which gives me very good performance. > I'd even be willing to extend on that, and defrag the file > on other events too, for example if it ends up being too heavily > fragmented. Since the append behavior of btrfs is so bad wrt journal files, it should be enough to simply let btrfs defrag the previous written journal block upon append the file: Lennart, I think you are hinting the OS that the file is going to grow and thus truncate it to 8 MB beyond the current end of file to continue writing. That would be a good event to let btrfs defrag the old 8 MB block (and just that, not the complete file). If this works well, you could maybe skip defragging the complete file upon rotation which should improve disk io performance during rotation. I think the default extent size hint for defragging with btrfs defrag has been set to 32 MB lately, so it would be enough to maybe do the above step every 32 MB. > But last time I looked btrfs didn't have any nice API for > that, that would have a clear focus on a single file only... The high number of extents may not be an indicator for fragmentation when btrfs compression is used. Compressed data will be organized in logical 128k units which are reported as fragments to filefrag, in reality they are laid out continuously on disk, so no fragmentation. It would be interesting to see the blockmap of this. -- Regards, Kai Replies to list-only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote: > Hi, > > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) that's > maybe a couple weeks old and was clean installed. Drive is NVMe. > > > # filefrag * > system.journal: 9283 extents found > user-1000.journal: 3437 extents found > # lsattr > C-- ./system.journal > C-- ./user-1000.journal > > I do manual snapshots before software updates, which means new writes > to these files are subject to COW, but additional writes to the same > extents are overwrites and are not COW because of chattr +C. I've used > this same strategy for a long time, since systemd-journald defaults to > +C for journal files; but I've not seen them get this fragmented this > quickly. > IIRC NOCOW only has an effect if set right after the file is created before the first write to it is done. Or in other words, you cannot retroactively make a file NOCOW. This means that if you in one way or another make a COW copy of a file (through reflinking — implicit or not, note that "cp" reflinks by default — or through snapshotting or something else) the file is COW and you'll get fragmentation. I am not entirely sure what to recommend you. Ultimately whether btrfs fragments or not, is probably something you have to discuss with the btrfs folks. We do try to make the best of btrfs, by managing the COW flag, but this only helps you to a limited degree as snapshots/reflinks will fuck things up anyway... We also ask btrfs to defrag the file as soon as we mark it as archived... I'd even be willing to extend on that, and defrag the file on other events too, for example if it ends up being too heavily fragmented. But last time I looked btrfs didn't have any nice API for that, that would have a clear focus on a single file only... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] journal fragmentation on Btrfs
Hi, This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) that's maybe a couple weeks old and was clean installed. Drive is NVMe. # filefrag * system.journal: 9283 extents found user-1000.journal: 3437 extents found # lsattr C-- ./system.journal C-- ./user-1000.journal I do manual snapshots before software updates, which means new writes to these files are subject to COW, but additional writes to the same extents are overwrites and are not COW because of chattr +C. I've used this same strategy for a long time, since systemd-journald defaults to +C for journal files; but I've not seen them get this fragmented this quickly. Meanwhile on a Fedora 25 Server, which has systemd-231-14.fc25.x86_64, and SD Card based, I've made a modification where /var/log is a nested subvolume so that when I snapshot the root subvolume, the contents of /var/log are not snapshot, therefore these files should always be no-COW, and yet they too are rather fragmented. # filefrag * system@00054c130c57bb79-5df6c2871d1edf1e.journal~: 1 extent found system@00054cb3cd18d71b-6a815220d62cc6ea.journal~: 1 extent found system@01b44589014542e3b48df31f152c0916-0001-000542e1fb4550e7.journal: 1 extent found system@01b44589014542e3b48df31f152c0916-ca2b-00054546539416e8.journal: 1 extent found system@01b44589014542e3b48df31f152c0916-000198f3-000547aac217c85b.journal: 1 extent found system.journal: 2992 extents found user-1000@00054c130a314ee9-4bb9fd0a9268dc1c.journal~: 1 extent found user-1000@ac4b2e5ded7d4e0dbcac6fc45430c857-05a9-000542e1fe209094.journal: 1 extent found user-1000@ac4b2e5ded7d4e0dbcac6fc45430c857-cafe-0005454b13a0349f.journal: 1 extent found user-1000@ac4b2e5ded7d4e0dbcac6fc45430c857-0001abe0-0005482397f286a5.journal: 1 extent found user-1000.journal: 405 extents found There are many 4096 byte extents is what's going on. Maybe this is a consequence of frequent fsync? On the plus side, even a 'reboot -f' or forced power off, and I get pretty much everything within the last few seconds in the journal on the next boot. That's pretty good. Maybe to do better is too much hassle - like no fsyncing on Btrfs and just let its normal 30s commit time apply; if things start crashing then journald could start fsyncing... some sort of dynamic trigger. There could be 8000 things higher priority than this though, this isn't broken. Output from # filefrag -v system.journal # btrfs-debugfs -f system.journal https://drive.google.com/open?id=0B_2Asp8DGjJ9UEdyVFRfU0c2V2s -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel