Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-24 Thread Lennart Poettering
On Sat, 22.04.17 15:29, Andrei Borzenkov (arvidj...@gmail.com) wrote:

> 18.04.2017 07:27, Chris Murphy пишет:
> > On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  
> > wrote:
> >> 18.04.2017 06:50, Chris Murphy пишет:
> > 
>  What exactly "changes" mean? Write() syscall?
> >>>
> >>> filefrag reported entries increase, it's using FIEMAP.
> >>>
> >>
> >> So far it sounds like btrfs allocates new extent on every write to
> >> journal file. Each journal record itself is relatively small indeed.
> > 
> > Hence why it would be better if there's no fsync so that it can
> > accumulate these and do its own commit (30s default for Btrfs) and let
> > them accumulate.
> > 
> 
> It is not related to fsync. I made some tests. Journald does not appear
> to preallocate file nor mmap the whole file (at least as far as I can
> see from the source); when it appends new record it basically does
> 
> fallocate (fd, end_of_file, new_size)
> mmap (fd, end_of_file, new_size)
> write to new size
> 
> This results in large number of extents as each fallocate() ends up in
> new extent.
> 
> I can easily reproduce it with small program that is using similar
> pattern; actually mmap is also red herring. Just fallocat'ing file in
> small increments gives file consisting of overly large number of
> extents. How exactly those extents get distributed across device
> probably depends on overall filesystem activity.
> 
> This is different from simply writing to file at the end, which still
> results in several extent, but significantly larger.
> 
> BTW you get the same pattern from direct IO. Writing 100M file in 4K
> blocks using cached writes gives me here 7 extents of size between 25M
> and 500K. Writing the same with direct IO results in 25600 extents (same
> as growing file in 4K steps with fallocate).

BTW, we are not really married to any particular fancy semantics of
fallocate(). We call it mostly so that our later writes to the file
blocks using mmap() will not result in SIGBUS. There was also the hope
that letting the fs know in advance that we are about to append the
specified amount of bytes to the end of the file through mmap() would
be a good thing not a bad thing... Or in other words, we really don't
need fallocate() to actually go to disk or anything and actually write
anything. All we want is *reserve* some space for us...

Maybe this is something to report to the btrfs folks? It appears to me
their implementation of fallocate() does more than it has to according
to the docs.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-22 Thread Andrei Borzenkov
22.04.2017 15:29, Andrei Borzenkov пишет:
> 18.04.2017 07:27, Chris Murphy пишет:
>> On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  
>> wrote:
>>> 18.04.2017 06:50, Chris Murphy пишет:
>>
> What exactly "changes" mean? Write() syscall?

 filefrag reported entries increase, it's using FIEMAP.

>>>
>>> So far it sounds like btrfs allocates new extent on every write to
>>> journal file. Each journal record itself is relatively small indeed.
>>
>> Hence why it would be better if there's no fsync so that it can
>> accumulate these and do its own commit (30s default for Btrfs) and let
>> them accumulate.
>>
> 
> It is not related to fsync. I made some tests. Journald does not appear
> to preallocate file nor mmap the whole file (at least as far as I can
> see from the source); when it appends new record it basically does
> 
> fallocate (fd, end_of_file, new_size)
> mmap (fd, end_of_file, new_size)
> write to new size
> 
> This results in large number of extents as each fallocate() ends up in
> new extent.
> 
> I can easily reproduce it with small program that is using similar
> pattern; actually mmap is also red herring. Just fallocat'ing file in
> small increments gives file consisting of overly large number of
> extents. How exactly those extents get distributed across device
> probably depends on overall filesystem activity.
> 
> This is different from simply writing to file at the end, which still
> results in several extent, but significantly larger.
> 
> BTW you get the same pattern from direct IO. Writing 100M file in 4K
> blocks using cached writes gives me here 7 extents of size between 25M
> and 500K. Writing the same with direct IO results in 25600 extents (same
> as growing file in 4K steps with fallocate).
> 

For comparison - on ext4 both direct IO and fallocate ends in 2-3
extents. On xfs fallocate gives 2 extents (the first being very small)
and direct IO - 1 extent.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-22 Thread Andrei Borzenkov
18.04.2017 07:27, Chris Murphy пишет:
> On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  
> wrote:
>> 18.04.2017 06:50, Chris Murphy пишет:
> 
 What exactly "changes" mean? Write() syscall?
>>>
>>> filefrag reported entries increase, it's using FIEMAP.
>>>
>>
>> So far it sounds like btrfs allocates new extent on every write to
>> journal file. Each journal record itself is relatively small indeed.
> 
> Hence why it would be better if there's no fsync so that it can
> accumulate these and do its own commit (30s default for Btrfs) and let
> them accumulate.
> 

It is not related to fsync. I made some tests. Journald does not appear
to preallocate file nor mmap the whole file (at least as far as I can
see from the source); when it appends new record it basically does

fallocate (fd, end_of_file, new_size)
mmap (fd, end_of_file, new_size)
write to new size

This results in large number of extents as each fallocate() ends up in
new extent.

I can easily reproduce it with small program that is using similar
pattern; actually mmap is also red herring. Just fallocat'ing file in
small increments gives file consisting of overly large number of
extents. How exactly those extents get distributed across device
probably depends on overall filesystem activity.

This is different from simply writing to file at the end, which still
results in several extent, but significantly larger.

BTW you get the same pattern from direct IO. Writing 100M file in 4K
blocks using cached writes gives me here 7 extents of size between 25M
and 500K. Writing the same with direct IO results in 25600 extents (same
as growing file in 4K steps with fallocate).
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-18 Thread Lennart Poettering
On Mon, 17.04.17 21:50, Chris Murphy (li...@colorremedies.com) wrote:

> >> Why do I see so many changes to the journal file, once ever 2-5
> >> seconds? This adds 4096 byte blocks to the file each time, and when
> >> cow, that'd explain why there are so many fragments.
> >>
> >
> >
> > What exactly "changes" mean? Write() syscall?
> 
> filefrag reported entries increase, it's using FIEMAP.

As mentioned we write to the file via mmap() as we recv the log
messages, and then issue ftruncate()'s to propagate mtime inotify
events which other clients can watch for for live log views. And then,
5min after a write we issue sync(), but at most every 5min once.

> Also with stat I see the times (all three) change on the file. If I go
> to GNOME Terminal and just sudo some command, that itself causes the
> current system.journal file to get all three times modified. It
> happens immediately, there's no delay. So if I'm doing something like
> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
> the journal, it's just constantly writing stuff to the journal. This
> is without anything running journalctl -f or reading the journal.

The "sudo" command logs each invocation, hence, yes, of course, the
log files will get updated.

> >
> >> #Storage=auto
> >> #Compress=yes
> >> #Seal=yes
> >> #SplitMode=uid
> >> #SyncIntervalSec=5m
> >
> > This controls how often systemd calls fsync() on currently active
> > journal file. Do you see fsync() every 3 seconds?
> 
> I have no idea if it's fsync or what. How can I tell?

You can do "strace -p `pidof systemd-journald` -e sync"...

> Also, I don't think these journal files are being compressed.
> 
> Using the btrfs-progs/btrfs-debugfs script on a few user journal
> files, I'm seeing massive compression ratios. Maybe I'll try
> Compress=No and see if there's a change.

As documented Compress= will only compress large objects stored in the
journal, but not the general journal structure. This means journal
files are usually highly compressible, still. Random access and
compression don't easily mix, and we valued random access more.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-18 Thread Lennart Poettering
On Mon, 17.04.17 13:49, Chris Murphy (li...@colorremedies.com) wrote:

> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  
> wrote:
> > 17.04.2017 19:25, Chris Murphy пишет:
> >> This explains one system's fragmented journals; but the other system
> >> isn't snapshotting journals and I haven't figured out why they're so
> >> fragmented. No snapshots, and they are all +C at create time
> >> (systemd-journald default on Btrfs). Is it possible to prevent
> >> journald from setting +C on /var/log/journal and
> >> /var/log/journal/? If I remove them, at next boot they get
> >> reset, so any new journals created inherit that.
> >>
> >
> > Yes, should be possible by creating empty
> > /etc/tmpfiles.d/journal-nocow.conf.
> 
> OK super.
> 
> How about inhibiting the defragmentation on rotate? I'm suspicious one
> of the things I'm seeing is due to ssd optimization mount options, but
> I need to see the predefrag state of the files.

You can't turn off the defrag-on-archive logic. But you can configure
journald to use mutch larger journal files, so that archival never
happens...

> Why do I see so many changes to the journal file, once ever 2-5
> seconds? This adds 4096 byte blocks to the file each time, and when
> cow, that'd explain why there are so many fragments.

We write to the journal files through mmap. If you see writes every
2-5 seconds then this indicates that there's something logging every 2-5s...

> 
> #Storage=auto
> #Compress=yes
> #Seal=yes
> #SplitMode=uid
> #SyncIntervalSec=5m
> #RateLimitIntervalSec=30s
> #RateLimitBurst=1000
> 
> A change every 5m is not what I'm seeing with stat. I have no crit,
> emerg, or alert messages happening. Just a bunch of drm debug messages
> which are constant. But if the flush should only happen every 5
> minutes, I'm confused.

SyncIntervalSec= configures the max time after each write that journald
will sync(). Or in other words, it means that sync() is called once
every 5min if you have a constant stream of log messages, but if you
have a long phase of no messages we'll not call it at al either.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  wrote:

> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.

https://paste.fedoraproject.org/paste/oVT-tsU2sBOdTJaZxGua-15M1UNdIGYhyRLivL9gydE=

That's just a partial, but the complete output captured for a couple
minutes doesn't contain an fsync.

Then I did this for 8 minutes, strace -c -f -p $(pgrep systemd-journal)

https://paste.fedoraproject.org/paste/Uzc2KhkkaqLOU8USLd38B15M1UNdIGYhyRLivL9gydE=

So 6 fsyncs in 8 minutes; more than 1 per 5 minutes, but not nearly as
many as I thought. So maybe as you say it's just memory mapped
activity I'm seeing with state and filefrag.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  wrote:

>> I have no idea if it's fsync or what. How can I tell?
>>
>
> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.

Also found this.

https://lwn.net/Articles/306046/

Not sure how to enable and use it though.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  wrote:
> 18.04.2017 06:50, Chris Murphy пишет:

>>> What exactly "changes" mean? Write() syscall?
>>
>> filefrag reported entries increase, it's using FIEMAP.
>>
>
> So far it sounds like btrfs allocates new extent on every write to
> journal file. Each journal record itself is relatively small indeed.

Hence why it would be better if there's no fsync so that it can
accumulate these and do its own commit (30s default for Btrfs) and let
them accumulate.

It is likely that the ssd allocation option on these ssd's is a factor
in fragmentation because it's trying to allocation to a unique 2MB
section based on expected erase block size. There's a lot of
discussion going on right now on the Btrfs list whether these
assumptions are still true, and in what cases maybe we should be using
nossd on higher end SSD's and NVMe.

What's for sure though is that with any of these allocators, nocow is
not good for lower end SSDs like SD cards; all that does it ask to
write to the same LBA over and over and over again, for a journal. And
it just increases write amplification unnecessarily. So I'm beginning
to think that on SSDs, it's better if journald did +c rather than +C
on journals. But there's still some researching to do.

I definitely think /var/log/journal/ should be a subvolume
to avoid its contents being snapshot. That does make the fragmentation
problem worse.

And also I think defragmentation feature should be disabled at least
on SSD; or should include zlib compression. The write amplification on
SSD is worse than just leaving the file fragmented.



>
>> Also with stat I see the times (all three) change on the file. If I go
>> to GNOME Terminal and just sudo some command, that itself causes the
>> current system.journal file to get all three times modified. It
>> happens immediately, there's no delay. So if I'm doing something like
>> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
>> the journal, it's just constantly writing stuff to the journal. This
>> is without anything running journalctl -f or reading the journal.
>>
>>>
 #Storage=auto
 #Compress=yes
 #Seal=yes
 #SplitMode=uid
 #SyncIntervalSec=5m
>>>
>>> This controls how often systemd calls fsync() on currently active
>>> journal file. Do you see fsync() every 3 seconds?
>>
>> I have no idea if it's fsync or what. How can I tell?
>>
>
> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.
>
> Is it possible that btrfs behavior you observe is specific to memory
> mapped files handling?

Maybe. But even after a reboot I see the same extent entries in the
file. Granted a good deal of these 1 block entries have addresses that
are one after the other so they often make up larger continuous
extents, but they still have separate entries.


>
>> Also, I don't think these journal files are being compressed.
>>
>> Using the btrfs-progs/btrfs-debugfs script on a few user journal
>> files, I'm seeing massive compression ratios. Maybe I'll try
>> Compress=No and see if there's a change.
>>
>
> Only actual message payload above some threshold (I think 256 or 512
> bytes, not sure) is compressed; everything else is not. For average
> syslog-type messages payload is far too small. This is really only
> interesting when you store core dump or similar.

Interesting I see. Thanks.

I'll try strace and see what's going on.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Andrei Borzenkov
18.04.2017 06:50, Chris Murphy пишет:
> On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkov  wrote:
>> 17.04.2017 22:49, Chris Murphy пишет:
>>> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  
>>> wrote:
 17.04.2017 19:25, Chris Murphy пишет:
> This explains one system's fragmented journals; but the other system
> isn't snapshotting journals and I haven't figured out why they're so
> fragmented. No snapshots, and they are all +C at create time
> (systemd-journald default on Btrfs). Is it possible to prevent
> journald from setting +C on /var/log/journal and
> /var/log/journal/? If I remove them, at next boot they get
> reset, so any new journals created inherit that.
>

 Yes, should be possible by creating empty
 /etc/tmpfiles.d/journal-nocow.conf.
>>>
>>> OK super.
>>>
>>> How about inhibiting the defragmentation on rotate? I'm suspicious one
>>> of the things I'm seeing is due to ssd optimization mount options, but
>>> I need to see the predefrag state of the files.
>>>
>>> Why do I see so many changes to the journal file, once ever 2-5
>>> seconds? This adds 4096 byte blocks to the file each time, and when
>>> cow, that'd explain why there are so many fragments.
>>>
>>
>>
>> What exactly "changes" mean? Write() syscall?
> 
> filefrag reported entries increase, it's using FIEMAP.
> 

So far it sounds like btrfs allocates new extent on every write to
journal file. Each journal record itself is relatively small indeed.

> Also with stat I see the times (all three) change on the file. If I go
> to GNOME Terminal and just sudo some command, that itself causes the
> current system.journal file to get all three times modified. It
> happens immediately, there's no delay. So if I'm doing something like
> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
> the journal, it's just constantly writing stuff to the journal. This
> is without anything running journalctl -f or reading the journal.
> 
>>
>>> #Storage=auto
>>> #Compress=yes
>>> #Seal=yes
>>> #SplitMode=uid
>>> #SyncIntervalSec=5m
>>
>> This controls how often systemd calls fsync() on currently active
>> journal file. Do you see fsync() every 3 seconds?
> 
> I have no idea if it's fsync or what. How can I tell?
> 

strace -p $(pgrep systemd-journal)

You will not see actual writes as file is memory mapped, but it
definitely does not do any fsync() every so often.

Is it possible that btrfs behavior you observe is specific to memory
mapped files handling?

> Also, I don't think these journal files are being compressed.
> 
> Using the btrfs-progs/btrfs-debugfs script on a few user journal
> files, I'm seeing massive compression ratios. Maybe I'll try
> Compress=No and see if there's a change.
> 

Only actual message payload above some threshold (I think 256 or 512
bytes, not sure) is compressed; everything else is not. For average
syslog-type messages payload is far too small. This is really only
interesting when you store core dump or similar.

> file: 
> user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal
> extents 64 disk size 294912 logical size 8388608 ratio 28.44
> file: 
> user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal
> extents 64 disk size 278528 logical size 8388608 ratio 30.12
> file: 
> user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal
> extents 320 disk size 5206016 logical size 41943040 ratio 8.06
> 

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkov  wrote:
> 17.04.2017 22:49, Chris Murphy пишет:
>> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  
>> wrote:
>>> 17.04.2017 19:25, Chris Murphy пишет:
 This explains one system's fragmented journals; but the other system
 isn't snapshotting journals and I haven't figured out why they're so
 fragmented. No snapshots, and they are all +C at create time
 (systemd-journald default on Btrfs). Is it possible to prevent
 journald from setting +C on /var/log/journal and
 /var/log/journal/? If I remove them, at next boot they get
 reset, so any new journals created inherit that.

>>>
>>> Yes, should be possible by creating empty
>>> /etc/tmpfiles.d/journal-nocow.conf.
>>
>> OK super.
>>
>> How about inhibiting the defragmentation on rotate? I'm suspicious one
>> of the things I'm seeing is due to ssd optimization mount options, but
>> I need to see the predefrag state of the files.
>>
>> Why do I see so many changes to the journal file, once ever 2-5
>> seconds? This adds 4096 byte blocks to the file each time, and when
>> cow, that'd explain why there are so many fragments.
>>
>
>
> What exactly "changes" mean? Write() syscall?

filefrag reported entries increase, it's using FIEMAP.

Also with stat I see the times (all three) change on the file. If I go
to GNOME Terminal and just sudo some command, that itself causes the
current system.journal file to get all three times modified. It
happens immediately, there's no delay. So if I'm doing something like
drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
the journal, it's just constantly writing stuff to the journal. This
is without anything running journalctl -f or reading the journal.

>
>> #Storage=auto
>> #Compress=yes
>> #Seal=yes
>> #SplitMode=uid
>> #SyncIntervalSec=5m
>
> This controls how often systemd calls fsync() on currently active
> journal file. Do you see fsync() every 3 seconds?

I have no idea if it's fsync or what. How can I tell?

Also, I don't think these journal files are being compressed.

Using the btrfs-progs/btrfs-debugfs script on a few user journal
files, I'm seeing massive compression ratios. Maybe I'll try
Compress=No and see if there's a change.

file: 
user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal
extents 64 disk size 294912 logical size 8388608 ratio 28.44
file: 
user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal
extents 64 disk size 278528 logical size 8388608 ratio 30.12
file: 
user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal
extents 320 disk size 5206016 logical size 41943040 ratio 8.06

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  wrote:
> 17.04.2017 19:25, Chris Murphy пишет:
>> This explains one system's fragmented journals; but the other system
>> isn't snapshotting journals and I haven't figured out why they're so
>> fragmented. No snapshots, and they are all +C at create time
>> (systemd-journald default on Btrfs). Is it possible to prevent
>> journald from setting +C on /var/log/journal and
>> /var/log/journal/? If I remove them, at next boot they get
>> reset, so any new journals created inherit that.
>>
>
> Yes, should be possible by creating empty
> /etc/tmpfiles.d/journal-nocow.conf.

OK super.

How about inhibiting the defragmentation on rotate? I'm suspicious one
of the things I'm seeing is due to ssd optimization mount options, but
I need to see the predefrag state of the files.

Why do I see so many changes to the journal file, once ever 2-5
seconds? This adds 4096 byte blocks to the file each time, and when
cow, that'd explain why there are so many fragments.

#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000

A change every 5m is not what I'm seeing with stat. I have no crit,
emerg, or alert messages happening. Just a bunch of drm debug messages
which are constant. But if the flush should only happen every 5
minutes, I'm confused.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
Here's an example rotated log (Btrfs, NVMe, no compression, default
ssd mount option). As you can see it takes up more space on disk than
it contains data, so there's a lot of slack space for some reason,
despite /etc/systemd/journald.conf being unmodified and thus
Compress=Yes.

file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 41 disk size 143511552 logical size 100663296 ratio 0.70

$ sudo btrfs fi defrag -c
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

And now:

file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

That's nearly 1/7th smaller. The existing defrag without compression
is probably just increasing write amplification on SSDs. If it's badly
fragmented just leave it alone.

This also works on nocow journals with +C set, although I'm not sure
whether this is intended behavior (I thought nocow implies no
compression); so I've asked about that on the Btrfs list.

Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Andrei Borzenkov
17.04.2017 19:25, Chris Murphy пишет:
> This explains one system's fragmented journals; but the other system
> isn't snapshotting journals and I haven't figured out why they're so
> fragmented. No snapshots, and they are all +C at create time
> (systemd-journald default on Btrfs). Is it possible to prevent
> journald from setting +C on /var/log/journal and
> /var/log/journal/? If I remove them, at next boot they get
> reset, so any new journals created inherit that.
> 

Yes, should be possible by creating empty
/etc/tmpfiles.d/journal-nocow.conf.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 3:57 AM, Lennart Poettering
 wrote:

>> I do manual snapshots before software updates, which means new writes
>> to these files are subject to COW, but additional writes to the same
>> extents are overwrites and are not COW because of chattr +C. I've used
>> this same strategy for a long time, since systemd-journald defaults to
>> +C for journal files; but I've not seen them get this fragmented this
>> quickly.
>>
>
> IIRC NOCOW only has an effect if set right after the file is created
> before the first write to it is done. Or in other words, you cannot
> retroactively make a file NOCOW. This means that if you in one way or
> another make a COW copy of a file (through reflinking — implicit or
> not, note that "cp" reflinks by default — or through snapshotting or
> something else) the file is COW and you'll get fragmentation.

Correct.

There are three states for files on Btrfs: cow (normal), nocow (+C),
and a snapshot of a nocow (+C) file which is "cowandthennocow" or
whatever you want to call it. But yes a snapshot of a nocow file does
fragment a ton, but then becomes nocow and won't fragment more.

This explains one system's fragmented journals; but the other system
isn't snapshotting journals and I haven't figured out why they're so
fragmented. No snapshots, and they are all +C at create time
(systemd-journald default on Btrfs). Is it possible to prevent
journald from setting +C on /var/log/journal and
/var/log/journal/? If I remove them, at next boot they get
reset, so any new journals created inherit that.

Anyway, snapshots of journals on Btrfs should be avoided for other
reasons. The autocleaning features (SystemMaxUse=, SystemKeepFree=) as
well as --vacuum-size=. ) don't work correctly when there are
snapshots of journals. Even when journald deletes journals, their
extents are pinned by snapshots, so they still take up the same space.
Basically journald could get into a situation where it deletes all
journals it sees, but no space is freed up because those journals are
stuck in a snapshot.


> I am not entirely sure what to recommend you. Ultimately whether btrfs
> fragments or not, is probably something you have to discuss with the
> btrfs folks. We do try to make the best of btrfs, by managing the COW
> flag, but this only helps you to a limited degree as
> snapshots/reflinks will fuck things up anyway...

Definitely.

An easy solution would be for journald to create
/var/log/journal/ as a subvolume instead of a directory.
This will make journals immune to snapshots of the containing
subvolume (typically root fs). Of course systemd already makes
subvolumes behind the scenes for other sane reasons like
/var/lib/machines.

Snapshotting logs strikes me as an invalid use case anyway. Anyone
would want logs immune to rollback, that'd defeat troubleshooting and
auditing. Logs should be linear and continuous, not rolled back. The
snapshotting is arguably a mistake, due to lack of user understanding
of the consequences. It is admittedly esoteric.


> We also ask btrfs to defrag the file as soon as we mark it as
> archived... I'd even be willing to extend on that, and defrag the file
> on other events too, for example if it ends up being too heavily
> fragmented. But last time I looked btrfs didn't have any nice API for
> that, that would have a clear focus on a single file only...

The biggest issue with them is they take up a lot of space and very
inconsistently defragment. Depending on kernel version they can become
magnificently larger.

Speaking of which, even with Compress=Yes (default), the journal files
are highly compressible. By copying some to a Btrfs volume with
compress mount option (this does not force compression it gives up
easily on already compressed data), I'm finding 4-6x smaller files. So
the journals are highly compressible. This is the last line for a
couple journals, from btrfs-progs/btrfs-debugfs:

file: 
system@01b44589014542e3b48df31f152c0916-ca2b-00054546539416e8.journal
extents 384 disk size 9691136 logical size 50331648 ratio 5.19
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

If there is a way to optimize this compression when rotating logs,
read-compress-write, this means defragmentation isn't needed on Btrfs,
and all file systems gain the benefit of much smaller logs.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Kai Krakow
Am Mon, 17 Apr 2017 16:01:48 +0200
schrieb Kai Krakow :

> > We also ask btrfs to defrag the file as soon as we mark it as
> > archived...  
> 
> This makes sense. And I've learned that journal on btrfs works much
> better if you use many small files vs. a few big files. I've currently
> set the journal size limit to 8 MB for that reason which gives me very
> good performance.

Hmm well, just looked, I eventually stopped doing that, probably when
you introduced defragging the archived journals. But I see no journal
file being bigger than 128M which seems to work well.

-- 
Regards,
Kai

Replies to list-only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Kai Krakow
Am Mon, 17 Apr 2017 11:57:21 +0200
schrieb Lennart Poettering :

> On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote:
> 
> > Hi,
> > 
> > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64)
> > that's maybe a couple weeks old and was clean installed. Drive is
> > NVMe.
> > 
> > 
> > # filefrag *
> > system.journal: 9283 extents found
> > user-1000.journal: 3437 extents found
> > # lsattr
> > C-- ./system.journal
> > C-- ./user-1000.journal
> > 
> > I do manual snapshots before software updates, which means new
> > writes to these files are subject to COW, but additional writes to
> > the same extents are overwrites and are not COW because of chattr
> > +C. I've used this same strategy for a long time, since
> > systemd-journald defaults to +C for journal files; but I've not
> > seen them get this fragmented this quickly.
> >  
> 
> IIRC NOCOW only has an effect if set right after the file is created
> before the first write to it is done. Or in other words, you cannot
> retroactively make a file NOCOW. This means that if you in one way or
> another make a COW copy of a file (through reflinking — implicit or
> not, note that "cp" reflinks by default — or through snapshotting or
> something else) the file is COW and you'll get fragmentation.

To mark a file nocow, it has to exist with zero bytes and never
been written to. The nocow attribute (chattr +C) will be inherited from
the directory upon creation of a file. So the best way to go is setting
+C on the directory and all future files of the journal would be nocow.

You can still do snapshots, nocow doesn't prohibit that and doesn't
make journals cow again. What happens is that btrfs simply unshares
extents as soon as you write to the snapshot. The newly created extent
itself will behave like nocow again. If the extents are big enough,
this shouldn't introduce any serious fragmentation, just waste space.
Btrfs won't split extents upon unsharing them during a write. It may,
however, "replace" only part of the unshared extent thus making three
new: two sharing the old copy, one having the new data. But since
journals are append only, that should be no problem. It's just that the
data is written so slowly that writes almost never become combined into
one single writes, resulting in many extents.

> I am not entirely sure what to recommend you. Ultimately whether btrfs
> fragments or not, is probably something you have to discuss with the
> btrfs folks. We do try to make the best of btrfs, by managing the COW
> flag, but this only helps you to a limited degree as
> snapshots/reflinks will fuck things up anyway...

Well, usually you shouldn't have to manage the cow flag at all: Just
set it once for the newly created journal directory and everything is
fine. And even then, people may not want this so they could easily
unset the flag on the directory and rotate the journal.

> We also ask btrfs to defrag the file as soon as we mark it as
> archived...

This makes sense. And I've learned that journal on btrfs works much
better if you use many small files vs. a few big files. I've currently
set the journal size limit to 8 MB for that reason which gives me very
good performance.

> I'd even be willing to extend on that, and defrag the file
> on other events too, for example if it ends up being too heavily
> fragmented.

Since the append behavior of btrfs is so bad wrt journal files, it
should be enough to simply let btrfs defrag the previous written
journal block upon append the file: Lennart, I think you are hinting the
OS that the file is going to grow and thus truncate it to 8 MB beyond
the current end of file to continue writing. That would be a good event
to let btrfs defrag the old 8 MB block (and just that, not the complete
file). If this works well, you could maybe skip defragging the complete
file upon rotation which should improve disk io performance during
rotation.

I think the default extent size hint for defragging with btrfs defrag
has been set to 32 MB lately, so it would be enough to maybe do the
above step every 32 MB.

> But last time I looked btrfs didn't have any nice API for
> that, that would have a clear focus on a single file only...

The high number of extents may not be an indicator for fragmentation
when btrfs compression is used. Compressed data will be organized in
logical 128k units which are reported as fragments to filefrag, in
reality they are laid out continuously on disk, so no fragmentation.
It would be interesting to see the blockmap of this.

-- 
Regards,
Kai

Replies to list-only preferred.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Lennart Poettering
On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote:

> Hi,
> 
> This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) that's
> maybe a couple weeks old and was clean installed. Drive is NVMe.
> 
> 
> # filefrag *
> system.journal: 9283 extents found
> user-1000.journal: 3437 extents found
> # lsattr
> C-- ./system.journal
> C-- ./user-1000.journal
> 
> I do manual snapshots before software updates, which means new writes
> to these files are subject to COW, but additional writes to the same
> extents are overwrites and are not COW because of chattr +C. I've used
> this same strategy for a long time, since systemd-journald defaults to
> +C for journal files; but I've not seen them get this fragmented this
> quickly.
>

IIRC NOCOW only has an effect if set right after the file is created
before the first write to it is done. Or in other words, you cannot
retroactively make a file NOCOW. This means that if you in one way or
another make a COW copy of a file (through reflinking — implicit or
not, note that "cp" reflinks by default — or through snapshotting or
something else) the file is COW and you'll get fragmentation.

I am not entirely sure what to recommend you. Ultimately whether btrfs
fragments or not, is probably something you have to discuss with the
btrfs folks. We do try to make the best of btrfs, by managing the COW
flag, but this only helps you to a limited degree as
snapshots/reflinks will fuck things up anyway...

We also ask btrfs to defrag the file as soon as we mark it as
archived... I'd even be willing to extend on that, and defrag the file
on other events too, for example if it ends up being too heavily
fragmented. But last time I looked btrfs didn't have any nice API for
that, that would have a clear focus on a single file only...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] journal fragmentation on Btrfs

2017-04-16 Thread Chris Murphy
Hi,

This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) that's
maybe a couple weeks old and was clean installed. Drive is NVMe.


# filefrag *
system.journal: 9283 extents found
user-1000.journal: 3437 extents found
# lsattr
C-- ./system.journal
C-- ./user-1000.journal

I do manual snapshots before software updates, which means new writes
to these files are subject to COW, but additional writes to the same
extents are overwrites and are not COW because of chattr +C. I've used
this same strategy for a long time, since systemd-journald defaults to
+C for journal files; but I've not seen them get this fragmented this
quickly.

Meanwhile on a Fedora 25 Server, which has systemd-231-14.fc25.x86_64,
and SD Card based, I've made a modification where /var/log is a nested
subvolume so that when I snapshot the root subvolume, the contents of
/var/log are not snapshot, therefore these files should always be
no-COW, and yet they too are rather fragmented.

# filefrag *
system@00054c130c57bb79-5df6c2871d1edf1e.journal~: 1 extent found
system@00054cb3cd18d71b-6a815220d62cc6ea.journal~: 1 extent found
system@01b44589014542e3b48df31f152c0916-0001-000542e1fb4550e7.journal:
1 extent found
system@01b44589014542e3b48df31f152c0916-ca2b-00054546539416e8.journal:
1 extent found
system@01b44589014542e3b48df31f152c0916-000198f3-000547aac217c85b.journal:
1 extent found
system.journal: 2992 extents found
user-1000@00054c130a314ee9-4bb9fd0a9268dc1c.journal~: 1 extent found
user-1000@ac4b2e5ded7d4e0dbcac6fc45430c857-05a9-000542e1fe209094.journal:
1 extent found
user-1000@ac4b2e5ded7d4e0dbcac6fc45430c857-cafe-0005454b13a0349f.journal:
1 extent found
user-1000@ac4b2e5ded7d4e0dbcac6fc45430c857-0001abe0-0005482397f286a5.journal:
1 extent found
user-1000.journal: 405 extents found

There are many 4096 byte extents is what's going on. Maybe this is a
consequence of frequent fsync?

On the plus side, even a 'reboot -f' or forced power off, and I get
pretty much everything within the last few seconds in the journal on
the next boot. That's pretty good. Maybe to do better is too much
hassle - like no fsyncing on Btrfs and just let its normal 30s commit
time apply; if things start crashing then journald could start
fsyncing... some sort of dynamic trigger.

There could be 8000 things higher priority than this though, this isn't broken.

Output from
# filefrag -v system.journal
# btrfs-debugfs -f system.journal

https://drive.google.com/open?id=0B_2Asp8DGjJ9UEdyVFRfU0c2V2s


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel