Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Chris Murphy
On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
 wrote:
>
> On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:
>
> > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > it isn't a NOP in that case?
> >
> > So, what is the reason for defragmenting journal is BTRFS is
> > detected? This does not happen at other filesystems. I have read
> > this thread but has not found a clear answer to this question.
>
> btrfs like any file system fragments files with nocow a bit. Without
> nocow (i.e. with cow) it fragments files horribly, given our write
> pattern (wich is: append something to the end, and update a few
> pointers in the beginning). By upstream default we set nocow, some
> downstreams/users undo that however. (this is done via tmpfiles,
> i.e. journald doesn't actually set nocow ever).

I don't see why it's upstream's problem to solve downstream decisions.
If they want to (re)enable datacow, then they can also setup some kind
of service to defragment /var/log/journal/ on a schedule, or they can
use autodefrag.


> When we archive a journal file (i.e stop writing to it) we know it
> will never receive any further writes. It's a good time to undo the
> fragmentation (we make no distinction whether heavily fragmented,
> little fragmented or not at all fragmented on this) and thus for the
> future make access behaviour better, given that we'll still access the
> file regularly (because archiving in journald doesn't mean we stop
> reading it, it just means we stop writing it — journalctl always
> operates on the full data set). defragmentation happens in the bg once
> triggered, it's a simple ioctl you can invoke on a file. if the file
> is not fragmented it shouldn't do anything.

ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0

What 'len' value does journald use?

> other file systems simply have no such ioctl, and they never fragment
> as terribly as btrfs can fragment. hence we don't call that ioctl.

I did explain how to avoid the fragmentation in the first place, to
obviate the need to defragment.

1. nodatacow. journald does this already
2. fallocate the intended final journal file size from the start,
instead of growing them in 8MB increments.
3. Don't reflink copy (including snapshot) the journals. This arguably
is not journald's responsibility but as it creates both the journal/
directory and $MACHINEID directory, it could make one or both of them
as subvolumes instead to ensure they're not subject to snapshotting
from above.


> I'd even be fine dropping it
> entirely, if someone actually can show the benefits of having the
> files unfragmented when archived don't outweigh the downside of
> generating some iops when executing the defragmentation.

I showed that the archived journals have way more fragmentation than
active journals. And the fragments in active journals are
insignificant, and can even be reduced by fully allocating the journal
file to final size rather than appending - which has a good chance of
fragmenting the file on any file system, not just Btrfs.

Further, even *despite* this worse fragmentation of the archived
journals, bcc-tools fileslower shows no meaningful latency as a
result. I wrote this in the previous email. I don't understand what
you want me to show you.

And since journald offers no ability to disable the defragment on
Btrfs, I can't really do a longer term A/B comparison can I?


>i.e. someone
> does some profiling, on both ssd and rotating media. Apparently noone
> who cares about this apparently wants to do such research though, and
> hence I remain deeply unimpressed. Let's not try to do such
> optimizations without any data that actually shows it betters things.

I did provide data. That you don't like what the data shows: archived
journals have more fragments than active journals, is not my fault.
The existing "optimization" is making things worse, in addition to
adding a pile of unnecessary writes upon journal rotation.

Conversely, you have not provided data proving that nodatacow
fallocated files on Btrfs are any more fragmented than fallocated
files on ext4 or XFS.

2-17 fragments on ext4:
https://pastebin.com/jiPhrDzG
https://pastebin.com/UggEiH2J

That behavior is no different for nodatacow fallocated journals on
Btrfs. There's no point in defragmenting these no matter the file
system. I don't have to profile this on HDD, I know that even in the
best case you're not likely to get (certainly not guaranteed) to get
fewer fragments than this. Defrag on Btrfs is for the thousands of
fragments case, which is what you get with datacow journals.



-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering
On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:

> > You know, we issue the btrfs ioctl, under the assumption that if the
> > file is already perfectly defragmented it's a NOP. Are you suggesting
> > it isn't a NOP in that case?
>
> So, what is the reason for defragmenting journal is BTRFS is
> detected? This does not happen at other filesystems. I have read
> this thread but has not found a clear answer to this question.

btrfs like any file system fragments files with nocow a bit. Without
nocow (i.e. with cow) it fragments files horribly, given our write
pattern (wich is: append something to the end, and update a few
pointers in the beginning). By upstream default we set nocow, some
downstreams/users undo that however. (this is done via tmpfiles,
i.e. journald doesn't actually set nocow ever).

When we archive a journal file (i.e stop writing to it) we know it
will never receive any further writes. It's a good time to undo the
fragmentation (we make no distinction whether heavily fragmented,
little fragmented or not at all fragmented on this) and thus for the
future make access behaviour better, given that we'll still access the
file regularly (because archiving in journald doesn't mean we stop
reading it, it just means we stop writing it — journalctl always
operates on the full data set). defragmentation happens in the bg once
triggered, it's a simple ioctl you can invoke on a file. if the file
is not fragmented it shouldn't do anything.

other file systems simply have no such ioctl, and they never fragment
as terribly as btrfs can fragment. hence we don't call that ioctl.

i'd be fine to avoid the ioctl if we knew for sure the file is at
worst mildly fragmented, but apparently btrfs is too broken to be able
to implement something like that.  I'd even be fine dropping it
entirely, if someone actually can show the benefits of having the
files unfragmented when archived don't outweigh the downside of
generating some iops when executing the defragmentation. i.e. someone
does some profiling, on both ssd and rotating media. Apparently noone
who cares about this apparently wants to do such research though, and
hence I remain deeply unimpressed. Let's not try to do such
optimizations without any data that actually shows it betters things.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering
On Fr, 05.02.21 16:16, Phillip Susi (ph...@thesusis.net) wrote:

>
> Lennart Poettering writes:
>
> > Nope. We always interleave stuff. We currently open all journal files
> > in parallel. The system one and the per-user ones, the current ones
> > and the archived ones.
>
> Wait... every time you look at the journal at all, it has to read back
> through ALL of the archived journals, even if you are only interested in
> information since the last boot that just happened 5 minutes ago?

no, we do not iterate though them. we just read some metadata off the header.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Lennart Poettering writes:

> journalctl gives you one long continues log stream, joining everything
> available, archived or not into one big interleaved stream.

If you ask for everything, yes... but if you run journalctl -b then
shuoldn't it only read back until it finds the start of the current
boot?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering
On Fr, 05.02.21 20:43, Dave Howorth (syst...@howorth.org.uk) wrote:

> 128 MB files, and I might allocate an extra MB or two for overhead, I
> don't know. So when it first starts there'll be 128 MB allocated and
> 384 MB free. In stable state there'll be 512 MB allocated and nothing
> free. One 128 MB allocated and slowly being used. 384 MB full of
> archive files. You always have between 384 MB and 512 MB of logs
> stored. I don't understand where you're getting your numbers from.

As mentioned elswhere: we typically have to remove two "almost 128M"
files to get space for "exactly 128M" of guaranteed space.

And you know, each user gets their own journal. Hence, once a single
user logs a single line aother 128M are gone, and if another user then
does it, bam, another 128M is gone.

We can't eat space away like that.

> If you can't figure out which parts of an archived file are useful and
> which aren't then why are you keeping them? Why not just delete them?
> And if you can figure it out then why not do so and compact the useful
> information into the minimum storage?

We archive for multiple reasons: because file was dirty when we
started up (in which case there apparently was an abnormal shutdown of
the system or journald), or because we rotate and start a new file (or
time change or whatnot). In the first ("dirty") case we don't touch
the file at all, because it's likely corrupt and we don't want to
corrupt further. We just rename it so that it gets "~" at the
end. When we archive the "clean" way we mark the file internally as
archived, but before sync everything to disk, so that we know for sure
it's all in a good state, and then we don't touch it anymore.

"journalctl" will process all these files, regardless if "dirty"
archived or "clean" archived. It tries hard to make the best of these
files, and varirous codepaths to make sure we don't get confused by
half-written files, and can use as much as possible of the parts that
were written correctly.

hence, that's why we don't delete corrupted files: because we use as
much of it as we can. Why? because usually the logs shortly before
your system died abnormally are the most interesting.

> > Because fs metadata, and because we don't always write files in
> > full. I mean, we often do not, because we start a new file *before*
> > the file would grow beyond the threshold. this typically means that
> > it's typically not enough to delete a single file to get the space we
> > need for a full new one, we usually need to delete two.
>
> Why would you start a new file before the old one is full?

Various reasons: user asked for rotation or vacuuming. because
abnormal shutdown. becase time change (we want individual files to be
montonically ordered), …

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Maksim Fomin writes:
> I would say it depends on whether defragmentation issues are feature
> of btrfs. As Chris mentioned, if root fs is snapshotted,
> 'defragmenting' the journal can actually increase fragmentation. This
> is an example when the problem is caused by a feature (not a bug) in
> btrfs. For example, my 'system.journal' file is currently 16 MB and
> according to filefrag it has 1608 extents (consequence of snapshotted
> rootfs?). It looks too much, if I am not missing some technical

Holy smokes!  How did btrfs manage to butcher that poor file that badly?
It shouldn't be possible for it to be *that* bad.  I mean, that's only
an average of 10kb per fragment!
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Dave Howorth writes:

> PS I'm subscribed to the list. I don't need a copy.

FYI, rather than ask others to go out of their way when replying to you,
you should configure your mail client to set the Reply-To: header to
point to the mailing list address so that other people's mail clients do
what you want automatically.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Lennart Poettering writes:

> Nope. We always interleave stuff. We currently open all journal files
> in parallel. The system one and the per-user ones, the current ones
> and the archived ones.

Wait... every time you look at the journal at all, it has to read back
through ALL of the archived journals, even if you are only interested in
information since the last boot that just happened 5 minutes ago?

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Maksim Fomin
‐‐‐ Original Message ‐‐‐
On Friday, February 5, 2021 3:23 PM, Lennart Poettering 
 wrote:

> On Do, 04.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote:
>
> > On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
> > lenn...@poettering.net wrote:
> >
> > > You want to optimize write pattersn I understand, i.e. minimize
> > > iops. Hence start with profiling iops, i.e. what defrag actually costs
> > > and then weight that agains the reduced access time when accessing the
> > > files. In particular on rotating media.
> >
> > A nodatacow journal on Btrfs is no different than a journal on ext4 or
> > xfs. So I don't understand why you think you also need to defragment
> > the file, only on Btrfs. You cannot do better than you already are
> > with a nodatacow file. That file isn't going to get anymore fragmented
> > in use than it was at creation.
>
> You know, we issue the btrfs ioctl, under the assumption that if the
> file is already perfectly defragmented it's a NOP. Are you suggesting
> it isn't a NOP in that case?

So, what is the reason for defragmenting journal is BTRFS is detected? This 
does not happen at other filesystems. I have read this thread but has not found 
a clear answer to this question.

> > But it gets worse. The way systemd-journald is submitting the journals
> > for defragmentation is making them more fragmented than just leaving
> > them alone.
>
> Sounds like a bug in btrfs? systemd is not the place to hack around
> btrfs bugs?

I would say it depends on whether defragmentation issues are feature of btrfs. 
As Chris mentioned, if root fs is snapshotted, 'defragmenting' the journal can 
actually increase fragmentation. This is an example when the problem is caused 
by a feature (not a bug) in btrfs. For example, my 'system.journal' file is 
currently 16 MB and according to filefrag it has 1608 extents (consequence of 
snapshotted rootfs?). It looks too much, if I am not missing some technical 
details (perhaps filefrag 'extent' is not a real extent in case of this fs?). 
Even if it is a bug in btrfs, it would make sense to temporarily disable the 
policy of 'defragmenting only in BTRFS' in systemd.

I am interested in this issue because for some time (probably since late 2017 
till late 2019) I had strange issues with systemd-journald crashing at boot 
time because of archiving journal/defragmenting. The setup was follows: btrfs 
on external hd (not ssd) with full disk encryption. After mistaken 
disconnection of mounted disk (but not in all such cases) systemd-journald 
caused very long lock of boot process because of following loop: 
systemd-journald tries to archive/defragment journal files -> it crashes for 
some reason -> systemd restarts systemd-journald -> it starts 
archiving/defragmenting journal files -> it crashes again -> systemd restarts 
systemd-journald (my understaing of logs after boot). Eventually this loop 
breaks and the boot process counties. After login I see that journal data is 
fine - at least there is no evidence of journal data corruption, so I presume 
it was caused by archiving/defragmentation policy on btrfs. I used this disk 
with ext4 filesystem from 2014 to 2017 and never had any problem like that. 
Eventually I decided to buy a better disk and this problem vanished since then, 
but why systemd defragmets journal only in btrfs remained a mystery to me.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Dave Howorth
On Fri, 5 Feb 2021 17:44:14 +0100
Lennart Poettering  wrote:
> On Fr, 05.02.21 16:06, Dave Howorth (syst...@howorth.org.uk) wrote:
> 
> > On Fri, 5 Feb 2021 16:23:02 +0100
> > Lennart Poettering  wrote:  
> > > I don't think that makes much sense: we rotate and start new
> > > files for a multitude of reasons, such as size overrun, time
> > > jumps, abnormal shutdown and so on. If we'd always leave a fully
> > > allocated file around people would hate us...  
> >
> > I'm not sure about that. The file is eventually going to grow to
> > 128 MB so if there isn't space for it, I might as well know right
> > now as later. And it's not like the space will be available for
> > anything else, it's left free for exactly this log file.  
> 
> let's say you assign 500M space to journald. If you allocate 128M at a
> time, this means the effective unused space is anything between 1M and
> 255M, leaving just 256M of logs around. it's probably surprising that
> you only end up with 255M of logs when you asked for 500M. I'd claim
> that's really shitty behaviour.

If you assign 500 MB for something that accommodates multiples of 128
MB then you're not very bright :) 512 MB by contrast can accommodate 4
128 MB files, and I might allocate an extra MB or two for overhead, I
don't know. So when it first starts there'll be 128 MB allocated and
384 MB free. In stable state there'll be 512 MB allocated and nothing
free. One 128 MB allocated and slowly being used. 384 MB full of
archive files. You always have between 384 MB and 512 MB of logs
stored. I don't understand where you're getting your numbers from.

BTW, I expect my linux systems to stay up from when they're booted
until I tell them to stop, and that's usually quite a while.

> > Or are you talking about left over files after some exceptional
> > event that are only part full? If so, then just deallocate the
> > unwanted empty space from them after you've recovered from the
> > exceptional event.  
> 
> Nah, it doesn't work like this: if a journal file isn't marked clean,
> i.e. was left in some half-written state we won't touch it, but just
> archive it and start a new one. We don't know how much was correctly
> written and how much was not, hence we can't sensibly truncate it. The
> kernel after all is entirely free to decide in which order it syncs
> writte blocks to disk, and hence it quite often happens that stuff at
> the end got synced while stuff in the middle didn't.

If you can't figure out which parts of an archived file are useful and
which aren't then why are you keeping them? Why not just delete them?
And if you can figure it out then why not do so and compact the useful
information into the minimum storage?

> > > Also, we vacuum old journals when allocating and the size
> > > constraints are hit. i.e. if we detect that adding 8M to journal
> > > file X would mean the space used by all journals together would
> > > be above the configure disk usage limits we'll delete the oldest
> > > journal files we can, until we can allocate 8M again. And we do
> > > this each time. If we'd allocate the full file all the time this
> > > means we'll likely remove ~256M of logs whenever we start a new
> > > file. And that's just shitty behaviour.  
> >
> > No it's not; it's exactly what happens most of the time, because all
> > the old log files are exactly the same size because that's why they
> > were rolled over. So freeing just one of those gives exactly the
> > right size space for the new log file. I don't understand why you
> > would want to free two?  
> 
> Because fs metadata, and because we don't always write files in
> full. I mean, we often do not, because we start a new file *before*
> the file would grow beyond the threshold. this typically means that
> it's typically not enough to delete a single file to get the space we
> need for a full new one, we usually need to delete two.

Why would you start a new file before the old one is full? Modulo truly
exceptional events. It's a genuine question - I don't think I've ever
seen it. And sure fs metadata - that just means allocate a bit extra
beyond the round number.

> actually it's even worse: btrfs lies in "df": it only updates counters
> with uncontrolled latency, hence we might actually delete more than
> necessary.

Sorry dunno much about btrfs. I'm planning to get rid of it here soon.

> Lennart

PS I'm subscribed to the list. I don't need a copy.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering
On Fr, 05.02.21 16:06, Dave Howorth (syst...@howorth.org.uk) wrote:

> On Fri, 5 Feb 2021 16:23:02 +0100
> Lennart Poettering  wrote:
> > I don't think that makes much sense: we rotate and start new files for
> > a multitude of reasons, such as size overrun, time jumps, abnormal
> > shutdown and so on. If we'd always leave a fully allocated file around
> > people would hate us...
>
> I'm not sure about that. The file is eventually going to grow to 128 MB
> so if there isn't space for it, I might as well know right now as
> later. And it's not like the space will be available for anything else,
> it's left free for exactly this log file.

let's say you assign 500M space to journald. If you allocate 128M at a
time, this means the effective unused space is anything between 1M and
255M, leaving just 256M of logs around. it's probably surprising that
you only end up with 255M of logs when you asked for 500M. I'd claim
that's really shitty behaviour.

> Or are you talking about left over files after some exceptional event
> that are only part full? If so, then just deallocate the unwanted empty
> space from them after you've recovered from the exceptional event.

Nah, it doesn't work like this: if a journal file isn't marked clean,
i.e. was left in some half-written state we won't touch it, but just
archive it and start a new one. We don't know how much was correctly
written and how much was not, hence we can't sensibly truncate it. The
kernel after all is entirely free to decide in which order it syncs
writte blocks to disk, and hence it quite often happens that stuff at
the end got synced while stuff in the middle didn't.

> > Also, we vacuum old journals when allocating and the size constraints
> > are hit. i.e. if we detect that adding 8M to journal file X would mean
> > the space used by all journals together would be above the configure
> > disk usage limits we'll delete the oldest journal files we can, until
> > we can allocate 8M again. And we do this each time. If we'd allocate
> > the full file all the time this means we'll likely remove ~256M of
> > logs whenever we start a new file. And that's just shitty behaviour.
>
> No it's not; it's exactly what happens most of the time, because all
> the old log files are exactly the same size because that's why they
> were rolled over. So freeing just one of those gives exactly the right
> size space for the new log file. I don't understand why you would want
> to free two?

Because fs metadata, and because we don't always write files in
full. I mean, we often do not, because we start a new file *before*
the file would grow beyond the threshold. this typically means that
it's typically not enough to delete a single file to get the space we
need for a full new one, we usually need to delete two.

actually it's even worse: btrfs lies in "df": it only updates counters
with uncontrolled latency, hence we might actually delete more than
necessary.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Dave Howorth
On Fri, 5 Feb 2021 16:23:02 +0100
Lennart Poettering  wrote:
> I don't think that makes much sense: we rotate and start new files for
> a multitude of reasons, such as size overrun, time jumps, abnormal
> shutdown and so on. If we'd always leave a fully allocated file around
> people would hate us...

I'm not sure about that. The file is eventually going to grow to 128 MB
so if there isn't space for it, I might as well know right now as
later. And it's not like the space will be available for anything else,
it's left free for exactly this log file.

Or are you talking about left over files after some exceptional event
that are only part full? If so, then just deallocate the unwanted empty
space from them after you've recovered from the exceptional event.

> Also, we vacuum old journals when allocating and the size constraints
> are hit. i.e. if we detect that adding 8M to journal file X would mean
> the space used by all journals together would be above the configure
> disk usage limits we'll delete the oldest journal files we can, until
> we can allocate 8M again. And we do this each time. If we'd allocate
> the full file all the time this means we'll likely remove ~256M of
> logs whenever we start a new file. And that's just shitty behaviour.

No it's not; it's exactly what happens most of the time, because all
the old log files are exactly the same size because that's why they
were rolled over. So freeing just one of those gives exactly the right
size space for the new log file. I don't understand why you would want
to free two?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering
On Fr, 05.02.21 10:24, Phillip Susi (ph...@thesusis.net) wrote:

>
> Lennart Poettering writes:
>
> > You are focussing only on the one-time iops generated during archival,
> > and are ignoring the extra latency during access that fragmented files
> > cost. Show me that the iops reduction during the one-time operation
> > matters and the extra latency during access doesn't matter and we can
> > look into making changes. But without anything resembling any form of
> > profiling we are just blind people in the fog...
>
> I'm curious why you seem to think that latency accessing old logs is so
> important.  I would think that old logs tend to be accessed very
> rarely.  On such a rare occasion, a few extra mS doesn't seem very
> important to me.  Even if it's on a 5400 rpm drive, typical latency is
> what?  8 mS?  Even with a fragment every 8 MB, that's only going to add
> up to an extra 128 mS to read and parse a 128 MB log file.  Even with no
> fragments it's going to take over 1 second to read that file, so we're
> only talking about a ~11% slow down here, on an operation that is rare
> and you're going to be spending far more time actually looking at the
> log than it took to read off the disk.

journalctl gives you one long continues log stream, joining everything
available, archived or not into one big interleaved stream.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Lennart Poettering writes:

> You are focussing only on the one-time iops generated during archival,
> and are ignoring the extra latency during access that fragmented files
> cost. Show me that the iops reduction during the one-time operation
> matters and the extra latency during access doesn't matter and we can
> look into making changes. But without anything resembling any form of
> profiling we are just blind people in the fog...

I'm curious why you seem to think that latency accessing old logs is so
important.  I would think that old logs tend to be accessed very
rarely.  On such a rare occasion, a few extra mS doesn't seem very
important to me.  Even if it's on a 5400 rpm drive, typical latency is
what?  8 mS?  Even with a fragment every 8 MB, that's only going to add
up to an extra 128 mS to read and parse a 128 MB log file.  Even with no
fragments it's going to take over 1 second to read that file, so we're
only talking about a ~11% slow down here, on an operation that is rare
and you're going to be spending far more time actually looking at the
log than it took to read off the disk.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Chris Murphy writes:

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Wait, doesn't it just create a new file, fallocate the whole thing, copy
the contents, and delete the original?  How can that possibly make
fragmentation *worse*?

> All of those archived files have more fragments (post defrag) than
> they had when they were active. And here is the FIEMAP for the 96MB
> file which has 92 fragments.

How the heck did you end up with nearly 1 frag per mb?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that matter).

Wouldn't that mean that when you take snapshots, they don't include the
logs?  That seems like an anti feature that violates the principal of
least surprise.  If I make a snapshot of my root, I *expect* it to
contain my logs.

> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those journals
> are more fragmented than your 20 in a 100MB file.
>
> I don't have any hard drives to test this on. This is what, 10% of the
> market at this point? The best you can do there is the same as on SSD.

The above sounded like great data, but not if it was done on SSD.  Of
course it doesn't cause latency on an SSD.  I don't know about market
trends, but I stopped trusting my data to SSDs a few years ago when my
ext4 fs kept being corrupted and it appeared that the FTL of the drive
was randomly swapping the contents of different sectors around when I
found things like the contents of a text file in a block of the inode
table or a directory.

> You can't depend on sysfs to conditionally do defragmentation on only
> rotational media, too many fragile media claim to be rotating.

It sounds like you are arguing that it is better to do the wrong thing
on all SSDs rather than do the right thing on ones that aren't broken.

> Looking at the two original commits, I think they were always in
> conflict with each other, happening within months of each other. They
> are independent ways of dealing with the same problem, where only one
> of them is needed. And the best of the two is fallocate+nodatacow
> which makes the journals behave the same as on ext4 where you also
> don't do defragmentation.

This makes sense.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering
On Do, 04.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote:

> On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
>  wrote:
>
> > You want to optimize write pattersn I understand, i.e. minimize
> > iops. Hence start with profiling iops, i.e. what defrag actually costs
> > and then weight that agains the reduced access time when accessing the
> > files. In particular on rotating media.
>
> A nodatacow journal on Btrfs is no different than a journal on ext4 or
> xfs. So I don't understand why you think you *also* need to defragment
> the file, only on Btrfs. You cannot do better than you already are
> with a nodatacow file. That file isn't going to get anymore fragmented
> in use than it was at creation.

You know, we issue the btrfs ioctl, under the assumption that if the
file is already perfectly defragmented it's a NOP. Are you suggesting
it isn't a NOP in that case?

> If you want to do better, maybe stop appending in 8MB increments?
> Every time you append it's another extent. Since apparently the
> journal files can max out at 128MB before they are rotated, why aren't
> they created 128MB from the very start? That would have a decent
> chance of getting you a file that's 1-4 extents, and it's not going to
> have more extents than that.

You know, there are certainly "perfect" ways to adjust our writing
scheme to match some specific file system on some specific storage
matching some specific user pattern. THing is though, what might be
ideal for some fs and some user might be terrible for another fs or
another user. We try to find some compromise in the middle, that might
not result in "perfect" behaviour everywhere, but at least reasonable
behaviour.

> Presumably the currently active journal not being fragmented is more
> important than archived journals, because searches will happen on
> recent events more than old events. Right?

Nope. We always interleave stuff. We currently open all journal files
in parallel. The system one and the per-user ones, the current ones
and the archived ones.

> So if you're going to say
> fragmentation matters at all, maybe stop intentionally fragmenting the
> active journal?

We are not *intentionally* fragmenting. Please don't argue on that
level. Not helpful, man.

> Just fallocate the max size it's going to be right off
> the bat? Doesn't matter what file system it is. Once that 128MB
> journal is full, leave it alone, and rotate to a new 128M file. The
> append is what's making them fragmented.

I don't think that makes much sense: we rotate and start new files for
a multitude of reasons, such as size overrun, time jumps, abnormal
shutdown and so on. If we'd always leave a fully allocated file around
people would hate us...

The 8M increase is a middle ground: we don#t allocate space for each
log message, and we don't allocate space for everything at once. We
allocate medium sized chunks at a time.

Also, we vacuum old journals when allocating and the size constraints
are hit. i.e. if we detect that adding 8M to journal file X would mean
the space used by all journals together would be above the configure
disk usage limits we'll delete the oldest journal files we can, until
we can allocate 8M again. And we do this each time. If we'd allocate
the full file all the time this means we'll likely remove ~256M of
logs whenever we start a new file. And that's just shitty behaviour.

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Sounds like a bug in btrfs? systemd is not the place to hack around
btrfs bugs?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that
> matter).

Not sure what the point of that would be... at least when systemd does
snapshots (i.e. systemd-nspawn --template= and so on) they are of
course recursive, so what'd be the point of doing a subvolume there?

> > Somehow I think you are missing what I am asking for: some data that
> > actually shows your optimization is worth it: i.e. that leaving the
> > files fragment doesn't hurt access to the journal badly, and that the
> > number of iops is substantially lowered at the same time.
>
> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those jour

Re: [systemd-devel] Still confused with socket activation

2021-02-05 Thread Benjamin Berg
On Thu, 2021-02-04 at 22:16 +0300, Andrei Borzenkov wrote:
> 03.02.2021 22:25, Benjamin Berg пишет:
> > Requires= actually has the difference that the unit must become
> > part of
> > the transaction (if it is not active already). So you get a hard
> > failure and appropriate logging if the unit cannot be added to the
> > transaction for some reason.
> > 
> 
> Oh, I said "documented" :) systemd documentation does not even define
> what "transaction" is. You really need to know low level implementation
> details to use it in this way.
> 
> But thank you, I missed this subtlety. Of course another reason could be
> stop behavior.

Oh, good point! I really had not been considering the implication on
stop behaviour. :)

Benjamin

> > > Care to show more complete example and explain why Wants does not
> > > work in this case?
> > 
> > Wants= would work fine. I think it boils down to whether you find
> > the
> > extra assertions useful. The Requires= documentation actually
> > suggests
> > using Wants= exactly to avoid this.
> > 
> > Benjamin
> > 
> 
> 



signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel