Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-20 Thread Lennart Poettering
On Wed, 18.06.14 07:51, Tollef Fog Heen (tfh...@err.no) wrote:

 
 ]] Lennart Poettering 
 
 (trimming Ccs)
 
  I am invoking fallocate() in advance, because we write those files with
  mmap() and that of course would normally triggered SIGBUS already on the
  most boring of reasons, such as disk full/quota full or so. Hence,
  before we do anything like that, we invoke fallocate() to ensure that
  the space is actually available... As far as I can see, that pretty much
  in line with what fallocate() is supposed to be useful for, the man page
  says this explicitly:
 
 OOI, why not use write + read-only mmap for reading?  In testing
 performance for another project, that's significantly faster than rw
 mmap.

Well, non-mmap stuff might indeed be quicker, but this isn't really
optimized to the last bit of speed but more to keep the code simple. And
the code is quite simple now, as we can use the same codepaths to locate
things for appending and for finding them...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Duncan
Goffredo Baroncelli posted on Mon, 16 Jun 2014 20:47:49 +0200 as
excerpted:

 The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3)
 calls aren't the root cause of the problem. Even without these the
 system.journal file still fragments.
 
 [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html

Thanks for injecting some concrete facts in the form of test results into 
things. =:^)

So I was barking up the wrong tree with fallocate.

But I had the base problem correct -- fragmentation -- and good 
recommendations for working around it -- NOCOW and/or (auto)defrag.
(And for that matter, for those like me where /var/log is a dedicated 
partition, simply putting /var/log on something other than btrfs would 
nicely work around the issue too.

The experts are working on it now, so I'll step out for the most part as 
what else I could add at this point would be simply noise.  But I did 
consider it important to acknowledge that I had been wrong about the 
fallocate.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Duncan
Lennart Poettering posted on Mon, 16 Jun 2014 00:39:39 +0200 as excerpted:

 Well, quite frankly I am not entirely sure why fallocate()

I was barking up the wrong tree with fallocate().  Sorry.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Duncan
Lennart Poettering posted on Mon, 16 Jun 2014 00:43:34 +0200 as excerpted:

  At least here, I interpreted that remark as primarily sarcastic
  commentary on the systemd devs' apparent attitude, which can be
  (controversially) summarized as: Systemd doesn't have problems
  because it's perfect.  Therefore, any problems you have with systemd
  must instead be with other components which systemd depends on.
 
 Come on, sorry, but this is fud. Really... ;-)
 
 Interestingly, I never commented on anything in this area, and neither
 did anybody else from the systemd side afaics. THe entire btrfs defrag
 thing i wasn't aware of before this thread started on the system ML a
 few days ago. I am not sure where you take your ideas about our
 attitude from. God, with behaviour like that you just make us ignore
 you, Duncan.

Sorry.

As you'll note, I said can be controversially...

I never stated that *I* held that position personally, as I was taking 
the third-person observer position.  As such, I've seen that attitude 
expressed by others in multiple threads when systemd comes up (as Josef 
alluded to as well), and that I simply interpreted the remark to which I 
was alluding (And that's now a btrfs problem :/) as a sarcastic 
reference to that attitude... which again I never claimed as my own.

Thus the whoosh, in reference to what I interpreted as sarcasm, which 
certainly doesn't require fully agreement with in ordered to understand.  
(Tho as I mentioned elsewhere, I can certainly see where they're coming 
from given the recent kernel debug kerfuffle and the like... and I'm 
certainly not alone there... but I'm /trying/ to steer a reasonably 
neutral path while appreciating both sides.)

In fact, I specifically stated elsewhere that in fact I recently switched 
to systemd -- by choice as I'm on gentoo which still defaults to openrc 
-- myself.  Certainly I would not have done so if I believed systemd was 
as bad as all that, and the fact that I HAVE done so definitely implies a 
rather large amount of both trust and respect in the systemd devs, or I'd 
not be willing to run their code.

But never-the-less I can see the viewpoint from both sides now, and do 
try to maintain a reasonable neutrality.  I guess I should have made that 
more explicit in the original post, but as they say, hindsight is 20/20. 
=:^\

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Martin
On 17/06/14 09:17, Duncan wrote:
[...]
 
 But never-the-less I can see the viewpoint from both sides now, and do 
 try to maintain a reasonable neutrality.  I guess I should have made that 
 more explicit in the original post, but as they say, hindsight is 20/20. 
 =:^\

Hey! All good for a giggle and excellent for friendly attention to get
things properly fixed :-)

And this thread may even hit the news for Phoronix and LWN and others,
all in true FLOSS open development and thorough debug :-P

And... And... We get a few bits of systemd, btrfs, and filesystem
interactions fixed for how things /should/ work best.


Thanks (and for others also) for stirring a rapid chase! ;-)

Regards,
Martin

(A fellow Gentoo-er ;-) )

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Goffredo Baroncelli
Hi Kai,


I investigate a bit why readahead doesn't defrag the journal.
I put in CC also the mailing list for the record.

On 06/17/2014 03:33 AM, Kai Krakow wrote:
[...]
 Instead, for me, the readahead collector catches access to my system
 journal and thus defragments it. That's not the case for your system
 which explains why it won't be defragmented for you when enabling
 readahead.

The readahead program skips file greater than 10M. If you look at its man page, 
it documents the switch --file-size-max. The default value is 
READAHEAD_FILE_SIZE_MAX which is equal to 10M.
So only when the system.journal is smaller than this value, the readahead 
defrags it.
But this happens only few times. 

This behavior seems to me reasonable to avoid blocking a system when a big file 
is touched during the boot.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Chris Murphy

On Jun 16, 2014, at 7:13 PM, cwillu cwi...@cwillu.com wrote:

 It's not a mmap problem, it's a small writes with an msync or fsync
 after each one problem.
 
 For the case of sequential writes (via write or mmap), padding writes
 to page boundaries would help, if the wasted space isn't an issue.
 Another approach, again assuming all other writes are appends, would
 be to periodically (but frequently enough that the pages are still in
 cache) read a chunk of the file and write it back in-place, with or
 without an fsync. On the other hand, if you can afford to lose some
 logs on a crash, not fsyncing/msyncing after each write will also
 eliminate the fragmentation.

Normally I'd be willing to give up ~30 seconds of journal to not have 
fragmented journals. But then if I use systemd.log_level=debug I'd like that to 
trigger more frequent flushing to make sure as little of the journaling is lost 
as possible. Does that make sense?


Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 I investigate a bit why readahead doesn't defrag the journal.
 I put in CC also the mailing list for the record.
 
 On 06/17/2014 03:33 AM, Kai Krakow wrote:
 [...]
 Instead, for me, the readahead collector catches access to my system
 journal and thus defragments it. That's not the case for your system
 which explains why it won't be defragmented for you when enabling
 readahead.
 
 The readahead program skips file greater than 10M. If you look at its man
 page, it documents the switch --file-size-max. The default value is
 READAHEAD_FILE_SIZE_MAX which is equal to 10M. So only when the
 system.journal is smaller than this value, the readahead defrags it. But
 this happens only few times.
 
 This behavior seems to me reasonable to avoid blocking a system when a big
 file is touched during the boot.

Ah okay, good catch... Well, systemd does not generate massive amounts of 
logs for me. Maybe you want to play around with SystemMaxUse in 
journald.conf to let the file size stay below 10M?

Actually, for servers using syslog, I set the log file rotation to kick in 
after 10M worth of logging (or after 7 days, whatever comes first) so our 
rsync-diffed backups stay small for log files as otherwise it would start 
creating huge diffs just for backups when I do the rsync --backup way of 
doing backups.

So that may actually be a not too bad option. I'm not using it with my home 
system because I do snapshot-based backups to btrfs and diffs are just much 
smaller (using rsync --no-whole-file --inplace).

We recently switched to zfs based backup storage for our servers, and in 
that process switched to using rsync --no-whole-file --inplace, too. So we 
won't need the small log rotation scheme any longer but I feel comfortable 
with it so I stick to this.

BTW: I checked our systemd-enabled servers and fragmentation of journal 
files is very very high there, too. Way above 5000 extents. The journals are 
on xfs so it suffers that problem, too (which already has been pointed out 
somewhere in this thread multiple times). But to my surprise, this does not 
affect the performance of journalctl there. The bad performance is only for 
fragmentation on btrfs, not xfs. IMHO, this pushes the pointer a little bit 
more towards systemd, tho I couldn't suggest how to fix it. Time for the 
filesystem hackers to kick in.

BTW: The servers are virtualized so by default, systemd-readahead does 
nothing there because of ConditionVirtual in the service files. OTOH, it 
won't defragment there anyways because xfs is not supported for this in 
systemd, altough it would probably be possible because there is xfs_fsr. 
But my feeling is that xfs_fsr is more or less just locking the file, 
copying it, then replaces the original. It cannot work on files in use. This 
is different for btrfs which only chokes on defragmenting files currently 
mmap'ed (e.g., binaries being executed, read: busy text files). Following 
that thought, I wonder why it seems to defragment system.journal for me 
because after all I learned here, systemd writes to the files mmap'ed.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Kai Krakow
Chris Murphy li...@colorremedies.com schrieb:

 It's not a mmap problem, it's a small writes with an msync or fsync
 after each one problem.
 
 For the case of sequential writes (via write or mmap), padding writes
 to page boundaries would help, if the wasted space isn't an issue.
 Another approach, again assuming all other writes are appends, would
 be to periodically (but frequently enough that the pages are still in
 cache) read a chunk of the file and write it back in-place, with or
 without an fsync. On the other hand, if you can afford to lose some
 logs on a crash, not fsyncing/msyncing after each write will also
 eliminate the fragmentation.
 
 Normally I'd be willing to give up ~30 seconds of journal to not have
 fragmented journals. But then if I use systemd.log_level=debug I'd like
 that to trigger more frequent flushing to make sure as little of the
 journaling is lost as possible. Does that make sense?

AFAIR systemd-journald already does explicit flushing when more important 
log entries hit the log. Well, of course debug is one of the least 
important log levels in that chain. But using the same code paths, it should 
be trivial to implement a configurable trigger which also forces explicit 
flushing...

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Kai Krakow
Filipe Brandenburger filbran...@google.com schrieb:

 On Mon, Jun 16, 2014 at 6:13 PM, cwillu cwi...@cwillu.com wrote:
 For the case of sequential writes (via write or mmap), padding writes
 to page boundaries would help, if the wasted space isn't an issue.
 Another approach, again assuming all other writes are appends, would
 be to periodically (but frequently enough that the pages are still in
 cache) read a chunk of the file and write it back in-place, with or
 without an fsync. On the other hand, if you can afford to lose some
 logs on a crash, not fsyncing/msyncing after each write will also
 eliminate the fragmentation.
 
 I was wondering if something could be done in btrfs to improve
 performance under this workload... Something like a defrag on demand
 for a case where mostly appends are happening.
 
 When there are small appends with fsync/msync, they become new
 fragments (as expected), but once the writes go past a block boundary,
 btrfs could defragment the previous block in background, since it's
 not really expected to change again.
 
 That could potentially achieve performance close to chattr +C without
 the drawbacks of disabling copy-on-write.

I thought about something like that, too. I'm pretty sure it really doesn't 
matter if your 500G image file is split across 1 extents - as long as at 
least chunks of extents are kept together and rebuilt as one extent. That 
means, instead of letting autodefrag work on the whole file just let it 
operate on a chunk of it within some sane boundaries - maybe 8MB chunks, - 
of course without splitting existing extents if those already cross a chunk 
boundary. That way, it would still reduce head movements a lot while 
maintaining good performance during defragmentation. Your idea would be the 
missing companion to that (it is some sort of slow-growing-file-detection).

If I remember correctly, MacOSX implements a similar adaptic defragmentation 
strategy for its HFS+ filesystem, tho the underlying semantics are probably 
quite different. And it acts upon opening the file instead upon writing to 
the file, so it is probably limited to smallish files only (which I don't 
think makes so much sense on its own, for small files locality to 
semantically similar files is much more important, i.e. files needed during 
boot, files needed for starting a specific application).

If, after those strategies, it is still important to get your file's chunks 
cleanly aligned one after each other, one could still run a manual defrag 
which does the complete story.

BTW: Is it possible to physically relocate files in btrfs? I think that is 
more important than defragmentation. Is such a thing possible with the 
defrag IOCTL? My current understanding is that defragmentating a file just 
rewrites it somewhere random as a contiguous block - which is not always the 
best option because it can hurt boot performance a lot and thus reverses the 
effect of what is being tried to achieve. It feels a bit like playing 
lottery when I defragment my boot files only to learn that the boot process 
now is slower instead of faster. :-\

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Lennart Poettering
On Mon, 16.06.14 09:05, Josef Bacik (jba...@fb.com) wrote:

 So you are doing all the right things from what I can tell, I'm just
 a little confused about when you guys run fsync.  From what I can
 tell it's only when you open the journal file and when you switch it
 to offline.  I didn't look too much past this point so I don't
 know how often these things happen.  Are you taking an individual
 message, writing it, updating the head of the file and then
 fsync'ing?  Or are you getting a good bit of dirty log data and
 fsyncing occasionally?

The latter. Basically when opening a file for writing we mark it in the
header as online, then fsync() it. When we close a file we fsync() it,
then change the header to offline, ans sync() again. Also, 5min after
each write we will also put things to offline, until the next write,
when we will put things to online again. Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
journalctl -p crit).

Also, we rotate and start a new file every now and then, when we hit a
size limit, but that is usually very seldom.

Putting this together: we should normally fsync() only very
infrequently. 

 What would cause btrfs problems is if you fallocate(), write a small
 chunk, fsync, write a small chunk again, fsync again etc.  Fallocate
 saves you the first write around, but if the next write is within
 the same block as the previous write we'll end up triggering cow and
 enter fragmented territory.  If this is what is what journald is
 doing then that would be good to know, if not I'd like to know what
 is happening since we shouldn't be fragmenting this badly.

Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.

Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Kai Krakow
Lennart Poettering lenn...@poettering.net schrieb:

 [...] Finally, if something is logged
 at priorities EMERG, ALERT or CRIT we will sync immediately (which
 actually should never happen in real-life, unless something is really
 broken -- a simple way to check if anything like this got written is
 journalctl -p crit).
[...]
 Hmm, the only way I see that that would happen is if a lot of stuff is
 logged at these super-high log levels mentioned above. But then again,
 that never really should happen in real-life.

Unless you use those HP utilities:

# journalctl -p crit
-- Logs begin at Mo 2014-05-05 02:26:26 CEST, end at Di 2014-06-17 23:33:48 
CEST. --
Mai 20 10:17:21 jupiter hp-systray[1366]: hp-systray[1366]: error: option -s 
not recognized
-- Reboot --
Mai 22 22:37:15 jupiter hp-systray[1413]: hp-systray[1413]: error: option -s 
not recognized
-- Reboot --
Jun 03 08:49:17 jupiter hp-systray[1501]: hp-systray[1501]: error: option -s 
not recognized
-- Reboot --
Jun 09 13:45:31 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s 
not recognized
-- Reboot --
Jun 11 00:21:16 jupiter hp-systray[1405]: hp-systray[1405]: error: option -s 
not recognized
-- Reboot --
Jun 14 00:30:49 jupiter hp-systray[1434]: hp-systray[1434]: error: option -s 
not recognized
-- Reboot --
Jun 14 22:54:43 jupiter hp-systray[1416]: hp-systray[1416]: error: option -s 
not recognized
-- Reboot --
Jun 14 23:11:11 jupiter hp-systray[1437]: hp-systray[1437]: error: option -s 
not recognized
-- Reboot --
Jun 15 14:53:44 jupiter hp-systray[1447]: hp-systray[1447]: error: option -s 
not recognized
-- Reboot --
Jun 15 16:53:39 jupiter hp-systray[1404]: hp-systray[1404]: error: option -s 
not recognized
-- Reboot --
Jun 15 17:38:40 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s 
not recognized
-- Reboot --
Jun 15 18:34:37 jupiter hp-systray[1375]: hp-systray[1375]: error: option -s 
not recognized
-- Reboot --
Jun 17 20:38:32 jupiter hp-systray[1406]: hp-systray[1406]: error: option -s 
not recognized

 Could anyone who's expereiencing the slowdowns have a look on the
 journalctl output menionted above? Do you have more than a few lines
 printed like that?

Nasty message but no slowdowns. But it's printed just during desktop session 
start so it's no problem. Still, I wonder what this utility tries to do 
(seems to put itself into autostart with an -s switch but without 
supporting it, probably a packaging and/or KDE issue). At least HP thinks: 
Oh boy - that's really critical! Whoohoo. Why not let the system explode? 
:~)

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Duncan
Kai Krakow posted on Tue, 17 Jun 2014 23:02:14 +0200 as excerpted:

  I'm pretty sure it really
 doesn't matter if your 500G image file is split across 1 extents -
 as long as at least chunks of extents are kept together and rebuilt as
 one extent.

It's worth noting that in btrfs terms, chunk has a specific meaning.  
Btrfs allocates chunks from free space, 1 GiB at a time for data 
chunks, 256 MiB at a time for metadata.[1]  And that btrfs-specific 
meaning does tend to distort your example case, somewhat, tho I expect 
most folks understand what you mean.  The problem is that we're running 
out of generic terms that mean chunk-like and are thus experiencing 
namespace collision!

Anyway, btrfs chunks, data chunks being of interest here, get allocated 
on demand.  The practical effect is thus that the maximum btrfs extent 
size is 1 GiB.  As such, the least-extents-possible ideal for that 500 
GiB image file would be 500 extents.

 That means, instead of letting autodefrag work on the whole
 file just let it operate on a chunk of it within some sane boundaries -
 maybe 8MB chunks, - of course without splitting existing extents if
 those already cross a chunk boundary.

As stated, chunk has a special meaning in btrfs.  Data chunks are 1 GiB 
in size and no extent can cross a btrfs chunk boundary.

 BTW: Is it possible to physically relocate files in btrfs?

Currently, ENOTIMPLEMENTED.  AFAIK it's possible and is on the list, but 
so are a number of other nice-to-haves, so it might be awhile.  
Actually just checked the wiki.  The closest specific feature point 
listed says hot-data-tracking and moving to faster devices, noting that 
there's actually some current work on the generic (not-btrfs-specific) 
feature, to be made available via VFS.  Your specific use-case will 
probably be part of that general implementation.

---
[1] Btrfs chunk allocation: 1 GiB data chunks size, 256 MiB metadata 
chunk size by default, but there are various exceptions.  The first of 
course is when space gets tight.  The last allocation will be whatever is 
left.  Second, there's mixed-data/metadata mode, the default when the 
filesystem is = 1 GiB.  Mixed chunks are (like metadata) 256 MiB but 
contain data and metadata mixed, sacrificing speed efficiency for more 
efficient space management.  Third, it's worth noting that the single-
device-btrfs default is dup-mode-metadata (with the same default for 
mixed-mode), so while they're 256 MiB each, two get allocated at once, 
thus allocating metadata half a GiB at a time.  Multi-device-btrfs 
metadata defaults to raid1-mode, still allocated in chunk pairs but one 
on each of two separate devices.  Data mode always defaults to single, 
tho on multi-device-btrfs both data and metadata can be (individually) 
set to various raid modes, each with its own specific allocation pattern.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Tollef Fog Heen
]] Lennart Poettering 

(trimming Ccs)

 I am invoking fallocate() in advance, because we write those files with
 mmap() and that of course would normally triggered SIGBUS already on the
 most boring of reasons, such as disk full/quota full or so. Hence,
 before we do anything like that, we invoke fallocate() to ensure that
 the space is actually available... As far as I can see, that pretty much
 in line with what fallocate() is supposed to be useful for, the man page
 says this explicitly:

OOI, why not use write + read-only mmap for reading?  In testing
performance for another project, that's significantly faster than rw
mmap.

-- 
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Lennart Poettering
On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:

  I am not really following though why this trips up btrfs though. I am
  not sure I understand why this breaks btrfs COW behaviour. I mean,
  fallocate() isn't necessarily supposed to write anything really, it's
  mostly about allocating disk space in advance. I would claim that
  journald's usage of it is very much within the entire reason why it
  exists...
 
 I don't believe that fallocate() makes any difference to fragmentation on 
 BTRFS.  Blocks will be allocated when writes occur so regardless of an 
 fallocate() call the usage pattern in systemd-journald will cause 
 fragmentation.

journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Andrey Borzenkov
On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:

  I am not really following though why this trips up btrfs though. I am
  not sure I understand why this breaks btrfs COW behaviour. I mean,
  fallocate() isn't necessarily supposed to write anything really, it's
  mostly about allocating disk space in advance. I would claim that
  journald's usage of it is very much within the entire reason why it
  exists...

 I don't believe that fallocate() makes any difference to fragmentation on
 BTRFS.  Blocks will be allocated when writes occur so regardless of an
 fallocate() call the usage pattern in systemd-journald will cause
 fragmentation.

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...


... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.

The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly write once at the
expense of additional read/write activity.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Goffredo Baroncelli
Hi Lennart,

On 06/16/2014 12:13 AM, Lennart Poettering wrote:
 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,
 fallocate() isn't necessarily supposed to write anything really, it's
 mostly about allocating disk space in advance. I would claim that
 journald's usage of it is very much within the entire reason why it
 exists...

I performed several tests, trying different setups [1]. One of these was 
replacing the posix_fallocate() with a truncate, to check where is the problem. 
The conclusion was that *posix_fallocate() is NOT the problem*.

In another reply you stated that systemd-journald appends some data at the end 
of file, then update some data in the middle. I think this is the reason 
because the file becomes quickly fragmented.


[1] Let me to revise the english, the I will post the results.


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Goffredo Baroncelli
Hi all,

in this blog [1] I collected all the results of the tests which I performed in 
order to investigate a bit this performance problem between systemd and btrfs. 
I had to put these results in a blog, because there are several images. Below a 
brief summary.

I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh 
installation of a Fedora 20 and I measured the boot time during several reboots 
(up to 70).  I tested the following scenarios

1) standard (without defragmenting any file, without readahead)
2) defragment the journal file at the end of the boot
3) defragment the journal file before the flushing
4) mark as NOCOW the journald log file
5) enable the systemd-readahead
6) remove the fsync(2) call from journald
7) remove the posix_fallocate(3) call from journald
8) do a defrag when posix_fallocate(3) is called
9) do a defrag when the journal log file is opened

The test #1 highlight the problem. It shows that the boot time may require up 
to 50 seconds. During the reboots the number of extents of the file 
system.journal increases up to 6000. De-fragmenting the system.journal file the 
boot time decreases by ~20 seconds. My conclusion is that in BTRFS the 
fragmentation of this file increases the boot time.

The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls 
aren't the root cause of the problem. Even without these the system.journal 
file still fragments.

The test #4 suggested that marking NOCOW the system.journal file reduces its 
fragmentation and so the boot time.

The test #2,#3,#9 suggested that performing a periodic defrag reduces 
significantly the fragmentation of system.journal and so the boot time.

The test #5 revealed that the readahead capability of systemd was not efficacy 
because it seems that the system.journal file was unaffected (but other 
*.journal files were). Further investigation is required.

BR
G.Baroncelli



[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html

On 06/16/2014 06:32 PM, Goffredo Baroncelli wrote:
 Hi Lennart,
 
 On 06/16/2014 12:13 AM, Lennart Poettering wrote:
 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,
 fallocate() isn't necessarily supposed to write anything really, it's
 mostly about allocating disk space in advance. I would claim that
 journald's usage of it is very much within the entire reason why it
 exists...
 
 I performed several tests, trying different setups [1]. One of these was 
 replacing the posix_fallocate() with a truncate, to check where is the 
 problem. The conclusion was that *posix_fallocate() is NOT the problem*.
 
 In another reply you stated that systemd-journald appends some data at the 
 end of file, then update some data in the middle. I think this is the reason 
 because the file becomes quickly fragmented.
 
 
 [1] Let me to revise the english, the I will post the results.
 
 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Martin
On 16/06/14 17:05, Josef Bacik wrote:
 
 On 06/16/2014 03:14 AM, Lennart Poettering wrote:
 On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:

 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,

 I don't believe that fallocate() makes any difference to
 fragmentation on
 BTRFS.  Blocks will be allocated when writes occur so regardless of an
 fallocate() call the usage pattern in systemd-journald will cause
 fragmentation.

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...

Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?

Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?


 Was waiting for you to show up before I said anything since most systemd
 related emails always devolve into how evil you are rather than what is
 actually happening.

Ouch! Hope you two know each other!! :-P :-)


[...]
 since we shouldn't be fragmenting this badly.
 
 Like I said what you guys are doing is fine, if btrfs falls on it's face
 then its not your fault.  I'd just like an exact idea of when you guys
 are fsync'ing so I can replicate in a smaller way.  Thanks,

Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use cp -a to defrag the files to a new area
and restart the data software logger on that).


Random thoughts:

Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?

Align the data writes to 16kByte or 64kByte boundaries/chunks?

Are mmap-ed files a similar problem to using a swap file and so should
the same btrfs file swap code be used for both?


Not looked over the code so all random guesses...

Regards,
Martin




___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Martin
On 16/06/14 19:47, Goffredo Baroncelli wrote:
 Hi all,
 
 in this blog [1] I collected all the results of the tests which I
 performed in order to investigate a bit this performance problem
 between systemd and btrfs. I had to put these results in a blog,
 because there are several images. Below a brief summary.

 The test #1 highlight the problem. It shows that the boot time may
 require up to 50 seconds. During the reboots the number of extents of
 the file system.journal increases up to 6000. De-fragmenting the
 system.journal file the boot time decreases by ~20 seconds. My
 conclusion is that in BTRFS the fragmentation of this file increases
 the boot time.
 
 The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3)
 calls aren't the root cause of the problem. Even without these the
 system.journal file still fragments.


 [1]
 http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html

Very good demonstration and summary, thanks.

The charts very clearly show the correspondence between time-to-boot and
the level of fragmentation.

... And I thought Linux was getting ever faster to boot!


Regards,
Martin



___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Kai Krakow
Andrey Borzenkov arvidj...@gmail.com schrieb:

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...

 
 ... which traditionally experienced severe sequential read performance
 degradation in this case. As I understand this is exactly what happens
 - readahead attempts to preload files which gives us heavy random read
 access.
 
 The only real remedy was to defragment files. It should work
 relatively well for journal where files are mostly write once at the
 expense of additional read/write activity.

This is an interesting point because it means: Readahead may hurt systemd 
boot performance a lot if it is hitting a heavily fragmented file like those 
affected journal files.

So, readahead should probably ignore these files? It may look obvious but I 
don't think so...

On my system, readahead will process system.journal and thus defragment it, 
so it won't fragment too much and by that won't slow the boot down. This may 
also explain why in Goffredo's tests the readahead process never got into 
recording the system.journal - because the boot process was finished 
(readahead-done) before any process started reading system.journal - but 
still the high fragmentation of the other journals causes very high IO and 
the boot process becomes slower and slower.

The root fix should be to make journald not causing high fragmentation. But 
I think Andrey's point should be given some thoughts, too.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Goffredo Baroncelli
On 06/14/2014 11:37 PM, Kai Krakow wrote:
 Goffredo Baroncelli kreij...@libero.it schrieb:
 
 On 06/14/2014 10:13 PM, Kai Krakow wrote:
 Take care to enable all needed services to enable defrag... If your
 services make use of journal file loading these files should also become
 part of the process. You can check with
 /usr/lib/systemd/systemd-readahead analyze. The state is stored in
 /.readahead.

 I have enabled all the services (collect, replay, done), but I was unable
 to see any gain.
 
 I don't think you can actually enable the done service because it is 
 static... But if it placed a symlink in /etc/systemd that is probably wrong 
 and you should undo it. It is automatically activated a few seconds after 
 booting is done to store the collected data for replaying on next reboot.

I checked the output of systemd-analyze: systemd-readahead-done is called at 
the end of the chart. The sym-links exist only for the collect and replay 
services, which were called at the beginning.

 
 Check with find /etc/systemd -name '*readahead*' -type l...
 
 I don't know why but system.journal is not considered by readahead:

 # /usr/lib/systemd/systemd-readahead analyze | grep journal


100% ( 1)  770: /etc/systemd/journald.conf
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
 
 I tried that on my system and it looks vice-versa somehow:
 
 $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1)  812: /etc/systemd/journald.conf
100% ( 1)  8388608: 
 /var/log/journal/121b87ca633e8ac001665668001b/system.journal
 
 Strange... ;-)

May be this is a reason of my degraded performance. However I tried on two 
machines (all fedora) and after several reboot; I tried also removing the 
/.readahead file...
Nothing.


On another fedora machine, I found

$ sudo /lib/systemd/systemd-readahead analyze | grep journal
   100% ( 1)   220384: /usr/lib/systemd/systemd-journald
   100% ( 1)  770: /etc/systemd/journald.conf
   100% ( 1)24544: /usr/lib64/rsyslog/imjournal.so
   100% ( 1)   122880: /usr/lib64/libsystemd-journal.so.0.11.3
50% ( 1)  4194304: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f1b1019bc549-09d9ae048ff8eeef.journal~
   100% ( 1)  8388608: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f1b7dcfd728c-38950862a134194d.journal~
50% ( 1)  4194304: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-995@0004f1c3fa738010-b0829bd837e30f61.journal~
50% ( 1)  4194304: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-995.journal
   100% ( 1)  8388608: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/system@0004f6ed6604ebc7-bb2f3b30ac6eb32c.journal~
   100% ( 1)  8388608: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f6ed66a05b08-9d244b02cc294baf.journal~
77% ( 1)  6508544: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/system@0004f6ed6d02e69f-d9c815aa06264d7a.journal~
   100% ( 1)  8388608: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f6ed6db6dd7d-cc84da57d5b01680.journal~
   100% ( 1)  8388608: 
/var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000.journal
   100% ( 1)  121: /var/lib/rsyslog/imjournal.state


So some journal files are present, but not the current one (system.journal)



 To get back to your performance problem: Did you try systemctl enable 
 bootchart and then (after a reboot) look at the png generated in /run/log 
 (at least in my system it is placed there, look at the configs if in doubt). 
 It can give some interesting clues about why a service starts late or takes 
 long to execute.
Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is 
not enabled in the kernel.

 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 I have enabled all the services (collect, replay, done), but I was
 unable to see any gain.
 
 I don't think you can actually enable the done service because it is
 static... But if it placed a symlink in /etc/systemd that is probably
 wrong and you should undo it. It is automatically activated a few seconds
 after booting is done to store the collected data for replaying on next
 reboot.
 
 I checked the output of systemd-analyze: systemd-readahead-done is called
 at the end of the chart. The sym-links exist only for the collect and
 replay services, which were called at the beginning.

Then all is good.

 I don't know why but system.journal is not considered by readahead:

 # /usr/lib/systemd/systemd-readahead analyze | grep journal


100% ( 1)  770: /etc/systemd/journald.conf
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
 
 I tried that on my system and it looks vice-versa somehow:
 
 $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1)  812: /etc/systemd/journald.conf
100% ( 1)  8388608:
 /var/log/journal/121b87ca633e8ac001665668001b/system.journal
 
 Strange... ;-)
 
 May be this is a reason of my degraded performance. However I tried on two
 machines (all fedora) and after several reboot; I tried also removing the
 /.readahead file... Nothing.

I'm not sure but generally it is said that readahead utilities may need a 
few reboots before fully working. Did you try that? Will it take the journal 
into account then?

Maybe there is something special about your setup and your system.journal 
does not become touched during boot because of this? You said you also use 
rsyslogd. Maybe logs are redirected there instead of written to both logging 
mechanisms? Maybe most logging by system daemons is turned off or they write 
directly to log files?

I tuned all my daemons and services to not write to any log files at all, 
instead log everything to stderr/stdout (which is captured by systemd) or 
into the syslogger which is journald as the only mechanism in my system. I 
just prefer to have everything in one place, and journalctl is a great way 
to filter the logs or create custom views. Maybe that's a reason why it is 
taken into account by readahead on my system? I can only guess but maybe 
worth a try for you...

 To get back to your performance problem: Did you try systemctl enable
 bootchart and then (after a reboot) look at the png generated in
 /run/log (at least in my system it is placed there, look at the configs
 if in doubt). It can give some interesting clues about why a service
 starts late or takes long to execute.
 Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS
 is not enabled in the kernel.

Well, I'm on Gentoo so my kernel is custom built by design. But I checked my 
config, I even have no config option CONFIG_SCHEDSTATS but bootchart works 
for me:

$ zfgrep SCHED /proc/config.gz 
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
CONFIG_IOSCHED_BFQ=y
CONFIG_DEFAULT_IOSCHED=deadline
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
# CONFIG_NET_SCHED is not set
CONFIG_USB_EHCI_TT_NEWSCHED=y

BTW: After migrating to systemd, I had a boot time of 15s to DM on 
rotational with serveral heavy services enabled. I did not measure time to 
desktop back then. But after I did a btrfs balance (I use multidevice 
btrfs), the time to DM went up to around 25s and never went back no matter 
what I tried (even when I disabled some services). I switched to autologin 
DM since then so at least systemd-readahead catches some of the KDE 
services, too, now. But it still feels slow. I'm probably redoing my btrfs 
in the future, switching to 16k leaf size in that process. But I'll wait 
with that until I migrate to bigger hard disks. Currently I have still 
plenty of space left. According to bootchart I never had a problem with time 
spent processing the journal. But since a few kernel versions, auto-suspend 
works great for me, so I usually do not switch the system off, just leave it 
running and it suspends to RAM after 2 hours. As a nice side effect, it 
wakes up instantly (especially since kernel 3.15), and I save a lot of 
power. Still waiting for the systemd auto-wakeup facility to run timer jobs 
like a backup. ;-)

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Goffredo Baroncelli
On 06/15/2014 12:50 PM, Kai Krakow wrote:
 Goffredo Baroncelli kreij...@libero.it schrieb:
 
 I have enabled all the services (collect, replay, done), but I was
 unable to see any gain.

 I don't think you can actually enable the done service because it is
 static... But if it placed a symlink in /etc/systemd that is probably
 wrong and you should undo it. It is automatically activated a few seconds
 after booting is done to store the collected data for replaying on next
 reboot.

 I checked the output of systemd-analyze: systemd-readahead-done is called
 at the end of the chart. The sym-links exist only for the collect and
 replay services, which were called at the beginning.
 
 Then all is good.
 
 I don't know why but system.journal is not considered by readahead:

 # /usr/lib/systemd/systemd-readahead analyze | grep journal


100% ( 1)  770: /etc/systemd/journald.conf
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal

 I tried that on my system and it looks vice-versa somehow:

 $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1)  812: /etc/systemd/journald.conf
100% ( 1)  8388608:
 /var/log/journal/121b87ca633e8ac001665668001b/system.journal

 Strange... ;-)

 May be this is a reason of my degraded performance. However I tried on two
 machines (all fedora) and after several reboot; I tried also removing the
 /.readahead file... Nothing.
 
 I'm not sure but generally it is said that readahead utilities may need a 
 few reboots before fully working. Did you try that? Will it take the journal 
 into account then?

I hope that today I will publish my results. Anyway I performed about 50-70 
reboot on each configuration.

 
 Maybe there is something special about your setup and your system.journal 
 does not become touched during boot because of this? You said you also use 
 rsyslogd. Maybe logs are redirected there instead of written to both logging 
 mechanisms? Maybe most logging by system daemons is turned off or they write 
 directly to log files?
 
 I tuned all my daemons and services to not write to any log files at all, 
 instead log everything to stderr/stdout (which is captured by systemd) or 
 into the syslogger which is journald as the only mechanism in my system. I 
 just prefer to have everything in one place, and journalctl is a great way 
 to filter the logs or create custom views. Maybe that's a reason why it is 
 taken into account by readahead on my system? I can only guess but maybe 
 worth a try for you...

I don't have any idea... definitely I will investigate a bit. 

 
 To get back to your performance problem: Did you try systemctl enable
 bootchart and then (after a reboot) look at the png generated in
 /run/log (at least in my system it is placed there, look at the configs
 if in doubt). It can give some interesting clues about why a service
 starts late or takes long to execute.
 Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS
 is not enabled in the kernel.
 
 Well, I'm on Gentoo so my kernel is custom built by design. But I checked my 
 config, I even have no config option CONFIG_SCHEDSTATS but bootchart works 
 for me:

I am talking about systemd-bootchart. I have to look if Fedora has the classic 
bootchart.



 
 $ zfgrep SCHED /proc/config.gz 
 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
 CONFIG_CGROUP_SCHED=y
 CONFIG_FAIR_GROUP_SCHED=y
 CONFIG_RT_GROUP_SCHED=y
 CONFIG_SCHED_AUTOGROUP=y
 CONFIG_IOSCHED_NOOP=y
 CONFIG_IOSCHED_DEADLINE=y
 CONFIG_IOSCHED_CFQ=y
 CONFIG_CFQ_GROUP_IOSCHED=y
 CONFIG_IOSCHED_BFQ=y
 CONFIG_DEFAULT_IOSCHED=deadline
 CONFIG_SCHED_OMIT_FRAME_POINTER=y
 CONFIG_SCHED_SMT=y
 CONFIG_SCHED_MC=y
 CONFIG_SCHED_HRTICK=y
 # CONFIG_NET_SCHED is not set
 CONFIG_USB_EHCI_TT_NEWSCHED=y
 
 BTW: After migrating to systemd, I had a boot time of 15s to DM on 
 rotational with serveral heavy services enabled. I did not measure time to 
 desktop back then. But after I did a btrfs balance (I use multidevice 
 btrfs), the time to DM went up to around 25s and never went back no matter 
 what I tried (even when I disabled some services). I switched to autologin 
 DM since then so at least systemd-readahead catches some of the KDE 
 services, too, now. But it still feels slow. I'm probably redoing my btrfs 
 in the future, switching to 16k leaf size in that process. But I'll wait 
 with that until I migrate to bigger hard disks. Currently I have still 
 plenty of space left. According to bootchart I never had a problem with time 
 spent processing the journal. But since a few kernel versions, auto-suspend 
 works great for me, so I usually do not switch the system off, just leave it 
 running and it suspends to RAM after 2 hours. As a nice side effect, it 
 wakes up instantly (especially since kernel 3.15), and I save a lot of 
 power. Still waiting for 

Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 To get back to your performance problem: Did you try systemctl enable
 bootchart and then (after a reboot) look at the png generated in
 /run/log (at least in my system it is placed there, look at the configs
 if in doubt). It can give some interesting clues about why a service
 starts late or takes long to execute.
 Unfortunately bootchart doesn't work on fedora, because
 CONFIG_SCHEDSTATS is not enabled in the kernel.
 
 Well, I'm on Gentoo so my kernel is custom built by design. But I checked
 my config, I even have no config option CONFIG_SCHEDSTATS but bootchart
 works for me:
 
 I am talking about systemd-bootchart. I have to look if Fedora has the
 classic bootchart.

I simply did systemctl enable bootchart.service which is probably part of 
bootchart2. It worked for me. I will look into systemd-bootchart, did never 
realize it existed as a standalone solution... Thanks for the pointer.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreij...@libero.it) wrote:

  Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
  block has been rewritten to a new location and is now an extent unto 
  itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
  single 4 KiB block in size.
 
 Several people pointed fallocate as the problem. But I don't
 understand the reason.

BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.

I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...

Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Wed, 11.06.14 20:32, Chris Murphy (li...@colorremedies.com) wrote:

  systemd has a very stupid journal write pattern. It checks if there
  is space in the file for the write, and if not it fallocates the
  small amount of space it needs (it does *4 byte* fallocate calls!)

Not really the case. 

http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354

We allocate 8mb at minimum.

  and then does the write to it.  All this does is fragment the crap
  out of the log files because the filesystems cannot optimise the
  allocation patterns.

Well, it would be good if you'd tell me what to do instead...

I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
says this explicitly:

 ...After a successful call to posix_fallocate(), subsequent writes
  to bytes in the specified range are guaranteed not to fail because
  of lack of disk space.

Happy to be informed that the man page is wrong. 

I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...

  Yup, it fragments journal files on XFS, too.
  
  http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
  
  IIRC, the systemd developers consider this a filesystem problem and
  so refused to change the systemd code to be nice to the filesystem
  allocators, even though they don't actually need to use fallocate...

What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Sun, 15.06.14 05:43, Duncan (1i5t5.dun...@cox.net) wrote:

 The base problem isn't fallocate per se, rather, tho it's the trigger in 
 this case.  The base problem is that for COW-based filesystems, *ANY* 
 rewriting of existing file content results in fragmentation.
 
 It just so happens that the only reason there's existing file content to 
 be rewritten (as opposed to simply appending) in this case, is because of 
 the fallocate.  The rewrite of existing file content is the problem, but 
 the existing file content is only there in this case because of the 
 fallocate.
 
 Taking a step back...
 
 On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
 rewrites into the already allocated location, thus guaranteeing extents 
 of 8 MiB each, since once the space is allocated it's simply rewritten in-
 place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
 than single filesystem blocks when an app knows the data is eventually 
 going to be written in to fill that space anyway is a GOOD thing, which 
 is why systemd is doing it.

Nope, that's not why we do it. We do it to avoid SIGBUS on disk full...

 But on a COW-based filesystem fallocate is the exact opposite, a BAD 
 thing, because an fallocate forces the file to be written out at that 
 size, effectively filled with nulls/blanks.  Then the actual logging 
 comes along and rewrites those nulls/blanks with actual data, but it's 
 now a rewrite, which on a COW, copy-on-write, based filesystem, the 
 rewritten block is copied elsewhere, it does NOT overwrite the existing 
 null/blank block, and elsewhere by definition means detached from the 
 previous blocks, thus in an extent all by itself.

Well, quite frankly I am not entirely sure why fallocate() would be any
useful like that on COW file systems, if this is really how it is
implemented... I mean, as I understood fallocate() -- and as the man
page suggests -- it is something for reserving space on disk, not for
writing out anything. This is why journald is invoking it, to reserve
the space, so that later write accesses to it will not require any
reservation anymore, and hence are unlikely to fail.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Sat, 14.06.14 12:59, Kai Krakow (hurikha...@gmail.com) wrote:

 
 Duncan 1i5t5.dun...@cox.net schrieb:
 
  As they say, Whoosh!
  
  At least here, I interpreted that remark as primarily sarcastic
  commentary on the systemd devs' apparent attitude, which can be
  (controversially) summarized as: Systemd doesn't have problems because
  it's perfect.  Therefore, any problems you have with systemd must instead
  be with other components which systemd depends on.
 
 Come on, sorry, but this is fud. Really... ;-)

Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
attitude from. God, with behaviour like that you just make us ignore
you, Duncan.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Zbigniew Jędrzejewski-Szmek
On Sun, Jun 15, 2014 at 11:40:42AM +0200, Goffredo Baroncelli wrote:
 Unfortunately bootchart doesn't work on fedora, because
 CONFIG_SCHEDSTATS is not enabled in the kernel.

It's being investigated, unfortunatly with no progress since April:

https://bugzilla.redhat.com/show_bug.cgi?id=1013225

Zbyszek
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Chris Murphy

On Jun 15, 2014, at 4:34 PM, Lennart Poettering lenn...@poettering.net wrote:

 On Wed, 11.06.14 20:32, Chris Murphy (li...@colorremedies.com) wrote:
 
 systemd has a very stupid journal write pattern. It checks if there
 is space in the file for the write, and if not it fallocates the
 small amount of space it needs (it does *4 byte* fallocate calls!)
 
 Not really the case. 
 
 http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354
 
 We allocate 8mb at minimum.
 
 and then does the write to it.  All this does is fragment the crap
 out of the log files because the filesystems cannot optimise the
 allocation patterns.
 
 Well, it would be good if you'd tell me what to do instead...
 
 I am invoking fallocate() in advance, because we write those files with
 mmap() and that of course would normally triggered SIGBUS already on the
 most boring of reasons, such as disk full/quota full or so. Hence,
 before we do anything like that, we invoke fallocate() to ensure that
 the space is actually available... As far as I can see, that pretty much
 in line with what fallocate() is supposed to be useful for, the man page
 says this explicitly:
 
 ...After a successful call to posix_fallocate(), subsequent writes
  to bytes in the specified range are guaranteed not to fail because
  of lack of disk space.
 
 Happy to be informed that the man page is wrong. 
 
 I am also happy to change our code, if it really is the wrong thing to
 do. Note however that I generally favour correctness and relying on
 documented behaviour, instead of nebulous optimizations whose effects
 might change with different file systems or kernel versions...
 
 Yup, it fragments journal files on XFS, too.
 
 http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
 
 IIRC, the systemd developers consider this a filesystem problem and
 so refused to change the systemd code to be nice to the filesystem
 allocators, even though they don't actually need to use fallocate...
 
 What? No need to be dick. Nobody ever pinged me about this. And yeah, I
 think I have a very good reason to use fallocate(). The only reason in
 fact the man page explicitly mentions.
 
 Lennart

For what it's worth, I did not write what is attributed to me above.  I was 
quoting Dave Chinner, and I've confirmed the original attribution correctly 
made it onto the systemd-devel@ list.

I don't know whether some people on this distribution list are even subscribed 
to systemd-devel@ so those subsequent responses aren't likely being posted to 
systemd-devel@ but rather to linux-btrfs@.


Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Goffredo Baroncelli
On 06/14/2014 04:53 AM, Duncan wrote:
 Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
 excerpted:
 
 On 06/13/2014 01:24 AM, Dave Chinner wrote:
 On Thu, Jun 12, 2014 at 12:37:13PM +, Duncan wrote:

 FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
 actually pretty much equally bad without NOCOW set on the file.

 So maybe it's been fixed in systemd since the last time I looked.
 Yup:

 http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
 file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58

 The reason it was changed? To save a syscall per append, not to
 prevent fragmentation of the file, which was the problem everyone was
 complaining about...

 thanks for pointing that. However I am performing my tests on a fedora
 20 with systemd-208, which seems have this change

 Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
 either 4 byte or 8 MiB file allocations is going to end up being the
 same, forcing (repeated until full) rewrite of each 4 KiB block into
 its own extent.

 I am reaching the conclusion that fallocate is not the problem. The
 fallocate increase the filesize of about 8MB, which is enough for some
 logging. So it is not called very often.
 
 But... 
 
 If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
 nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
 MiB and write that out.  So far so good as at that point the 8 MiB should 
 be a single extent.  But then, data gets written into 4 KiB blocks of 
 that 8 MiB one at a time, and because btrfs is COW, the new data in the 
 block must be written to a new location.
 
 Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
 block has been rewritten to a new location and is now an extent unto 
 itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
 single 4 KiB block in size.

Several people pointed fallocate as the problem. But I don't understand the 
reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during 
the boot. Often never because the log are less than 8MB.
2) it is true that btrfs rewrite almost 2 times each 4kb page with fallocate. 
But the first time is a big write of 8MB; instead the second write would 
happen in any case. What I mean is that without the fallocate in any case 
journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW 
filesystem... may be that I don't understand very well the fallocate() call.

 
 =:^(
 
 Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
 btrfs autodefrag mount option should at least catch that and rewrite 
 (again), this time sequentially.
 
 I have to investigate more what happens when the log are copied from
 /run to /var/log/journal: this is when journald seems to slow all.
 
 That's an interesting point.
 
 At least in theory, during normal operation journald will write to 
 /var/log/journal, but there's a point during boot at which it flushes the 
 information accumulated during boot from the volatile /run location to 
 the non-volatile /var/log location.  /That/ write, at least, should be 
 sequential, since there will be  4 KiB of journal accumulated that needs 
 to be transferred at once.  However, if it's being handled by the forced 
 pre-write fallocate described above, then that's not going to be the 
 case, as it'll then be a rewrite of already fallocated file blocks and 
 thus will get COWed exactly as I described above.
 
 =:^(
 
 
 I am prepared a PC which reboot continuously; I am collecting the time
 required to finish the boot vs the fragmentation of the system.journal
 file vs the number of boot. The results are dramatic: after 20 reboot,
 the boot time increase of 20-30 seconds. Doing a defrag of
 system.journal reduces the boot time to the original one, but after
 another 20 reboot, the boot time still requires 20-30 seconds more

 It is a slow PC, but I saw the same behavior also on a more modern pc
 (i5 with 8GB).

 For both PC the HD is a mechanical one...
 
 The problem's duplicable.  That's the first step toward a fix. =:^)
I Hope so
 
 And that's now a btrfs problem :/

 Are you sure ?
 
 As they say, Whoosh!
 
[...] 
 Another alternative is that distros will start setting /var/log/journal 
 NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
 the problem.  (Altho if they do automated snapshotting they'll also have 
 to set it as its own subvolume, to avoid the first-write-after-snapshot-
 is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
 options.

Pay attention, that this remove also the checksum, which are very useful in a 
RAID configuration.

 
 Meanwhile, there's some focus on making btrfs behave better with such 
 rewrite-pattern files, but while I think the problem can be made /some/ 
 better, hopefully enough that the defaults bother far fewer people in far 

Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 As they say, Whoosh!
 
 At least here, I interpreted that remark as primarily sarcastic
 commentary on the systemd devs' apparent attitude, which can be
 (controversially) summarized as: Systemd doesn't have problems because
 it's perfect.  Therefore, any problems you have with systemd must instead
 be with other components which systemd depends on.

Come on, sorry, but this is fud. Really... ;-)

 IOW, it's a btrfs problem now in practice, not because it is so in a
 technical sense, but because systemd defines it as such and is unlikely
 to budge, so the only way to achieve progress is for btrfs to deal with
 it.

I think that systemd is even one of the early supporters of btrfs because it 
will defragment readahead files on boot from btrfs. I'd suggest the problem 
is to be found in the different semantics with COW filesystems. And if 
someone loudly complains to the systemd developers how bad they are at doing 
their stuff - hmm, well, I would be disapointed/offended, too, as a 
programmer because much very well done work has been put into systemd and 
I'd start ignoring such people. In Germany we have a saying for this: Wie 
man in den Wald hineinruft, so schallt es heraus. [1] They are doing many 
things right that have not been adopted to modern systems in the last twenty 
years (or so) with the legacy init systems.

So let's start with my journals, on btrfs:

$ sudo filefrag *
system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal:
 
5 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal:
 
5 extents found 
 
system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal:
 
5 extents found 
 
system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal:
 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal:
 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal:
 
5 extents found
system.journal: 9 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal:
 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal:
 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal:
 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal:
 
2 extents found
user-500.journal: 2 extents found

I don't think these are too bad values, eh?

Well, how did I accomblish that?

First, I've set the journal directories nocow. Of course, systemd should do 
this by default. I'm not sure if this is a packaging or systemd code issue, 
tho. But I think the systemd devs are in common that for cow fs, the journal 
directories should be set nocow. After all, the journal is a transactional 
database - it does not need cow protection at all costs. And I think they 
have their own checksumming protection. So, why let systemd bother with 
that? A lot of other software has the same semantic problems with btrfs, too 
(ex. MySQL) where nobody shouts at the inabilities of the programmers. So 
why for systemd? Just because it's intrusive by its nature for being a 
radically and newly designed init system and thus requires some learning by 
its users/admins/packagers? Really? Come on... As admin and/or packager you 
have to stay with current technologies and developments anyways. It's only 
important to hide the details from the users.

Back to the extents counts: What I did next was implementing a defrag job 
that regularly defrags the journal (actually, the complete log directory as 
other log files suffer the same problem):

$ cat /usr/local/sbin/defrag-logs.rb 
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log

It can be easily converted into a timer job with systemd. This is left as an 
excercise to the reader.

BTW: Actually, that job isn't currently executed on my system which makes 
the numbers above pretty impressive... However, autodefrag is turned on 
which may play into the mix. I'm not sure. I stopped automatically running 
those defrag jobs a while ago (I have a few more).
 
 An arguably fairer and more impartial assessment of this particular
 situations suggests that neither btrfs, which as a COW-based filesystem,
 like all COW-based filesystems has the existing-file-rewrite as a major
 technical challenge that it must deal with /somehow/, nor systemd, which
 in choosing to use fallocate is specifically putting itself in that
 existing-file-rewrite class, 

Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Goffredo Baroncelli
On 06/14/2014 12:59 PM, Kai Krakow wrote:
[...]
 
 I think that systemd is even one of the early supporters of btrfs because it 
 will defragment readahead files on boot from btrfs. 

In know that systemd does readahead, but it is the first time that I heard that 
it does defrag too. Could you elaborate ?

 I'd suggest the problem 
 is to be found in the different semantics with COW filesystems. 
[--cut--cut--cut--]

The problem is due to the interaction between BTRFS and systemd. Both behave 
differently regarding their predecessor. 

[--cut--cut--cut--]

 So let's start with my journals, on btrfs:
 
 $ sudo filefrag *
 system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
 system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
 system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal:
  
 5 extents found
 system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal:
  
 5 extents found   

 system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal:
  
 5 extents found   

 system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal:
  
 5 extents found
 system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal:
  
 5 extents found
 system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal:
  
 5 extents found
 system.journal: 9 extents found
 user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal:
  
 2 extents found
 user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal:
  
 2 extents found
 user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal:
  
 2 extents found
 user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal:
  
 2 extents found
 user-500.journal: 2 extents found
 
 I don't think these are too bad values, eh?
 
 Well, how did I accomblish that?
 
 First, I've set the journal directories nocow. 

If you use nocow, you lost the btrfs ability to rebuild a RAID array discarding 
the wrong sector. With the systemd journal checksum, you can say that a data is 
wrong, but BTRFS with its checksum (when used in raid mode) is able to 
*correct* the data too.

[...]


 
 Back to the extents counts: What I did next was implementing a defrag job 
 that regularly defrags the journal (actually, the complete log directory as 
 other log files suffer the same problem):
 
 $ cat /usr/local/sbin/defrag-logs.rb 
 #!/bin/sh
 exec btrfs filesystem defragment -czlib -r /var/log

I think that this is a more viable solution; but what seems to me strange is 
the fact that the fragmentation level of my rsyslog file is a lot lower.

[...]


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 On 06/14/2014 12:59 PM, Kai Krakow wrote:
 [...]
 
 I think that systemd is even one of the early supporters of btrfs because
 it will defragment readahead files on boot from btrfs.
 
 In know that systemd does readahead, but it is the first time that I heard
 that it does defrag too. Could you elaborate ?

Look at src/readahead/readahead-collect.c. It works for btrfs and ext4. With 
ext4 it will also relocate the files. Not sure if it does for btrfs. If it 
does, the question is: where to relocate in a multi devices file system?

That means, you have to enable systemd-readahead-collect, tho.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 First, I've set the journal directories nocow.
 
 If you use nocow, you lost the btrfs ability to rebuild a RAID array
 discarding the wrong sector. With the systemd journal checksum, you can
 say that a data is wrong, but BTRFS with its checksum (when used in raid
 mode) is able to *correct* the data too.

The decision is up to you. For me it was easy:

First: I do not care much if some of my logs become broken. I have backups 
anyway with a backlog several weeks long.

Second: I don't use data RAID (at least not yet), just data striping. So the 
btrfs checksum won't help me much - it has about the same meaning as the 
journal checksums.

YMMV... For me, nocow'ing the journal was the clear winner. If you persist 
on using cow for journal files, I'd stick with the defragger:

 Back to the extents counts: What I did next was implementing a defrag job
 that regularly defrags the journal (actually, the complete log directory
 as other log files suffer the same problem):
 
 $ cat /usr/local/sbin/defrag-logs.rb
 #!/bin/sh
 exec btrfs filesystem defragment -czlib -r /var/log
 
 I think that this is a more viable solution; but what seems to me strange
 is the fact that the fragmentation level of my rsyslog file is a lot
 lower.

Are you sure? As Duncan pointed out at many occassions, if your journal file 
is compressed by btrfs (for whatever reason, depends on your mount options 
and defragging habbits), then filefrag does not show correct values. It 
shows the count of compression blocks. Only if you are sure about that, it 
is worth considering further steps.

If you are sure about the journals not being compressed (and the rsyslog 
file, too), you can compare values. I think, only then, you could say that 
systemd's allocation strategy for journal files does no good on btrfs. Maybe 
then it's time to fix that by patching systemd to use another strategy, 
either by configuration or auto-detection.

Most of those strategies will probably result in coalescing writes to the 
journal into bigger blocks, which means buffering, which in turn is usually 
not wanted for log files. That in turn means that cowed btrfs is probably 
the wrong medium for storing journal files. You may want to play with 
SyncIntervalSec in journald.conf...

BTW: I'm not sure if autodefrag would kick in on journal files because those 
appends are no random writes. I think I remember that autodefrag was about 
detecting random writes and then rewrite the file as a whole to reduce 
extent fragmentation. Read: I'm not sure if autodefrag would have an effect 
on this. A btrfs dev may comment on that.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Goffredo Baroncelli
On 06/14/2014 04:03 PM, Kai Krakow wrote:
 Goffredo Baroncelli kreij...@libero.it schrieb:
 
 On 06/14/2014 12:59 PM, Kai Krakow wrote:
 [...]

 I think that systemd is even one of the early supporters of btrfs because
 it will defragment readahead files on boot from btrfs.

 In know that systemd does readahead, but it is the first time that I heard
 that it does defrag too. Could you elaborate ?
 
 Look at src/readahead/readahead-collect.c. It works for btrfs and ext4. With 
 ext4 it will also relocate the files. Not sure if it does for btrfs. 

I am looking at the source, and yes, it does. To be honest it seems to 
defragment only on btrfs.

 If it 
 does, the question is: where to relocate in a multi devices file system?

Systemd uses the defrag capability of btrfs.

 
 That means, you have to enable systemd-readahead-collect, tho.
I have to admit that I disabled it. I will make some test also with readhead 
enabled. 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 I am looking at the source, and yes, it does. To be honest it seems to
 defragment only on btrfs.

Oh that seems true... Then defragmenting and relocating is probably a todo 
item. I'm sure I've read about it somewhere.

 If it
 does, the question is: where to relocate in a multi devices file system?
 
 Systemd uses the defrag capability of btrfs.

man:systemd-readahead says it relocates and defrags if supported. Scanning 
through the source, it only defrags.

 That means, you have to enable systemd-readahead-collect, tho.
 I have to admit that I disabled it. I will make some test also with
 readhead enabled.

Take care to enable all needed services to enable defrag... If your services 
make use of journal file loading these files should also become part of the 
process. You can check with /usr/lib/systemd/systemd-readahead analyze. 
The state is stored in /.readahead.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Goffredo Baroncelli
On 06/14/2014 10:13 PM, Kai Krakow wrote:
 Goffredo Baroncelli kreij...@libero.it schrieb:
 
 I am looking at the source, and yes, it does. To be honest it seems to
 defragment only on btrfs.
 
 Oh that seems true... Then defragmenting and relocating is probably a todo 
 item. I'm sure I've read about it somewhere.
 
 If it
 does, the question is: where to relocate in a multi devices file system?

 Systemd uses the defrag capability of btrfs.
 
 man:systemd-readahead says it relocates and defrags if supported. Scanning 
 through the source, it only defrags.
 
 That means, you have to enable systemd-readahead-collect, tho.
 I have to admit that I disabled it. I will make some test also with
 readhead enabled.
 
 Take care to enable all needed services to enable defrag... If your services 
 make use of journal file loading these files should also become part of the 
 process. You can check with /usr/lib/systemd/systemd-readahead analyze. 
 The state is stored in /.readahead.
 
I have enabled all the services (collect, replay, done), but I was unable to 
see any gain. 

I don't know why but system.journal is not considered by readahead:

# /usr/lib/systemd/systemd-readahead analyze | grep journal


   100% ( 1)  770: /etc/systemd/journald.conf
50% ( 1)  4194304: 
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
50% ( 1)  4194304: 
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Kai Krakow
Goffredo Baroncelli kreij...@libero.it schrieb:

 On 06/14/2014 10:13 PM, Kai Krakow wrote:
 Take care to enable all needed services to enable defrag... If your
 services make use of journal file loading these files should also become
 part of the process. You can check with
 /usr/lib/systemd/systemd-readahead analyze. The state is stored in
 /.readahead.
 
 I have enabled all the services (collect, replay, done), but I was unable
 to see any gain.

I don't think you can actually enable the done service because it is 
static... But if it placed a symlink in /etc/systemd that is probably wrong 
and you should undo it. It is automatically activated a few seconds after 
booting is done to store the collected data for replaying on next reboot.

Check with find /etc/systemd -name '*readahead*' -type l...

 I don't know why but system.journal is not considered by readahead:
 
 # /usr/lib/systemd/systemd-readahead analyze | grep journal
 
 
100% ( 1)  770: /etc/systemd/journald.conf
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
 50% ( 1)  4194304:
 /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal

I tried that on my system and it looks vice-versa somehow:

$ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
   100% ( 1)  812: /etc/systemd/journald.conf
   100% ( 1)  8388608: 
/var/log/journal/121b87ca633e8ac001665668001b/system.journal

Strange... ;-)

To get back to your performance problem: Did you try systemctl enable 
bootchart and then (after a reboot) look at the png generated in /run/log 
(at least in my system it is placed there, look at the configs if in doubt). 
It can give some interesting clues about why a service starts late or takes 
long to execute.

-- 
Replies to list only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Duncan
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
excerpted:

 On 06/14/2014 04:53 AM, Duncan wrote:
 Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
 excerpted:
 
 thanks for pointing that. However I am performing my tests on a fedora
 20 with systemd-208, which seems have this change

 I am reaching the conclusion that fallocate is not the problem. The
 fallocate increase the filesize of about 8MB, which is enough for some
 logging. So it is not called very often.

Right.

 But...

Exactly, _but_...

 [A]n fallocate of 8 MiB will increase the file size
 by 8 MiB and write that out.  So far so good as at that point the 8 MiB
 should be a single extent.  But then, data gets written into 4 KiB
 blocks of that 8 MiB one at a time, and because btrfs is COW, the new
 data in the block must be written to a new location.
 
 Which effectively means that by the time the 8 MiB is filled, each 4
 KiB block has been rewritten to a new location and is now an extent
 unto itself.  So now that 8 MiB is composed of 2048 new extents, each
 one a single 4 KiB block in size.
 
 Several people pointed fallocate as the problem. But I don't understand
 the reason.
 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
 during the boot. Often never because the log are less than 8MB.
 2) it is true that btrfs rewrite almost 2 times each 4kb page with
 fallocate. But the first time is a big write of 8MB; instead the
 second write would happen in any case. What I mean is that without the
 fallocate in any case journald would make small write.
 
 To be honest, I fatigue to see the gain of having a fallocate on a COW
 filesystem... may be that I don't understand very well the fallocate()
 call.

The base problem isn't fallocate per se, rather, tho it's the trigger in 
this case.  The base problem is that for COW-based filesystems, *ANY* 
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to 
be rewritten (as opposed to simply appending) in this case, is because of 
the fallocate.  The rewrite of existing file content is the problem, but 
the existing file content is only there in this case because of the 
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
rewrites into the already allocated location, thus guaranteeing extents 
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
than single filesystem blocks when an app knows the data is eventually 
going to be written in to fill that space anyway is a GOOD thing, which 
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD 
thing, because an fallocate forces the file to be written out at that 
size, effectively filled with nulls/blanks.  Then the actual logging 
comes along and rewrites those nulls/blanks with actual data, but it's 
now a rewrite, which on a COW, copy-on-write, based filesystem, the 
rewritten block is copied elsewhere, it does NOT overwrite the existing 
null/blank block, and elsewhere by definition means detached from the 
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in 
with actual data, because they were rewrites from null/blank blocks that 
fallocate had already forced to be allocated, that's now 2048 separate 
extents, 2048 separate file fragments, where without the forced fallocate, 
the writes would have all been appends, and there would have been at 
least /some/ chance of some of those 2048 separate blocks being written 
at close enough to the same time that they would have been written 
together as a single extent.  So while the 8 MiB might not have been a 
single extent as opposed to 2048 separate extents, it might have been 
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being 
because fallocate meant that each block was a rewrite into an existing 
file, not a new append-write at the end of an existing file.

 [...]
 Another alternative is that distros will start setting /var/log/journal
 NOCOW in their setup scripts by default when it's btrfs, thus avoiding
 the problem.  (Altho if they do automated snapshotting they'll also
 have to set it as its own subvolume, to avoid the
 first-write-after-snapshot-
 is-COW problem.)  Well, that, and/or set autodefrag in the default
 mount options.
 
 Pay attention, that this remove also the checksum, which are very useful
 in a RAID configuration.

Well, it can be.  But this is only log data, not executable or the like 
data, and (as Kai K points out) journald has its own checksumming method 
in any case.

Besides which, you still haven't explained why you can't either set the 
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them 

Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Chris Murphy

On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli kreij...@libero.it wrote:
 
 If someone is able to suggest me how FRAGMENT the log file, I can try to 
 collect more scientific data.

So long as you're not using compression, filefrag will show you fragments of 
systemd-journald journals. I can vouch for the behavior you experience without 
xattr +C or autodefrag, but further it also causes much slowness when reading 
journal contents. LIke if I want to search all boots for a particular error 
message to see how far back it started, this takes quite a bit longer than on 
other file systems. So far I'm not experiencing this problem with autodefrag or 
any other negative side effects, but my understanding is this code is still in 
flux.

Since the journals have their own checksumming I'm not overly concerned about 
setting xattr +C.

Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-11 Thread Chris Murphy
On Jun 11, 2014, at 7:21 PM, Dave Chinner da...@fromorbit.com wrote:

 On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
 Hi all,
 
 I would like to share a my experience about a slowness of systemd when used 
 on BTRFS.
 
 My boot time was very high (about ~50 seconds); most of time it was due to 
 NetworkManager which took about 30-40 seconds to start (this data came from 
 systemd-analyze plot).
 
 I make several attempts to address this issue. Also I noticed that sometime 
 this problem disappeared; but I was never able to understand why.
 
 However this link
 
  https://bugzilla.redhat.com/show_bug.cgi?id=1006386
 
 suggested me that the problem could be due to a bad interaction between 
 systemd and btrfs. NetworkManager was innocent. 
 
 systemd has a very stupid journal write pattern. It checks if there
 is space in the file for the write, and if not it fallocates the
 small amount of space it needs (it does *4 byte* fallocate calls!)
 and then does the write to it.  All this does is fragment the crap
 out of the log files because the filesystems cannot optimise the
 allocation patterns.
 
 Yup, it fragments journal files on XFS, too.
 
 http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
 
 IIRC, the systemd developers consider this a filesystem problem and
 so refused to change the systemd code to be nice to the filesystem
 allocators, even though they don't actually need to use fallocate...
 
 Cheers,
 
 Dave.
 
 -- 
 Dave Chinner
 da...@fromorbit.com

On Jun 11, 2014, at 7:37 PM, Dave Chinner da...@fromorbit.com wrote:
 
 BTW, the systemd list is subscriber only, so thay aren't going to
 see anything that we comment on from a cross-post to the btrfs list.


Unless a subscriber finds something really interesting, quotes it, and cross 
posts it.

Chris murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel