Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Wed, 18.06.14 07:51, Tollef Fog Heen (tfh...@err.no) wrote: ]] Lennart Poettering (trimming Ccs) I am invoking fallocate() in advance, because we write those files with mmap() and that of course would normally triggered SIGBUS already on the most boring of reasons, such as disk full/quota full or so. Hence, before we do anything like that, we invoke fallocate() to ensure that the space is actually available... As far as I can see, that pretty much in line with what fallocate() is supposed to be useful for, the man page says this explicitly: OOI, why not use write + read-only mmap for reading? In testing performance for another project, that's significantly faster than rw mmap. Well, non-mmap stuff might indeed be quicker, but this isn't really optimized to the last bit of speed but more to keep the code simple. And the code is quite simple now, as we can use the same codepaths to locate things for appending and for finding them... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli posted on Mon, 16 Jun 2014 20:47:49 +0200 as excerpted: The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments. [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html Thanks for injecting some concrete facts in the form of test results into things. =:^) So I was barking up the wrong tree with fallocate. But I had the base problem correct -- fragmentation -- and good recommendations for working around it -- NOCOW and/or (auto)defrag. (And for that matter, for those like me where /var/log is a dedicated partition, simply putting /var/log on something other than btrfs would nicely work around the issue too. The experts are working on it now, so I'll step out for the most part as what else I could add at this point would be simply noise. But I did consider it important to acknowledge that I had been wrong about the fallocate. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Lennart Poettering posted on Mon, 16 Jun 2014 00:39:39 +0200 as excerpted: Well, quite frankly I am not entirely sure why fallocate() I was barking up the wrong tree with fallocate(). Sorry. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Lennart Poettering posted on Mon, 16 Jun 2014 00:43:34 +0200 as excerpted: At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on. Come on, sorry, but this is fud. Really... ;-) Interestingly, I never commented on anything in this area, and neither did anybody else from the systemd side afaics. THe entire btrfs defrag thing i wasn't aware of before this thread started on the system ML a few days ago. I am not sure where you take your ideas about our attitude from. God, with behaviour like that you just make us ignore you, Duncan. Sorry. As you'll note, I said can be controversially... I never stated that *I* held that position personally, as I was taking the third-person observer position. As such, I've seen that attitude expressed by others in multiple threads when systemd comes up (as Josef alluded to as well), and that I simply interpreted the remark to which I was alluding (And that's now a btrfs problem :/) as a sarcastic reference to that attitude... which again I never claimed as my own. Thus the whoosh, in reference to what I interpreted as sarcasm, which certainly doesn't require fully agreement with in ordered to understand. (Tho as I mentioned elsewhere, I can certainly see where they're coming from given the recent kernel debug kerfuffle and the like... and I'm certainly not alone there... but I'm /trying/ to steer a reasonably neutral path while appreciating both sides.) In fact, I specifically stated elsewhere that in fact I recently switched to systemd -- by choice as I'm on gentoo which still defaults to openrc -- myself. Certainly I would not have done so if I believed systemd was as bad as all that, and the fact that I HAVE done so definitely implies a rather large amount of both trust and respect in the systemd devs, or I'd not be willing to run their code. But never-the-less I can see the viewpoint from both sides now, and do try to maintain a reasonable neutrality. I guess I should have made that more explicit in the original post, but as they say, hindsight is 20/20. =:^\ -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 17/06/14 09:17, Duncan wrote: [...] But never-the-less I can see the viewpoint from both sides now, and do try to maintain a reasonable neutrality. I guess I should have made that more explicit in the original post, but as they say, hindsight is 20/20. =:^\ Hey! All good for a giggle and excellent for friendly attention to get things properly fixed :-) And this thread may even hit the news for Phoronix and LWN and others, all in true FLOSS open development and thorough debug :-P And... And... We get a few bits of systemd, btrfs, and filesystem interactions fixed for how things /should/ work best. Thanks (and for others also) for stirring a rapid chase! ;-) Regards, Martin (A fellow Gentoo-er ;-) ) ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Hi Kai, I investigate a bit why readahead doesn't defrag the journal. I put in CC also the mailing list for the record. On 06/17/2014 03:33 AM, Kai Krakow wrote: [...] Instead, for me, the readahead collector catches access to my system journal and thus defragments it. That's not the case for your system which explains why it won't be defragmented for you when enabling readahead. The readahead program skips file greater than 10M. If you look at its man page, it documents the switch --file-size-max. The default value is READAHEAD_FILE_SIZE_MAX which is equal to 10M. So only when the system.journal is smaller than this value, the readahead defrags it. But this happens only few times. This behavior seems to me reasonable to avoid blocking a system when a big file is touched during the boot. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Jun 16, 2014, at 7:13 PM, cwillu cwi...@cwillu.com wrote: It's not a mmap problem, it's a small writes with an msync or fsync after each one problem. For the case of sequential writes (via write or mmap), padding writes to page boundaries would help, if the wasted space isn't an issue. Another approach, again assuming all other writes are appends, would be to periodically (but frequently enough that the pages are still in cache) read a chunk of the file and write it back in-place, with or without an fsync. On the other hand, if you can afford to lose some logs on a crash, not fsyncing/msyncing after each write will also eliminate the fragmentation. Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense? Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: I investigate a bit why readahead doesn't defrag the journal. I put in CC also the mailing list for the record. On 06/17/2014 03:33 AM, Kai Krakow wrote: [...] Instead, for me, the readahead collector catches access to my system journal and thus defragments it. That's not the case for your system which explains why it won't be defragmented for you when enabling readahead. The readahead program skips file greater than 10M. If you look at its man page, it documents the switch --file-size-max. The default value is READAHEAD_FILE_SIZE_MAX which is equal to 10M. So only when the system.journal is smaller than this value, the readahead defrags it. But this happens only few times. This behavior seems to me reasonable to avoid blocking a system when a big file is touched during the boot. Ah okay, good catch... Well, systemd does not generate massive amounts of logs for me. Maybe you want to play around with SystemMaxUse in journald.conf to let the file size stay below 10M? Actually, for servers using syslog, I set the log file rotation to kick in after 10M worth of logging (or after 7 days, whatever comes first) so our rsync-diffed backups stay small for log files as otherwise it would start creating huge diffs just for backups when I do the rsync --backup way of doing backups. So that may actually be a not too bad option. I'm not using it with my home system because I do snapshot-based backups to btrfs and diffs are just much smaller (using rsync --no-whole-file --inplace). We recently switched to zfs based backup storage for our servers, and in that process switched to using rsync --no-whole-file --inplace, too. So we won't need the small log rotation scheme any longer but I feel comfortable with it so I stick to this. BTW: I checked our systemd-enabled servers and fragmentation of journal files is very very high there, too. Way above 5000 extents. The journals are on xfs so it suffers that problem, too (which already has been pointed out somewhere in this thread multiple times). But to my surprise, this does not affect the performance of journalctl there. The bad performance is only for fragmentation on btrfs, not xfs. IMHO, this pushes the pointer a little bit more towards systemd, tho I couldn't suggest how to fix it. Time for the filesystem hackers to kick in. BTW: The servers are virtualized so by default, systemd-readahead does nothing there because of ConditionVirtual in the service files. OTOH, it won't defragment there anyways because xfs is not supported for this in systemd, altough it would probably be possible because there is xfs_fsr. But my feeling is that xfs_fsr is more or less just locking the file, copying it, then replaces the original. It cannot work on files in use. This is different for btrfs which only chokes on defragmenting files currently mmap'ed (e.g., binaries being executed, read: busy text files). Following that thought, I wonder why it seems to defragment system.journal for me because after all I learned here, systemd writes to the files mmap'ed. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Chris Murphy li...@colorremedies.com schrieb: It's not a mmap problem, it's a small writes with an msync or fsync after each one problem. For the case of sequential writes (via write or mmap), padding writes to page boundaries would help, if the wasted space isn't an issue. Another approach, again assuming all other writes are appends, would be to periodically (but frequently enough that the pages are still in cache) read a chunk of the file and write it back in-place, with or without an fsync. On the other hand, if you can afford to lose some logs on a crash, not fsyncing/msyncing after each write will also eliminate the fragmentation. Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense? AFAIR systemd-journald already does explicit flushing when more important log entries hit the log. Well, of course debug is one of the least important log levels in that chain. But using the same code paths, it should be trivial to implement a configurable trigger which also forces explicit flushing... -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Filipe Brandenburger filbran...@google.com schrieb: On Mon, Jun 16, 2014 at 6:13 PM, cwillu cwi...@cwillu.com wrote: For the case of sequential writes (via write or mmap), padding writes to page boundaries would help, if the wasted space isn't an issue. Another approach, again assuming all other writes are appends, would be to periodically (but frequently enough that the pages are still in cache) read a chunk of the file and write it back in-place, with or without an fsync. On the other hand, if you can afford to lose some logs on a crash, not fsyncing/msyncing after each write will also eliminate the fragmentation. I was wondering if something could be done in btrfs to improve performance under this workload... Something like a defrag on demand for a case where mostly appends are happening. When there are small appends with fsync/msync, they become new fragments (as expected), but once the writes go past a block boundary, btrfs could defragment the previous block in background, since it's not really expected to change again. That could potentially achieve performance close to chattr +C without the drawbacks of disabling copy-on-write. I thought about something like that, too. I'm pretty sure it really doesn't matter if your 500G image file is split across 1 extents - as long as at least chunks of extents are kept together and rebuilt as one extent. That means, instead of letting autodefrag work on the whole file just let it operate on a chunk of it within some sane boundaries - maybe 8MB chunks, - of course without splitting existing extents if those already cross a chunk boundary. That way, it would still reduce head movements a lot while maintaining good performance during defragmentation. Your idea would be the missing companion to that (it is some sort of slow-growing-file-detection). If I remember correctly, MacOSX implements a similar adaptic defragmentation strategy for its HFS+ filesystem, tho the underlying semantics are probably quite different. And it acts upon opening the file instead upon writing to the file, so it is probably limited to smallish files only (which I don't think makes so much sense on its own, for small files locality to semantically similar files is much more important, i.e. files needed during boot, files needed for starting a specific application). If, after those strategies, it is still important to get your file's chunks cleanly aligned one after each other, one could still run a manual defrag which does the complete story. BTW: Is it possible to physically relocate files in btrfs? I think that is more important than defragmentation. Is such a thing possible with the defrag IOCTL? My current understanding is that defragmentating a file just rewrites it somewhere random as a contiguous block - which is not always the best option because it can hurt boot performance a lot and thus reverses the effect of what is being tried to achieve. It feels a bit like playing lottery when I defragment my boot files only to learn that the boot process now is slower instead of faster. :-\ -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Mon, 16.06.14 09:05, Josef Bacik (jba...@fb.com) wrote: So you are doing all the right things from what I can tell, I'm just a little confused about when you guys run fsync. From what I can tell it's only when you open the journal file and when you switch it to offline. I didn't look too much past this point so I don't know how often these things happen. Are you taking an individual message, writing it, updating the head of the file and then fsync'ing? Or are you getting a good bit of dirty log data and fsyncing occasionally? The latter. Basically when opening a file for writing we mark it in the header as online, then fsync() it. When we close a file we fsync() it, then change the header to offline, ans sync() again. Also, 5min after each write we will also put things to offline, until the next write, when we will put things to online again. Finally, if something is logged at priorities EMERG, ALERT or CRIT we will sync immediately (which actually should never happen in real-life, unless something is really broken -- a simple way to check if anything like this got written is journalctl -p crit). Also, we rotate and start a new file every now and then, when we hit a size limit, but that is usually very seldom. Putting this together: we should normally fsync() only very infrequently. What would cause btrfs problems is if you fallocate(), write a small chunk, fsync, write a small chunk again, fsync again etc. Fallocate saves you the first write around, but if the next write is within the same block as the previous write we'll end up triggering cow and enter fragmented territory. If this is what is what journald is doing then that would be good to know, if not I'd like to know what is happening since we shouldn't be fragmenting this badly. Hmm, the only way I see that that would happen is if a lot of stuff is logged at these super-high log levels mentioned above. But then again, that never really should happen in real-life. Could anyone who's expereiencing the slowdowns have a look on the journalctl output menionted above? Do you have more than a few lines printed like that? Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Lennart Poettering lenn...@poettering.net schrieb: [...] Finally, if something is logged at priorities EMERG, ALERT or CRIT we will sync immediately (which actually should never happen in real-life, unless something is really broken -- a simple way to check if anything like this got written is journalctl -p crit). [...] Hmm, the only way I see that that would happen is if a lot of stuff is logged at these super-high log levels mentioned above. But then again, that never really should happen in real-life. Unless you use those HP utilities: # journalctl -p crit -- Logs begin at Mo 2014-05-05 02:26:26 CEST, end at Di 2014-06-17 23:33:48 CEST. -- Mai 20 10:17:21 jupiter hp-systray[1366]: hp-systray[1366]: error: option -s not recognized -- Reboot -- Mai 22 22:37:15 jupiter hp-systray[1413]: hp-systray[1413]: error: option -s not recognized -- Reboot -- Jun 03 08:49:17 jupiter hp-systray[1501]: hp-systray[1501]: error: option -s not recognized -- Reboot -- Jun 09 13:45:31 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s not recognized -- Reboot -- Jun 11 00:21:16 jupiter hp-systray[1405]: hp-systray[1405]: error: option -s not recognized -- Reboot -- Jun 14 00:30:49 jupiter hp-systray[1434]: hp-systray[1434]: error: option -s not recognized -- Reboot -- Jun 14 22:54:43 jupiter hp-systray[1416]: hp-systray[1416]: error: option -s not recognized -- Reboot -- Jun 14 23:11:11 jupiter hp-systray[1437]: hp-systray[1437]: error: option -s not recognized -- Reboot -- Jun 15 14:53:44 jupiter hp-systray[1447]: hp-systray[1447]: error: option -s not recognized -- Reboot -- Jun 15 16:53:39 jupiter hp-systray[1404]: hp-systray[1404]: error: option -s not recognized -- Reboot -- Jun 15 17:38:40 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s not recognized -- Reboot -- Jun 15 18:34:37 jupiter hp-systray[1375]: hp-systray[1375]: error: option -s not recognized -- Reboot -- Jun 17 20:38:32 jupiter hp-systray[1406]: hp-systray[1406]: error: option -s not recognized Could anyone who's expereiencing the slowdowns have a look on the journalctl output menionted above? Do you have more than a few lines printed like that? Nasty message but no slowdowns. But it's printed just during desktop session start so it's no problem. Still, I wonder what this utility tries to do (seems to put itself into autostart with an -s switch but without supporting it, probably a packaging and/or KDE issue). At least HP thinks: Oh boy - that's really critical! Whoohoo. Why not let the system explode? :~) -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Kai Krakow posted on Tue, 17 Jun 2014 23:02:14 +0200 as excerpted: I'm pretty sure it really doesn't matter if your 500G image file is split across 1 extents - as long as at least chunks of extents are kept together and rebuilt as one extent. It's worth noting that in btrfs terms, chunk has a specific meaning. Btrfs allocates chunks from free space, 1 GiB at a time for data chunks, 256 MiB at a time for metadata.[1] And that btrfs-specific meaning does tend to distort your example case, somewhat, tho I expect most folks understand what you mean. The problem is that we're running out of generic terms that mean chunk-like and are thus experiencing namespace collision! Anyway, btrfs chunks, data chunks being of interest here, get allocated on demand. The practical effect is thus that the maximum btrfs extent size is 1 GiB. As such, the least-extents-possible ideal for that 500 GiB image file would be 500 extents. That means, instead of letting autodefrag work on the whole file just let it operate on a chunk of it within some sane boundaries - maybe 8MB chunks, - of course without splitting existing extents if those already cross a chunk boundary. As stated, chunk has a special meaning in btrfs. Data chunks are 1 GiB in size and no extent can cross a btrfs chunk boundary. BTW: Is it possible to physically relocate files in btrfs? Currently, ENOTIMPLEMENTED. AFAIK it's possible and is on the list, but so are a number of other nice-to-haves, so it might be awhile. Actually just checked the wiki. The closest specific feature point listed says hot-data-tracking and moving to faster devices, noting that there's actually some current work on the generic (not-btrfs-specific) feature, to be made available via VFS. Your specific use-case will probably be part of that general implementation. --- [1] Btrfs chunk allocation: 1 GiB data chunks size, 256 MiB metadata chunk size by default, but there are various exceptions. The first of course is when space gets tight. The last allocation will be whatever is left. Second, there's mixed-data/metadata mode, the default when the filesystem is = 1 GiB. Mixed chunks are (like metadata) 256 MiB but contain data and metadata mixed, sacrificing speed efficiency for more efficient space management. Third, it's worth noting that the single- device-btrfs default is dup-mode-metadata (with the same default for mixed-mode), so while they're 256 MiB each, two get allocated at once, thus allocating metadata half a GiB at a time. Multi-device-btrfs metadata defaults to raid1-mode, still allocated in chunk pairs but one on each of two separate devices. Data mode always defaults to single, tho on multi-device-btrfs both data and metadata can be (individually) set to various raid modes, each with its own specific allocation pattern. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
]] Lennart Poettering (trimming Ccs) I am invoking fallocate() in advance, because we write those files with mmap() and that of course would normally triggered SIGBUS already on the most boring of reasons, such as disk full/quota full or so. Hence, before we do anything like that, we invoke fallocate() to ensure that the space is actually available... As far as I can see, that pretty much in line with what fallocate() is supposed to be useful for, the man page says this explicitly: OOI, why not use write + read-only mmap for reading? In testing performance for another project, that's significantly faster than rw mmap. -- Tollef Fog Heen UNIX is user friendly, it's just picky about who its friends are ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... ... which traditionally experienced severe sequential read performance degradation in this case. As I understand this is exactly what happens - readahead attempts to preload files which gives us heavy random read access. The only real remedy was to defragment files. It should work relatively well for journal where files are mostly write once at the expense of additional read/write activity. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Hi Lennart, On 06/16/2014 12:13 AM, Lennart Poettering wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*. In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented. [1] Let me to revise the english, the I will post the results. -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Hi all, in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary. I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70). I tested the following scenarios 1) standard (without defragmenting any file, without readahead) 2) defragment the journal file at the end of the boot 3) defragment the journal file before the flushing 4) mark as NOCOW the journald log file 5) enable the systemd-readahead 6) remove the fsync(2) call from journald 7) remove the posix_fallocate(3) call from journald 8) do a defrag when posix_fallocate(3) is called 9) do a defrag when the journal log file is opened The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time. The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments. The test #4 suggested that marking NOCOW the system.journal file reduces its fragmentation and so the boot time. The test #2,#3,#9 suggested that performing a periodic defrag reduces significantly the fragmentation of system.journal and so the boot time. The test #5 revealed that the readahead capability of systemd was not efficacy because it seems that the system.journal file was unaffected (but other *.journal files were). Further investigation is required. BR G.Baroncelli [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html On 06/16/2014 06:32 PM, Goffredo Baroncelli wrote: Hi Lennart, On 06/16/2014 12:13 AM, Lennart Poettering wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*. In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented. [1] Let me to revise the english, the I will post the results. -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 16/06/14 17:05, Josef Bacik wrote: On 06/16/2014 03:14 AM, Lennart Poettering wrote: On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... Even though this appears to be a problem case for btrfs/COW, is there a more favourable write/access sequence possible that is easily implemented that is favourable for both ext4-like fs /and/ COW fs? Database-like writing is known 'difficult' for filesystems: Can a data log can be a simpler case? Was waiting for you to show up before I said anything since most systemd related emails always devolve into how evil you are rather than what is actually happening. Ouch! Hope you two know each other!! :-P :-) [...] since we shouldn't be fragmenting this badly. Like I said what you guys are doing is fine, if btrfs falls on it's face then its not your fault. I'd just like an exact idea of when you guys are fsync'ing so I can replicate in a smaller way. Thanks, Good if COW can be so resilient. I have about 2GBytes of data logging files and I must defrag those as part of my backups to stop the system fragmenting to a stop (I use cp -a to defrag the files to a new area and restart the data software logger on that). Random thoughts: Would using a second small file just for the mmap-ed pointers help avoid repeated rewriting of random offsets in the log file causing excessive fragmentation? Align the data writes to 16kByte or 64kByte boundaries/chunks? Are mmap-ed files a similar problem to using a swap file and so should the same btrfs file swap code be used for both? Not looked over the code so all random guesses... Regards, Martin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 16/06/14 19:47, Goffredo Baroncelli wrote: Hi all, in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary. The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time. The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments. [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html Very good demonstration and summary, thanks. The charts very clearly show the correspondence between time-to-boot and the level of fragmentation. ... And I thought Linux was getting ever faster to boot! Regards, Martin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Andrey Borzenkov arvidj...@gmail.com schrieb: journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... ... which traditionally experienced severe sequential read performance degradation in this case. As I understand this is exactly what happens - readahead attempts to preload files which gives us heavy random read access. The only real remedy was to defragment files. It should work relatively well for journal where files are mostly write once at the expense of additional read/write activity. This is an interesting point because it means: Readahead may hurt systemd boot performance a lot if it is hitting a heavily fragmented file like those affected journal files. So, readahead should probably ignore these files? It may look obvious but I don't think so... On my system, readahead will process system.journal and thus defragment it, so it won't fragment too much and by that won't slow the boot down. This may also explain why in Goffredo's tests the readahead process never got into recording the system.journal - because the boot process was finished (readahead-done) before any process started reading system.journal - but still the high fragmentation of the other journals causes very high IO and the boot process becomes slower and slower. The root fix should be to make journald not causing high fragmentation. But I think Andrey's point should be given some thoughts, too. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/14/2014 11:37 PM, Kai Krakow wrote: Goffredo Baroncelli kreij...@libero.it schrieb: On 06/14/2014 10:13 PM, Kai Krakow wrote: Take care to enable all needed services to enable defrag... If your services make use of journal file loading these files should also become part of the process. You can check with /usr/lib/systemd/systemd-readahead analyze. The state is stored in /.readahead. I have enabled all the services (collect, replay, done), but I was unable to see any gain. I don't think you can actually enable the done service because it is static... But if it placed a symlink in /etc/systemd that is probably wrong and you should undo it. It is automatically activated a few seconds after booting is done to store the collected data for replaying on next reboot. I checked the output of systemd-analyze: systemd-readahead-done is called at the end of the chart. The sym-links exist only for the collect and replay services, which were called at the beginning. Check with find /etc/systemd -name '*readahead*' -type l... I don't know why but system.journal is not considered by readahead: # /usr/lib/systemd/systemd-readahead analyze | grep journal 100% ( 1) 770: /etc/systemd/journald.conf 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal I tried that on my system and it looks vice-versa somehow: $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal 100% ( 1) 812: /etc/systemd/journald.conf 100% ( 1) 8388608: /var/log/journal/121b87ca633e8ac001665668001b/system.journal Strange... ;-) May be this is a reason of my degraded performance. However I tried on two machines (all fedora) and after several reboot; I tried also removing the /.readahead file... Nothing. On another fedora machine, I found $ sudo /lib/systemd/systemd-readahead analyze | grep journal 100% ( 1) 220384: /usr/lib/systemd/systemd-journald 100% ( 1) 770: /etc/systemd/journald.conf 100% ( 1)24544: /usr/lib64/rsyslog/imjournal.so 100% ( 1) 122880: /usr/lib64/libsystemd-journal.so.0.11.3 50% ( 1) 4194304: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f1b1019bc549-09d9ae048ff8eeef.journal~ 100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f1b7dcfd728c-38950862a134194d.journal~ 50% ( 1) 4194304: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-995@0004f1c3fa738010-b0829bd837e30f61.journal~ 50% ( 1) 4194304: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-995.journal 100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/system@0004f6ed6604ebc7-bb2f3b30ac6eb32c.journal~ 100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f6ed66a05b08-9d244b02cc294baf.journal~ 77% ( 1) 6508544: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/system@0004f6ed6d02e69f-d9c815aa06264d7a.journal~ 100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000@0004f6ed6db6dd7d-cc84da57d5b01680.journal~ 100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000.journal 100% ( 1) 121: /var/lib/rsyslog/imjournal.state So some journal files are present, but not the current one (system.journal) To get back to your performance problem: Did you try systemctl enable bootchart and then (after a reboot) look at the png generated in /run/log (at least in my system it is placed there, look at the configs if in doubt). It can give some interesting clues about why a service starts late or takes long to execute. Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is not enabled in the kernel. -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: I have enabled all the services (collect, replay, done), but I was unable to see any gain. I don't think you can actually enable the done service because it is static... But if it placed a symlink in /etc/systemd that is probably wrong and you should undo it. It is automatically activated a few seconds after booting is done to store the collected data for replaying on next reboot. I checked the output of systemd-analyze: systemd-readahead-done is called at the end of the chart. The sym-links exist only for the collect and replay services, which were called at the beginning. Then all is good. I don't know why but system.journal is not considered by readahead: # /usr/lib/systemd/systemd-readahead analyze | grep journal 100% ( 1) 770: /etc/systemd/journald.conf 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal I tried that on my system and it looks vice-versa somehow: $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal 100% ( 1) 812: /etc/systemd/journald.conf 100% ( 1) 8388608: /var/log/journal/121b87ca633e8ac001665668001b/system.journal Strange... ;-) May be this is a reason of my degraded performance. However I tried on two machines (all fedora) and after several reboot; I tried also removing the /.readahead file... Nothing. I'm not sure but generally it is said that readahead utilities may need a few reboots before fully working. Did you try that? Will it take the journal into account then? Maybe there is something special about your setup and your system.journal does not become touched during boot because of this? You said you also use rsyslogd. Maybe logs are redirected there instead of written to both logging mechanisms? Maybe most logging by system daemons is turned off or they write directly to log files? I tuned all my daemons and services to not write to any log files at all, instead log everything to stderr/stdout (which is captured by systemd) or into the syslogger which is journald as the only mechanism in my system. I just prefer to have everything in one place, and journalctl is a great way to filter the logs or create custom views. Maybe that's a reason why it is taken into account by readahead on my system? I can only guess but maybe worth a try for you... To get back to your performance problem: Did you try systemctl enable bootchart and then (after a reboot) look at the png generated in /run/log (at least in my system it is placed there, look at the configs if in doubt). It can give some interesting clues about why a service starts late or takes long to execute. Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is not enabled in the kernel. Well, I'm on Gentoo so my kernel is custom built by design. But I checked my config, I even have no config option CONFIG_SCHEDSTATS but bootchart works for me: $ zfgrep SCHED /proc/config.gz CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_RT_GROUP_SCHED=y CONFIG_SCHED_AUTOGROUP=y CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_CFQ_GROUP_IOSCHED=y CONFIG_IOSCHED_BFQ=y CONFIG_DEFAULT_IOSCHED=deadline CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_SCHED_HRTICK=y # CONFIG_NET_SCHED is not set CONFIG_USB_EHCI_TT_NEWSCHED=y BTW: After migrating to systemd, I had a boot time of 15s to DM on rotational with serveral heavy services enabled. I did not measure time to desktop back then. But after I did a btrfs balance (I use multidevice btrfs), the time to DM went up to around 25s and never went back no matter what I tried (even when I disabled some services). I switched to autologin DM since then so at least systemd-readahead catches some of the KDE services, too, now. But it still feels slow. I'm probably redoing my btrfs in the future, switching to 16k leaf size in that process. But I'll wait with that until I migrate to bigger hard disks. Currently I have still plenty of space left. According to bootchart I never had a problem with time spent processing the journal. But since a few kernel versions, auto-suspend works great for me, so I usually do not switch the system off, just leave it running and it suspends to RAM after 2 hours. As a nice side effect, it wakes up instantly (especially since kernel 3.15), and I save a lot of power. Still waiting for the systemd auto-wakeup facility to run timer jobs like a backup. ;-) -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/15/2014 12:50 PM, Kai Krakow wrote: Goffredo Baroncelli kreij...@libero.it schrieb: I have enabled all the services (collect, replay, done), but I was unable to see any gain. I don't think you can actually enable the done service because it is static... But if it placed a symlink in /etc/systemd that is probably wrong and you should undo it. It is automatically activated a few seconds after booting is done to store the collected data for replaying on next reboot. I checked the output of systemd-analyze: systemd-readahead-done is called at the end of the chart. The sym-links exist only for the collect and replay services, which were called at the beginning. Then all is good. I don't know why but system.journal is not considered by readahead: # /usr/lib/systemd/systemd-readahead analyze | grep journal 100% ( 1) 770: /etc/systemd/journald.conf 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal I tried that on my system and it looks vice-versa somehow: $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal 100% ( 1) 812: /etc/systemd/journald.conf 100% ( 1) 8388608: /var/log/journal/121b87ca633e8ac001665668001b/system.journal Strange... ;-) May be this is a reason of my degraded performance. However I tried on two machines (all fedora) and after several reboot; I tried also removing the /.readahead file... Nothing. I'm not sure but generally it is said that readahead utilities may need a few reboots before fully working. Did you try that? Will it take the journal into account then? I hope that today I will publish my results. Anyway I performed about 50-70 reboot on each configuration. Maybe there is something special about your setup and your system.journal does not become touched during boot because of this? You said you also use rsyslogd. Maybe logs are redirected there instead of written to both logging mechanisms? Maybe most logging by system daemons is turned off or they write directly to log files? I tuned all my daemons and services to not write to any log files at all, instead log everything to stderr/stdout (which is captured by systemd) or into the syslogger which is journald as the only mechanism in my system. I just prefer to have everything in one place, and journalctl is a great way to filter the logs or create custom views. Maybe that's a reason why it is taken into account by readahead on my system? I can only guess but maybe worth a try for you... I don't have any idea... definitely I will investigate a bit. To get back to your performance problem: Did you try systemctl enable bootchart and then (after a reboot) look at the png generated in /run/log (at least in my system it is placed there, look at the configs if in doubt). It can give some interesting clues about why a service starts late or takes long to execute. Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is not enabled in the kernel. Well, I'm on Gentoo so my kernel is custom built by design. But I checked my config, I even have no config option CONFIG_SCHEDSTATS but bootchart works for me: I am talking about systemd-bootchart. I have to look if Fedora has the classic bootchart. $ zfgrep SCHED /proc/config.gz CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_RT_GROUP_SCHED=y CONFIG_SCHED_AUTOGROUP=y CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_CFQ_GROUP_IOSCHED=y CONFIG_IOSCHED_BFQ=y CONFIG_DEFAULT_IOSCHED=deadline CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_SCHED_HRTICK=y # CONFIG_NET_SCHED is not set CONFIG_USB_EHCI_TT_NEWSCHED=y BTW: After migrating to systemd, I had a boot time of 15s to DM on rotational with serveral heavy services enabled. I did not measure time to desktop back then. But after I did a btrfs balance (I use multidevice btrfs), the time to DM went up to around 25s and never went back no matter what I tried (even when I disabled some services). I switched to autologin DM since then so at least systemd-readahead catches some of the KDE services, too, now. But it still feels slow. I'm probably redoing my btrfs in the future, switching to 16k leaf size in that process. But I'll wait with that until I migrate to bigger hard disks. Currently I have still plenty of space left. According to bootchart I never had a problem with time spent processing the journal. But since a few kernel versions, auto-suspend works great for me, so I usually do not switch the system off, just leave it running and it suspends to RAM after 2 hours. As a nice side effect, it wakes up instantly (especially since kernel 3.15), and I save a lot of power. Still waiting for
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: To get back to your performance problem: Did you try systemctl enable bootchart and then (after a reboot) look at the png generated in /run/log (at least in my system it is placed there, look at the configs if in doubt). It can give some interesting clues about why a service starts late or takes long to execute. Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is not enabled in the kernel. Well, I'm on Gentoo so my kernel is custom built by design. But I checked my config, I even have no config option CONFIG_SCHEDSTATS but bootchart works for me: I am talking about systemd-bootchart. I have to look if Fedora has the classic bootchart. I simply did systemctl enable bootchart.service which is probably part of bootchart2. It worked for me. I will look into systemd-bootchart, did never realize it existed as a standalone solution... Thanks for the pointer. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreij...@libero.it) wrote: Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. Several people pointed fallocate as the problem. But I don't understand the reason. BTW, the reason we use fallocate() in journald is not about trying to optimize anything. It's only used for one reason: to avoid SIGBUS on disk/quota full, since we actually write everything to the files using mmap(). I mean, writing things with mmap() is always problematic, and handling write errors is awfully difficult, but at least two of the most common reasons for failure we'd like protect against in advance, under the assumption that disk/quota full will be reported immediately by the fallocate(), and the mmap writes later on will then necessarily succeed. I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... Anyway, happy to change these things around if necesary, but first I'd like to have a very good explanation why fallocate() wouldn't be the right thing to invoke here, and a suggestion what we should do instead to cover this usecase... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Wed, 11.06.14 20:32, Chris Murphy (li...@colorremedies.com) wrote: systemd has a very stupid journal write pattern. It checks if there is space in the file for the write, and if not it fallocates the small amount of space it needs (it does *4 byte* fallocate calls!) Not really the case. http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354 We allocate 8mb at minimum. and then does the write to it. All this does is fragment the crap out of the log files because the filesystems cannot optimise the allocation patterns. Well, it would be good if you'd tell me what to do instead... I am invoking fallocate() in advance, because we write those files with mmap() and that of course would normally triggered SIGBUS already on the most boring of reasons, such as disk full/quota full or so. Hence, before we do anything like that, we invoke fallocate() to ensure that the space is actually available... As far as I can see, that pretty much in line with what fallocate() is supposed to be useful for, the man page says this explicitly: ...After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. Happy to be informed that the man page is wrong. I am also happy to change our code, if it really is the wrong thing to do. Note however that I generally favour correctness and relying on documented behaviour, instead of nebulous optimizations whose effects might change with different file systems or kernel versions... Yup, it fragments journal files on XFS, too. http://oss.sgi.com/archives/xfs/2014-03/msg00322.html IIRC, the systemd developers consider this a filesystem problem and so refused to change the systemd code to be nice to the filesystem allocators, even though they don't actually need to use fallocate... What? No need to be dick. Nobody ever pinged me about this. And yeah, I think I have a very good reason to use fallocate(). The only reason in fact the man page explicitly mentions. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Sun, 15.06.14 05:43, Duncan (1i5t5.dun...@cox.net) wrote: The base problem isn't fallocate per se, rather, tho it's the trigger in this case. The base problem is that for COW-based filesystems, *ANY* rewriting of existing file content results in fragmentation. It just so happens that the only reason there's existing file content to be rewritten (as opposed to simply appending) in this case, is because of the fallocate. The rewrite of existing file content is the problem, but the existing file content is only there in this case because of the fallocate. Taking a step back... On a non-COW filesystem, allocating 8 MiB ahead and writing into it rewrites into the already allocated location, thus guaranteeing extents of 8 MiB each, since once the space is allocated it's simply rewritten in- place. Thus, on a non-COW filesystem, pre-allocating in something larger than single filesystem blocks when an app knows the data is eventually going to be written in to fill that space anyway is a GOOD thing, which is why systemd is doing it. Nope, that's not why we do it. We do it to avoid SIGBUS on disk full... But on a COW-based filesystem fallocate is the exact opposite, a BAD thing, because an fallocate forces the file to be written out at that size, effectively filled with nulls/blanks. Then the actual logging comes along and rewrites those nulls/blanks with actual data, but it's now a rewrite, which on a COW, copy-on-write, based filesystem, the rewritten block is copied elsewhere, it does NOT overwrite the existing null/blank block, and elsewhere by definition means detached from the previous blocks, thus in an extent all by itself. Well, quite frankly I am not entirely sure why fallocate() would be any useful like that on COW file systems, if this is really how it is implemented... I mean, as I understood fallocate() -- and as the man page suggests -- it is something for reserving space on disk, not for writing out anything. This is why journald is invoking it, to reserve the space, so that later write accesses to it will not require any reservation anymore, and hence are unlikely to fail. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Sat, 14.06.14 12:59, Kai Krakow (hurikha...@gmail.com) wrote: Duncan 1i5t5.dun...@cox.net schrieb: As they say, Whoosh! At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on. Come on, sorry, but this is fud. Really... ;-) Interestingly, I never commented on anything in this area, and neither did anybody else from the systemd side afaics. THe entire btrfs defrag thing i wasn't aware of before this thread started on the system ML a few days ago. I am not sure where you take your ideas about our attitude from. God, with behaviour like that you just make us ignore you, Duncan. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Sun, Jun 15, 2014 at 11:40:42AM +0200, Goffredo Baroncelli wrote: Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is not enabled in the kernel. It's being investigated, unfortunatly with no progress since April: https://bugzilla.redhat.com/show_bug.cgi?id=1013225 Zbyszek ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Jun 15, 2014, at 4:34 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 11.06.14 20:32, Chris Murphy (li...@colorremedies.com) wrote: systemd has a very stupid journal write pattern. It checks if there is space in the file for the write, and if not it fallocates the small amount of space it needs (it does *4 byte* fallocate calls!) Not really the case. http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354 We allocate 8mb at minimum. and then does the write to it. All this does is fragment the crap out of the log files because the filesystems cannot optimise the allocation patterns. Well, it would be good if you'd tell me what to do instead... I am invoking fallocate() in advance, because we write those files with mmap() and that of course would normally triggered SIGBUS already on the most boring of reasons, such as disk full/quota full or so. Hence, before we do anything like that, we invoke fallocate() to ensure that the space is actually available... As far as I can see, that pretty much in line with what fallocate() is supposed to be useful for, the man page says this explicitly: ...After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. Happy to be informed that the man page is wrong. I am also happy to change our code, if it really is the wrong thing to do. Note however that I generally favour correctness and relying on documented behaviour, instead of nebulous optimizations whose effects might change with different file systems or kernel versions... Yup, it fragments journal files on XFS, too. http://oss.sgi.com/archives/xfs/2014-03/msg00322.html IIRC, the systemd developers consider this a filesystem problem and so refused to change the systemd code to be nice to the filesystem allocators, even though they don't actually need to use fallocate... What? No need to be dick. Nobody ever pinged me about this. And yeah, I think I have a very good reason to use fallocate(). The only reason in fact the man page explicitly mentions. Lennart For what it's worth, I did not write what is attributed to me above. I was quoting Dave Chinner, and I've confirmed the original attribution correctly made it onto the systemd-devel@ list. I don't know whether some people on this distribution list are even subscribed to systemd-devel@ so those subsequent responses aren't likely being posted to systemd-devel@ but rather to linux-btrfs@. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/14/2014 04:53 AM, Duncan wrote: Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as excerpted: On 06/13/2014 01:24 AM, Dave Chinner wrote: On Thu, Jun 12, 2014 at 12:37:13PM +, Duncan wrote: FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think actually pretty much equally bad without NOCOW set on the file. So maybe it's been fixed in systemd since the last time I looked. Yup: http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal- file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 The reason it was changed? To save a syscall per append, not to prevent fragmentation of the file, which was the problem everyone was complaining about... thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change Why? Because btrfs data blocks are 4 KiB. With COW, the effect for either 4 byte or 8 MiB file allocations is going to end up being the same, forcing (repeated until full) rewrite of each 4 KiB block into its own extent. I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often. But... If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with nodatacow), then an fallocate of 8 MiB will increase the file size by 8 MiB and write that out. So far so good as at that point the 8 MiB should be a single extent. But then, data gets written into 4 KiB blocks of that 8 MiB one at a time, and because btrfs is COW, the new data in the block must be written to a new location. Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. Several people pointed fallocate as the problem. But I don't understand the reason. 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB. 2) it is true that btrfs rewrite almost 2 times each 4kb page with fallocate. But the first time is a big write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write. To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call. =:^( Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the btrfs autodefrag mount option should at least catch that and rewrite (again), this time sequentially. I have to investigate more what happens when the log are copied from /run to /var/log/journal: this is when journald seems to slow all. That's an interesting point. At least in theory, during normal operation journald will write to /var/log/journal, but there's a point during boot at which it flushes the information accumulated during boot from the volatile /run location to the non-volatile /var/log location. /That/ write, at least, should be sequential, since there will be 4 KiB of journal accumulated that needs to be transferred at once. However, if it's being handled by the forced pre-write fallocate described above, then that's not going to be the case, as it'll then be a rewrite of already fallocated file blocks and thus will get COWed exactly as I described above. =:^( I am prepared a PC which reboot continuously; I am collecting the time required to finish the boot vs the fragmentation of the system.journal file vs the number of boot. The results are dramatic: after 20 reboot, the boot time increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot time to the original one, but after another 20 reboot, the boot time still requires 20-30 seconds more It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB). For both PC the HD is a mechanical one... The problem's duplicable. That's the first step toward a fix. =:^) I Hope so And that's now a btrfs problem :/ Are you sure ? As they say, Whoosh! [...] Another alternative is that distros will start setting /var/log/journal NOCOW in their setup scripts by default when it's btrfs, thus avoiding the problem. (Altho if they do automated snapshotting they'll also have to set it as its own subvolume, to avoid the first-write-after-snapshot- is-COW problem.) Well, that, and/or set autodefrag in the default mount options. Pay attention, that this remove also the checksum, which are very useful in a RAID configuration. Meanwhile, there's some focus on making btrfs behave better with such rewrite-pattern files, but while I think the problem can be made /some/ better, hopefully enough that the defaults bother far fewer people in far
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Duncan 1i5t5.dun...@cox.net schrieb: As they say, Whoosh! At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on. Come on, sorry, but this is fud. Really... ;-) IOW, it's a btrfs problem now in practice, not because it is so in a technical sense, but because systemd defines it as such and is unlikely to budge, so the only way to achieve progress is for btrfs to deal with it. I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. I'd suggest the problem is to be found in the different semantics with COW filesystems. And if someone loudly complains to the systemd developers how bad they are at doing their stuff - hmm, well, I would be disapointed/offended, too, as a programmer because much very well done work has been put into systemd and I'd start ignoring such people. In Germany we have a saying for this: Wie man in den Wald hineinruft, so schallt es heraus. [1] They are doing many things right that have not been adopted to modern systems in the last twenty years (or so) with the legacy init systems. So let's start with my journals, on btrfs: $ sudo filefrag * system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal: 5 extents found system.journal: 9 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal: 2 extents found user-500.journal: 2 extents found I don't think these are too bad values, eh? Well, how did I accomblish that? First, I've set the journal directories nocow. Of course, systemd should do this by default. I'm not sure if this is a packaging or systemd code issue, tho. But I think the systemd devs are in common that for cow fs, the journal directories should be set nocow. After all, the journal is a transactional database - it does not need cow protection at all costs. And I think they have their own checksumming protection. So, why let systemd bother with that? A lot of other software has the same semantic problems with btrfs, too (ex. MySQL) where nobody shouts at the inabilities of the programmers. So why for systemd? Just because it's intrusive by its nature for being a radically and newly designed init system and thus requires some learning by its users/admins/packagers? Really? Come on... As admin and/or packager you have to stay with current technologies and developments anyways. It's only important to hide the details from the users. Back to the extents counts: What I did next was implementing a defrag job that regularly defrags the journal (actually, the complete log directory as other log files suffer the same problem): $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem defragment -czlib -r /var/log It can be easily converted into a timer job with systemd. This is left as an excercise to the reader. BTW: Actually, that job isn't currently executed on my system which makes the numbers above pretty impressive... However, autodefrag is turned on which may play into the mix. I'm not sure. I stopped automatically running those defrag jobs a while ago (I have a few more). An arguably fairer and more impartial assessment of this particular situations suggests that neither btrfs, which as a COW-based filesystem, like all COW-based filesystems has the existing-file-rewrite as a major technical challenge that it must deal with /somehow/, nor systemd, which in choosing to use fallocate is specifically putting itself in that existing-file-rewrite class,
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/14/2014 12:59 PM, Kai Krakow wrote: [...] I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. In know that systemd does readahead, but it is the first time that I heard that it does defrag too. Could you elaborate ? I'd suggest the problem is to be found in the different semantics with COW filesystems. [--cut--cut--cut--] The problem is due to the interaction between BTRFS and systemd. Both behave differently regarding their predecessor. [--cut--cut--cut--] So let's start with my journals, on btrfs: $ sudo filefrag * system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal: 5 extents found system.journal: 9 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal: 2 extents found user-500.journal: 2 extents found I don't think these are too bad values, eh? Well, how did I accomblish that? First, I've set the journal directories nocow. If you use nocow, you lost the btrfs ability to rebuild a RAID array discarding the wrong sector. With the systemd journal checksum, you can say that a data is wrong, but BTRFS with its checksum (when used in raid mode) is able to *correct* the data too. [...] Back to the extents counts: What I did next was implementing a defrag job that regularly defrags the journal (actually, the complete log directory as other log files suffer the same problem): $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem defragment -czlib -r /var/log I think that this is a more viable solution; but what seems to me strange is the fact that the fragmentation level of my rsyslog file is a lot lower. [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: On 06/14/2014 12:59 PM, Kai Krakow wrote: [...] I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. In know that systemd does readahead, but it is the first time that I heard that it does defrag too. Could you elaborate ? Look at src/readahead/readahead-collect.c. It works for btrfs and ext4. With ext4 it will also relocate the files. Not sure if it does for btrfs. If it does, the question is: where to relocate in a multi devices file system? That means, you have to enable systemd-readahead-collect, tho. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: First, I've set the journal directories nocow. If you use nocow, you lost the btrfs ability to rebuild a RAID array discarding the wrong sector. With the systemd journal checksum, you can say that a data is wrong, but BTRFS with its checksum (when used in raid mode) is able to *correct* the data too. The decision is up to you. For me it was easy: First: I do not care much if some of my logs become broken. I have backups anyway with a backlog several weeks long. Second: I don't use data RAID (at least not yet), just data striping. So the btrfs checksum won't help me much - it has about the same meaning as the journal checksums. YMMV... For me, nocow'ing the journal was the clear winner. If you persist on using cow for journal files, I'd stick with the defragger: Back to the extents counts: What I did next was implementing a defrag job that regularly defrags the journal (actually, the complete log directory as other log files suffer the same problem): $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem defragment -czlib -r /var/log I think that this is a more viable solution; but what seems to me strange is the fact that the fragmentation level of my rsyslog file is a lot lower. Are you sure? As Duncan pointed out at many occassions, if your journal file is compressed by btrfs (for whatever reason, depends on your mount options and defragging habbits), then filefrag does not show correct values. It shows the count of compression blocks. Only if you are sure about that, it is worth considering further steps. If you are sure about the journals not being compressed (and the rsyslog file, too), you can compare values. I think, only then, you could say that systemd's allocation strategy for journal files does no good on btrfs. Maybe then it's time to fix that by patching systemd to use another strategy, either by configuration or auto-detection. Most of those strategies will probably result in coalescing writes to the journal into bigger blocks, which means buffering, which in turn is usually not wanted for log files. That in turn means that cowed btrfs is probably the wrong medium for storing journal files. You may want to play with SyncIntervalSec in journald.conf... BTW: I'm not sure if autodefrag would kick in on journal files because those appends are no random writes. I think I remember that autodefrag was about detecting random writes and then rewrite the file as a whole to reduce extent fragmentation. Read: I'm not sure if autodefrag would have an effect on this. A btrfs dev may comment on that. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/14/2014 04:03 PM, Kai Krakow wrote: Goffredo Baroncelli kreij...@libero.it schrieb: On 06/14/2014 12:59 PM, Kai Krakow wrote: [...] I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. In know that systemd does readahead, but it is the first time that I heard that it does defrag too. Could you elaborate ? Look at src/readahead/readahead-collect.c. It works for btrfs and ext4. With ext4 it will also relocate the files. Not sure if it does for btrfs. I am looking at the source, and yes, it does. To be honest it seems to defragment only on btrfs. If it does, the question is: where to relocate in a multi devices file system? Systemd uses the defrag capability of btrfs. That means, you have to enable systemd-readahead-collect, tho. I have to admit that I disabled it. I will make some test also with readhead enabled. -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: I am looking at the source, and yes, it does. To be honest it seems to defragment only on btrfs. Oh that seems true... Then defragmenting and relocating is probably a todo item. I'm sure I've read about it somewhere. If it does, the question is: where to relocate in a multi devices file system? Systemd uses the defrag capability of btrfs. man:systemd-readahead says it relocates and defrags if supported. Scanning through the source, it only defrags. That means, you have to enable systemd-readahead-collect, tho. I have to admit that I disabled it. I will make some test also with readhead enabled. Take care to enable all needed services to enable defrag... If your services make use of journal file loading these files should also become part of the process. You can check with /usr/lib/systemd/systemd-readahead analyze. The state is stored in /.readahead. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/14/2014 10:13 PM, Kai Krakow wrote: Goffredo Baroncelli kreij...@libero.it schrieb: I am looking at the source, and yes, it does. To be honest it seems to defragment only on btrfs. Oh that seems true... Then defragmenting and relocating is probably a todo item. I'm sure I've read about it somewhere. If it does, the question is: where to relocate in a multi devices file system? Systemd uses the defrag capability of btrfs. man:systemd-readahead says it relocates and defrags if supported. Scanning through the source, it only defrags. That means, you have to enable systemd-readahead-collect, tho. I have to admit that I disabled it. I will make some test also with readhead enabled. Take care to enable all needed services to enable defrag... If your services make use of journal file loading these files should also become part of the process. You can check with /usr/lib/systemd/systemd-readahead analyze. The state is stored in /.readahead. I have enabled all the services (collect, replay, done), but I was unable to see any gain. I don't know why but system.journal is not considered by readahead: # /usr/lib/systemd/systemd-readahead analyze | grep journal 100% ( 1) 770: /etc/systemd/journald.conf 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli kreij...@libero.it schrieb: On 06/14/2014 10:13 PM, Kai Krakow wrote: Take care to enable all needed services to enable defrag... If your services make use of journal file loading these files should also become part of the process. You can check with /usr/lib/systemd/systemd-readahead analyze. The state is stored in /.readahead. I have enabled all the services (collect, replay, done), but I was unable to see any gain. I don't think you can actually enable the done service because it is static... But if it placed a symlink in /etc/systemd that is probably wrong and you should undo it. It is automatically activated a few seconds after booting is done to store the collected data for replaying on next reboot. Check with find /etc/systemd -name '*readahead*' -type l... I don't know why but system.journal is not considered by readahead: # /usr/lib/systemd/systemd-readahead analyze | grep journal 100% ( 1) 770: /etc/systemd/journald.conf 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal 50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal I tried that on my system and it looks vice-versa somehow: $ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal 100% ( 1) 812: /etc/systemd/journald.conf 100% ( 1) 8388608: /var/log/journal/121b87ca633e8ac001665668001b/system.journal Strange... ;-) To get back to your performance problem: Did you try systemctl enable bootchart and then (after a reboot) look at the png generated in /run/log (at least in my system it is placed there, look at the configs if in doubt). It can give some interesting clues about why a service starts late or takes long to execute. -- Replies to list only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as excerpted: On 06/14/2014 04:53 AM, Duncan wrote: Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as excerpted: thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often. Right. But... Exactly, _but_... [A]n fallocate of 8 MiB will increase the file size by 8 MiB and write that out. So far so good as at that point the 8 MiB should be a single extent. But then, data gets written into 4 KiB blocks of that 8 MiB one at a time, and because btrfs is COW, the new data in the block must be written to a new location. Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. Several people pointed fallocate as the problem. But I don't understand the reason. 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB. 2) it is true that btrfs rewrite almost 2 times each 4kb page with fallocate. But the first time is a big write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write. To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call. The base problem isn't fallocate per se, rather, tho it's the trigger in this case. The base problem is that for COW-based filesystems, *ANY* rewriting of existing file content results in fragmentation. It just so happens that the only reason there's existing file content to be rewritten (as opposed to simply appending) in this case, is because of the fallocate. The rewrite of existing file content is the problem, but the existing file content is only there in this case because of the fallocate. Taking a step back... On a non-COW filesystem, allocating 8 MiB ahead and writing into it rewrites into the already allocated location, thus guaranteeing extents of 8 MiB each, since once the space is allocated it's simply rewritten in- place. Thus, on a non-COW filesystem, pre-allocating in something larger than single filesystem blocks when an app knows the data is eventually going to be written in to fill that space anyway is a GOOD thing, which is why systemd is doing it. But on a COW-based filesystem fallocate is the exact opposite, a BAD thing, because an fallocate forces the file to be written out at that size, effectively filled with nulls/blanks. Then the actual logging comes along and rewrites those nulls/blanks with actual data, but it's now a rewrite, which on a COW, copy-on-write, based filesystem, the rewritten block is copied elsewhere, it does NOT overwrite the existing null/blank block, and elsewhere by definition means detached from the previous blocks, thus in an extent all by itself. Once the full 2048 original blocks composing that 8 MiB are filled in with actual data, because they were rewrites from null/blank blocks that fallocate had already forced to be allocated, that's now 2048 separate extents, 2048 separate file fragments, where without the forced fallocate, the writes would have all been appends, and there would have been at least /some/ chance of some of those 2048 separate blocks being written at close enough to the same time that they would have been written together as a single extent. So while the 8 MiB might not have been a single extent as opposed to 2048 separate extents, it might have been perhaps 512 or 1024 extents, instead of the 2048 that it ended up being because fallocate meant that each block was a rewrite into an existing file, not a new append-write at the end of an existing file. [...] Another alternative is that distros will start setting /var/log/journal NOCOW in their setup scripts by default when it's btrfs, thus avoiding the problem. (Altho if they do automated snapshotting they'll also have to set it as its own subvolume, to avoid the first-write-after-snapshot- is-COW problem.) Well, that, and/or set autodefrag in the default mount options. Pay attention, that this remove also the checksum, which are very useful in a RAID configuration. Well, it can be. But this is only log data, not executable or the like data, and (as Kai K points out) journald has its own checksumming method in any case. Besides which, you still haven't explained why you can't either set the autodefrag mount option and be done with it, or run a systemd-timer- triggered or cron-triggered defrag script to defrag them
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli kreij...@libero.it wrote: If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data. So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux. Since the journals have their own checksumming I'm not overly concerned about setting xattr +C. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On Jun 11, 2014, at 7:21 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote: Hi all, I would like to share a my experience about a slowness of systemd when used on BTRFS. My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from systemd-analyze plot). I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. However this link https://bugzilla.redhat.com/show_bug.cgi?id=1006386 suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. systemd has a very stupid journal write pattern. It checks if there is space in the file for the write, and if not it fallocates the small amount of space it needs (it does *4 byte* fallocate calls!) and then does the write to it. All this does is fragment the crap out of the log files because the filesystems cannot optimise the allocation patterns. Yup, it fragments journal files on XFS, too. http://oss.sgi.com/archives/xfs/2014-03/msg00322.html IIRC, the systemd developers consider this a filesystem problem and so refused to change the systemd code to be nice to the filesystem allocators, even though they don't actually need to use fallocate... Cheers, Dave. -- Dave Chinner da...@fromorbit.com On Jun 11, 2014, at 7:37 PM, Dave Chinner da...@fromorbit.com wrote: BTW, the systemd list is subscriber only, so thay aren't going to see anything that we comment on from a cross-post to the btrfs list. Unless a subscriber finds something really interesting, quotes it, and cross posts it. Chris murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel