On 06/14/2014 04:53 AM, Duncan wrote: > Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as > excerpted: > >> On 06/13/2014 01:24 AM, Dave Chinner wrote: >>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote: >>>> >>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think >>>> actually pretty much equally bad without NOCOW set on the file. >>> >>> So maybe it's been fixed in systemd since the last time I looked. >>> Yup: >>> >>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal- > file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 >>> >>> The reason it was changed? To "save a syscall per append", not to >>> prevent fragmentation of the file, which was the problem everyone was >>> complaining about... >> >> thanks for pointing that. However I am performing my tests on a fedora >> 20 with systemd-208, which seems have this change >>> >>>> Why? Because btrfs data blocks are 4 KiB. With COW, the effect for >>>> either 4 byte or 8 MiB file allocations is going to end up being the >>>> same, forcing (repeated until full) rewrite of each 4 KiB block into >>>> its own extent. >> >> I am reaching the conclusion that fallocate is not the problem. The >> fallocate increase the filesize of about 8MB, which is enough for some >> logging. So it is not called very often. > > But... > > If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with > nodatacow), then an fallocate of 8 MiB will increase the file size by 8 > MiB and write that out. So far so good as at that point the 8 MiB should > be a single extent. But then, data gets written into 4 KiB blocks of > that 8 MiB one at a time, and because btrfs is COW, the new data in the > block must be written to a new location. > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB > block has been rewritten to a new location and is now an extent unto > itself. So now that 8 MiB is composed of 2048 new extents, each one a > single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't understand the reason. 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB. 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write. To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call. > > =:^( > > Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the > btrfs autodefrag mount option should at least catch that and rewrite > (again), this time sequentially. > >> I have to investigate more what happens when the log are copied from >> /run to /var/log/journal: this is when journald seems to slow all. > > That's an interesting point. > > At least in theory, during normal operation journald will write to > /var/log/journal, but there's a point during boot at which it flushes the > information accumulated during boot from the volatile /run location to > the non-volatile /var/log location. /That/ write, at least, should be > sequential, since there will be > 4 KiB of journal accumulated that needs > to be transferred at once. However, if it's being handled by the forced > pre-write fallocate described above, then that's not going to be the > case, as it'll then be a rewrite of already fallocated file blocks and > thus will get COWed exactly as I described above. > > =:^( > > >> I am prepared a PC which reboot continuously; I am collecting the time >> required to finish the boot vs the fragmentation of the system.journal >> file vs the number of boot. The results are dramatic: after 20 reboot, >> the boot time increase of 20-30 seconds. Doing a defrag of >> system.journal reduces the boot time to the original one, but after >> another 20 reboot, the boot time still requires 20-30 seconds more.... >> >> It is a slow PC, but I saw the same behavior also on a more modern pc >> (i5 with 8GB). >> >> For both PC the HD is a mechanical one... > > The problem's duplicable. That's the first step toward a fix. =:^) I Hope so > >>> And that's now a btrfs problem.... :/ >> >> Are you sure ? > > As they say, "Whoosh!" > [...] > Another alternative is that distros will start setting /var/log/journal > NOCOW in their setup scripts by default when it's btrfs, thus avoiding > the problem. (Altho if they do automated snapshotting they'll also have > to set it as its own subvolume, to avoid the first-write-after-snapshot- > is-COW problem.) Well, that, and/or set autodefrag in the default mount > options. Pay attention, that this remove also the checksum, which are very useful in a RAID configuration. > > Meanwhile, there's some focus on making btrfs behave better with such > rewrite-pattern files, but while I think the problem can be made /some/ > better, hopefully enough that the defaults bother far fewer people in far > fewer cases, I expect it'll always be a bit of a sore spot because that's > just how the technology works, and as such, setting NOCOW for such files > and/or using autodefrag will continue to be recommended for an optimized > setup. > > --- > [1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of > setting NOCOW (chattr +C) unless the attribute is set while the file is > still zero size, effectively, at file creation. The easiest way to do > that is to set NOCOW on the subdir that will contain the file, such that > when the file is created it inherits the NOCOW attribute automatically. > > [2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB > should be fine, over 1 GiB is known to cause issues, between the two is a > gray area that depends on the speed of the hardware and the incoming > write-stream. > -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel