Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkovwrote: > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. https://paste.fedoraproject.org/paste/oVT-tsU2sBOdTJaZxGua-15M1UNdIGYhyRLivL9gydE= That's just a partial, but the complete output captured for a couple minutes doesn't contain an fsync. Then I did this for 8 minutes, strace -c -f -p $(pgrep systemd-journal) https://paste.fedoraproject.org/paste/Uzc2KhkkaqLOU8USLd38B15M1UNdIGYhyRLivL9gydE= So 6 fsyncs in 8 minutes; more than 1 per 5 minutes, but not nearly as many as I thought. So maybe as you say it's just memory mapped activity I'm seeing with state and filefrag. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkovwrote: >> I have no idea if it's fsync or what. How can I tell? >> > > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. Also found this. https://lwn.net/Articles/306046/ Not sure how to enable and use it though. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkovwrote: > 18.04.2017 06:50, Chris Murphy пишет: >>> What exactly "changes" mean? Write() syscall? >> >> filefrag reported entries increase, it's using FIEMAP. >> > > So far it sounds like btrfs allocates new extent on every write to > journal file. Each journal record itself is relatively small indeed. Hence why it would be better if there's no fsync so that it can accumulate these and do its own commit (30s default for Btrfs) and let them accumulate. It is likely that the ssd allocation option on these ssd's is a factor in fragmentation because it's trying to allocation to a unique 2MB section based on expected erase block size. There's a lot of discussion going on right now on the Btrfs list whether these assumptions are still true, and in what cases maybe we should be using nossd on higher end SSD's and NVMe. What's for sure though is that with any of these allocators, nocow is not good for lower end SSDs like SD cards; all that does it ask to write to the same LBA over and over and over again, for a journal. And it just increases write amplification unnecessarily. So I'm beginning to think that on SSDs, it's better if journald did +c rather than +C on journals. But there's still some researching to do. I definitely think /var/log/journal/ should be a subvolume to avoid its contents being snapshot. That does make the fragmentation problem worse. And also I think defragmentation feature should be disabled at least on SSD; or should include zlib compression. The write amplification on SSD is worse than just leaving the file fragmented. > >> Also with stat I see the times (all three) change on the file. If I go >> to GNOME Terminal and just sudo some command, that itself causes the >> current system.journal file to get all three times modified. It >> happens immediately, there's no delay. So if I'm doing something like >> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus >> the journal, it's just constantly writing stuff to the journal. This >> is without anything running journalctl -f or reading the journal. >> >>> #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m >>> >>> This controls how often systemd calls fsync() on currently active >>> journal file. Do you see fsync() every 3 seconds? >> >> I have no idea if it's fsync or what. How can I tell? >> > > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. > > Is it possible that btrfs behavior you observe is specific to memory > mapped files handling? Maybe. But even after a reboot I see the same extent entries in the file. Granted a good deal of these 1 block entries have addresses that are one after the other so they often make up larger continuous extents, but they still have separate entries. > >> Also, I don't think these journal files are being compressed. >> >> Using the btrfs-progs/btrfs-debugfs script on a few user journal >> files, I'm seeing massive compression ratios. Maybe I'll try >> Compress=No and see if there's a change. >> > > Only actual message payload above some threshold (I think 256 or 512 > bytes, not sure) is compressed; everything else is not. For average > syslog-type messages payload is far too small. This is really only > interesting when you store core dump or similar. Interesting I see. Thanks. I'll try strace and see what's going on. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
18.04.2017 06:50, Chris Murphy пишет: > On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkovwrote: >> 17.04.2017 22:49, Chris Murphy пишет: >>> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov >>> wrote: 17.04.2017 19:25, Chris Murphy пишет: > This explains one system's fragmented journals; but the other system > isn't snapshotting journals and I haven't figured out why they're so > fragmented. No snapshots, and they are all +C at create time > (systemd-journald default on Btrfs). Is it possible to prevent > journald from setting +C on /var/log/journal and > /var/log/journal/? If I remove them, at next boot they get > reset, so any new journals created inherit that. > Yes, should be possible by creating empty /etc/tmpfiles.d/journal-nocow.conf. >>> >>> OK super. >>> >>> How about inhibiting the defragmentation on rotate? I'm suspicious one >>> of the things I'm seeing is due to ssd optimization mount options, but >>> I need to see the predefrag state of the files. >>> >>> Why do I see so many changes to the journal file, once ever 2-5 >>> seconds? This adds 4096 byte blocks to the file each time, and when >>> cow, that'd explain why there are so many fragments. >>> >> >> >> What exactly "changes" mean? Write() syscall? > > filefrag reported entries increase, it's using FIEMAP. > So far it sounds like btrfs allocates new extent on every write to journal file. Each journal record itself is relatively small indeed. > Also with stat I see the times (all three) change on the file. If I go > to GNOME Terminal and just sudo some command, that itself causes the > current system.journal file to get all three times modified. It > happens immediately, there's no delay. So if I'm doing something like > drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus > the journal, it's just constantly writing stuff to the journal. This > is without anything running journalctl -f or reading the journal. > >> >>> #Storage=auto >>> #Compress=yes >>> #Seal=yes >>> #SplitMode=uid >>> #SyncIntervalSec=5m >> >> This controls how often systemd calls fsync() on currently active >> journal file. Do you see fsync() every 3 seconds? > > I have no idea if it's fsync or what. How can I tell? > strace -p $(pgrep systemd-journal) You will not see actual writes as file is memory mapped, but it definitely does not do any fsync() every so often. Is it possible that btrfs behavior you observe is specific to memory mapped files handling? > Also, I don't think these journal files are being compressed. > > Using the btrfs-progs/btrfs-debugfs script on a few user journal > files, I'm seeing massive compression ratios. Maybe I'll try > Compress=No and see if there's a change. > Only actual message payload above some threshold (I think 256 or 512 bytes, not sure) is compressed; everything else is not. For average syslog-type messages payload is far too small. This is really only interesting when you store core dump or similar. > file: > user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal > extents 64 disk size 294912 logical size 8388608 ratio 28.44 > file: > user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal > extents 64 disk size 278528 logical size 8388608 ratio 30.12 > file: > user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal > extents 320 disk size 5206016 logical size 41943040 ratio 8.06 > ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] feature request: implement macsec interface configuration in systemd-networkd
Are there any plans on implementing macsec interface configuration from systemd-networkd? Since its already added in kernel as a loadable module, fedora misses a patched iproute2 to support macsec and also lacks automatic interface configuration (i dunno if nm supports it?) preferably from systemd-networkd. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkovwrote: > 17.04.2017 22:49, Chris Murphy пишет: >> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov >> wrote: >>> 17.04.2017 19:25, Chris Murphy пишет: This explains one system's fragmented journals; but the other system isn't snapshotting journals and I haven't figured out why they're so fragmented. No snapshots, and they are all +C at create time (systemd-journald default on Btrfs). Is it possible to prevent journald from setting +C on /var/log/journal and /var/log/journal/? If I remove them, at next boot they get reset, so any new journals created inherit that. >>> >>> Yes, should be possible by creating empty >>> /etc/tmpfiles.d/journal-nocow.conf. >> >> OK super. >> >> How about inhibiting the defragmentation on rotate? I'm suspicious one >> of the things I'm seeing is due to ssd optimization mount options, but >> I need to see the predefrag state of the files. >> >> Why do I see so many changes to the journal file, once ever 2-5 >> seconds? This adds 4096 byte blocks to the file each time, and when >> cow, that'd explain why there are so many fragments. >> > > > What exactly "changes" mean? Write() syscall? filefrag reported entries increase, it's using FIEMAP. Also with stat I see the times (all three) change on the file. If I go to GNOME Terminal and just sudo some command, that itself causes the current system.journal file to get all three times modified. It happens immediately, there's no delay. So if I'm doing something like drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus the journal, it's just constantly writing stuff to the journal. This is without anything running journalctl -f or reading the journal. > >> #Storage=auto >> #Compress=yes >> #Seal=yes >> #SplitMode=uid >> #SyncIntervalSec=5m > > This controls how often systemd calls fsync() on currently active > journal file. Do you see fsync() every 3 seconds? I have no idea if it's fsync or what. How can I tell? Also, I don't think these journal files are being compressed. Using the btrfs-progs/btrfs-debugfs script on a few user journal files, I'm seeing massive compression ratios. Maybe I'll try Compress=No and see if there's a change. file: user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal extents 64 disk size 294912 logical size 8388608 ratio 28.44 file: user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal extents 64 disk size 278528 logical size 8388608 ratio 30.12 file: user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal extents 320 disk size 5206016 logical size 41943040 ratio 8.06 -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkovwrote: > 17.04.2017 19:25, Chris Murphy пишет: >> This explains one system's fragmented journals; but the other system >> isn't snapshotting journals and I haven't figured out why they're so >> fragmented. No snapshots, and they are all +C at create time >> (systemd-journald default on Btrfs). Is it possible to prevent >> journald from setting +C on /var/log/journal and >> /var/log/journal/? If I remove them, at next boot they get >> reset, so any new journals created inherit that. >> > > Yes, should be possible by creating empty > /etc/tmpfiles.d/journal-nocow.conf. OK super. How about inhibiting the defragmentation on rotate? I'm suspicious one of the things I'm seeing is due to ssd optimization mount options, but I need to see the predefrag state of the files. Why do I see so many changes to the journal file, once ever 2-5 seconds? This adds 4096 byte blocks to the file each time, and when cow, that'd explain why there are so many fragments. #Storage=auto #Compress=yes #Seal=yes #SplitMode=uid #SyncIntervalSec=5m #RateLimitIntervalSec=30s #RateLimitBurst=1000 A change every 5m is not what I'm seeing with stat. I have no crit, emerg, or alert messages happening. Just a bunch of drm debug messages which are constant. But if the flush should only happen every 5 minutes, I'm confused. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
Here's an example rotated log (Btrfs, NVMe, no compression, default ssd mount option). As you can see it takes up more space on disk than it contains data, so there's a lot of slack space for some reason, despite /etc/systemd/journald.conf being unmodified and thus Compress=Yes. file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 41 disk size 143511552 logical size 100663296 ratio 0.70 $ sudo btrfs fi defrag -c system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal And now: file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 That's nearly 1/7th smaller. The existing defrag without compression is probably just increasing write amplification on SSDs. If it's badly fragmented just leave it alone. This also works on nocow journals with +C set, although I'm not sure whether this is intended behavior (I thought nocow implies no compression); so I've asked about that on the Btrfs list. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
17.04.2017 19:25, Chris Murphy пишет: > This explains one system's fragmented journals; but the other system > isn't snapshotting journals and I haven't figured out why they're so > fragmented. No snapshots, and they are all +C at create time > (systemd-journald default on Btrfs). Is it possible to prevent > journald from setting +C on /var/log/journal and > /var/log/journal/? If I remove them, at next boot they get > reset, so any new journals created inherit that. > Yes, should be possible by creating empty /etc/tmpfiles.d/journal-nocow.conf. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Mon, Apr 17, 2017 at 3:57 AM, Lennart Poetteringwrote: >> I do manual snapshots before software updates, which means new writes >> to these files are subject to COW, but additional writes to the same >> extents are overwrites and are not COW because of chattr +C. I've used >> this same strategy for a long time, since systemd-journald defaults to >> +C for journal files; but I've not seen them get this fragmented this >> quickly. >> > > IIRC NOCOW only has an effect if set right after the file is created > before the first write to it is done. Or in other words, you cannot > retroactively make a file NOCOW. This means that if you in one way or > another make a COW copy of a file (through reflinking — implicit or > not, note that "cp" reflinks by default — or through snapshotting or > something else) the file is COW and you'll get fragmentation. Correct. There are three states for files on Btrfs: cow (normal), nocow (+C), and a snapshot of a nocow (+C) file which is "cowandthennocow" or whatever you want to call it. But yes a snapshot of a nocow file does fragment a ton, but then becomes nocow and won't fragment more. This explains one system's fragmented journals; but the other system isn't snapshotting journals and I haven't figured out why they're so fragmented. No snapshots, and they are all +C at create time (systemd-journald default on Btrfs). Is it possible to prevent journald from setting +C on /var/log/journal and /var/log/journal/? If I remove them, at next boot they get reset, so any new journals created inherit that. Anyway, snapshots of journals on Btrfs should be avoided for other reasons. The autocleaning features (SystemMaxUse=, SystemKeepFree=) as well as --vacuum-size=. ) don't work correctly when there are snapshots of journals. Even when journald deletes journals, their extents are pinned by snapshots, so they still take up the same space. Basically journald could get into a situation where it deletes all journals it sees, but no space is freed up because those journals are stuck in a snapshot. > I am not entirely sure what to recommend you. Ultimately whether btrfs > fragments or not, is probably something you have to discuss with the > btrfs folks. We do try to make the best of btrfs, by managing the COW > flag, but this only helps you to a limited degree as > snapshots/reflinks will fuck things up anyway... Definitely. An easy solution would be for journald to create /var/log/journal/ as a subvolume instead of a directory. This will make journals immune to snapshots of the containing subvolume (typically root fs). Of course systemd already makes subvolumes behind the scenes for other sane reasons like /var/lib/machines. Snapshotting logs strikes me as an invalid use case anyway. Anyone would want logs immune to rollback, that'd defeat troubleshooting and auditing. Logs should be linear and continuous, not rolled back. The snapshotting is arguably a mistake, due to lack of user understanding of the consequences. It is admittedly esoteric. > We also ask btrfs to defrag the file as soon as we mark it as > archived... I'd even be willing to extend on that, and defrag the file > on other events too, for example if it ends up being too heavily > fragmented. But last time I looked btrfs didn't have any nice API for > that, that would have a clear focus on a single file only... The biggest issue with them is they take up a lot of space and very inconsistently defragment. Depending on kernel version they can become magnificently larger. Speaking of which, even with Compress=Yes (default), the journal files are highly compressible. By copying some to a Btrfs volume with compress mount option (this does not force compression it gives up easily on already compressed data), I'm finding 4-6x smaller files. So the journals are highly compressible. This is the last line for a couple journals, from btrfs-progs/btrfs-debugfs: file: system@01b44589014542e3b48df31f152c0916-ca2b-00054546539416e8.journal extents 384 disk size 9691136 logical size 50331648 ratio 5.19 file: system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal extents 768 disk size 21504000 logical size 100663296 ratio 4.68 If there is a way to optimize this compression when rotating logs, read-compress-write, this means defragmentation isn't needed on Btrfs, and all file systems gain the benefit of much smaller logs. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
Am Mon, 17 Apr 2017 16:01:48 +0200 schrieb Kai Krakow: > > We also ask btrfs to defrag the file as soon as we mark it as > > archived... > > This makes sense. And I've learned that journal on btrfs works much > better if you use many small files vs. a few big files. I've currently > set the journal size limit to 8 MB for that reason which gives me very > good performance. Hmm well, just looked, I eventually stopped doing that, probably when you introduced defragging the archived journals. But I see no journal file being bigger than 128M which seems to work well. -- Regards, Kai Replies to list-only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
Am Mon, 17 Apr 2017 11:57:21 +0200 schrieb Lennart Poettering: > On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote: > > > Hi, > > > > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) > > that's maybe a couple weeks old and was clean installed. Drive is > > NVMe. > > > > > > # filefrag * > > system.journal: 9283 extents found > > user-1000.journal: 3437 extents found > > # lsattr > > C-- ./system.journal > > C-- ./user-1000.journal > > > > I do manual snapshots before software updates, which means new > > writes to these files are subject to COW, but additional writes to > > the same extents are overwrites and are not COW because of chattr > > +C. I've used this same strategy for a long time, since > > systemd-journald defaults to +C for journal files; but I've not > > seen them get this fragmented this quickly. > > > > IIRC NOCOW only has an effect if set right after the file is created > before the first write to it is done. Or in other words, you cannot > retroactively make a file NOCOW. This means that if you in one way or > another make a COW copy of a file (through reflinking — implicit or > not, note that "cp" reflinks by default — or through snapshotting or > something else) the file is COW and you'll get fragmentation. To mark a file nocow, it has to exist with zero bytes and never been written to. The nocow attribute (chattr +C) will be inherited from the directory upon creation of a file. So the best way to go is setting +C on the directory and all future files of the journal would be nocow. You can still do snapshots, nocow doesn't prohibit that and doesn't make journals cow again. What happens is that btrfs simply unshares extents as soon as you write to the snapshot. The newly created extent itself will behave like nocow again. If the extents are big enough, this shouldn't introduce any serious fragmentation, just waste space. Btrfs won't split extents upon unsharing them during a write. It may, however, "replace" only part of the unshared extent thus making three new: two sharing the old copy, one having the new data. But since journals are append only, that should be no problem. It's just that the data is written so slowly that writes almost never become combined into one single writes, resulting in many extents. > I am not entirely sure what to recommend you. Ultimately whether btrfs > fragments or not, is probably something you have to discuss with the > btrfs folks. We do try to make the best of btrfs, by managing the COW > flag, but this only helps you to a limited degree as > snapshots/reflinks will fuck things up anyway... Well, usually you shouldn't have to manage the cow flag at all: Just set it once for the newly created journal directory and everything is fine. And even then, people may not want this so they could easily unset the flag on the directory and rotate the journal. > We also ask btrfs to defrag the file as soon as we mark it as > archived... This makes sense. And I've learned that journal on btrfs works much better if you use many small files vs. a few big files. I've currently set the journal size limit to 8 MB for that reason which gives me very good performance. > I'd even be willing to extend on that, and defrag the file > on other events too, for example if it ends up being too heavily > fragmented. Since the append behavior of btrfs is so bad wrt journal files, it should be enough to simply let btrfs defrag the previous written journal block upon append the file: Lennart, I think you are hinting the OS that the file is going to grow and thus truncate it to 8 MB beyond the current end of file to continue writing. That would be a good event to let btrfs defrag the old 8 MB block (and just that, not the complete file). If this works well, you could maybe skip defragging the complete file upon rotation which should improve disk io performance during rotation. I think the default extent size hint for defragging with btrfs defrag has been set to 32 MB lately, so it would be enough to maybe do the above step every 32 MB. > But last time I looked btrfs didn't have any nice API for > that, that would have a clear focus on a single file only... The high number of extents may not be an indicator for fragmentation when btrfs compression is used. Compressed data will be organized in logical 128k units which are reported as fragments to filefrag, in reality they are laid out continuously on disk, so no fragmentation. It would be interesting to see the blockmap of this. -- Regards, Kai Replies to list-only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 20:20, Chris Murphy (li...@colorremedies.com) wrote: > 4. Systemd for not enforcing limited kill exemption to those running > from initramfs, i.e. ignore kill exemption if the program is running > other than initramfs. Well, we are not the police, and we do kill everything by default, even though we have this explicit, privileged opt-out of this. If people misuse it, then I am pretty sure it's on them, not us... That said, I will subscribe to the request that systemd's shutdown logic should go the safest way possible, and hence I am fine with calling the generic FIFREEZE+FITHAW ioctls one after the other, if that helps, even though I think this is really broken API. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to mask /proc using currently available options (InaccessiblePaths...)
On Wed, 12.04.17 18:27, Timothée Ravier (sios...@gmail.com) wrote: > Hi, > > I would like to make the /proc directory inaccessible for some services. > Unfortunately, adding the InaccessiblePaths=/proc option to a service unit > will > not work. Hmm, what precisely do you intend to make unavailable here? Note that /proc/self/ is kinda normal process API on Linux, as are some other files, and a variety of calls (including in glibc defined ones) assume that /proc is available, at least for read access. It definitely makes sense to restrict /proc somehwat. ProtectKernelTunables= will make /proc/sys read-only for example, and there's work in progress to permit the kernel's hidepid procfs mount option to be settable per mount point so that we can expose it per-service in systemd, but I am not sure it is really desirable to completely disable it — at least at a service level. It might make sense to restrict it in even more restricted sandboxes (for example, a web browser might restrict this if it uses per-page renderer process sandboxes). That all said, even if I don't see the great benefit of blocking the entirety of /proc for a service, I'm still willing to merge changes to make this work, if this helps you. > With systemd v233, during the filesystem layout setup for the new service, an > empty directory will be mounted on top of /proc first (in core:namespace.c: > setup_namespace(): apply_mount()) and then mount points will be turned > readonly > (in core:namespace.c: setup_namespace(): make_read_only()), using > /proc/mountinfo which is now unavailable. Thus this step will fail. Maybe we can find a somewhat clean fall-back for this, when /proc is not around? Or maybe we slightly alter the logic here, and open /proc/self/mountinfo before we rearrange the directories, and then always only read from the already opened fd, and do not refer to the actual file system anymore? I figure that would mean adding a version of bind_remount_recursive() that takes a FILE* or so of /proc/self/mountinfo as additional parameter, and then seeks to the beginning before reading off it, if you follow what I mean? I think this approach would be the nicest one. > With systemd v233, it is possible to work around this issue leaving only a > single > /proc/self/mountinfo file available using this hack: > > $ umask 0277 > $ mkdir -p /.proc/self > $ touch /.proc/self/mountinfo > > And in the unit: > > BindReadOnlyPaths=/.proc:/proc /proc/self/mountinfo:/.proc/self/mountinfo > > But this is not really pretty. > > I would like your opinion on the following suggestions before writing code: > * Should I extend the MountVFSAPI option to support the case where the > RootImage and RootDirectory options are not set? How precisely would you alter the effect of MountVFSAPI= here? > * Should I add a special HideProc option to support hiding /proc for > conventional services? As above, I'd prefer not to add this. I am not against making work what you want to do, but I am not convinced that adding first class config options for it would be a good idea, since systemd after all is a service manager and hence we should focus on making things easy that match the service usecase, but not more. Or in other words: making InaccessiblePaths=/proc work sounds preferable to me. > As a side note, debug logs in core/namespace.c are non functional. A call to > log_open() appears to be missing. Yupp, this is known. But opening fds comes with other issues (in particular because seccomp and other security systems would need preparation to permit that), hence currently we just keep the code in there, and it is normally a NOP, except if you hack around, turn it on manually, by adding a log_open for your local compilation. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 19:30, Chris Murphy (li...@colorremedies.com) wrote: > >> Remember, all of this is because there *is* software that does the wrong > >> thing, and it *is* possible for software to hang and be unkillable. It > >> would > >> be good for systemd to do the right thing even in the presence of that kind > >> of software. > > > > Yeah, we do what we can. > > > > But I seriously doubt FIFREEZE will make things better. It's just > > going to make shutdowns hang every now and then. > > My understanding is freeze isn't ignorable, it's expressly for the use > case when the disk has active processing writing and the fs must be > made completely consistent, e.g. prior to taking a snapshot. The thaw > immediately following freeze would prevent any shutdown hang. > > The point of freeze/thaw is it will cause the file system metadata > that grub depends on to know where the new grub.cfg is located, to get > committed to disk prior to reboot. If some process is still hanging > around with an open write, it doesn't really matter. As mentioned: if you prep a patch that adds FIFREEZE+FITHAW when we remount stuff read-only, then I'd merge it, even though I think the kernel APIs for this are really broken, and it would be much preferably having a proper API for this, either exposed via the well-understood sync() syscall, or through a new ioctl, if they must. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Why journald has NotifyAccess=all set in the unit file?
On Tue, 11.04.17 10:18, Michal Sekletar (msekl...@redhat.com) wrote: > Hi everyone, > > I was asked today about $subject. I quickly skimmed trough the > relevant parts of the code and current default looks like an > oversight. I think there are no processes other than journald involved > in notification handling. I think it would be nice if drop the setting > and rely on default NotifyAccess=main. Good question. It has been that way since time began, and I couldn't extract any useful explanation for that from the git history. Hence, please file a PR that turns this into NotifyAccess=main. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] what is sd_notify() really for ?
Am 17.04.2017 um 00:47 schrieb Enrico Weigelt, metux IT consult: On 17.04.2017 00:04, Lennart Poettering wrote: Please always check the man pages if you have questions regarding a specific systemd interface: https://www.freedesktop.org/software/systemd/man/sd_notify.html Done so, of course. Unfortunately, it doesn't answer my questions, eg. what the service manager actually does w/ that information. really? what exactly do you not understand in the descriptions below? if there are several services depending on each other you don't want to start depending services while your big database still inits and is not ready for connections - for "Restart=always" it maybe not enough that your proess is just running - hence the watchdog where the service needs to say "i am still alive" READY=1 Tells the service manager that service startup is finished. This is only used by systemd if the service definition file has Type=notify set. Since there is little value in signaling non-readiness, the only value services should send is "READY=1" (i.e. "READY=0" is not defined). Example 2. Extended Start-up Notification A service could send the following after completing initialization: sd_notifyf(0, "READY=1\n" "STATUS=Processing requests?\n" "MAINPID=%lu", (unsigned long) getpid()); RELOADING=1 Tells the service manager that the service is reloading its configuration. This is useful to allow the service manager to track the service's internal state, and present it to the user. Note that a service that sends this notification must also send a "READY=1" notification when it completed reloading its configuration. STOPPING=1 Tells the service manager that the service is beginning its shutdown. This is useful to allow the service manager to track the service's internal state, and present it to the user. WATCHDOG=1 Tells the service manager to update the watchdog timestamp. This is the keep-alive ping that services need to issue in regular intervals if WatchdogSec= is enabled for it. See systemd.service(5) for information how to enable this functionality and sd_watchdog_enabled(3) for the details of how the service can check whether the watchdog is enabled. https://www.freedesktop.org/software/systemd/man/sd_watchdog_enabled.html ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-nspawn network-interface
On Thu, 13.04.17 16:08, poma (pomidorabelis...@gmail.com) wrote: > Hello > > Regaining of the network-interface, as is stated in the manual, ain't > happening; > man 1 systemd-nspawn > ... > OPTIONS > ... > --network-interface= > Assign the specified network interface to the container. > This will remove the specified interface from the calling namespace and > place it in the container. > When the container terminates, > it is moved back to the host namespace. [...] > > Given what's actually going on, should be stated; > --network-interface= > Assign the specified network interface to the container. > This will remove the specified interface from the calling namespace and > place it in the container. > When the container terminates, > considering that the specified interface is not moved back to the host > namespace, > specific kernel module need to be reloaded to move it back to the host > namespace. [...] Upgrade your kernel! This all works correctly on current kernels: network interfaces will now safely migrate back to the parent namespace when a network namespace dies. We usually don't document bugs in other software in systemd, but instead ask people to run current systemd only in conjunction with somewhat current kernels. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Short way to show messages of executable and unit with `journalctl`
On Fri, 14.04.17 20:30, Paul Menzel (paulepan...@users.sourceforge.net) wrote: > Dear systemd folks, > > > Is there a shorter way than below to show all messages of an executable > and a unit? > > ``` > $ journalctl _COMM=sudo + _SYSTEMD_UNIT=NetworkManager.service > ``` > > I would be happy about a command, that involves `-u` so that I don’t > have to type the suffix `.service`. This is currently not available. And I am not sure this is a highly typical usage that warrants an explicit option... Sorry... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] journal fragmentation on Btrfs
On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote: > Hi, > > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) that's > maybe a couple weeks old and was clean installed. Drive is NVMe. > > > # filefrag * > system.journal: 9283 extents found > user-1000.journal: 3437 extents found > # lsattr > C-- ./system.journal > C-- ./user-1000.journal > > I do manual snapshots before software updates, which means new writes > to these files are subject to COW, but additional writes to the same > extents are overwrites and are not COW because of chattr +C. I've used > this same strategy for a long time, since systemd-journald defaults to > +C for journal files; but I've not seen them get this fragmented this > quickly. > IIRC NOCOW only has an effect if set right after the file is created before the first write to it is done. Or in other words, you cannot retroactively make a file NOCOW. This means that if you in one way or another make a COW copy of a file (through reflinking — implicit or not, note that "cp" reflinks by default — or through snapshotting or something else) the file is COW and you'll get fragmentation. I am not entirely sure what to recommend you. Ultimately whether btrfs fragments or not, is probably something you have to discuss with the btrfs folks. We do try to make the best of btrfs, by managing the COW flag, but this only helps you to a limited degree as snapshots/reflinks will fuck things up anyway... We also ask btrfs to defrag the file as soon as we mark it as archived... I'd even be willing to extend on that, and defrag the file on other events too, for example if it ends up being too heavily fragmented. But last time I looked btrfs didn't have any nice API for that, that would have a clear focus on a single file only... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Early testing for service enablement
On Thu, 13.04.17 11:58, Martin Wilck (mwi...@suse.com) wrote: > On Thu, 2017-04-13 at 08:49 +, Mantas Mikulėnas wrote: > > IIRC, enable/disable/is-enabled are implemented entirely via direct > > filesystem access. Other than that, systemctl uses a private socket > > when running as root – it talks DBus but doesn't require dbus-daemon. > > > > A bigger problem is that initramfs can't know much about the main > > system due to having a separate /etc, unless maybe you run `systemctl > > --root=...` > > This is not a problem for us because in initramfs, we only care whether > the service is enabled in initramfs itself. > > > Could you elaborate on why you find this checking necessary in the > > first place? Do your udev rules run some weird stuff? > > It's about multipath. In the udev rule that checks whether or not a > given device should be treated as a multipath device path, we need to > figure out whether multipathd.service is enabled. We want to to that > without connecting to multipathd.socket at that time in the boot > process, because that would fire up multipathd, and there's strong > evidence that multipath-enabled systems boot more stably if multipathd > is started later (after udev settle). Therefore the idea was to obtain > the information from systemd ("will multipathd.service be started later > in the boot process?"). That appears questionnable to me. Synchronously requesting data from other services from inside an udev rule like that appears highly problematic to me, in particular if you sometimes do it and sometimes not, as that makes things underterministic. Also: instead of checking whether a service unit is enabled before contacting a specific socket, please make sure that the socket unit is only enabled if the service is enabled too (i.e. via Also= in the [Install] section of the service), so that you can directly talk to the socket, and if the service is not enabled (and hence the socket either) you will just get an ENOENT/ECONNREFUSED back... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Early testing for service enablement
On Thu, 13.04.17 12:05, Martin Wilck (mwi...@suse.com) wrote: > On Thu, 2017-04-13 at 11:45 +0200, Lennart Poettering wrote: > > On Thu, 13.04.17 08:49, Mantas Mikulėnas (graw...@gmail.com) wrote: > > > > > IIRC, enable/disable/is-enabled are implemented entirely via direct > > > filesystem access. Other than that, systemctl uses a private socket > > > when > > > running as root – it talks DBus but doesn't require dbus-daemon. > > > > Correct, enable/disable/is-enabled can operate without PID 1, but > > they > > usually don't unless the tool detects it is being run in a chroot > > environment. > > > > And yes, systemctl can communicate with PID 1 through a private > > communication socket that exists as long as PID 1 exists. dbus-daemon > > is not needed, except when your client is unprivileged. > > If I interpret this answer correctly, you're saying that "systemctl is- > enabled xyz.service" *should* actually work, even if it's called right > after PID 1 is started. I'm pretty certain that that wasn't the case > for me. My client was running from an udev rule and thus not > unprivileged. That should be considered a bug, then? Yes, systemctl is-enabled should always work fine regardless if you run it in early or late boot or even the initrd. However, it will always just return you the state that applies to its current context, i.e. inside the initrd it will tell you whether the unit is enabled in the initrd, and on the host whether it is enabled on the host. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] what is sd_notify() really for ?
On Mon, 17.04.17 00:47, Enrico Weigelt, metux IT consult (enrico.weig...@gr13.net) wrote: > On 17.04.2017 00:04, Lennart Poettering wrote: > > > Please always check the man pages if you have questions regarding a > > specific systemd interface: > > > > https://www.freedesktop.org/software/systemd/man/sd_notify.html > > Done so, of course. Unfortunately, it doesn't answer my questions, > eg. what the service manager actually does w/ that information. Well, it's used for a variety of things. I figure most relevant usage is for the implementation of Type=notify services, which is referenced from the man page, if you have a look. For details about that option see: https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type= Another major use is for the watchdog logic, i.e. the implementation of the WatchdogSec= setting, also referenced from sd_notify()'s man page. For details about this specific setting see: https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec= And there's more. For example, you can use it to store fds in the service manager, so that your service may be restarted (or terminated abnormally) and access to specific sockets, devices, or any other object that may be referenced with a file descriptor isn't lost. If the brief descriptions in the man pages aren't sufficient, I'd recommend you to have a look at the sources. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel