Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] version bump of minimal kernel version supported by systemd?
On Fr, 01.04.22 13:54, Greg Kroah-Hartman (gre...@linuxfoundation.org) wrote: > > While it is true that the syscall interface is kept reasonably stable, > > almost everything else gets monkeyed with a lot, because a lot of > > kernel developers only consider the syscall interface a program > > interface. This is a problem because a *lot* of things are only > > accessible through other means (procfs, sysfs, uevents, etc.). > > > > Unfortunately, that means that in practice, the kernel interfaces that > > userspace *must* depend on break far more than anyone likes. > > The above example is an interesting case. A new feature was added, was > around for a while, and a few _years later_ it was found out that some That's not quite true. The breakages were actually reported pretty quickly to the kernel people who added the offending patches, and they even changed some things around (an incomplete patch for udev was posted, which we merged), but the issue was still not properly addressed. It died down then, nothing much happened, and udev maintainers didn't bring this up again for a while, as they had other stuff to do. The issue became more and more visible though as more subsystems in the kernel started generating these uevents, to a point where ignoring the issue wasn't sustainable. At that point kernel people were pretty dismissive though (not that they were particularly helpful in the beginning either), partly because the change was in now for so long. So we reworked how udev worked. > people had userspace "scripts" that broke because the feature was > added. nah, this broke C code all over the place, too. Not just "scripts". I am not even disagreeing though that bind/unbind uevents made sense to add. I just want to correct how things happened here. There was a general disinterest from the kernel people who broke things to fix things, and in particular major disinterest in understanding how udev actually works and how udev rules are used IRL. (I mean, that early patch we got and merged literally just changed udev to drop messages with bind/unbind entirely, thus not fixing anything, just hiding the problem with no prospect of actually making it useful for userspace. I figure the kernel devs involved actually cared about Android, and not classic Linux userspace, i.e. udev.) I know the kernel people like to carry that mantra of not breaking userspace quite like a monstrance, but IRL it's broken all the time. Often for good reasons, quite often also for no reason but lack of testing. Things like that will happen. But I also think that Windows for example is probably better at not breaking their interfaces than Linux is. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] problem starting systemd in a container using parameters --default-standard-output=fd --default-standard-error=fd:stdout
On Mi, 30.03.22 17:25, masber masber (mas...@hotmail.com) wrote: > Any idea why --default-standard-output=fd > --default-standard-error=fd:stdout breaks systemd? They make no sense? "fd" you can only use if you have a .socket unit that passes in an fd to a service. But if you don#t have that it just doesn't make any sense... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] udevadm: Failed to scan devices: Input/output error
On Do, 31.03.22 12:58, Belal, Awais (awais_be...@mentor.com) wrote: > Hi Lennart, > > > No distro from the last 10y should use "udevadm settle" in the clean > > boot path. Please work with your distro to fix that. It doesn't do > > what people think it does, and clean-written software really doesn't > > need that in the boot path. It just slows down boot. > > Thanks for pointing that out. I will definitely report this and work > with the distro folks to see why we're doing this and drop it if we > can work without it. However, the failure I mentioned is in the > invocation of udevadm triigger. Here's what strace revealed > > faccessat(AT_FDCWD, > "/sys/devices/virtual/devlink/platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu/uevent", > F_OK) = 0 > readlinkat(AT_FDCWD, > "/sys/devices/virtual/devlink/platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu/subsystem", > "../../../../class/devlink", 4096) = 25 > openat(AT_FDCWD, > "/sys/devices/virtual/devlink/platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu/uevent", > O_RDONLY|O_CLOEXEC) = 5 > fstat(5, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0 > fstat(5, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0 > read(5, "", 4096) = 0 > close(5)= 0 > openat(AT_FDCWD, > "/run/udev/data/+devlink:platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu", > O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) > getdents64(4, 0x1b02edd0, 32768)= -1 EIO (Input/output > error) uh? getdents64() is the syscall that reads directory contents. Smells like a kernel problem. If EIO is thrown when reading a directory, then that's almost certainly a fuckup in the kernel, given that this probably refers to sysfs or so. Would be good to know which fd 4 refers to. Consider reruning the strace with "-y". With that it will show you which fd this is triggered from. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] udevadm: Failed to scan devices: Input/output error
On Do, 31.03.22 08:53, Belal, Awais (awais_be...@mentor.com) wrote: > Hi Lennart, > > > Which udevadm command is this from? The udevadm trigger invocation we > > do during boot? > > This is a Yocto based build and the boot flow is using an initramfs. After > setting up /sys /proc and other related specifics the initramfs calls > > $_UDEV_DAEMON --daemon > udevadm trigger --action=add > udevadm settle No distro from the last 10y should use "udevadm settle" in the clean boot path. Please work with your distro to fix that. It doesn't do what people think it does, and clean-written software really doesn't need that in the boot path. It just slows down boot. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] udevadm: Failed to scan devices: Input/output error
On Mi, 30.03.22 19:23, Belal, Awais (awais_be...@mentor.com) wrote: > Now, the system boots just fine on some attempts while on others it takes > quite a lot of time to boot with logs such as > > > Starting version 244 > Failed to scan devices: Input/output error > WARNING: Device /dev/ram0 not initialized in udev database even after > waiting 1000 microseconds. > WARNING: Device /dev/mmcblk0 not initialized in udev database even after > waiting 1000 microseconds. > WARNING: Device /dev/ram1 not initialized in udev database even after > waiting 1000 microseconds. > WARNING: Device /dev/mmcblk0p1 not initialized in udev database > even after waiting 1000 microseconds. Which udevadm command is this from? The udevadm trigger invocation we do during boot? can you reproduce this if you trigger manually? If you strace, do you see where the EIO comes from? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] version bump of minimal kernel version supported by systemd?
On Do, 24.03.22 10:28, Luca Boccassi (bl...@debian.org) wrote: > > What I am trying to say is that it would actually help us a lot if > > we'd not just be able to take croupv2 for granted but to take a > > reasonably complete cgroupv2 for granted. > > > > Lennart > > > > -- > > Lennart Poettering, Berlin > > Yes, that does sound like worth exploring - our README doesn't document > it though, do we have a list of required controllers and when they were > introduced? So I'd argue cgroupsv2 was pretty useless before 4.15, since it lacked the cpu controller, which I'd argue is actually the one that matters most. hence, before 4.15 cgroupsv2 was an experiment, not something you could actually deploy. some other interesting milestones: * kcmp → 3.5 * renameat2 on all relevant file systems → 4.0 * pids controller in cgroupv1 → 4.3 * pids controller in cgroupv2 → 4.5 * cgroup namespaces → 4.6 * statx → 4.11 * pidfd → 5.3 This is just some quick search through man pages. There might be a lot of other stuff that would make sense for us to be able to rely on. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 24.03.22 14:32, Benjamin Berg (benja...@sipsolutions.net) wrote: > HI, > > On Thu, 2022-03-24 at 12:40 +0100, Felip Moll wrote: > > False, the JobRemoved signal returns the id, job, unit and result. To > > wait for JobRemoved only needs a matching rule for this signal. The > > matching rule can just contain the path. In fact, nothing else than > > strings can be matched in a rule, so I may be only able to use the > > path. > > I think you need to add a wildcard match before the job is created > (i.e. before StartTransientUnit). Otherwise registering the match rule > (using the job's object path) will race with systemd signalling that > the job has completed. Correct. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 24.03.22 00:45, Felip Moll (fe...@schedmd.com) wrote: > Hi, some days ago we were talking about this: > > > > > Problem number two, there's a significant delay since when creating the > > > scope, until it is ready and the pid attached into it. The only way it > > > worked was to put a 'sleep' after the dbus call and make my process wait > > > for the async call to dbus to be materialized. This is really > > > un-elegant. > > > > If you want to synchronize in the cgroup creation to complete just > > wait for the JobRemoved bus signal for the job returned by > > StartTransientUnit(). > > > > > StartTransientUnit returns a string to a job object path. To call > JobRemoved I need the job id, so the easier way to get it is to strip the > last part of the returned string from StartTransientUnit job object path. > Am I right? JobRemoved is a signal, not a method call. i.e. not something you call, but you are notified about. And it originates from an object and objects have object paths in D-Bus. > Once I have the job id, I can then subscribe to JobRemoved bus signal for > the recently created job, but what happens if during the time I am > obtaining the ID or parsing the output, the job is finished? Will I lose > the signal? Yes. D-Bus sucks that way. You ave to subscribe to all jobs first, and the filte rout the ones you don#t want. > What is the correct order of doing a StartTransientUnit and wait for the > job to be finished (done, failed, whatever) ? first subscribe to JobRemoved, then issue StartTransientUnit, and then wait until you see JobRemoved for the unit you just started. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] version bump of minimal kernel version supported by systemd?
On Do, 24.03.22 14:05, Zbigniew Jędrzejewski-Szmek (zbys...@in.waw.pl) wrote: > > Yes, that does sound like worth exploring - our README doesn't document > > it though, do we have a list of required controllers and when they were > > introduced? > > In the README: > Linux kernel >= 4.2 for unified cgroup hierarchy support > Linux kernel >= 4.10 for cgroup-bpf egress and ingress hooks > Linux kernel >= 4.15 for cgroup-bpf device hook > Linux kernel >= 4.17 for cgroup-bpf socket address hooks > > In this light, 4.19 is better than 4.4 or 4.9 ;) Well, the list is not complete. i.e. the "io" controller came late iirc. And killing and stuff too. would take some work to figure out which features of cgroupv2 we actually make us of, and then when they were added. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] /etc/os-release but for images
On Mi, 23.03.22 17:14, Davide Bettio (davide.bet...@secomind.com) wrote: > I opened this PR: https://github.com/systemd/systemd/pull/22841 > > This doesn't enable full semver support since that would require allowing > A-Z range, but I'm not sure if it is something we really want to enable > (uppercase in semver looks quite uncommon by the way). how does semver suggest uppercase chars are handled? is "0.1.1a" newer than "0.1.1A"? > I would use the UUID as a further metadata in addition to IMAGE_VERSION, > also for the reasons I describe later here in this mail. Sounds like something you could just add as suffix to IMAGE_VERSION, no? > > > Compared to other options an UUID here would be both reliable and easy to > > > handle and generate. > > > > UUID is are effectively randomly generated. That sucks for build > > processes I am sure, simply because they hence aren't reproducible. > > > > Using a reliable digest, such as the one that can be generated with `casync > digest`, was my first option, however if you think about an update system > such such as RAUC and its bundles, you might still have the same exact > filesystem digest mapping to a number of different bundles, since they can > bring different hook scripts and whatever. > I'm also aware of scenarios where the same filesystem tree has been used to > generate different flash images in order to satisfy different flash sizes / > NAND technologies. > So from a practical point of view having an UUID, and forcing a different > one in /etc/os-release every time a filesystem image / RAUC bundle is > created allows us to have a reasonable 1:1 mapping between the update > artifact and the system image that is on the device. > Last but not least having it in /etc/os-release makes it pretty convenient > to read it (and for sure using an UUID is easier than trying to embed the > digest of the filesystem where /etc/os-release is kept ;) ) > Also there is an interesting bonus: UUID is globally unique also in > scenarios where users try to delete and recreate version tags without > incrementing the version number (or other messy scenarios). Shouldn't you use the fs header uuid? or the GPT partition or overall uuids? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] version bump of minimal kernel version supported by systemd?
On Mi, 23.03.22 11:28, Luca Boccassi (bl...@debian.org) wrote: > At least according to our documentation it wouldn't save us much > anyway, as the biggest leap is taking cgroupv2 for granted, which > requires 4.1, so it's included regardless. Unless there's something > undocumented that would make a big difference, in practical terms of > maintainability? Note that "cgroupv2 exists" and "cgroupv2 works well" are two distinct things. Initially too few controllers supported cgroupv2 for cgroupv2 to be actually useful. What I am trying to say is that it would actually help us a lot if we'd not just be able to take croupv2 for granted but to take a reasonably complete cgroupv2 for granted. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Antw: [EXT] Re: version bump of minimal kernel version supported by systemd?
On Do, 24.03.22 08:21, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > I wonder: > > Why not providing some test suite instead: If the test suite succeeds, systemd > might work; if it doesn't, manual steps are needed. One goal here is to reduce our maintainance burden, not increase it. Another is to communicate clearly what we support and what we don't. Any such test suite collides with both these goals. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] /etc/os-release but for images
On Mi, 23.03.22 13:38, Davide Bettio (davide.bet...@secomind.com) wrote: > > That's the idea: take the packages, build an image, and then append > > IMAGE_ID/IMAGE_VERSION to it? > > Sure, sounds pretty convenient, here my point was about blindly appending > those additional fields (trusting the distribution didn't already set > them). I don't know your distro. But I'd certainly view it as a bug if your distro fills in these two fields but doesn't actually work based on pre-built images, but is solely package-based. > > > I cook a new image, furthermore I plan to replace the whole operating > > > system image (that I keep read-only) in order to update it, so BUILD_ID > > > would change at every update (so it sounds slightly different from the > > > original described semantic). > > > > BUILD_ID is not for that. You are looking for IMAGE_VERSION. > > IMAGE_VERSION didn't look to me a good option for identifying nightly > builds, or any kind of build that wasn't cooked from a tagged image build > recipe. I think it should be fine for that. > Also sadly IMAGE_VERSION doesn't allow + which is used from semver for > build metadata (such as 1.0.0+21AF26D3 or 1.0.0+20130313144700). Ah, interesting. This looks like something to fix in our syntax descriptio though. I am pretty sure we should allow all characters that semver requires. Can you file an RFE issue about this on github? Or even better, submit a PR that adds that? That said, I'd always include some time-based counter in automatic builds, so that the builds can be ordered. Things like sd-boot's boot menu entry ordering relies on that. > That's pretty useful when storing a relation between the exact update > bundle that has been used to flash a device, and the system flashed on a > device. It turns out to be pretty useful when collecting stats about a > device fleet or when remote managing system versions and their updates. > So what I would do on os-release would be storing an UUID that is generated > every time a system image is generated, that UUID can be collected/sent at > runtime from a device to a remote management service. Why wouldn't the IMAGE_VERSION suffice for that? Why pick a UUID where a version already works? > Compared to other options an UUID here would be both reliable and easy to > handle and generate. UUID is are effectively randomly generated. That sucks for build processes I am sure, simply because they hence aren't reproducible. BTW, there's now also this: https://systemd.io/BUILDING_IMAGES/#image-metadata Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] /etc/os-release but for images
On Mi, 23.03.22 10:51, Davide Bettio (davide.bet...@secomind.com) wrote: > Hello, > > First of all, thanks for your answers. > > It wasn't really clear to me that the /etc/os-release file was editable > from a 3rd party other than the distribution maintainers, so thanks for the > clarifications. Well, it's not precisely supposed to be something users or admins should edit. But image builders may. > Are the distributions required to leave IMAGE_ID and > IMAGE_VERSION empty? Well, if the distribution people build both packages and disk images, they can set IMAGE_ID/IMAGE_VERSION for the latter. But this should always be part of building images, not of building packages. > Can I safely just append those fields at the end of > the copy of the /etc/os-release file? That's the idea: take the packages, build an image, and then append IMAGE_ID/IMAGE_VERSION to it? > Speaking of BUILD_ID, according to the spec sounds like a field > reserved to BUILD_ID? That's a different thing... https://www.freedesktop.org/software/systemd/man/os-release.html#IMAGE_ID= vs. https://www.freedesktop.org/software/systemd/man/os-release.html#BUILD_ID= > distributions: "BUILD_ID may be used in distributions where the original > installation image version is important", from my side what I need is to > identify the git revision + build date of the recipe I'm using to cook the > image installed on the system, also my plan is to change that ID every time > I cook a new image, furthermore I plan to replace the whole operating > system image (that I keep read-only) in order to update it, so BUILD_ID > would change at every update (so it sounds slightly different from the > original described semantic). BUILD_ID is not for that. You are looking for IMAGE_VERSION. > Last but not least, I was looking for a machine parsable unique id, so I > plan to use BUILD_UUID if it is not kept reserved for other usages, that > will be an UUID that is freshly generated every time I cook a new image. What's this for? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd move processes to user.slice cgroup after updating service configuration file
On Mi, 23.03.22 14:25, 吾为男子 (csren...@qq.com) wrote: > dear all experts, > > now we have such a problem: > > we need to update our systemd service configuration file, > > before updating, our service has already created some processes and > make them attach to cgroup > /system.slice/{our-service-name}.service/{our-service-sub-group}, > this is what we would expect, > > but, on some machine, sometimes, after we updating our service > configuration file, these processesas mentioned above, > will be moved to /user.slice, this is what we do NOT > expect,there is a certain probability that this will happen Is it possible that said service invokes sudo or su or so, or in some other way opens a PAM session? If so, this will migrate the calling process into a per-session cgroup below user.slice. What's the precise cgroup slice of one such occurance? > how to prevent this action from systemd? it will be a great honor > for me to get your help, thanks. Don't use sudo/su from scripts. If you need to acquire privileges from a script, use util-linux' setpriv tool. It will change privileges for you but without opening a PAM session, and thus without cgroup migratory effect. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] /etc/os-release but for images
On Di, 22.03.22 17:10, Davide Bettio (davide.bet...@secomind.com) wrote: > Hello, > > I would like to figure out if anyone has proposed any kind of standard for > storing metadata about a system image. The IMAGE_ID= and IMAGE_VERSION= fields from /etc/os-release are supposed to be used for that. i.e. the idea was that you can take a generic distro (fedora, debian, …) build an image off it, and it will put its own os info in /usr/lib/os-release, and make /etc/os-release a symlink to it. Your image build would then replace /etc/os-release with a file, that incorporates /usr/lib/os-release and adds in IMAGE_ID=/IMAGE_VERSION=. Each time you rebuild the image your image building tool would repeat that step. i.e. it would be the image builder tool's job to extend the generic OS data from /usr/lib/ with info about the image and place the result in /etc/. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] find_device() and FOREACH_DEVICE_DEVLINK memory leaks on "systemd-249"
On So, 13.03.22 19:14, Tony Rodriguez (unixpro1...@gmail.com) wrote: > Valgrind is reporting "still reachable" memory leak (2 blocks) when calling > find_device() and FOREACH_DEVICE_DEVLINK against "systemd-249". In my case, > they are both called within fstab-generator.c on "systemd-249". Only code > modifications, on my end, are within fstab-generator.c The mempool stuff is not really "leaked": it's an allocation cache, i.e. subsequent calls will reuse the already allocated objects. The stuff is hence reachable via the allocation cache. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mi, 16.03.22 17:35, Michal Koutný (mkou...@suse.com) wrote: > True, in the unified mode it should be safe doing manually. > I was worried about migrating e.g. MainPID of a service into this scope > but PID1 should handle that AFAICS. Also since this has to be performed > by the privileged user (scopes are root's), the manual migration works. This is actually a common case: for getty style login process the main process of the getty service will migrate to the new scope. A service is thus always a cgroup *and* a main pid for us, in case the main pid is outside of the cgroup. And conversely, a process can be associated to multiple units this way. It can be main pid of one service and be in a cgroup of a scope. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote: > > > (The above is slightly misleading) there could be an alternative of > > > something like RemainAfterExit=yes for scopes, i.e. such scopes would > > > not be stopped after last process exiting (but systemd would still be in > > > charge of cleaning the cgroup after explicit stop request and that'd > > > also mark the scope as truly stopped). > > > > Yeah, I'd be fine with adding RemainAfterExit= to scope units > > > > > Note that what Michal is saying is "something like RemainAfterExit=yes for > scopes", which means systemd would NOT clean up the cgroup tree when there > are no processes inside. > AFAIK RemainAfterExit for services actually does cleanup the cgroup tree if > there are no more processes in it. It doesn't do that if delegation is on (iirc, if not I'd consider that a bug). Same logic should apply here. > If that behavior of keeping the cgroup tree even if there are no pids is > what you agree with, then I coincide is a good idea to include this option > to scopes. Yes, that is what I was suggesting this would do. > > > Such a recycled scope would only be useful via > > > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). > > > > Well, if delegation is on, then people don#t really have to use our > > API, they can just do that themselves. > > That's not exact. If slurmd (my main process) forks a slurmstepd (child > process) and I want to move slurmstepd into a delegated subtree from the > scope I already created, I must use AttachProcessesToUnit(), isn't that > true? depends on your privs. You can just move it yourself if you have enough privs. See commit msg in 6592b9759cae509b407a3b49603498468bf5d276 > Or are you saying that I can just migrate processes wildly without > informing systemd and just doing an 'echo > cgroup.procs' from one > non-delegated tree to my delegated subtree? yeah, you can do that. Note that (independently of systemd) you shouldn't migrate stuff to aggressively, since it fucks up kernel resource accounting. i.e. it is wise to minimize process migration in cgroups and always migrate plus shortly after exec(), or even better do a clone(CLONE_INTO_CGROUP) – though unfortunately the latter cannot work with glibc right now :-(. i.e. keeping processes that already "have history" around for a long time after migration kinda sucks. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mi, 16.03.22 16:15, Felip Moll (fe...@schedmd.com) wrote: > On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný wrote: > > > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll > > wrote: > > > Meaning that it would be great to have a delegated cgroup subtree without > > > the need of a service or scope. > > > Just an empty subtree. > > > > It looks appealing to add Delegate= directive to slice units. > > Firstly, that'd prevent the use of the slice by anything systemd. > > Then some notion of owner of that subtree would have to be defined (if > > only for cleanup). > > That owner would be a process -- bang, you created a service with > > delegation or a scope with "keepalive" process. > > > > > Correct, this is how the current systemd design works. > But... what if the concept of owner was irrelevant? What if we could just > tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it > or do anything to it or pids residing into it. No, that's not something we will offer. We bind a lot of meaning to the cgroup concept. i.e. we derive unit info from it, and many things are based on that. For example any client logging to journald will do so from a cgroup and we pick that up to know which service logging is from, and store that away and use it for filtering, for picking per-unit log settings and so on. Moreover we need to be able to shutdown all processes on the system in a systematic way for shutdown, and we do that based on units, and the ordering between them. Having processes and cgroups that live entirely independent makes a total mess from this. And there's a lot more, like resource mgmt: we want that all processes on the system are placed in a unit of some form so that we can apply useful resource mgmt to it. So yes you can have a delegated subtree, if you like and we'll not interfere with what you do there mostly, but it must be a leaf of our tree, and we'll "macro manage" it for you, i.e. define a lifetime for it, and track processes back to it. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Di, 15.03.22 17:24, Michal Koutný (mkou...@suse.com) wrote: > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll > wrote: > > Meaning that it would be great to have a delegated cgroup subtree without > > the need of a service or scope. > > Just an empty subtree. > > It looks appealing to add Delegate= directive to slice units. Hm? Slice units are *inner* node of *our* cgroup trees. if we'd allow delegation of that, then we'd could not put stuff inside it, hence it wouldn't be a slice because it couldn#t contain anything anymore. > Firstly, that'd prevent the use of the slice by anything systemd. yeah, precisely? i don't follow. What would a slice with delegation be that a scope with delegation isn't already? > Then some notion of owner of that subtree would have to be defined (if > only for cleanup). scopes already have that, so why not use that? > That owner would be a process -- bang, you created a service with > delegation or a scope with "keepalive" process. can't parse this. > (The above is slightly misleading) there could be an alternative of > something like RemainAfterExit=yes for scopes, i.e. such scopes would > not be stopped after last process exiting (but systemd would still be in > charge of cleaning the cgroup after explicit stop request and that'd > also mark the scope as truly stopped). Yeah, I'd be fine with adding RemainAfterExit= to scope units > Such a recycled scope would only be useful via > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). Well, if delegation is on, then people don#t really have to use our API, they can just do that themselves. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Di, 15.03.22 16:35, Felip Moll (fe...@schedmd.com) wrote: > > I don't follow. You can enable delegation on the scope. I mean, that's > > the reason I suggested to use a scope. > > > > > Meaning that it would be great to have a delegated cgroup subtree without > the need of a service or scope. > Just an empty subtree. That's what a scope is. I don't follow? What do you think a scope is beyond that? It just encapsulates a cgroup subtree. It auto-cleans it though once it goes empty, and because it does that it also requires you to provide at least one PID to add to the scope when it is created. For services we have a RemainAfterExit= property btw. There were requests for adding the same for scopes. I'd be fine with adding that, happy to take a patch. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Di, 15.03.22 10:50, Felip Moll (fe...@schedmd.com) wrote: > Another thing I have found is that if the process which created the scope > (e.g. my main process, slurmd) terminates, then the scope is stopped even > if I abandoned it and there's a pid inside. > So this makes the proposed solution not working. What am I missing? > > ● gamba11_slurmstepd.scope > Loaded: loaded (/run/systemd/transient/gamba11_slurmstepd.scope; > transient) > Transient: yes > Active: active (abandoned) since Tue 2022-03-15 10:40:34 CET; 4s ago It's shown as active, so where is the problem? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 14.03.22 23:12, Felip Moll (fe...@schedmd.com) wrote: > > But note that you can also run your main service as a service, and > > then allocate a *single* scope unit for *all* your payloads. > > The main issue is the scope needs a pid attached to it. I thought that the > scope could live without any process inside, but that's not happening. > So every time a user step/job finishes, my main process must take care of > it, and launch the scope again on the next coming job. Leave a stub process around in it. i.e something similar to "/bin/sleep infinity". > The forked process just does the dbus call, and when the scope is ready it > is moved to the corresponding cgroup (PIDFile=). Hmm? PIDFile= is a property of *services*, not *scopes*. And "scopes" cannot be moved to "cgroups". I cannot parse the above. Did you read up on scopes and services? See https://systemd.io/CGROUP_DELEGATION/, it explains the concept of "scopes". Scopes *have* cgroups, but cannot be moved to "cgroups". > Problem number one: if other processes are in the scope, the dbus call > won't work since I am using the same name all the time, e.g. > slurmstepd.scope. > So I first need to check if the scope exists and if so put the new > slurmstepd process inside. But we still have the race condition, if during > this phase all steps ends, systemd will do the cleanup. Leave a stub process around in it. > Problem number two, there's a significant delay since when creating the > scope, until it is ready and the pid attached into it. The only way it > worked was to put a 'sleep' after the dbus call and make my process wait > for the async call to dbus to be materialized. This is really > un-elegant. If you want to synchronize in the cgroup creation to complete just wait for the JobRemoved bus signal for the job returned by StartTransientUnit(). > If instead I could just ask systemd to delegate a part of the tree for my > processes, then everything would be solved. I don't follow. You can enable delegation on the scope. I mean, that's the reason I suggested to use a scope. > Do you have any other suggestions? Not really, except maybe: please read up on the documentation, it explains a lot of the concepts. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] PrivateNetwork=yes is memory costly
On Do, 10.03.22 11:50, Christopher Wong (christopher.w...@axis.com) wrote: > Hi Lennart, > > > It is definitely a functionality we want to use. However, the memory came as > an unexpected side effect. Since we are not only enabling this for one single > service, instead we are applying it globally for all services. > > Now due to this huge memory consumption we are trying to put > everything into the same namespace using > JoinsNamespaceOf=. It seems to consume less memory. This means they will still be isolated from the network, but no longer from each other. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] PrivateNetwork=yes is memory costly
On Mo, 07.03.22 15:10, Christopher Wong (christopher.w...@axis.com) wrote: > Hi, > > > It seems that PrivateNetwork=yes is a memory consuming > directive. The kernel seems to allocate quite an amount of memory > for each service (~50 kB) that has this directive enabled. I wonder > if this is expected and if anyone has had similar experience? PrivateNetwork=yes means that a private network namespace is allocated for the service. If you think network namespaces are too expensive in their current implementation, please bring this up with the kernel people, because they are a kernel concept after all, we just allocate them if told so. network namespaces are an effective way to disconnect a service from the network, if the service doesn't need it. It's probably one of the most relevant sandboxing options we offer, since disabling the attack surface called "network" for a service is of such major importance. That said, if you disable the network namespace functionality in the kernel systemd will handle this gracefully, and not use it. If the feature is available in the kernel we will however use it. > Is there any ways to reduce the usage? Besides turning it off? Nothing I was aware of. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] making firewalld an early boot service
On Mi, 09.03.22 08:17, Michael Biebl (mbi...@gmail.com) wrote: > > firewalld requires D-Bus so it must be started after D-Bus. You cannot > > start it earlier. > > See above, being Type=dbus, it has an explicit > Requires/After=dbus.socket. It has After=dbus.service, not After=dbus.socket, no? That's a difference during shutdown: if you order against the service this means you can still talk via the broker on shutdown. if you only order against the socket the broker might be dead by the time you shutdown. Ideally services would be written in a style that they just exit at shutdown and don't need to tdo D-Bus anymore just to exit. But of course reality isn't always ideal. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] making firewalld an early boot service
On Mi, 09.03.22 08:49, Andrei Borzenkov (arvidj...@gmail.com) wrote: > On 09.03.2022 00:59, Michael Biebl wrote: > > Hi, > > > > I need help with firewalld issue, specifically > > https://github.com/firewalld/firewalld/issues/414 > > > > the TLDR: both firewalld.service and cloud-init-local.service hook > > into network-pre.target and have a Before=network-pre.target ordering. > > > > cloud-init-local.service is an early boot service using > > DefaultDependencies=no and before sysinit.target. > > firewalld.service via DefaultDependencies=yes get's an > > After=sysinit.target ordering. > > > > So we have conflicting requirements and a dependency loop that needs > > to be broken by systemd. > > > > Firewalld is red herring here. cloud-init.service has > > After=networking.service What is this unit? Is this a Debian thing? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] making firewalld an early boot service
e65;6602;1cOn Di, 08.03.22 22:59, Michael Biebl (mbi...@gmail.com) wrote: > I wonder if firewald should be turned into an early boot service as > well. I doubt you can do that. Thing is that firewalld uses D-Bus, and services that do D-Bus will have a hard time to run during early boot. In systemd we have some services which do D-Bus and run in early boot, specifically networkd, resolved and systemd itself. They do that by simply not doing D-Bus that early, and watching the d-bus socket so that they connect the moment it becomes available. It's ugly as fuck, though and very hard to get right, it took us quite some time to get this reasonably right and race-free. Last time I looked firewalld is a bunch of scripts around iptables/nft shell outs? I have my doubts it's going to be easy to make that work, i.e. add the glue it needs to instantly connect to D-Bus once it becomes available in a race-free fashion- > Currently it looks like this: > > [Unit] > Description=firewalld - dynamic firewall daemon > Before=network-pre.target Network management services such as networkd are early-boot services. A late boot service ordered before network-pre.target and thus networkd is hence already an ordering cycle. > After=dbus.service > After=polkit.service These two are late boot service, hence hard to move to early boot if you keep them. > I wonder if the following would make sense > > > [Unit] > Description=firewalld - dynamic firewall daemon > DefaultDependencies=no > Before=network-pre.target > Wants=network-pre.target > After=local-fs.target > Conflicts=iptables.service ip6tables.service ebtables.service > ipset.service nftables.service > Documentation=man:firewalld(1) > > [Service] > ... > [Install] > WantedBy=sysinit.target It should also have Before=sysinit.target really. > Alias=dbus-org.fedoraproject.FirewallD1.service > I dropped the After=dbus.service polkit.service orderings, as they are > either socket or D-Bus activated services, added an explicit > After=local-fs.target ordering just to be sure and hooked it into > sysinit.target. My educated guess is that they want After=dbus.service mostly for shutdown ordering, i.e. so that they can still be talked to while the system goes down or so? The thing though is: i doubt firewalld is able to handle the case where the dbus broker isn't connectible yet. > Would you agree that making a firewall service an early boot service > is a good idea? Well, I am not a fan of the firewalld concept tbh. But yes, if you buy into the idea of firewalld, then you have to make it an early boot service really, if you intend to be compatible with early boot networking. That said, I think NetworkManager is not early-boot either right now, is it? So you have to move that too. But in that case too, not sure if it can deal with D-Bus not being around. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Antw: [EXT] Re: timer "OnBootSec=15m" not triggering
On Mo, 07.03.22 12:24, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > Thanks for that. The amazing things are that "systemd.analyze verify" finds no > problem and "enable" virtually succeeds, too: Because there is no problem really: systemd allows you to define your targets as you like, and we generally focus on a model where you can extend stuff without requiring it to be installed. i.e. we want to allow lose coupling, where stuff can be ordered against other stuff, or be pulled in by other stuff without requiring that the other stuff to be hard installed. Thus you can declare that you want to be pulled in by a target that doesn't exist, and that's *not* considered an issue, because it might just mean that you haven't installed the package that defines it. Example: if you install mysql and apache, then there's a good reason you want that mysql runs before apache, so that the web apps you run on apache can access mysql. Still it should be totally OK to install one without the other, and it's not a bug thus if one refers to the other in its unit files, even if the other thing is not installed. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing
On Fr, 04.03.22 09:26, Christopher Obbard (chris.obb...@collabora.com) wrote: > Right, so it looks like the call to close_range fails. This is a 5.4 kernel > which doesn;t have close_range - so this is understandable. > > For a quick fix, I set have_close_range to false - see the patch attached. > It seemed to work well. > > Since my 5.4 kernel is a heavily modified downstream one - next I will check > if that syscall was implemented by someone else, and also I will check if > vanilla systemd works on vanilla 5.4 (there is no reason why it shouldn't, > right?). Hmm, this is strange. Our code already has fallback paths in place to handle cases where the syscall is not implemented, i.e. where we see ENOSYS when we call it. Our code should handle this perfectly already. Is it possible that your patched kernel added non-standard syscalls under the syscall numbers the official kernel later assigned to close_range()? If so, this would explain that we see EINVAL, as of course we call the syscall assuming it was close_range(), but if it is actually something else mit very likely might not be able to make sense of our parameters and thus return EINVAL. In this case I am not very sympathetic to your case: squatting syscall numbers is just a terrible idea... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] How to find out the processes systemd‑shutdown is waiting for?
On Fr, 04.03.22 08:20, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > >> Something seems to be off with containerd's integration into systemd but > >> I'm not sure what. > > > > Docker traditionally has not followed any of our documented ways to > > You are implying that "our documented ways" is a definitive > standard? Not sure what a "standard" is, but yeah, systemd defines a non-trivial part of the APIs of general purpose Linux distributions. > > interact with cgroups, even though they were made aware of them, not > > sure why, I think some systemd hate plays a role there. I am not sure > > if this has changed, but please contact Docker if you have issues with > > Docker, they have to fix their stuff themselves, we cannot work around > > it. > > The problem with systemnd (people) is that they try to establish new standards > outside of systemd. > > "If A does not work with systemd", it's always A that is broken, never systemd > ;-) It's a stack of software. The lower layers dictate how the upper layers interact with the lower layers, not the other way around. Yes, systemd has bugs, but here we are not at fault, we document our interfaces, but Docker knowingly goes its own way, and there's little I can do about it. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?
On Mi, 02.03.22 17:50, Lennart Poettering (lenn...@poettering.net) wrote: > That said, we could certainly show both the comm field and the PID of > the offending processes. I am prepping a patch for that. See: https://github.com/systemd/systemd/pull/22655 Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 03.03.22 18:35, Felip Moll (fe...@schedmd.com) wrote: > I have read and studied all your suggestions and I understand them. > I also did some performance tests in which I fork+executed a systemd-run to > launch a service for every step and I got bad performance overall. > One of our QA tests (test 9.8 of our testsuite) shows a decrease of > performance of 3x. systemd-run is synchronous, and unless you specify "--scope" it will tell systemd to fork things off instead of doing that client-side, which I understand is what you want to do. The fact it's synchronous, i.e. waits for completion of the whole operation (including start-up of dependencies and whatnot) necessarily means it's slow. > > But note that you can also run your main service as a service, and > > then allocate a *single* scope unit for *all* your payloads. That way > > you can restart your main service unit independently of the scope > > unit, but you only have to issue a single request once for allocating > > the scope, and not for each of your payloads. > > > > > My questions are, where would the scope reside? Does it have an associated > cgroup? Yes, I explicitly pointed you to them, it's why I suggested you use them. My recommendation if you hack on stuff like this is reading the docs btw, specifically: https://systemd.io/CGROUP_DELEGATION It pretty explicitly lists your options in the "Three Scenarios" section. It also explains what scope units are and when to use htme. > I am also curious of what this sentence does exactly mean: > > "You might break systemd as a whole though (for example, add a process > directly to a slice's cgroup and systemd will be very sad).". if you add a process to a cgroup systemd manages that is supposed to be an inner one in the tree, you will make creation of children fail that way, and thus starting services and other operations will likely start failing all over the place. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 22:16, Felip Moll (lip...@gmail.com) wrote: > Silvio, > > As I commented in my previous post, creating every single job in a separate > slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal structures and > manage it all is a no-no. Firing an async D-Bus packet to systemd should be hardly measurable. But note that you can also run your main service as a service, and then allocate a *single* scope unit for *all* your payloads. That way you can restart your main service unit independently of the scope unit, but you only have to issue a single request once for allocating the scope, and not for each of your payloads. But that too means you have to issue a bus call. If you really don't like talking to systemd this is not going to work of course, but quite frankly, that's a problem you are making yourself, and I am not particularly sympathetic to it. > One other option that I am thinking about is extending the parameters of a > unit file, for example adding a DelegateCgroupLeaf=yes option. > > DelegateCgroupLeaf=. If set to yes an extra directory will be > created into the unit cgroup to place the newly spawned service process. > This is useful for services which need to be restarted while its forked > pids remain in the cgroup and the service cgroup is not a leaf > anymore. No. Let's not add that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 18:07, Felip Moll (lip...@gmail.com) wrote: > > That's a bad idea typically, and a generally a hack: the unit should > > probably be split up differently, i.e. the processes that shall stick > > around on restart should probably be in their own unit, i.e. another > > service or scope unit. > > So, if I understand it correctly you are suggesting that every forked > process must be started through a new systemd unit? systemd has two different unit types: services and scopes. Both group processes in a cgroup. But only services are where systemd actually forks+execs (i.e. "starts a process"). If you want to fork yourself, that's fine, then a scope unit is your thing. If you use scope units you do everything yourself, but as part of your setup you then tell systemd to move your process into its own scope unit. > If that's the case it seems inconvenient because we're talking about a job > scheduler where sometimes may have thousands of forked processes executed > quickly, and where performance is key. > Having to manage a unit per each process will probably not work in this > situation in terms of performance. You don't really have to "manage" it. You can register a scope unit asynchronously, it's firing off one dbus message basically at the same time you fork things off, telling systemd to put it in a new scope unit. > The other option I can imagine is to start a new unit from my daemon of > Type=forking, which remains forever until I decide to clean it up even if > it doesn't have any process inside. > Then I could put my processes in the associated cgroup instead of inside > the main daemon cgroup. Would that make sense? Migrating processes wildly between cgroups is messy, because it fucks up accounting and is restricted permission-wise. Typically you want to create a cgroup and populate it, and then stick to that. > The issue here is that for creating the new unit I'd need my daemon to > depend on systemd libraries, or to do some fork-exec using systemd commands > and parsing output. To allocate a scope unit you'd have to fire off a D-Bus method call. No need for any systemd libraries. > I am trying to keep the dependencies at a minimum and I'd love to have an > alternative. Sorry, but if you want to rearrange processes in cgroups, or want systemd to manage your processes orthogonal to the service concept you have to talk to systemd. > Yeah, I know and understand it is not supported, but I am more interested > in the technical part of how things would break. > I see in systemd/src/core/cgroup.c that it often differentiates a cgroup > with delegation with one without it (!unit_cgroup_delegate(u)), but it's > hard for me to find out how or where this exactly will mess up with any > cgroup created outside of systemd. I'd appreciate it if you can give me > some light on why/when/where things will break in practice, or just an > example? THis depends highly on what precisely you do. At best systemd will complain or just override the changes you did outside of the tree you got delegated. You might break systemd as a whole though (for example, add a process directly to a slice's cgroup and systemd will be very sad). Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?
On Mi, 02.03.22 12:29, Manuel Wagesreither (man...@fastmail.fm) wrote: > Hi all, > > My embedded system is shutting down rather slow, and I'd like to find out the > responsible processes. > > [ 7668.571133] systemd-shutdown[1]: Waiting for process: dockerd, python3 > [ 7674.670684] systemd-shutdown[1]: Sending SIGKILL to remaining processes... > > Is there a way for systemd-shutdown to give me the PID of the > processes it waits for? The messages above are from the second phase of shutdown, the "killing spree phase": it kills processes/unmounts file systems/detaches storage/… that is left over from the first phase of shutdown. If stuff is terminated here, and in particular processes are killed this almost certainly indicates something is seriously broken in the first phase of shutdown. Or in other words: don#t try to debug the second phase too much, the first phase is where the brokeness is. Since this is dockerized stuff: look at docker, they are pretty bad at cooperating with systemd in the cgroup hierarchy, and they really need to clean up their own stuff properly when going down. If they don't do that, you need to fix that in Docker. Or in other words: talk to the docker people aout all this. That said, we could certainly show both the comm field and the PID of the offending processes. I am prepping a patch for that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?
On Mi, 02.03.22 13:02, Arian van Putten (arian.vanput...@gmail.com) wrote: > I've seen this a lot with docker/containerd. It seems as if for some reason > systemd doesn't wait for their cgroups to cleaned up on shutdown. It's > very easy to reproduce. Start a docker container and then power off the > machine. Since the move to cgroups V2 containerd should be using systemd to > manage the cgroup tree so a bit puzzling why it's always happening. > > Something seems to be off with containerd's integration into systemd but > I'm not sure what. Docker traditionally has not followed any of our documented ways to interact with cgroups, even though they were made aware of them, not sure why, I think some systemd hate plays a role there. I am not sure if this has changed, but please contact Docker if you have issues with Docker, they have to fix their stuff themselves, we cannot work around it. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 14:14, Felip Moll (lip...@gmail.com) wrote: > Hello, > > I am creating a software which consists of one daemon which forks several > processes from user requests. > This is basically acting like a job scheduler. > > The daemon is started using a unit file and with Delegate=yes option, > because every process must be constrained differently. I manage my cgroup > hierarchy, create some leaves into the tree and put each pid there. > For example, after starting up the service and receiving 3 user requests, a > tree under /sys/fs/cgroup/system.slice/ could look like: > > sgamba1.service/ > ├── daemon_pid > ├── user1_stuff > ├── user2_stuff > └── user3_stuff > > I create the hierarchy and set cgroup.subtree_control in the root directory > (sgamba1.service in the example) and everything runs smoothly, until when I > decide to restart my service. > > The service then cannot restart: > > feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to > cgroup /system.slice/sgamba1.service: Device or resource busy > feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step > CGROUP spawning /path_to_bin/mydaemond: Device or resource busy > > This is because systemd tries to put the pid of the new daemon in > sgamba1.service/cgroup.procs and this would break the "no internal process > constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore > because it has subtree_control enabled for the user stuff. > > One hard requirement is that user stuff must live even if the service is > restarted. Hmm? Hard requirement of what? Not following? You are leaving processes around when your service dies/restarts? That's a bad idea typically, and a generally a hack: the unit should probably be split up differently, i.e. the processes that shall stick around on restart should probably be in their own unit, i.e. another service or scope unit. > What's the way to achieve that? I see one easy way, which is to move user > stuff into its own cgroup and out of sgamba1.service/, but then it will run > outside a Delegate=yes unit. What can happen then? > Will systemd eventually migrate my processes? > How do services workaround that issue? > If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will > systemd touch my directories? That's not supported. You may only create your own cgroups where you turned on delegation, otherwise all bets are off. If you but stuff in /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's "-.slice" without telling it so, and things will break sooner or later, and often in non-obvious ways. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [RFC] systemd-resolved: Send d-bus signal after DNS resolution
On Mi, 16.02.22 12:13, Dave Howorth (syst...@howorth.org.uk) wrote: > > This could be used by applications for auditing/logging services > > downstream of the resolver, or to update the firewall on the system. > > Perhaps an example use case would help but I'm not clear how a DNS > resolution would usefully cause a state change in the firewall without > some further external guidance? Yeah, I am not sure I grok the relationship to firewalls here, either. Updatign firewalls asynchronously based on DNS lookups sounds wrong to me... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Restart=on-failure and SuccessAction=reboot-force causing reboots on every exit of Main PID
On Mi, 16.02.22 11:45, Michał Rudowicz (michal.rudow...@fl9.eu) wrote: > Hi, > > I am trying to write a .service file for an application which is supposed to > run indefinitely. The approach I have is: > > - if the application crashes (exits with a non-zero exit code), I want >it to be restarted. This can be achieved easily using the Restart >directive, like Restart=on-failure in the [Service] section. > - if the application exits cleanly (with a zero exit code), I want the >whole device to reboot (eg. because a software update was done). I've >found the SuccessAction directive, which - after being set to reboot or >reboot-force in the [Unit] section - seems to do what I want. > > The issue I have is: when I use both SuccessAction=reboot-force and > Restart=on-failure in one .service file, the system reboots when I kill > the Main PID of the service (causing non-clean exit for testing). Can you provide a minimal .service file that shows the issue? Smells like a bug. SuccessAction= should not be triggred if a service process exits with SIGKILL... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [RFC] systemd-resolved: Send d-bus signal after DNS resolution
On Di, 15.02.22 22:37, Suraj Krishnan (sura...@microsoft.com) wrote: > Hello, > > I'm reaching out to the community to gather feedback about a feature > to broadcast a d-bus signal notification from systemd-resolved when > a DNS query is completed. The message would contain information > about the query and IP addresses received from the DNS server. Broadcasting this on the system bus sounds like a bit too heavy. I am sure there are setups which will resolve a *lot* of names in a very short time, and you'd flood the bus with that. D-Bus is expressly not built for streaming more than control data, but if you have a flood of DNS requests it becomes substantially more than that. Also, given that in 99.9%of all cases the broadcast messages would just be dropped by the broker because nothig is listening this sounds needlessly expensive. What would make sense is adding a Varlink interface for this however. resolved uses varlink anyway it could just build on that. Varlink has the benefit that no broker is involved: if noone is listening we wouldn't do anything and not have to pay for it. Moreover varlink has no issues with streaming large amounts of data. And its easy to secure to ensure nobody unprivileged will see this (simply by making the socket have a restrictive access mode). So yes, i think adding the concept makes a ton of sense. But not via D-Bus, but via Varlink. Would love to review/merge a patch that adds that and then exposes this via "resolvectl monitor" or so. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Q: Perform action for reboots happen too frequently?
On Mi, 16.02.22 14:09, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > Hi! > > I wonder: Is it possible with systemd to detect multiple reboots > within a specific time interval, and react, like executing some > systemctl command (that is expected to "improve" things)? With > "reboots" I'm mainly thinking of unexpected boots, not so much the > "shutdown -r" commands, but things like external resets, kernel > panics, watchdog timeouts, etc. The information why a system was reset is hard to get. Some watchdog hw will tell you if it was recently triggered, and some expensive hw might log serious errors to the acpi pstore stuff, but it's all icky and lacks integration. A safe approach would probably to instead track boots where the root fs or so is in dirty state. Pretty much all of today's file systems track that. It sounds like an OK feature to add to the systemd root fs mount logic to check for the dirty flag before doing that and then using that is hint to boot into a target other than default.target that could then apply further policy (and maybe then go on to enter default.target). TLDR: nope, this does not exist yet, but parts of it sound like worthwile feature additions to systemd. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Proposal to extend os-release/machine-info with field PREFER_HARDENED_CONFIG
On Di, 15.02.22 19:05, Stefan Schröder (ste...@tokonoma.de) wrote: > Situation: > > Many packages in a distribution ship with a default configuration > that is not considered 'secure'. Do they? What dos "secure" mean? If there's a security vulnerability, maybe talk to the distro about that? They should be interested... > Hardening guidelines are available for all major distributions. Each is a > little different. > Many configuration suggestions are common-sense among security-conscious > administrators, who have to apply more secure configuration using some > automation framework after installation. > > PROPOSAL > > os-release or machine-info should be amended with a field > > PREFER_HARDENED_CONFIG > > If the value is '1' or 'True' or 'yes' a package manager can opt to > configure an alternative, more secure default configuration (if > avaialble). I am not sure what "hardening" means, sounds awfully vague to me. I mean, i'd assume that a well meaning distro would lock everything down as much as they can *by*default*, except if this comes at too high a price on performance or maintainance or so. But how is a single boolean going to address that? If security was as easy as just turning on a boolean, then why would anyone want to set that to false? > According to the 'Securing Debian Manual' [1] the login configuration is > configured as > auth optional pam_faildelay.so delay=300 > whereas > delay=1000 > would provide a more secure default. > > The package 'login' contains the file /etc/pam.d/login. If dpkg (or > apt or rpm or pacman or whatever) detected that os-release asks for > secure defaults, the alternative /etc/pam.d./login.harden could be > made the default. (This file doesn't exist yet, the details are left > to the packaging infrastructure or package maintainer.) If the default settings are too low, why nor raise them for everyone? I must say, I am very sure that the primar focus should always be on locking things down as well as we can for *everyone* and as *default*. Making security an opt-in sounds like a systematically wrong approach. If specific security tech cannot be enabled by default, then work should be done to make it something that can be enabled by default. And if that's not possible then it apparently comes at some price, but a simple config boolean somewhere can't decide whether that price is worth it... So, quite frnakly, I am not convinced this is desirable. That said, You can extend machine-info with anything you like, it's supposed to be extensible. But please make sure you prefix the variables with some prefix that makes collisions unlikely. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Passive vs Active targets
On Di, 15.02.22 09:14, Kenneth Porter (sh...@sewingwitch.com) wrote: > Given that interfaces can come and go, does network.target imply that all > possible interfaces are up? No, totally and absolutely not. It's only very vaguely defined what reaching network.target at boot actually means. Usually not more than "some network managing daemon is running now" and nothing else. It explicitly does not say anythign about network interfaces being up or down or even existing. Things are more interesting during shutdown here: because in systemd shutdown order is the reverse of the startup order it means that stuff ordered After=network.target will be shutdown *before* the network is shut down, so they should still be able to send "goodbye" packets if they want, before the network goes away. Note that there's another target: network-online.target which is an active target that is supposed to encapsulate the point where the machine is actually "online". But given that this can mean a myriad of different things (local interface up vs. ping works vs. DNS reachable, vs. DHCP acquired, …) it's also pretty vaguely defined, and ultimately must be filled with meaning by the admin. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Passive vs Active targets
On Di, 15.02.22 17:30, Thomas HUMMEL (thomas.hum...@pasteur.fr) wrote: > > > Also, it seems that there are more than one way to pull in a passive > > > dependency (or maybe several providers which can "publish" it). Like for > > > instance network-pre.target wich is pulled in by both nftables.service > > > and/or rdma-ndd.service. > > > > nftables.service should pull it in and order itself before it, if it > > intends to set up the firewall before the first network iterface is > > configured. > > It makes sense but I'm still a bit confused here : I thought that a unit > which pulled a passive target in was conceptually "publishing it" for *other > units* to sync After= or Before= it but not to use it itself. What you're > saying here seems to imply that nftables.services uses itself the passive > target it "publishes". A passive unit is a sync point that should be pulled in by the service that actually needs it to operate correctly. hence: ask the question whether networkd/NetworkManager will operate only correctly if nftables finished start-up before it? I think that answer is a clear "no". But the opposite holds, i.e. nftables only operates as a safe firewall if it is run *before* networkd/NM start up. Thus it should be nftables that pulls network-pre.target in, not networkd/NM, because it matters to nftables, and it doesn't to networkd/NM. > Or maybe it is the other way around : by pulling it *and* knowing that > network interface is configured After= nftable.service is guaranteed to set > up its firewall before any interface gets configured. So yeah, passive units are mostly about synchronization, i.e. if they are pulled in they should have units on both sides, otherwise they make no sense. > > not sure what rdma-ndd does, can't comment on that. > > My point was more : is it legit for 2 supposedly different units to pull in > the same passive target ? Yes. If there are multiple services that really want to be started before some other set of services are started, then the passive target is a great way to generically put a synchornization point between them. It can be any set of services before, and any set of services after it. > Anyway both point above seem to confirm that one cannot take for granted > that some passive target will be pulled in, correct ? So before ordering > around it one can make sure some unit pulls the checkpoint ? Yeah, that's the idea: passive units are mostly synchronization points, that allow lose coupling for ordering things: for generically ordering stuff before and after it without actually listing the servicess explicitly on either side. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Passive vs Active targets
On Di, 15.02.22 08:46, Kenneth Porter (sh...@sewingwitch.com) wrote: > --On Tuesday, February 15, 2022 11:52 AM +0100 Lennart Poettering > wrote: > > > Yes, rsyslog.service should definitely not pull in network.target. (I > > am not sure why a syslog implementation would bother with > > network.target at all, neither Wants= nor After= really makes any > > sense. i.e. if people want rsyslog style logging but it would be > > delayed until after network is up then this would mean they could > > never debug DHCP leases or so with it, which sounds totally backwards > > to me. But then again, I am not an rsyslog person, so maybe I > > misunderstand what it is supposed to do.) > > Presumably it's to run the log server for other hosts. (I use it to log for > my routers and IoT devices.) Well, I presumed people actually ran it because they want traditional /var/log/messages files or so. But if that's not the case... > On RHEL/CentOS, rsyslog uses imjournal to read the systemd journal so > presumably DHCP debugging messages would be retrieved from that. Sure, but that means if people use /var/log/messages to look for DHCP issues it won't see anything until DHCP was actually acquired, which makes it useless fo debugging DHCP... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Passive vs Active targets
On Mo, 31.01.22 20:13, Thomas HUMMEL (thomas.hum...@pasteur.fr) wrote: > Hello, > > I'm successully using systemd with some non trivial (for me!) unit > dependencies including some performing: > > custom local disk formatting and mounting at boot > additionnal nics configuration by running postscripts fetched from the > network > Infiniband initialisation > NFS remote mounts > Infiniband remote mounts > HPC scheduler and its side services activation > > and I've read > https://www.freedesktop.org/software/systemd/man/systemd.special.html > > Still I do not fully (or at all ?) understand the concept of passive vs > active targets and some related points: > > The link above states : > > "Note specifically that these passive target units are generally not pulled > in by the consumer of a service, but by the provider of the service. This > means: a consuming service should order itself after these targets (as > appropriate), but not pull it in. A providing service should order itself > before these targets (as appropriate) and pull it in (via a Wants= type > dependency)." > > And also : > > "Note that these passive units cannot be started manually, i.e. "systemctl > start time-sync.target" will fail with an error. They can only be pulled in > by dependency." > > Since my first look at a passive dependency was network.target which I > indeed saw was pulled in by NetworkManager.service which ordered itself > Before it and which I compared with the active network-online.target which > pulls in the NetworkManager-wait-online.service I first deduced the > following: > > a) a passive target "does" nothing and serves only as an ordering checkpoint > b) an active target "does" actually something Yes, you could see it that way. > I thought that a passive target could be seen as "published" by the > corresponding provider > But this does not seems as simple as that: > > For one I see on my system that rsyslog.service also pulls in network.target > (but orders itself After it and thus does not seeems to be the actual > "publisher" of it as opposed the NetworkManager.service) There might very well be wrong users of this, that use this the wrong way around. We do not systematically review other people's unit files, and hence there might be a lot of issues lurking. Yes, rsyslog.service should definitely not pull in network.target. (I am not sure why a syslog implementation would bother with network.target at all, neither Wants= nor After= really makes any sense. i.e. if people want rsyslog style logging but it would be delayed until after network is up then this would mean they could never debug DHCP leases or so with it, which sounds totally backwards to me. But then again, I am not an rsyslog person, so maybe I misunderstand what it is supposed to do.) Anyway, it would be excellent to file a bug against the rsyslog package and ask it to drop the deps. > Then rpcbind.target seems to auto pull itself so without the Before ordering > we see in the NetworkManager.service pulling network.target example Can't parse this. > Also, it seems that there are more than one way to pull in a passive > dependency (or maybe several providers which can "publish" it). Like for > instance network-pre.target wich is pulled in by both nftables.service > and/or rdma-ndd.service. nftables.service should pull it in and order itself before it, if it intends to set up the firewall before the first network iterface is configured. not sure what rdma-ndd does, can't comment on that. > Finally, my understanding is some passive targets are not to be taken for > granted, i.e. they may not be pulled in at all and it is to the user to > check it if actually is the case if he want to order a unit againt it. I'm > not talking here about obvious targets we don't have because out of our > scope (like not having remote mounts related targets if system is purely > local) but some we could think we have but maybe not. For instance on my > system I see remote-fs-pre.target pulled in by nfs-client.target but would > be remote-fs-pre-target be pulled in (by who?) if I had only Infiniband > remote mounts ? remote-fs-pre.target should be pulled in by whoever wants to run *before* any remote mounts (i.e. do Wants= + Before= on it). The remote mounts should only order themselves *after* it, but not pull it in. > So my question would revolve around the above points > > Can you help me figuring out the correct way to see those concepts ? I think you mostly got things right but the services you listed are simply buggy. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] mdmon@md127 is stopped early
On Fr, 11.02.22 13:50, Mariusz Tkaczyk (mariusz.tkac...@linux.intel.com) wrote: > > Otherwise there might simply be another program that explicitly tells > > systemd to shut this stuff down, i.e. some script or so. Turn on debug > > logging (systemd-analyze log-level debug) before shutting down, the > > logs should tell you a thing or two about why the service is stopped. > > > > That is ridiculous when I enabled debug logging by command provided, it > is not killed: A heisenbug. Usually some race then. i.e. the extra debug logging probably slows down relevant processes long enough so that others can catch up that previously couldn't. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Failed to add PIDs to scope's control group: No such process
On Do, 03.02.22 18:39, Gena Makhomed (g...@csdoc.com) wrote: > Hello, All! > > Periodically I see in /var/log/messages error message about > > Failed to add PIDs to scope's control group: No such process > Failed with result 'resources'. > > How I can resolve or workaround this error? It usually just means that the process that makes up a user session dies more quickly than we have time to actually set up the session for it. It's not really a problem, mostly just noise. If this is reproducible on current upstream systemd versions, please file a bug upstream and we can look into fixing this. But fixing would mostly entail to just downgrade logging in this case, i.e. just cosmetically suppressing the noisy logging about this case. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Need a systemd unit example that checks /etc/fstab for modification and sends a text message
On Di, 08.02.22 17:27, Tony Rodriguez (unixpro1...@gmail.com) wrote: > From my understanding, it is frowned by systemd developers to > "automatically" reload systemd via "systemctl daemon-reload" when /etc/fstab > is modified. Guessing this is why such functionality hasn't already been > incorporated into systemd. However, I would like to send a simple text > message. Instructing users to manually invoke "systemctl deamon-reload" > after modifying /etc/fstab via dmesg command or inside /var/log/messages. > > Unsure how to do so inside a systemd UNIT. Will someone please provide an > example how to do so? At least Fedora puts a comment about this in /etc/fstab, explaining the situation. Tht sounds a lot more appropriate to me rather then making this appear in the logs... You can use a PathModified= .path unit for this if you like. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-journald namespace persistence
On Mi, 09.02.22 10:18, Roger James (ro...@beardandsandals.co.uk) wrote: > How do I create a persistent systemd-journald namespace? > > I have a backup service that is run by a systemd timer. I would like that to > use it's own namespace. I can create the namespace manually using systemctl > start systemd-journald@mynamespace.service. However I cannot find a way to > do that successfully at boot time. I have tried a RequiredBy and a Requires > in the timer unit but neither seem to work. Not sure I follow? the journald instance sockets should get auto-activated as dependency of your backup service, implicitly, as LogNamespace= side effect. There should be no need to run it all the time. The socket units come with StopWhenUnneeded=yes set, so they automatically go away if no service needs them. Why would you want to run those services continously? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Strange behavior of socket activation units
On Do, 10.02.22 10:09, Tuukka Pasanen (pasanen.tuu...@gmail.com) wrote: > Hello, > > Thank you for your sharp answer. Is there any way to debug what launches > these sockets and makes socket activation? these log messages look like they are generated client side. hence figure out where they come from. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] mdmon@md127 is stopped early
On Mi, 09.02.22 17:16, Mariusz Tkaczyk (mariusz.tkac...@linux.intel.com) wrote: > It is probably wrong, but it worked this way for many years: > "Again: if your code is being run from the root file system, then this > logic suggested above is NOT for you. Sorry. Talk to us, we can > probably help you to find a different solution to your problem."[3] > > How can I block the service from being stopped? In initramfs there is a > mdmon restart procedure, for example in dracut[4]. I need to save > mdmon process from being stopped. > > I will try to adapt our implementation to your[3] suggestions but it is > longer topic, I want to workaround the issue first. > > [1]https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git > [2]https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd/mdmon@.service So with that unit systemd shouldn#t stop it the service at all, given that you set DefaultDependencies=no. It would be good to figure out why it is stopped anyway. i.e. check with "systemctl show" on the unit what kind of requirement/conflicts deps there are which might explain it. Otherwise there might simply be another program that explicitly tells systemd to shut this stuff down, i.e. some script or so. Turn on debug logging (systemd-analyze log-level debug) before shutting down, the logs should tell you a thing or two about why the service is stopped. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Run "ipmitool power cycle" after lib/systemd/system-shutdown scripts
On Mi, 09.02.22 22:05, Etienne Champetier (champetier.etie...@gmail.com) wrote: > Hello systemd hackers, > > After flashing the firmware of some pcie card I need to power cycle > the server to finish the flashing process. > For now I have a simple script in lib/systemd/system-shutdown/ running > "ipmitool power cycle" but I would like to make sure it runs after > other scripts like fwupd.shutdown or mdadm.shutdown > > Is there any way to have systemd cleanly power cycle my server instead > of rebooting it ? What does "power cycle" entail that "reboot" doesnt? i.e. why doesn't "systemctl reboot" suffice? /usr/lib/systemd/system-shutdown/ drop-ins are executed before the OS transitions back into the initrd — the initrd will then detach the root fs (i.e. undo what it attached at boot) and actually reboot. This means if your command turns off the power source you should stick it in the initrd's shutdown logic, and not into /usr/lib/systemd/system-shutdown/. If you are using RHEL this means into dracut. But adding it there is something to better discuss with the dracut community than here. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd.sockets vs xinetd
On Do, 10.02.22 08:41, Yolo von BNANA (y...@bnana.de) wrote: > Hello, > > i read the following in an LPIC 1 Book: > > " > If you’ve done any investigation into systemd.sockets, you may believe that > it makes super servers like xinetd obsolete. At this point in time, that is > not true. The xinetd super server offers more functionality than > systemd.sockets can currently deliver. > " > > I thought, that this information could be deprecated. > > Is systemd.sockets at this point in time a good replacement for xined? xinetd supports various things systemd does not: - tcp_wrappers support - implementation of various internal mini-servers, such as RFC868 time server and so on - SUN RPC support - configurable access times - precise configuration of generated log message contents - stream redirection and a couple of other minor things. The first 3 of these are outright obsolete I am sure. We don't implement them for that reason. Instead of configurable access times we allow you to start/stop the socket units individually any time, and you could bind that to a clock on anything else really, it's up to you. I think systemd's logic is vastly more powerful there. For stream redirection we have systemd-socket-proxy, which should be at least as good, but is not implemented in the socket unit logic itself, but as an auxiliary service. So yes, it does some stuff we don't. Are there some people who want those things? I guess there are. But I am also sure that they are either obsolete if you look at the bigger pictue or better ways to do them, which we do support. Or to say this differently: it has been years that anyone filed an RFE bug on systemd github asking for a feature from xinetd that we lack. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Strange behavior of socket activation units
On Mo, 07.02.22 09:01, Tuukka Pasanen (pasanen.tuu...@gmail.com) wrote: > Hello, > I have encountered this kind of problem. This particularly is MariaDB 10.6 > installation to Debian. > Nearly every application that is restarted/started dies on: > > > Failed to get properties: Unit name mariadb-extra@.socket is missing the > instance name. > Failed to get properties: Unit name mariadb@.socket is missing the instance > name. > > > This is kind of strange as those are for multi target version and should not > be launched if I don't have mariadb@something.service added or make > specified systemctl enable mariadb@something.socker which I haven't. > They don't show up in dependencies of mariadb.service or other service file? My educated guess is that some script specific to mariadb in debian assumes that you only run templated instances of mariadb, and if you don#t things break. Plese work with the mariadb people at Debian to figure this out, there's nothing much we can do from systemd upstream about that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] OnCalendar weekday range syntax
On Fr, 04.02.22 06:23, Kenneth Porter (sh...@sewingwitch.com) wrote: > <https://www.freedesktop.org/software/systemd/man/systemd.time.html> > > Shows a range of weekdays separated by two dots: > > Mon..Fri > > When I use this on CentOS 7.9.2009, systemd-219-78.el7_9.5.x86_64, I get > this error from systemd-analyze verify: Probably not a good idea to use the documentation of a current systemd versions from 2022 with a systemd version from 2015. 7 (!) years changes an awful lot. ".." is the official way to denote workday ranges, "-" is accpeted for compat, but not documented. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Udevd and dev file creation
On Di, 01.02.22 16:04, Nishant Nayan (nayan.nishant2...@gmail.com) wrote: > One thought > Is it advisable to turn off systemd-udevd if I am sure that I won't be > adding /removing any devices to my server. Note that there are various synthetic/virtual devices that are created on app request, i.e. lvm, loopback, network devices and such. We live in a dynamic and virtual world, where devices come and go all the time. Moreover there are various devices that send out uevents for change notification of various forms. If you turn off udev apps won't get those either. i.e. udev is about more than just plug + unplug. If you stop udev apps waiting for their devices to show up won't be able to ever get the ready notifications for that and thus will stop working. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd killing processes on monitor wakeup?
On Mo, 31.01.22 09:47, Raman Gupta (rocketra...@gmail.com) wrote: > > Honestly this just sounds like systemd killing "leftover" processes within > > the plasma-plasmashell cgroup, after the "main" process of that service has > > exited. That's not a bug; that's standard behavior for systemd services. > > > > What determines whether a process becomes part of the plasma-plasmashell > cgroup or not? When I run plasmashell independently of systemd, processes > do indeed start as child processes of plasmashell. I'm guessing this > implies that when plasmashell is run under systemd, all these processes > become part of the cgroup, and this is why systemd "cleans up" all these > child processes after a plasmashell crash? I don't know plasma, but generally: whatever is forked off from a process is part of the same cgroup as the process — unless it is expicitly moved away. Moving things away explicitly is typically done by PID 1 or the per-user instance of systemd, on explicit API request. So, don#t know if plasma calls into systemd at all. If it doesn't all its children will be part of the same cgroup as plasma. If it otoh does IPC calls to systemd in some form and tells it to fork/take over the processes then they might end up in a different cgroup however. > > It's also interesting to me that many applications *do not* exit in this > scenario -- Slack Desktop exits about 50% of the time, and IDEA exits > pretty consistently. Most other apps remain running. Not sure why that > would be -- if systemd is cleaning up, shouldn't all apps exit? "systemd-cgls" should give you a hint which cgroups exists and which processes remain children of plasma inside its cgroup, and which ones got their own cgroup. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Launching script that needs network before suspend
On So, 23.01.22 22:13, Tomáš Hnyk (tomash...@gmail.com) wrote: > Hello, > I have my computer hooked up to an AVR that runs my home cinema and ideally > I would like the computer to turn off the AVR when I turn it off or suspend > it. The only way to do this is over network and I wrote a simple script that > does just that. Hooking it to shutdown was quite easy using network.target > that is defined when shutting down. > > > I am struggling to make it work with suspend though. When I look at the > logs, terminating network seems to be the first thing that happens when > suspend is invoked. That shouldn't happen. Normally networking shouldn't be shut down during suspend. If your network management solution does this explicitly, I am puzzled, why it would do that. > I tried putting the script to > /usr/lib/systemd/system-sleep/ and it runs, but only after network si down, > so it fails. Running the script with systemd-inhibit > (ExecStart=/usr/bin/systemd-inhibit --what=sleep my_script) tells me that > "Failed to inhibit: The operation inhibition has been requested for is > already running". Inhibitors are designed to be taken by long running apps, and not by the stuff that is run when you are already suspending. i.e. it's too late then: if you want to temporarily stall suspending to do your stuff, it's already too late once these scripts are invoked, because they are invoked once it was decided to actually go into suspend now. > Is there a way to make this work with service files by specifying that the > script needs to be run before network is shut down or would I need to run a > daemon listening for PrepareForSleep as here: > https://github.com/davidn/av/blob/master/av ? Usually that's what you do, yes: you take an inhibitor lock while you are running, and wait until you are informed about system suspend, then you do your thing, and release the lock once you are done at which point the suspend continues. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Translating --machine parameter to a service file
On Di, 25.01.22 13:04, Tomáš Hnyk (tomash...@gmail.com) wrote: > Hello, > I want to run a script invoked by udev to run a pactl script. I am now using > a udev rule SUBSYSTEM=="drm", ACTION=="change", > RUN+="/usr/local/bin/my_script" > > which calls (drew is my username): > systemctl --machine=drew@.host --user --now my.service > > > which has: > [Service] > Type=oneshot > ExecStart=/usr/local/bin/my_script.py > > and in the my_script.py, I do what I need. I cannot call my_script.py > directly > from the udev rule because, if I understood it correctly, scripts triggered > by udev run in a limited environment and pactl runs as user. > > Needless to say, this feels rather hackish. Ideally I would use something > like TAG+="systemd", ENV{SYSTEMD_WANTS}="my.service" > > I can specify "User=" in the service file but I could not figure out > to translate the --machine=drew@.host parameter to it. This is not supported. Containers run in their own little world, and generally get their own devices (i.e. just virtual devices such as /dev/null and similar), hence we do not have infra to propagate evnts to containers. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Udevd and dev file creation
On So, 30.01.22 17:14, Nishant Nayan (nayan.nishant2...@gmail.com) wrote: > I have started reading about udevd. > I was trying to find out if there is a way to play with udev without > plugging in/out any devices. > Is there a way to trigger a uevent without plugging in devices? use "udevadm trigger" to fire uevents for existing devices. Or create new, synthetic virtual devices during runtime, for example via "losetup". Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] sd_bus_process() + sd_bus_wait() is it not suitable for application?
On Sa, 22.01.22 14:08, www (ouyangxua...@163.com) wrote: > Dear all, > > > When using sd_bus_process() + sd_bus_wait() to implement the > application(Service), call the methods function on the service can obtain the > correct information. Run a certain number of times will lead to insufficient > memory and memleak does occur. > > > It should not be a problem with the DBUS method, because a single call does > not increase memory, it needs to call the method 65 ~ 70 times, and you will > see the memory increase. After stopping the call, the memory will not > decrease. It seems that it has nothing to do with the time interval when the > method is called. > > > code implementation: > int main() > { > .. > r = sd_bus_open_system(); > ... > r = sd_bus_add_object_vtable(bus, ..); > .. > r= sd_bus_request_name(bus, "xxx.xx.xx.xxx"); > .. > > > for( ; ; ) > { > r = sd_bus_process(bus, NULL); > ... > r = sd_bus_wait(bus, -1); > .. > } > sd_bus_slot_unref(slot); > sd_bus_unref(bus); > } Maybe the callback handlers you added in the vtable keep some objects pinned? Also note that unreffing the bus in the end is typically not enough, if it still has messages queued. Use sd_bus_flush() + sd_bus_close() first (or combine them in one sd_bus_flush_close_unref()). Otherwise it might happen that messages still not flushed out at the end remain pinned. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Initial system date and time set by systemd
On Mo, 03.01.22 13:13, Sergei Poselenov (sposele...@emcraft.com) wrote: > Where systemd takes this date "Tue 2020-07-14 19:19:34 UTC"? It's configurable via the "time-epoch" meson variable at build time. If you don't specify it, the SOURCE_DATE_EPOCH env var is used (defined by the reproducible build folks). If that's not set, then it's the date of the latest git tag in the history of your git checkout, and if the sources didn't come via git but as tarball or so, it's the timestamp of the NEWS file. Or in other words: it's typically the build time of the package, or the time the release of systemd was done. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Fr, 10.12.21 12:25, Chris Murphy (li...@colorremedies.com) wrote: > On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering > wrote: > > > That said: naked squashfs sucks. Always wrap your squashfs in a GPT > > wrapper to make things self-descriptive. > > Do you mean the image file contains a GPT, and the squashfs is a > partition within the image? Does this recommendation apply to any > image? Let's say it's a Btrfs image. And in the context of this > thread, the GPT partition type GUID would be the "super-root" GUID? Yes, I'd always add a GPT wrapper around disk images. It's simple, extensible and first and foremost self-descriptive: you know what you are looking at, safely, before parsing the fs. It opens the door for adding verity data in a very natural way, and more. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Q: When will WorkingDirectory be checked?
On Fr, 17.12.21 08:11, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > Hi! > > I have a simple question: When will WorkingDirectory be checked? > Specifically: Will it be checked before ExecStartPre? I could not > get it form the manual page. What do you mean by "checked"? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Predictable Network Interface Name Bug?
On Mi, 15.12.21 21:37, Tim Safe (timsafeem...@gmail.com) wrote: > Hello- > > I have an Ubuntu Server 20.04 (systemd 245 (245.4-4ubuntu3.13)) box that I > recently installed a Intel quad-port Gigabit ethernet adapter (E1G44ETBLK). > > It appears that the predictable interface naming is only renaming the first > two interfaces (ens8f0, ens8f1) and the second two fail to be renamed > (eth2, eth3). Consider updating your systemd version to a newer version (or ask your distro to backport the relevant patches). The predictable network interface naming received a number of tweaks since 245, and it's pretty likely this has since been fixed. Specifically, the NAMING_SLOT_FUNCTION_ID feature flag introduced with v249 will likely fix your case. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [RFC] Switching to OpenSSL 3?
On Di, 23.11.21 11:53, Dimitri John Ledkov (dimitri.led...@canonical.com) wrote: > Just an update from Ubuntu - for the upcoming release of Jammy (22.04 > LTS targeting release in April 2022) we have started transition to > OpenSSL 3 and currently upgrading to systemd v249. Did Ubuntu adopt Debian's stance of accepting OpenSSL as system component? i.e. is OpenSSL 3 compatible with both (L)GPL 2.x code *and* GPL3 code in Ubuntu's eyes? Or only the latter? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] give unprivileged nspawn container write access to host wayland socket
On Mo, 22.11.21 16:02, Nozz (n...@protonmail.com) wrote: > I recently moved to pure wayland, I want to run a graphical > application in a unprivileged container(user namespace isolation) > . The application needs write access to wayland socket on the host > side. What's the best way to achieve this? I've been able to do > this if I map the host UID/GID range using --private-users=0:65536 > but then there is no namespace isolation. Also I would have to map > the same range to every container and documentation states it's bad > security wise to have it overlap. Well, if you run n containers and all n have the same UID/GID mapping then of course they can access/change each other resources should they be able to see it. That might or might not be OK. In the upcoming 250 release nspawn bind mounts are changed (if a kernel with uidmap support in the fs layer is available that is) so that bind mounts placed in the kernel are optionally idmapped, i.e. that host UID 0 is mapped to container UID 0 for such bind mounts, instead of "nobody". That should make what you are trying to do pretty easy, as you can mout individual inodes and make them appear under their original ownership. We might want to extend this later on: when bind mounting non-directory inodes (such as sockets) we could even allow fixing ownership to any uid of your choice, to give you full freedom there. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Networking in a systemd-nspawn container
On Fr, 22.10.21 19:54, Tobias Hunger (tobias.hun...@gmail.com) wrote: > Hello Systemd Mailing List! > > I have a laptop and run a couple of systemd-nspawn containers on that > machine. This works great, except that name resolution insode the > containers fails whenever the network on the outside changes. > > This is not too surprising: At setup time the resolver information is > copied into the containers and never updated. That is sup-optimal for > my laptop that I keep moving between networks. > > I have been wondering: Would it be possible to forward the containers > resolver to the host machine resolver somehow? > > Could e.g. systemd-nspawn optionally make the hosts resolver available > in the containers network namespace? Maybe by setting up some port > forwarding or by putting a socket into the container somewhere? > > Any ideas? I can do some of the work with a bit of guidance. You could use DNSStubListenerExtra= in resolved.conf to make the stub listen on some additional container-facing IP address, or multiple of them. But it's not pretty, as it requires manual configuration. Two ideas I recently thought about: 1. Maybe resolved's "stub" logic should support listening on yet another local IP address: 127.0.0.54 or so, where the same stub listens as on 127.0.0.53, but where we unconditionally enable "bypass" mode. This mode in resolved means that we'll not process the messages ourselves very much, but just look at the domains mentioned in it for routing info and then pass the lookup upstream and its answer back almost unmodified. (we'd also not consider the lookups for mdns/llmnr and such). Right now we only enable that mode if we encounter traffic we otherwise don't understand. Thus, if you use that other IP address you can use resolved basically as a proxy towards whatever the current DNS server is, nothing else. (though we'd still translate classic UDP and TCP DNS to DoT if configured) 2. Then, teach nspawn to optionally set up nftables/iptables NAT so that port 53 of some veth tunnel IP of the host is automatically NAT'ed to 127.0.0.54:53. That way you then get what you are looking for, as you could then advertise the host-side IP address of your veth tunnel as DNS server unconditionally, and the right thing would happen. (i figure wifi tethering applications could make use of this too?) Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Do, 18.11.21 15:01, Chris Murphy (li...@colorremedies.com) wrote: > On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy wrote: > > > > How to do swapfiles? > > > > Currently I'm creating a "swap" subvolume in the top-level of the file > > system and /etc/fstab looks like this > > > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > > /var/swap/swapfile1 none swap defaults 0 0 > > > > This seems to work reliably after hundreds of boots. > > > > a. Is this naming convention for the subvolume adequate? Seems like it > > can just be "swap" because the GPT method is just a single partition > > type GUID that's shared by multiboot Linux setups, i.e. not arch or > > distro specific > > b. Is the mount point, /var/swap, OK? > > c. What should the additional naming convention be for the swapfile > > itself so swapon happens automatically? > > Actually I'm thinking of something different suddenly... because > without user ownership of swapfiles, and instead systemd having domain > over this, it's perhaps more like: > > /x-systemd.auto/swap -> /run/systemd/swap I'd be conservative with mounting disk stuff to /run/. We do this for removable disks because the mount points are kinda dynamic, hence it makes sense, but for this case it sounds unnecessary, /var/swap sounds fine to me, in particular as the /var/ partition actually sounds like the right place to it if /var/swap/ is not a mount point in itself but just a plain subdir. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > How to do swapfiles? Is this really a concept that deserves too much attention? I mean, I have the suspicion that half the benefit of swap space is that it can act as backing store for hibernation. But swap files are icky for that since that means the resume code has to mount the fs first, but given the fs is dirty during the hibernation state this is highly problematic. Hence, I have the suspicion that if you do swap you should probably do swap partitions, not swap files, because it can cover all usecase: paging *and* hibernation. > Currently I'm creating a "swap" subvolume in the top-level of the file > system and /etc/fstab looks like this > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > /var/swap/swapfile1 none swap defaults 0 0 > > This seems to work reliably after hundreds of boots. > > a. Is this naming convention for the subvolume adequate? Seems like it > can just be "swap" because the GPT method is just a single partition > type GUID that's shared by multiboot Linux setups, i.e. not arch or > distro specific I'd still put it one level down, and marke it with some non-typical character so that it is less likely to clash with anything else. > b. Is the mount point, /var/swap, OK? I see no reason why not. > c. What should the additional naming convention be for the swapfile > itself so swapon happens automatically? To me it appears these things should be distinct: if automatic activation of swap files is desirable, then there should probably be a systemd generator that finds all suitable files in /var/swap/ and generates .swap units for them. This would then work with any kind of setup, i.e. independently of the btrfs auto-discovery stuff. The other thing would be the btrfs auto-disocvery to then actually mount something there automatically. > Also, instead of /@auto/ I'm wondering if we could have > /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and > while I'm a big fan of the @ symbol for typographic history reasons, > it's being used in the subvolume/snapshot regimes rather haphazardly > for different purposes which might be confusing? e.g. Timeshift > expects subvolumes it manages to be prefixed with @. Meanwhile SUSE > uses @ for its (visible) root subvolume in which everything else goes. > And still ZFS uses @ for their (read-only) snapshots. I try to keep the "systemd" name out of entirely generic specs, since there are some people who have an issue with that. i.e. this way we tricked even Devuan to adopt /etc/os-release and the /run/ hierarchy, since they probably aren't even aware that these are systemd things. Other chars could be used too: /+auto/ sounds OK to me too. or /_auto/, or /=auto/ or so. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] hardware conditional OS boot/load
On Do, 18.11.21 19:16, lejeczek (pelj...@yahoo.co.uk) wrote: > Hi guys. > > I hope an expert(or two) could shed some light - I ain't a kernel nor > hardware expert so go easy on me please - on whether it is possible to boot > system only under certain conditions, meaning: as early as possible (grub?) > and similarly securely, Linux checks for certain hardware, eg. CPU serial > no. and continue to load only when such conditions are met? > > I realize that perhaps kernel devel be the place for such questions but seen > I'm subscriber here, knowing people here are experts of same caliber, I > decided to ask. You can certainly hack something up like this, but to my knowledge none of the boot loaders currently implement something like this. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to get array[struct type] using sd_bus_message_* API's
On Fr, 19.11.21 12:31, Manojkiran Eda (manojkiran@gmail.com) wrote: > In the `busctl monitor` i could confirm that i am getting a message of > signature a{sas} from the dbus call, and here is the logic that I could > come up with to read the data. > > r = sd_bus_message_enter_container(reply, SD_BUS_TYPE_ARRAY, "{sas}"); > if (r < 0) > goto exit; > while ((r = sd_bus_message_enter_container(reply, > SD_BUS_TYPE_DICT_ENTRY, >"sas")) > 0) > { > char* service = NULL; > r = sd_bus_message_read(reply, "s", ); > if (r < 0) > goto exit; > printf("service = %s\n", service); > r = sd_bus_message_enter_container(reply, 'a', "s"); > if (r < 0) > goto exit; > for (;;) > { > const char* s; > r = sd_bus_message_read(reply, "s", ); > if (r < 0) > goto exit; > if (r == 0) > break; > printf("%s\n", s); > } > } > > Output: > service = xyz.openbmc_project.EntityManager > org.freedesktop.DBus.Introspectable > org.freedesktop.DBus.Peer > org.freedesktop.DBus.Properties > > But, I was only able to get the data from the first dictionary, can anyone > help me to solve this issue? what am I missing? You always need to leaver each container again once you read its contents. i.e. each sd_bus_message_enter_container(…) must be paired with sd_bus_message_leave_container(…) Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to build a unified kernel for aarch64?
On Fr, 12.11.21 00:00, Zameer Manji (zma...@gmail.com) wrote: > I have noticed there exists a systemd-stub for aarch64, the file > is located at `/usr/lib/systemd/boot/efi/linuxaa64.efi.stub`. > > The manpage instructs users to use `objcopy` to build the unified > kernel from the stub. However when I use objcopy I get the following > error: > > ``` > objcopy: /usr/lib/systemd/boot/efi/linuxaa64.efi.stub: file format not > recognized > ``` > I also noticed the GNU binutils bug tracker indicates that PE executables on > ARM are not yet supported for objcopy [0]. > > How do systemd developers build the unified kernel on aarch64? Is there > an alternative toolchain used? > > [0]: https://sourceware.org/bugzilla/show_bug.cgi?id=26206 I personally never played around with this for anything non-x86-64. But I wonder, maybe llvm-objcopy supports this for aarch64? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Do, 11.11.21 18:27, Lennart Poettering (mzerq...@0pointer.de) wrote: > A patch for that should be pretty easy to do, and be very generically > useful. I kinda like it. What do you think? For now I added TODO list items for these ideas: https://github.com/systemd/systemd/commit/af11e0ef843c19cbf8ccaefb93a44dbe4602f7a8#diff-337e547a950fc8a98592f10d964c1e79a304961790a8da0ce449a1f000cefabb Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Mi, 10.11.21 10:34, Topi Miettinen (toiwo...@gmail.com) wrote: > > Doing this RootDirectory= would make a ton of sense too I guess, but > > it's not as obvious there: we'd need to extend the setting a bit I > > think to explicitly enable this logic. As opposed to the RootImage= > > case (where the logic should be default on) I think any such logic for > > RootDirectory= should be opt-in for security reasons because we cannot > > safely detect environments where this logic is desirable and discern > > them from those where it isn't. In RootImage= we can bind this to the > > right GPT partition type being used to mark root file systems that are > > arranged for this kind of setup. But in RootDirectory= we have no > > concept like that and the stuff inside the image is (unlike a GPT > > partition table) clearly untrusted territory, if you follow what I am > > babbling. > > My images don't have GPT partition tables, they are just raw squashfs file > systems. So I'd prefer a way to identify the version either by contents of > the image (/@auto/ directory), or something external, like name of the image > (/path/to/image/foo.version-X.Y). Either option would be easy to implement > when generating the image or directory. Hmm, so thinking about this again, I think we might get away with a check "/@auto/ exists and /usr/ does not". i.e. the second part of the check removes any ambiguity: since we unified the OS in /usr it's an excellent way to check if something is or could be an OS tree. That said: naked squashfs sucks. Always wrap your squashfs in a GPT wrapper to make things self-descriptive. > But if you have several RootDirectories or RootImages available for a > service, what would be the way to tell which ones should be tried if there's > no GPT? They can't all have the same name. I think using a specifier (like > %q) would solve this issue nicely and there wouldn't be a need for /@auto/ > in that case. A specifier is resolved at unit file load time only. It wouldn#t be the right fit here, since we don#t want to require that the paths specified in RootDirectory=/RootImage= are already accessible at the time PID 1 reads/parses the unit file. What about this: we could entirely independently of the proposal originally discussed here teach RootDirectory= + RootImage= one magic trick: if the path specified ends in ".auto.d/" (or so) then we'll not actually use the dir/image as-is but assume the path refers to a directory, and we'd pick the newest entry inside it as decided by strverscmp(). Or in other words, we'd establish the general rule that dirs ending in ".auto.d/" contains versioned resources inside, that we could apply here and everywhere else where it fits, too. of course intrdocuing this rule would be kind of a compat breakage because if anyone happened to have named their dirs like that already we'd suddenly do weird stuff with it the user might not expect. But I think I could live with that. A patch for that should be pretty easy to do, and be very generically useful. I kinda like it. What do you think? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Di, 09.11.21 19:48, Topi Miettinen (toiwo...@gmail.com) wrote: > > i.e. we'd drop the counting suffix. > > Could we have this automatic versioning scheme extended also to service > RootImages & RootDirectories as well? If the automatic versioning was also > extended to services, we could have A/B testing also for RootImages with > automatic fallback to last known good working version. At least in the case of RootImage= this was my implied assumption: we'd implement the same there, since that uses the exact same code as systemd-nspawn's image dissection and we definitely want it there. Doing this RootDirectory= would make a ton of sense too I guess, but it's not as obvious there: we'd need to extend the setting a bit I think to explicitly enable this logic. As opposed to the RootImage= case (where the logic should be default on) I think any such logic for RootDirectory= should be opt-in for security reasons because we cannot safely detect environments where this logic is desirable and discern them from those where it isn't. In RootImage= we can bind this to the right GPT partition type being used to mark root file systems that are arranged for this kind of setup. But in RootDirectory= we have no concept like that and the stuff inside the image is (unlike a GPT partition table) clearly untrusted territory, if you follow what I am babbling. Or in other words: to enable this for RootDirectory= we probably need a new option RootDirectoryVersioned= or so that takes a boolean. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Di, 09.11.21 14:48, Ludwig Nussel (ludwig.nus...@suse.de) wrote: > > and so on. Until boot succeeds in which case we'd rename it: > > > >/@auto/root-x86-64:fedora_36.0 > > > > i.e. we'd drop the counting suffix. > > Thanks for the explanation and pointer! > > Need to think aloud a bit :-) > > That method basically works for systems with read-only root. Ie where > the next OS to boot is in a separate snapshot, eg MicroOS. > A traditional system with rw / on btrfs would stay on the same subvolume > though. Ie the "root-x86-64:fedora_36.0" volume in the example. In > openSUSE package installation automatically leads to ro snapshot > creation. In order to fit in I suppose those could then be named eg. > "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the > subvolume would never be booted. > > Anyway, let's assume the ro case and both efi partition and btrfs volume > use this scheme. That means each time some packages are updated we get a > new subvolume. After reboot the initrd in the efi partition would try to > boot that new subvolume. If it reaches systemd-bless-boot.service the > new subvolume becomes the default for the future. > > So far so good. What if I discover later that something went wrong > though? Some convenience tooling to mark the current version bad again > would be needed. In the sd-boot/kernel case any time you like you can rename an entry to "…+0" to mark it as "bad", you could drop the suffix to mark it as "good" or you could mark it as "+3" to mark it as "dont-know/try-again". Now, at least in theory we could declare the same for this new directory auto-discovery scheme. But I am not entirely sure this will work out trivially IRL because I have the suspicion one cannot rename subvolumes which are the source of a bind mount (i.e. once you boot into one root subtree, then it might be impossible to rename that top-level inode without rebooting first). Would be something to try out. If it doesn't work it might suffice to move things one level down, i.e. that the dir that actually becomes root is /@auto/root-x86-64:fedora_36.0/payload/ or so, instead of just /@auto/root-x86-64:fedora_36.0/. I think that that would work, and might be desirable anyway so that the enumeration of entries doesn't already leak fs attributes/ownership/access modes/… of actual root fs. > But then having Tumbleweed in mind it needs some capability to boot any > old snapshot anyway. I guess the solution here would be to just always > generate a bootloader entry, independent of whether a kernel was > included in an update. Each entry would then have to specify kernel, > initrd and the root subvolume to use. > This approach would work with a separate usr volume also. In that case > kernel, initrd, root and usr volume need to be linked by means of a > bootloader entry. For the GPT case if you want to bind a kernel together with a specific root fs, you'd do this by specifying 'root=PARTLABEL=fooos_0.3' on the kernel cmdline. I'd take inspiration from that and maybe introduce 'rootentry=fedora_36.2' or so which would then be honoured by the logic we are discussing here, and would hard override which subdir to use, regardless of versioning preference, assesment counting and so on. (Yeah, the subvol= mount option for btrfs would work too, but as mentioned I'd keep this reasonably independent of btrfs where its easy, plain dirs otherwise are fine too after all. Which reminds me, recent util-linux implements the X-mount.subdir= mount option, which means one could also use 'rootflags=X-mount.subdir=@auto/fedora_36.2' as non-btrfs-specific way to express the btrfs-specific 'rootflags=subvol=@auto/fedora_36.2') Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Mo, 08.11.21 14:24, Ludwig Nussel (ludwig.nus...@suse.de) wrote: > Lennart Poettering wrote: > > [...] > > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named > >[:]. The type should have a similar vocubulary > >as the GPT spec type UUIDs, but probably use textual identifiers > >rater than UUIDs, simply because naming dirs by uuids is > >weird. Examples: > > > >/@auto/root-x86-64:fedora_36.0/ > >/@auto/root-x86-64:fedora_36.1/ > >/@auto/root-x86-64:fedora_37.1/ > >/@auto/home/ > >/@auto/srv/ > >/@auto/tmp/ > > > >Which would be assembled by the initrd into the following via bind > >mounts: > > > >/ → /@auto/root-x86-64:fedora_37.1/ > >/home/→ /@auto/home/ > >/srv/ → /@auto/srv/ > >/var/tmp/ → /@auto/tmp/ > > > > If we do this, then we should also leave the door open so that maybe > > ostree can be hooked up with this, i.e. if we allow the dirs in > > /@auto/ to actually be symlinks, then they could put their ostree > > checkotus wherever they want and then create a symlink > > /@auto/root-x86-64:myostreeos pointing to it, and their image would be > > spec conformant: we'd boot into that automatically, and so would > > nspawn and similar things. Thus they could switch their default OS to > > boot into without patching kernel cmdlines or such, simply by updating > > that symlink, and vanille systemd would know how to rearrange things. > > MicroOS has a similar situation. It edits /etc/fstab. microoos is a suse thing? > Anyway in the above example I guess if you install some updates you'd > get eg root-x86-64:fedora_37.2, .3, .4 etc? Well, the spec wouldn't mandate that. But yeah, the idea is that you could do it like that if you want. What's important is to define the vocabulary to make this easy and possible, but of course, whether people follow such an update scheme is up to them. I mean, it's the same as with the GPT auto discovery logic: it already implements such a versioning scheme because its easy to implement, but if you don't want to take benefit of the versioning, then don't, it's fine regardless. the logic we'd define here is about *consuming* available OS root filesystems, not about *installing* them, after all. The GPT auto-discovery thing basically does an strverscmp() on the full GPT partition label string, i.e. it does not attempt to split a name from a version, but assumes strverscmp() will handle a common prefix nicely anyway. I'd do it the exact same way here: if there are multiple options, then pick the newest as per strverscmp(), but that also means it's totally fine to not version your stuff and instead of calling it "root-x86-64:fedora_37.3" could could also just name it "root-x86-64:fedora" if you like, and then not have any versioning. > I suppose the autodetection is meant to boot the one sorted last. What > if that one turns out to be bad though? How to express rollback in that > model? Besides the GPT auto-discovery where versioning is implemented the way I mentioned, there's also the sd-boot boot loader which does roughly the same kind of OS versioning with the boot entries it discovers. So right now, you can already chose whether: 1. you want to do OS versioning on the boot loader entry level: name your EFI binary fooos-0.1.efi (or fooos-0.1.conf, as defined by the boot loader spec) and similar and the boot loader automatically picks it up, makes sense of it and boots the newest version installed. 2. you want to do OS versioning on the GPT partition table level: name your partitions "fooos-0.1" and similar, with the right GPT type, and tools such as systemd-nspawn, systemd-dissect, portable services, RootImage= in service unit files all will be able to automatically pick the newest version of the OS among the ones in the image. and now: 3. If we implement what I proprose above then you could do OS version on the file system level too. (Or you could do a combination of the above, if you want — which is highly desirable I think in case you want a universal image that can boot on bare metal and in nspawn in a nice versioned way.) Now, in sd-boot's versioning logic we implement an automatic boot assesment logic on top of the OS versioning: if you add a "+x-y" string into the boot entry name we use it as x=tries-left and y=tries-done counters. i.e. fooos-0.1+3-0.efi is semantically the same as fooos-0.1.efi, except that there are 3 attempts left and 0 done yet. On each boot attempt the boot loader decreases x and increases y. i.e. fooos-0.1+3-0.efi → fooos-0.1+2-1.efi → fooos-0.1+1-2.efi → fooos-0.1+0-3.efi. If a boot succeeds the two counters are dropped from the filename, i
Re: [systemd-devel] the need for a discoverable sub-volumes specification
e could do its thing in some other subdir of the root fs if it wants to) 3. Inside the "@auto" dir of the "super-root" fs, have dirs named [:]. The type should have a similar vocubulary as the GPT spec type UUIDs, but probably use textual identifiers rater than UUIDs, simply because naming dirs by uuids is weird. Examples: /@auto/root-x86-64:fedora_36.0/ /@auto/root-x86-64:fedora_36.1/ /@auto/root-x86-64:fedora_37.1/ /@auto/home/ /@auto/srv/ /@auto/tmp/ Which would be assembled by the initrd into the following via bind mounts: / → /@auto/root-x86-64:fedora_37.1/ /home/→ /@auto/home/ /srv/ → /@auto/srv/ /var/tmp/ → /@auto/tmp/ If we do this, then we should also leave the door open so that maybe ostree can be hooked up with this, i.e. if we allow the dirs in /@auto/ to actually be symlinks, then they could put their ostree checkotus wherever they want and then create a symlink /@auto/root-x86-64:myostreeos pointing to it, and their image would be spec conformant: we'd boot into that automatically, and so would nspawn and similar things. Thus they could switch their default OS to boot into without patching kernel cmdlines or such, simply by updating that symlink, and vanille systemd would know how to rearrange things. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [EXT] Question about timestamps in the USER_RECORD spec
On Do, 28.10.21 11:46, Arian van Putten (arian.vanput...@gmail.com) wrote: > Indeed it mentions it; but after careful reading there is no normative > suggestion to actually adhere to it. (no SHOULD and definitely not a MUST, > not even a RECOMMENDED). > > They just say that to increase interoperability no more than 53 bits of > integer precision should be assumed without making a clear normative > decision about it. The only normative part in that section is that > numbers consist of an integer part and a fractional part. > > They also say that implementations are allowed to set any limits on the > range and precision of numbers accepted. > > So yeah Lennart seems to be technically correct. Even when reading the RFC > by the letter. BTW: https://github.com/systemd/systemd/pull/21168 Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Question about timestamps in the USER_RECORD spec
On Di, 26.10.21 10:41, Arian van Putten (arian.vanput...@gmail.com) wrote: > Hey list, > > I'm reading the https://systemd.io/USER_RECORD/ spec and I have a question > > There are some fields in the USER_RECORD spec which are described as > "unsigned 64 bit integer values". Specifically the fields describing > time. > > However JSON lacks integers and only has doubles [0]; which would mean 53 > bit integer precision is about the maximum we can reach. The spec itself doesn't really mandate this is implemented in double, the spec just says "sticking to doubles would be nice". Actual implementations implement this differently IRL. Python based implementations have arbitrary precision for this. sd-bus uses uint64_t, or int64_t or long double. It prefers the integer types if the value fits, and uses the floating point type otherwise. json-glibc uses int64_t or double. There are plenty of specs that rely that 64bit integers work with full range (OCI for example, for much of its resource management stuff). > It's unclear to me from the spec whether I should use doubles to > encode these fields or use strings. Would it be possible to further > clarify it? If it is indeed a number literal; this means the > maximum date we can encode is 9007199254740991 which corresponds to > Tuesday, June 5, 2255 . This honestly is too soon in the future for > my comfort. You appear to plan for quite a long life ;-) Frankly, this is not a problem specific to user records. A multitude of JSON formats tend to store dates this way. The overflow is still 200y out. I think that leaves plenty time to teach implementations full 64bit support, and I am pretty sure that the ones that will deal with user records will catch up sooner or later. > I suggest encoding 64 > bit integers as string literals instead to avoid the truncation > problem. I am sorry, but I am not convinced this is a pressing issue. I value cleanliness and obviousness a lot more than theoretic issues that might happen 200 years from now. In particular as they are issues that can be dealt with in offending JSON implementations, and limitations in the parser implementations shouldn't really leak in to the spec I think. I mean, at this point it isn't even clear humanity will survive that long, and I seriously doubt that systemd is the one project that survives humanity. What might make sense is to add a comment about the whole situation to the spec and be done with it: "Please not that this specification assumes that JSON numbers may cover the full integer range of -2^63 … 2^64-1 without loss of accuracy (i.e. INT64_MIN … UINT64_MAX). Please read, write and process user records following this specification only with JSON implementations that guarantee this range." Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] A questions about modules-load service in systemd
On Sa, 23.10.21 02:27, Joakim Zhang (qiangqing.zh...@nxp.com) wrote: > > It doesn't do that actually. But udev when it loads kernel modules does > > things > > from a bunch of worker processes all in parallel. > > Ok, is there a way to disable this parallel tasks in systemd-udev > service? udev.children_max=1 on the kernel command line. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] loose thoughts around portable services
On Mi, 20.10.21 16:01, Umut Tezduyar Lindskog (u...@tezduyar.com) wrote: > > That said: systemd's nss-systemd NSS module can nowadays (v249) read > > user definitions from drop-in JSON fragments in > > /run/host/userdb/. This is is used by nspawn's --bind-user= feature to > > make a host user easily available in a container, with group info, > > password and so on. My plan was to also make use of this in the unit > > executor, i.e. so that whenever RootDirectory=/RootImage= are used the > > service manager places such minimal user info for the selected user > > there, so that the user is perfectly resolvable inside the service > > too. This is particularly relevant for DynamicUser=1 services. I > > haven't come around actually implementing that though. Given > > nss-systemd is enabled in most bigger distro's nssswitch.conf file > > these days I think this is a really nice approach to propagate user > > databases like that. > > > > Why don't we also make the varlink user API available to most of the > profiles? This way sandboxed service doesn't need any of the nss conf and > libraries if they don't want to. Most profiles allow dbus communication. I > guess in a similar thought, most system services should be able to do a > user lookup in a modern way. I sympathize with the idea, but I am not entirely sure this is desirable to do this 1:1, as this means we'd leak a ton of stuff that might only make sense on the host into something that is supposed to be an isolated container. i.e. home dir info and things like that. shell paths and so on. Maybe we can find a middle ground on this though. i.e. we could make systemd-userdb.service listen on a new varlink service socket that provides the host's database to sandboxed environments in a restricted form, i.e. with basically all records dumbed down to just contain uid/gid/name info and nothing else. We'd then update the portabled profiles that do not use PrivateUsers= to bind mount that one socket, so that they get the full db, dynamically. I kinda like the idea. > We could implement our own profiles without needing nesting but we believe > it is beneficial to collaborate on profiles upstream and have common > additions to upstream profiles with nesting other profiles. If we get to it > before other people, we would really like to contribute and send a patch on > this. A patch adding .d/ style drop-ins for profiles would make a ton of sense. Happy to take that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] A questions about modules-load service in systemd
On Fr, 22.10.21 10:31, Joakim Zhang (qiangqing.zh...@nxp.com) wrote: > > Hi systemd experts, > > I saw you guys did much contributions in modules-load part recently, I have a > questions, some insight you input would be appreciated, thanks in advance. > > Do you know how to load all modules in a single task? In other > words, load all modules within a single task as I want they process > sequentially. Are you sure you mean "systemd-modules-load"? Most module loading happens via udev, not systemd-modules-load. That service is only required for a few select modules that do not support auto-loading. udev loads all modules as the hw they are for shows up. And no there's no way to make that sequential. Why do you need this? For debugging purposes? To work around a broken driver? > If I understand correctly, systemd-modules-load service now will > fork many tasks to process different kernel modules parallelly. It doesn't do that actually. But udev when it loads kernel modules does things from a bunch of worker processes all in parallel. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] loose thoughts around portable services
On Mi, 13.10.21 13:38, Umut Tezduyar Lindskog (umut.tezdu...@axis.com) wrote: > Hi, we have been playing around more with the portable services and > lots of loose thoughts came up. Hopefully we can initiate > discussions. > > The PrivateUsers and DynamicUsers are turned off for the trusted > profile in portable services but none of the passwd/group and nss > files are mapped to the sandbox by default essentially preventing > the sandbox to do a user look up. Is this a use case that should be > offered by the “trusted” profile or should this be handled by the > services that would like to do a look-up? The "trusted" profile basically means you dealt with that synchronization yourself in some way. That said: systemd's nss-systemd NSS module can nowadays (v249) read user definitions from drop-in JSON fragments in /run/host/userdb/. This is is used by nspawn's --bind-user= feature to make a host user easily available in a container, with group info, password and so on. My plan was to also make use of this in the unit executor, i.e. so that whenever RootDirectory=/RootImage= are used the service manager places such minimal user info for the selected user there, so that the user is perfectly resolvable inside the service too. This is particularly relevant for DynamicUser=1 services. I haven't come around actually implementing that though. Given nss-systemd is enabled in most bigger distro's nssswitch.conf file these days I think this is a really nice approach to propagate user databases like that. > Is there a way to have PrivateUsers=yes and map more host users to > the sandbox? We have dynamic, uid based authorization on dbus > methods. Up on receiving a method, the server checks the sender uid > against a set of rule files. I guess we could add BindUser= or so, which could control the /run/host/userdb/ propagation I proposed above. > Would it benefit others if the “profile” support was moved out of > the portable services and be part of the unit files? For example > part of the [Install] section. Right now profiles are a concept of portabled, not of the service manager. There's a github issue somewhere where people asked us to make this generically usable from services too, so I guess you are not the only one who'd like someting like that. > Has there been any thought about nesting profiles? Example, one > profile can include other profiles in it. File an RFE issue. I guess we could support that for any profile x we'd implicitly also pull in x.d/*.conf, or so. > Systemd analyze security is great! We believe it would be easier to > audit if we had a way to compare a service file’s sandboxing > directives against a profile and find the delta. Then score the > service file against delta. Interesting idea. Current git has all kinds of JSON hookup for systemd-analyze security btw, so tools could do that externally too. But you are right, doing this implicitly might indeed make sense. Please file an RFE issue on github. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [systemd‑devel] Removing bold fonts from boot messages
On Mi, 13.10.21 18:29, Frank Steiner (fsteiner-ma...@bio.ifi.lmu.de) wrote: > Ulrich Windl wrote: > > > Stupid question: If you see bold face at the end of the serial line, > > wouldn't > > changing the terminal type ($TERM) do? > > Maybe construct your own terminal capabilities. > > I'd need a TERM that has colors but disallows bold fonts. For some > reason I wasn't even able to construct a terminfo that would disallow > colors when using that $TERM inside xterm (and starting a new bash). > It seems that xterm always has certain capabilities, i.e. "ls --color" > is always showing colors in xterm, also with TERM=xterm-mono and > everything else I tried. > > Anway, settings a translation to bind "allow-bold-fonts(toggle)" > to a key in xterm resources allows to block bold fonts whenever > watching systemd boot messages via ipmi or AMT in a xterm... Note that systemd doesn't care about terminfo/termcap or anything like that. We only support exactly three types of terminals: 1. TERM=dumb → you get no ANSI sequences, no fancy emojis or other non-ASCII unicode chars, no clickable links. 2. TERM=linux → you do get ANSI sequences, but no fancy emojis, but some simpler known-safe unicode chars (TERM=linux is the Linux console/VT subsystem), no clickable links. 3. everything else → you get ANSI sequences, fancy emojis, fancy unicode chars, clickable links. And that's really it. It's 2021 and so far this was unproblematic. The ANSI sequences we use aren't crazy exotic stuff but pretty much baseline and virtually any terminal from the last 25 years probably supports them. You can turn these features off individually, too. SYSTEMD_COLORS=0 → no ANSI colors sequences (alternatively: "NO_COLOR=1" as per https://no-color.org/) SYSTEMD_EMOJI=0 → no unicode emojis LC_CTYPE=ANSI_X3.4-1968 → no non-ASCII chars (which also means no emojis) SYSTEMD_URLIFY=0 → no clickable links Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] troubleshooting Clevis
On Di, 12.10.21 16:17, lejeczek (pelj...@yahoo.co.uk) wrote: > > > I have 'clevis' set to get luks pin from 'tang' but unlock does not happen > > > at/during boot time and I wonder if someone can share thoughts on how to > > > investigate that? > > > I cannot see anything obvious fail during boot, moreover, manual > > > 'clevis-luks-unlock' works no problems. > > This is the systemd mailing list, not the clevis/tang mailing > > list. Please contact the clevis/tang community instead. > > May ask of any possible plans where systemd would, somehow similarly to > 'tpm', utilize 'tang'(or similar) technique to unlock luks encrypted > devices? You mean that networked unlock feature? I mean, it's not always clear what belongs and systemd and what does not. But outside of data centers I am not sure tang/clevis really has much use, and that's quite a limited userbase, so I'd say: no this should be done outside of systemd. Maybe a plugin for libcryptsetup's "token" feature. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Removing bold fonts from boot messages
On Di, 12.10.21 12:09, Frank Steiner (fsteiner-ma...@bio.ifi.lmu.de) wrote: > Hi, > > after upgrading from SLES 15 SP2 (systemd 2.34) to SP3 (systemd 2.46) > the boot messages are not only colored (which I like for seeing failures > in red) but partially printed in bold face. This makes messages indeed > harder to read on serial console (with amt or ipmi), so I wonder if > there is a place where the ascii sequences for colors and font faces > are defined and can be adjusted? Sounds like in an graphics issue in your terminal emulator, no? > Or is there some option to remove the bold face only, but not the colors? > systemd.log_color=0 removes all formatting, but I'd like to keep the > colors... No, this is not configurable. We are not a themeable desktop, sorry. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Tempering the Logging Data when Knowing the Verification Key / Time Synchronization
On Mo, 11.10.21 17:08, Andreas Krueger (andreas.krue...@fmc-ag.com) wrote: > Hi Folks, > > > I am currently working in an embedded project that uses Journal for logging. > The logging data shall be protected by the Journal's sealing mechanism FSS > and for various reasons the verification key is located unprotected in memory. > > Regarding this constellation, my first question is that: > > If an attacker knows the verification key, is he able to modify the > logging data in such a way that its tempering remains undetected, > even if this has happened e.g. one day ago (which means that several > new sealing keys has been generated in the meantime) ? Yes, the verification key should be kept secret. (The text output when it is generated should make this very clear, actually.) if you don't keep it secret, then all bets are off, the construction of the underlying cryptography does not work then. > Since sealing is always done for a time interval, my second question is that: > > What will happen to the logging data and sealing mechanism when the > system clock is suddenly modified? This can e.g. happen, when the > board starts first with a default time value and then synchronized > after a while by a time daemon. the sealing key is "evolved" based on time (which means a new key is generated from the old and the old one is securely deleted). When time jumps forward, then this scheme automatically keeps up, and if needed will evolve a number of steps at once, as necessary. If time jumps backwards things are more problematic though: the key appropriate for the old time has already been generated likely, and while a newer key can be derived from the old an older cannot be derived from the new (this fact is after all the whole point of the excercise). For cases like this it might make sense to ensure that flushing of the journal to disk (i.e. systemd-journald-flush.service) is scheduled after correct time has been acquired (i.e. time-sync.target). Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] dm-integrity volume with TPM key?
On Fr, 08.10.21 21:15, Sebastian Wiesner (sebast...@swsnr.de) wrote: > Am Montag, dem 04.10.2021 um 14:49 +0200 schrieb Lennart Poettering: > > On Do, 30.09.21 21:20, Sebastian Wiesner (sebast...@swsnr.de) wrote: > > > > > Hello, > > > > > > thanks for quick reply, I guess this explains the lack of > > > instructions > > > > btw, coincidentally this was posted on github on the day you posted > > this: > > > > https://github.com/systemd/systemd/pull/20902 > > > > so hopefully we'll have te missing tools in place soon too. > > Great, so it looks as if everything's in place with systemd 250 > perhaps? Dunno, we'll see, if the submitter rolls another revision possibly, but it all depends on that. Would love to see this happen, but right now the ball is in the field of the submitter of that PR. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [systemd]: How to set systemd not to generate loop0.device and mtdblockx.device?
On Sa, 09.10.21 11:27, www (ouyangxua...@163.com) wrote: > systemd version: V242 > > In our system, the whole machine starts too slowly. We want to do > some optimization. I found that two services( loop0.device and > mtdblock5.device) started slowly. I want to remove them (I > personally think our system are not need them). I want to ask you > how to avoid generating these two device files and not start them? /dev/loop0 is a loopback block device. It's probably some tool that needs them you are using. /dev/mtdblock5 is some physical hw you have. And it's probably mounted by something you are using. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] Q: write error, watchdog, journald core dump, ordering of entries
On Mo, 11.10.21 10:57, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > > Now when journald hangs due to some underlying IO issue, then it might > > miss the watchdog deadline, and PID 1 might then kill it to get it > > back up. It will log about this to the journal, but given tha tthe > > journal is hanging/being killed it's not going to write the messages > > to disk, the mesages will remain queued in the logging socket for a > > bit. Until eventually journald starts up again, and resumes processing > > log messages. it will then process the messages already queued in the > > sockets from when it was hanging, and thus the order might be > > surprising. > > Hi! > > Thanks for explaining. > Don't you have some OOB-logging, that is: Log a message before processing the > queue logs. The "Journal started" message is inserted into the log stream by journald itself before processing the already queued messages. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Q: write error, watchdog, journald core dump, ordering of entries
On Mi, 06.10.21 10:29, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote: > Hi! > > We had a stuck networkc card on a server that seems to have caused the RAID > controller with two SSDs to be stuck on write as well. > Anyway journald dumped core with this stack: > Oct 05 20:13:25 h19 systemd-coredump[26759]: Process 3321 (systemd-journal) > of user 0 dumped core. > Oct 05 20:13:25 h19 systemd-coredump[26759]: Coredump diverted to > /var/lib/systemd/coredump/core.systemd-journal.0.a4eb19afcc314d99936cbdd5542e4fed.3321.163345758500.lz4 > Oct 05 20:13:25 h19 systemd-coredump[26759]: Stack trace of thread 3321: > Oct 05 20:13:25 h19 systemd-coredump[26759]: #0 0x7f913492d0c2 > journal_file_append_object (libsystemd-shared-234.so) > Oct 05 20:13:25 h19 systemd-coredump[26759]: #1 0x7f913492dba3 n/a > (libsystemd-shared-234.so) > Oct 05 20:13:25 h19 systemd-coredump[26759]: #2 0x7f913492fc79 > journal_file_append_entry (libsystemd-shared-234.so) > Oct 05 20:13:25 h19 systemd-coredump[26759]: #3 0x557fe532908d n/a > (systemd-journald) > Oct 05 20:13:25 h19 systemd-coredump[26759]: #4 0x557fe532b15f n/a > (systemd-journald) > Oct 05 20:13:25 h19 systemd-coredump[26759]: #5 0x557fe5324664 n/a > (systemd-journald) > Oct 05 20:13:25 h19 systemd-coredump[26759]: #6 0x557fe5326a80 n/a > (systemd-journald) > Oct 05 20:13:25 h19 kernel: printk: systemd-coredum: 6 output lines > suppressed due to ratelimiting > > (systemd-234-24.90.1.x86_64 of SLES15 SP2 on x86_64) > > journald seems to have restarted later, but I wonder about the ordering of > the entries following: > Oct 05 20:13:25 h19 systemd-journald[26760]: Journal started > Oct 05 20:13:25 h19 systemd-journald[26760]: System journal > (/var/log/journal/8695c89eb080463dad2ca9f9aaedf162) is 928.0M, max 4.0G, 3.0G > free. > > Oct 05 20:12:52 h19 systemd[1]: systemd-journald.service: Watchdog timeout > (limit 3min)! > Oct 05 20:12:52 h19 systemd[1]: systemd-journald.service: Killing process > 3321 (systemd-journal) with signal SIGABRT. > Oct 05 20:13:25 h19 systemd[1]: Starting Flush Journal to Persistent > Storage... > Oct 05 20:13:25 h19 systemd[1]: Started Flush Journal to Persistent Storage. > > I don't understand why the core dump is logged before the signal > being sent and the watchdog timeout. PID 1 logs to journald. PID 1 also runs and supervises journald. That's quite a special relationship: PID1 both is client to journald and manages it. Now when journald hangs due to some underlying IO issue, then it might miss the watchdog deadline, and PID 1 might then kill it to get it back up. It will log about this to the journal, but given tha tthe journal is hanging/being killed it's not going to write the messages to disk, the mesages will remain queued in the logging socket for a bit. Until eventually journald starts up again, and resumes processing log messages. it will then process the messages already queued in the sockets from when it was hanging, and thus the order might be surprising. -- Lennart Poettering, Berlin
Re: [systemd-devel] dm-integrity volume with TPM key?
On Do, 30.09.21 21:20, Sebastian Wiesner (sebast...@swsnr.de) wrote: > Hello, > > thanks for quick reply, I guess this explains the lack of > instructions btw, coincidentally this was posted on github on the day you posted this: https://github.com/systemd/systemd/pull/20902 so hopefully we'll have te missing tools in place soon too. > As a workaround you'd use a regular file key for dm-integrity and put > that on a TPM-protected partition, if I understand you correctly? yes. > I.e. you'd > > 1. enable secureboot (custom keys or shim), > 2. bundle kernel & initrd into signed UEFI image for systemd-boot, > 3. make / a LUKS-encrypted parition with systemd-cryptenroll, bound to > the TPM (perhaps PCR 0 and 7) aund unlocked automatically at boot, only pcr 7, for the reasons explained in the blog story. > 4. make /home a dm-integrity partition, with a regular keyfile from > e.g. /etc/integrity.key (which is on the encrypted partition), and actually, after thinking a bit more about this I figure the ultimate path for this would be /etc/integritysetup-keys.d/home.key – because we already implemented in systemd-cryptsetup a scheme where we search for the encryption key for volume xyz in /etc/cryptsetup-keys.d/xyz.key, and we should probably do it similar for verity keys, too. > 5. use homed for LUKS-encrypted home areas on /home? > > Does this sound reasonable? Yes! Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Authenticated Boot and Disk Encryption on Linux
On Do, 30.09.21 18:54, Łukasz Stelmach (stl...@poczta.fm) wrote: > > I have been working on code in homed to "balance" free space between > > active home dirs in regular intervals (shorter intervals when disk > > space is low, higher intervals when there's plenty). Also, right now > > we already run FITRIM on home dirs on logout, to make sure all air is > > removed then. I intend to also add logic to shrink to minimal size > > then (and conversely grow on login again). > > > > This will only really work in case btrfs is used inside the homedir > > images, as only then we can both shrink and grow the fs whenever we > > want to. > > Interesting. Apparently[1] loopback driver punches holes in the image > files and makes them sparse. We currently issue FITRIM on logout (thus making the file sparse), and on login we issue fallocate() to remove the holes again, and being able to give disk space guarantees and disable overcommit during runtime. > > [Encryption] isn't typically needed for /usr/ given that it generally > > contains no secret data > > This isn't IMHO precisely true. Especially not for laptops. And I don't > mean the presence of "hacking tools" you mentioned below. Even when all > the binaries in the /usr all come from the Internet there are many > different versions available. Knowledge which versions are running on a > device may be quite valuable for an attacker to mount an remote on-line > attack and extract data with malware. Well, that's security through obscurity to some level. I know some people are concerned about this, and they can encrypt that if they really thinkg they must. But I doubt that this makes sense for the cases where your OS payload comes in flatpaks, containers, sysexts, portable services, …, i.e. is not written to /usr. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Prefix for direct logging
On Mi, 29.09.21 20:21, Arjun D R (drarju...@gmail.com) wrote: > Hi Lennart, > > Please help me understand how the journald is figuring out the PID of the > log line. Google SCM_CREDENTIALS. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Authenticated Boot and Disk Encryption on Linux
On Mi, 29.09.21 21:09, Łukasz Stelmach (stl...@poczta.fm) wrote: > Hi, Lennart. > > I read your blog post and there is little I can add regarding > encryption/authentication*. However, distributions need to address one > more detail, I think. You've mentioned recovery scenarios, but even with > an additional set of keys stored securely, there are enough moving parts > in FDE that something may go wrong beyond what recovery keys could > fix. To help users minimise the risk of data loss distributions should > provide backup tools and help configure them securely. > > This is of course outside of the scope of your original post, but IMHO > it is a good moment to mention this. > > * Well there is one tiny detail. > > You noted double encryption needs to be avoided in case of home > directory images by storing them on a separate partition. Separating > /home may be considered a slight inefficiency in storage usage, but > using LVM to distribute storage space between the root(+/usr) and /home > might help. However, to best of my knowledge (which I will be glad to > update) there is no tool to dynamically and automatically manage storage > space used by home images. In theory the code is there, but UX of > resize2fs(8) and dd(1) is far from satisfying and I am not entirely sure > what happens if one truncates (after resize2fs, which will work) > a file containing a mounted image. > > The first solution that comes to my mind is to make systemd-homed resize > home filesystem images according to some policy upon locking and > unlocking. But it's not perfect as users would need to log out(?) to > trigger allocation of more storage should they fill their home > directory. I have been working on code in homed to "balance" free space between active home dirs in regular intervals (shorter intervals when disk space is low, higher intervals when there's plenty). Also, right now we already run FITRIM on home dirs on logout, to make sure all air is removed then. I intend to also add logic to shrink to minimal size then (and conversely grow on login again). This will only really work in case btrfs is used inside the homedir images, as only then we can both shrink and grow the fs whenever we want to. Lennart -- Lennart Poettering, Berlin