[systemd-devel] Antw: [EXT] Re: [systemd‑devel] How to find out the processes systemd‑shutdown is waiting for?
>>> Lennart Poettering schrieb am 02.03.2022 um 17:22 in Nachricht : > On Mi, 02.03.22 13:02, Arian van Putten (arian.vanput...@gmail.com) wrote: > >> I've seen this a lot with docker/containerd. It seems as if for some reason >> systemd doesn't wait for their cgroups to cleaned up on shutdown. It's >> very easy to reproduce. Start a docker container and then power off the >> machine. Since the move to cgroups V2 containerd should be using systemd to >> manage the cgroup tree so a bit puzzling why it's always happening. >> >> Something seems to be off with containerd's integration into systemd but >> I'm not sure what. > > Docker traditionally has not followed any of our documented ways to You are implying that "our documented ways" is a definitive standard? > interact with cgroups, even though they were made aware of them, not > sure why, I think some systemd hate plays a role there. I am not sure > if this has changed, but please contact Docker if you have issues with > Docker, they have to fix their stuff themselves, we cannot work around > it. The problem with systemnd (people) is that they try to establish new standards outside of systemd. "If A does not work with systemd", it's always A that is broken, never systemd ;-) Regards, Ulrich > > Lennart > > ‑‑ > Lennart Poettering, Berlin
[systemd-devel] Q: journalctl -g
Hi! In SLES15 SP3 (systemd-246.16-7.33.1.x86_64) I have this effect, wondering whether it is a bug or a feature: When using "journalctl -b -g raid" I see that _ome_ matches are highlighted in red, but others aren't. For example: Mar 01 01:37:09 h16 kernel: mega*raid*_sas :c1:00.0: BAR:0x1 BAR's base_addr(phys):0xa550 mapped virt_addr:0xae628322 Mar 01 01:37:09 h16 kernel: megaraid_sas :c1:00.0: FW now in Ready state ... That means in the line following the raid is not highlighted, even though it matched obviously. Likewise any further "megaraid_sas" aren't highlighted. But also those are not highlighted: Mar 01 01:37:20 h16 kernel: raid6: avx2x4 gen() 16182 MB/s Mar 01 01:37:47 h16 kernel: md/raid1:md127: active with 2 out of 2 mirrors Mar 01 01:37:48 h16 smartd[5871]: Device: /dev/bus/0 [megaraid_disk_00], type changed from 'megaraid,0' to 'sat+megaraid,0' But here it is highlighted again: Mar 01 01:38:55 h16 pacemaker-controld[7236]: notice: Requesting local execution of probe operation for prm_lockspace_*raid*_md10 on h16 Mar 01 01:38:55 h16 pacemaker-controld[7236]: notice: Result of probe operation for prm_lockspace_*raid*_md10 on h16: not running And this, too: Mar 01 07:58:44 h16 kernel: mega*raid*_sas :c1:00.0: Firmware crash dump is not available Mar 01 08:00:47 h16 supportconfig[671]: Software *Raid*... And this not: Mar 02 03:07:48 h16 smartd[5871]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], starting scheduled Short Self-Test. Regards, Ulrich
Re: [systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing
Ah, right, I forgot – since this is done in the service child (right before exec) and not in the main process, you probably need to add the -f option to make strace follow forks... On Thu, Mar 3, 2022, 22:08 Christopher Obbard wrote: > Hi Mantas, > > On 03/03/2022 19:18, Mantas Mikulėnas wrote: > > On Thu, Mar 3, 2022 at 9:09 PM Christopher Obbard > > mailto:chris.obb...@collabora.com>> wrote: > > > > Hi systemd experts! > > > > I am using systemd-247 and systemd-250 on debian system, which is > > running a minimal downstream 5.4 kernel for a Qualcomm board. > > > > systemd 241 in debian buster works fine, but systemd 247 (debian > > bullseye) and systemd 250 (debian unstable) seem to get upset about > > file > > descriptors on services. These errors are consistant and the board > > boots > > just fine with init=/bin/sh > > > > I've got the required kernel config from README in my kernel, I am > > using > > a heavily patched downstream kernel, but from the following log can > you > > suggest anything I can do to debug this (other than throwing the > board > > out of the window) ? > > > > > > From the message, it looks like the error is returned by > > close_all_fds() in src/basic/fd-util.c, where the only major change is > > that it has been ported to call close_range() if that's available... > > > > I would boot with init=/bin/sh, then run `exec strace -D -o > > /var/log/systemd.trace /lib/systemd/systemd` to get a trace, and see if > > the EINVAL actually comes from calling close_range() or from something > else. > > > > -- > > Mantas Mikulėnas > > Thanks for your reply. It reproduced nicely with the command you gave. > > Seems like nothing related to close_range returning EINVAL, only the > following calls returned EINVAL: > > cat systemd-failing.trace | grep EINVAL > prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument) > prctl(PR_CAPBSET_READ, CAP_CHECKPOINT_RESTORE) = -1 EINVAL (Invalid > argument) > prctl(PR_CAPBSET_READ, CAP_PERFMON) = -1 EINVAL (Invalid argument) > read(4, 0x55675a75b0, 4095) = -1 EINVAL (Invalid argument) > mount("cgroup2", "/proc/self/fd/4", "cgroup2", > MS_NOSUID|MS_NODEV|MS_NOEXEC, "nsdelegate,memory_recursiveprot") = -1 > EINVAL (Invalid argument) > > > I have attached the full strace output, in case that would be useful? > > Thanks > Chris
Re: [systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing
On Thu, Mar 3, 2022 at 9:09 PM Christopher Obbard < chris.obb...@collabora.com> wrote: > Hi systemd experts! > > I am using systemd-247 and systemd-250 on debian system, which is > running a minimal downstream 5.4 kernel for a Qualcomm board. > > systemd 241 in debian buster works fine, but systemd 247 (debian > bullseye) and systemd 250 (debian unstable) seem to get upset about file > descriptors on services. These errors are consistant and the board boots > just fine with init=/bin/sh > > I've got the required kernel config from README in my kernel, I am using > a heavily patched downstream kernel, but from the following log can you > suggest anything I can do to debug this (other than throwing the board > out of the window) ? > >From the message, it looks like the error is returned by close_all_fds() in src/basic/fd-util.c, where the only major change is that it has been ported to call close_range() if that's available... I would boot with init=/bin/sh, then run `exec strace -D -o /var/log/systemd.trace /lib/systemd/systemd` to get a trace, and see if the EINVAL actually comes from calling close_range() or from something else. -- Mantas Mikulėnas
[systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing
Hi systemd experts! I am using systemd-247 and systemd-250 on debian system, which is running a minimal downstream 5.4 kernel for a Qualcomm board. systemd 241 in debian buster works fine, but systemd 247 (debian bullseye) and systemd 250 (debian unstable) seem to get upset about file descriptors on services. These errors are consistant and the board boots just fine with init=/bin/sh I've got the required kernel config from README in my kernel, I am using a heavily patched downstream kernel, but from the following log can you suggest anything I can do to debug this (other than throwing the board out of the window) ? Thanks in advance! [ 12.704592] systemd[1]: systemd 247.3-6 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD + IDN2 -IDN +PCRE2 default-hierarchy=unified) [ 12.737160] systemd[1]: Detected architecture arm64. Welcome to Debian GNU/Linux 11 (bullseye)! [ 12.761573] systemd[1]: Set hostname to test. [ 12.855973] systemd-bless-b (344) used greatest stack depth: 12352 bytes left [ 12.893857] systemd-sysv-ge (354) used greatest stack depth: 11808 bytes left [ 14.991262] systemd-gpt-aut (349) used greatest stack depth: 11712 bytes left [ 15.266999] systemd[1]: Queued start job for default target Graphical Interface. [ 15.281405] systemd[1]: Created slice system-getty.slice. [ OK ] Created slice system-getty.slice. [ 15.302385] systemd[1]: Created slice system-modprobe.slice. [ OK ] Created slice system-modprobe.slice. [ 15.329318] systemd[1]: Created slice system-serial\x2dgetty.slice. [ OK ] Created slice system-serial\x2dgetty.slice. [ 15.349720] systemd[1]: Started Dispatch Password Requests to Console Directory Watch. [ OK ] Started Dispatch Password …ts to Console Directory Watch. [ 15.373314] systemd[1]: Started Forward Password Requests to Wall Directory Watch. [ OK ] Started Forward Password R…uests to Wall Directory Watch. [ 15.405305] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point. [ OK ] Set up automount Arbitrary…s File System Automount Point. [ 15.429187] systemd[1]: Reached target Local Encrypted Volumes. [ OK ] Reached target Local Encrypted Volumes. [ 15.449096] systemd[1]: Reached target Paths. [ OK ] Reached target Paths. [ 15.464745] systemd[1]: Reached target Remote File Systems. [ OK ] Reached target Remote File Systems. [ 15.484707] systemd[1]: Reached target Slices. [ OK ] Reached target Slices. [ 15.500719] systemd[1]: Reached target Swap. [ OK ] Reached target Swap. [ 15.520772] systemd[1]: Listening on Syslog Socket. [ OK ] Listening on Syslog Socket. [ 15.538503] systemd[1]: Listening on initctl Compatibility Named Pipe. [ OK ] Listening on initctl Compatibility Named Pipe. [ 15.562231] systemd[1]: Condition check resulted in Journal Audit Socket being skipped. [ 15.574187] systemd[1]: Listening on Journal Socket (/dev/log). [ OK ] Listening on Journal Socket (/dev/log). [ 15.594213] systemd[1]: Listening on Journal Socket. [ OK ] Listening on Journal Socket. [ 15.619267] systemd[1]: Listening on udev Control Socket. [ OK ] Listening on udev Control Socket. [ 15.639342] systemd[1]: Listening on udev Kernel Socket. [ OK ] Listening on udev Kernel Socket. [ 15.656768] systemd[1]: Reached target Sockets. [ OK ] Reached target Sockets. [ 15.674793] systemd[1]: Condition check resulted in Huge Pages File System being skipped. [ 15.685036] systemd[1]: Condition check resulted in POSIX Message Queue File System being skipped. [ 15.709769] systemd[1]: Mounting Kernel Debug File System... [ 15.710514] systemd[356]: sys-kernel-debug.mount: Failed to close unwanted file descriptors: Invalid argument Mounting Kernel Debug File System... [ 15.742045] systemd[1]: Mounting Kernel Trace File System... [ 15.742446] systemd[357]: sys-kernel-tracing.mount: Failed to close unwanted file descriptors: Invalid argument Mounting Kernel Trace File System... [ 15.770279] systemd[1]: Condition check resulted in Create list of static device nodes for the current kernel being skipped. [ 15.789036] systemd[1]: Starting Load Kernel Module configfs... [ 15.789195] systemd[358]: modprobe@configfs.service: Failed to close unwanted file descriptors: Invalid argument Starting Load Kernel Module configfs... [ 15.821560] systemd[1]: Starting Load Kernel Module fuse... [ 15.821679] systemd[359]: modprobe@fuse.service: Failed to close unwanted file descriptors: Invalid argument Starting Load Kernel Module fuse... [ 15.850714] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped. [ 15.874113] systemd[360]: systemd-journald.service: Failed to close unwanted file descriptors: Invalid argument [ 15.874167] systemd[1]:
Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?
On Mi, 02.03.22 17:50, Lennart Poettering (lenn...@poettering.net) wrote: > That said, we could certainly show both the comm field and the PID of > the offending processes. I am prepping a patch for that. See: https://github.com/systemd/systemd/pull/22655 Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 03.03.22 18:35, Felip Moll (fe...@schedmd.com) wrote: > I have read and studied all your suggestions and I understand them. > I also did some performance tests in which I fork+executed a systemd-run to > launch a service for every step and I got bad performance overall. > One of our QA tests (test 9.8 of our testsuite) shows a decrease of > performance of 3x. systemd-run is synchronous, and unless you specify "--scope" it will tell systemd to fork things off instead of doing that client-side, which I understand is what you want to do. The fact it's synchronous, i.e. waits for completion of the whole operation (including start-up of dependencies and whatnot) necessarily means it's slow. > > But note that you can also run your main service as a service, and > > then allocate a *single* scope unit for *all* your payloads. That way > > you can restart your main service unit independently of the scope > > unit, but you only have to issue a single request once for allocating > > the scope, and not for each of your payloads. > > > > > My questions are, where would the scope reside? Does it have an associated > cgroup? Yes, I explicitly pointed you to them, it's why I suggested you use them. My recommendation if you hack on stuff like this is reading the docs btw, specifically: https://systemd.io/CGROUP_DELEGATION It pretty explicitly lists your options in the "Three Scenarios" section. It also explains what scope units are and when to use htme. > I am also curious of what this sentence does exactly mean: > > "You might break systemd as a whole though (for example, add a process > directly to a slice's cgroup and systemd will be very sad).". if you add a process to a cgroup systemd manages that is supposed to be an inner one in the tree, you will make creation of children fail that way, and thus starting services and other operations will likely start failing all over the place. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Hi folks, I wanted to keep the case as generic as possible but I think it is important at this point to comment on what we're talking about, so let me clarify a little bit the case I am dealing with at the moment. In SchedMD, we want Slurm to support 'Cgroup v2'. As you may know Slurm is a HPC resource manager, and for the moment we're limited to Cgroup v1. We actually use the freezer, memory, cpuset, cpuacct and devices controllers in v1. We think it is already a good time to add a plugin to our software to make it capable to run on unified systems, and since systemd is widely used we want to do this integration as best as we can to coexist with systemd and not get our pids moved or make systemd mad. We have a 'slurmd' daemon running on every compute node, waiting for communications from the controller. The controller submits different kinds of RPCs to slurmd and at one point one RPC can instruct slurmd to start a new job step for a specific uid. Slurmd then forks twice; the original slurmd just ends and goes back to other work. The first fork (child) sets a bunch of pipes and prepares initialization data, then forks again generating a grandchild. The grandchild finally exec's the slurmstepd daemon which will be receiving the initialization data, prepare the cgroups, and finally fork+exec the user software. This can happen many times in a second because a user can submit a "job array" which with one single RPC call can submit thousands of steps, and at the same time thousands of other steps can be finishing at the same time, so the work that systemd would need to do starting up new scopes/services and/or stopping them + monitoring all this stuff could be considerable. After this introduction I have to say that we successfully managed to work following systemd rules by just starting a unit file for slurmd with Delegate=yes and creating our own hierarchy inside. Every slurmstepd would be forked and started in the delegated cgroup and would create its directory and move itself where it belongs to (always in the delegated cgroup), according to our needs. Everything ran smoothly until when I restarted slurmd and slurmstepds were still running in the cgroup, systemd was unable to start slurmd again because the cgroup was not deleted, since it was busy with directories and slurmstepds; main reason for this bug. Note that one feature of Slurm is that one can upgrade/restart slurmd without affecting running jobs (slurmstepds) in the compute node. I have read and studied all your suggestions and I understand them. I also did some performance tests in which I fork+executed a systemd-run to launch a service for every step and I got bad performance overall. One of our QA tests (test 9.8 of our testsuite) shows a decrease of performance of 3x. But, the positive thing is that we did a test to manually fork+exec one new Delegated separate service when starting up slurmd, and we moved new forked slurmstepd pids *manually* into the new cgroup associated with the new service. This service contains a 'sleep infinity' as the main pid to make the cgroup not disappear even if no slurmstepds are running. As I say, this is a dirty test, which works. After reading your last two emails, I think the most efficient way we need to go is this one: Firing an async D-Bus packet to systemd should be hardly measurable. > > But note that you can also run your main service as a service, and > then allocate a *single* scope unit for *all* your payloads. That way > you can restart your main service unit independently of the scope > unit, but you only have to issue a single request once for allocating > the scope, and not for each of your payloads. > > My questions are, where would the scope reside? Does it have an associated cgroup? If I am a new slurmstepd, can I attach myself to this scope or must I be attached by slurmd before being executed? > But that too means you have to issue a bus call. If you really don't > like talking to systemd this is not going to work of course, but quite > frankly, that's a problem you are making yourself, and I am not > particularly sympathetic to it. > I can study this option. It is not that I like or don't like talking to systemd, but the idea is that Slurm must work in other OSes, possibly without systemd but still with cgroup v2, and still be compatible with cgroup v1 and with no cgroup at all. It's thinking about the future, the less complexity and particularities it has, the more maintainable and flexible the software is. I think this is understandable, but if this is not possible at all we will have to adapt. > > DelegateCgroupLeaf=. If set to yes an extra directory will be > > created into the unit cgroup to place the newly spawned service process. > > This is useful for services which need to be restarted while its forked > > pids remain in the cgroup and the service cgroup is not a leaf > > anymore. > > No. Let's not add that. > I could foresee the benefits of such an option, but I can
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 22:16, Felip Moll (lip...@gmail.com) wrote: > Silvio, > > As I commented in my previous post, creating every single job in a separate > slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal structures and > manage it all is a no-no. Firing an async D-Bus packet to systemd should be hardly measurable. But note that you can also run your main service as a service, and then allocate a *single* scope unit for *all* your payloads. That way you can restart your main service unit independently of the scope unit, but you only have to issue a single request once for allocating the scope, and not for each of your payloads. But that too means you have to issue a bus call. If you really don't like talking to systemd this is not going to work of course, but quite frankly, that's a problem you are making yourself, and I am not particularly sympathetic to it. > One other option that I am thinking about is extending the parameters of a > unit file, for example adding a DelegateCgroupLeaf=yes option. > > DelegateCgroupLeaf=. If set to yes an extra directory will be > created into the unit cgroup to place the newly spawned service process. > This is useful for services which need to be restarted while its forked > pids remain in the cgroup and the service cgroup is not a leaf > anymore. No. Let's not add that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 18:07, Felip Moll (lip...@gmail.com) wrote: > > That's a bad idea typically, and a generally a hack: the unit should > > probably be split up differently, i.e. the processes that shall stick > > around on restart should probably be in their own unit, i.e. another > > service or scope unit. > > So, if I understand it correctly you are suggesting that every forked > process must be started through a new systemd unit? systemd has two different unit types: services and scopes. Both group processes in a cgroup. But only services are where systemd actually forks+execs (i.e. "starts a process"). If you want to fork yourself, that's fine, then a scope unit is your thing. If you use scope units you do everything yourself, but as part of your setup you then tell systemd to move your process into its own scope unit. > If that's the case it seems inconvenient because we're talking about a job > scheduler where sometimes may have thousands of forked processes executed > quickly, and where performance is key. > Having to manage a unit per each process will probably not work in this > situation in terms of performance. You don't really have to "manage" it. You can register a scope unit asynchronously, it's firing off one dbus message basically at the same time you fork things off, telling systemd to put it in a new scope unit. > The other option I can imagine is to start a new unit from my daemon of > Type=forking, which remains forever until I decide to clean it up even if > it doesn't have any process inside. > Then I could put my processes in the associated cgroup instead of inside > the main daemon cgroup. Would that make sense? Migrating processes wildly between cgroups is messy, because it fucks up accounting and is restricted permission-wise. Typically you want to create a cgroup and populate it, and then stick to that. > The issue here is that for creating the new unit I'd need my daemon to > depend on systemd libraries, or to do some fork-exec using systemd commands > and parsing output. To allocate a scope unit you'd have to fire off a D-Bus method call. No need for any systemd libraries. > I am trying to keep the dependencies at a minimum and I'd love to have an > alternative. Sorry, but if you want to rearrange processes in cgroups, or want systemd to manage your processes orthogonal to the service concept you have to talk to systemd. > Yeah, I know and understand it is not supported, but I am more interested > in the technical part of how things would break. > I see in systemd/src/core/cgroup.c that it often differentiates a cgroup > with delegation with one without it (!unit_cgroup_delegate(u)), but it's > hard for me to find out how or where this exactly will mess up with any > cgroup created outside of systemd. I'd appreciate it if you can give me > some light on why/when/where things will break in practice, or just an > example? THis depends highly on what precisely you do. At best systemd will complain or just override the changes you did outside of the tree you got delegated. You might break systemd as a whole though (for example, add a process directly to a slice's cgroup and systemd will be very sad). Lennart -- Lennart Poettering, Berlin