[systemd-devel] Antw: [EXT] Re: [systemd‑devel] How to find out the processes systemd‑shutdown is waiting for?

2022-03-03 Thread Ulrich Windl
>>> Lennart Poettering  schrieb am 02.03.2022 um 17:22
in
Nachricht :
> On Mi, 02.03.22 13:02, Arian van Putten (arian.vanput...@gmail.com) wrote:
> 
>> I've seen this a lot with docker/containerd. It seems as if for some
reason
>> systemd doesn't wait for their  cgroups to cleaned up on shutdown. It's
>> very easy to reproduce. Start a docker container and then power off the
>> machine. Since the move to cgroups V2 containerd should be using systemd
to
>> manage the cgroup tree so a bit puzzling why it's always happening.
>>
>> Something seems to be off with containerd's integration into systemd but
>> I'm not sure what.
> 
> Docker traditionally has not followed any of our documented ways to

You are implying that "our documented ways" is a definitive standard?

> interact with cgroups, even though they were made aware of them, not
> sure why, I think some systemd hate plays a role there. I am not sure
> if this has changed, but please contact Docker if you have issues with
> Docker, they have to fix their stuff themselves, we cannot work around
> it.

The problem with systemnd (people) is that they try to establish new standards
outside of systemd.

"If A does not work with systemd", it's always A that is broken, never systemd
;-)

Regards,
Ulrich

> 
> Lennart
> 
> ‑‑
> Lennart Poettering, Berlin





[systemd-devel] Q: journalctl -g

2022-03-03 Thread Ulrich Windl
Hi!

In SLES15 SP3 (systemd-246.16-7.33.1.x86_64) I have this effect, wondering 
whether it is a bug or a feature:
When using "journalctl -b -g raid" I see that _ome_ matches are highlighted in 
red, but others aren't. For example:
Mar 01 01:37:09 h16 kernel: mega*raid*_sas :c1:00.0: BAR:0x1  BAR's 
base_addr(phys):0xa550  mapped virt_addr:0xae628322
Mar 01 01:37:09 h16 kernel: megaraid_sas :c1:00.0: FW now in Ready state
...

That means in the line following the raid is not highlighted, even though it 
matched obviously.
Likewise any further "megaraid_sas" aren't highlighted.

But also those are not highlighted:
Mar 01 01:37:20 h16 kernel: raid6: avx2x4   gen() 16182 MB/s

Mar 01 01:37:47 h16 kernel: md/raid1:md127: active with 2 out of 2 mirrors

Mar 01 01:37:48 h16 smartd[5871]: Device: /dev/bus/0 [megaraid_disk_00], type 
changed from 'megaraid,0' to 'sat+megaraid,0'

But here it is highlighted again:
Mar 01 01:38:55 h16 pacemaker-controld[7236]:  notice: Requesting local 
execution of probe operation for prm_lockspace_*raid*_md10 on h16
Mar 01 01:38:55 h16 pacemaker-controld[7236]:  notice: Result of probe 
operation for prm_lockspace_*raid*_md10 on h16: not running

And this, too:
Mar 01 07:58:44 h16 kernel: mega*raid*_sas :c1:00.0: Firmware crash dump is 
not available
Mar 01 08:00:47 h16 supportconfig[671]: Software *Raid*...

And this not:
Mar 02 03:07:48 h16 smartd[5871]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], 
starting scheduled Short Self-Test.

Regards,
Ulrich





Re: [systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing

2022-03-03 Thread Mantas Mikulėnas
Ah, right, I forgot – since this is done in the service child (right before
exec) and not in the main process, you probably need to add the -f option
to make strace follow forks...

On Thu, Mar 3, 2022, 22:08 Christopher Obbard 
wrote:

> Hi Mantas,
>
> On 03/03/2022 19:18, Mantas Mikulėnas wrote:
> > On Thu, Mar 3, 2022 at 9:09 PM Christopher Obbard
> > mailto:chris.obb...@collabora.com>> wrote:
> >
> > Hi systemd experts!
> >
> > I am using systemd-247 and systemd-250 on debian system, which is
> > running a minimal downstream 5.4 kernel for a Qualcomm board.
> >
> > systemd 241 in debian buster works fine, but systemd 247 (debian
> > bullseye) and systemd 250 (debian unstable) seem to get upset about
> > file
> > descriptors on services. These errors are consistant and the board
> > boots
> > just fine with init=/bin/sh
> >
> > I've got the required kernel config from README in my kernel, I am
> > using
> > a heavily patched downstream kernel, but from the following log can
> you
> > suggest anything I can do to debug this (other than throwing the
> board
> > out of the window) ?
> >
> >
> >  From the message, it looks like the error is returned by
> > close_all_fds() in src/basic/fd-util.c, where the only major change is
> > that it has been ported to call close_range() if that's available...
> >
> > I would boot with init=/bin/sh, then run `exec strace -D -o
> > /var/log/systemd.trace /lib/systemd/systemd` to get a trace, and see if
> > the EINVAL actually comes from calling close_range() or from something
> else.
> >
> > --
> > Mantas Mikulėnas
>
> Thanks for your reply. It reproduced nicely with the command you gave.
>
> Seems like nothing related to close_range returning EINVAL, only the
> following calls returned EINVAL:
>
> cat systemd-failing.trace | grep EINVAL
> prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument)
> prctl(PR_CAPBSET_READ, CAP_CHECKPOINT_RESTORE) = -1 EINVAL (Invalid
> argument)
> prctl(PR_CAPBSET_READ, CAP_PERFMON) = -1 EINVAL (Invalid argument)
> read(4, 0x55675a75b0, 4095) = -1 EINVAL (Invalid argument)
> mount("cgroup2", "/proc/self/fd/4", "cgroup2",
> MS_NOSUID|MS_NODEV|MS_NOEXEC, "nsdelegate,memory_recursiveprot") = -1
> EINVAL (Invalid argument)
>
>
> I have attached the full strace output, in case that would be useful?
>
> Thanks
> Chris


Re: [systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing

2022-03-03 Thread Mantas Mikulėnas
On Thu, Mar 3, 2022 at 9:09 PM Christopher Obbard <
chris.obb...@collabora.com> wrote:

> Hi systemd experts!
>
> I am using systemd-247 and systemd-250 on debian system, which is
> running a minimal downstream 5.4 kernel for a Qualcomm board.
>
> systemd 241 in debian buster works fine, but systemd 247 (debian
> bullseye) and systemd 250 (debian unstable) seem to get upset about file
> descriptors on services. These errors are consistant and the board boots
> just fine with init=/bin/sh
>
> I've got the required kernel config from README in my kernel, I am using
> a heavily patched downstream kernel, but from the following log can you
> suggest anything I can do to debug this (other than throwing the board
> out of the window) ?
>

>From the message, it looks like the error is returned by close_all_fds() in
src/basic/fd-util.c, where the only major change is that it has been ported
to call close_range() if that's available...

I would boot with init=/bin/sh, then run `exec strace -D -o
/var/log/systemd.trace /lib/systemd/systemd` to get a trace, and see if the
EINVAL actually comes from calling close_range() or from something else.

-- 
Mantas Mikulėnas


[systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing

2022-03-03 Thread Christopher Obbard

Hi systemd experts!

I am using systemd-247 and systemd-250 on debian system, which is 
running a minimal downstream 5.4 kernel for a Qualcomm board.


systemd 241 in debian buster works fine, but systemd 247 (debian 
bullseye) and systemd 250 (debian unstable) seem to get upset about file 
descriptors on services. These errors are consistant and the board boots 
just fine with init=/bin/sh


I've got the required kernel config from README in my kernel, I am using 
a heavily patched downstream kernel, but from the following log can you 
suggest anything I can do to debug this (other than throwing the board 
out of the window) ?


Thanks in advance!

[   12.704592] systemd[1]: systemd 247.3-6 running in system mode. (+PAM 
+AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP 
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +

IDN2 -IDN +PCRE2 default-hierarchy=unified)
[   12.737160] systemd[1]: Detected architecture arm64.

Welcome to Debian GNU/Linux 11 (bullseye)!

[   12.761573] systemd[1]: Set hostname to test.
[   12.855973] systemd-bless-b (344) used greatest stack depth: 12352 
bytes left
[   12.893857] systemd-sysv-ge (354) used greatest stack depth: 11808 
bytes left
[   14.991262] systemd-gpt-aut (349) used greatest stack depth: 11712 
bytes left
[   15.266999] systemd[1]: Queued start job for default target Graphical 
Interface.

[   15.281405] systemd[1]: Created slice system-getty.slice.
[  OK  ] Created slice system-getty.slice.
[   15.302385] systemd[1]: Created slice system-modprobe.slice.
[  OK  ] Created slice system-modprobe.slice.
[   15.329318] systemd[1]: Created slice system-serial\x2dgetty.slice.
[  OK  ] Created slice system-serial\x2dgetty.slice.
[   15.349720] systemd[1]: Started Dispatch Password Requests to Console 
Directory Watch.

[  OK  ] Started Dispatch Password …ts to Console Directory Watch.
[   15.373314] systemd[1]: Started Forward Password Requests to Wall 
Directory Watch.

[  OK  ] Started Forward Password R…uests to Wall Directory Watch.
[   15.405305] systemd[1]: Set up automount Arbitrary Executable File 
Formats File System Automount Point.

[  OK  ] Set up automount Arbitrary…s File System Automount Point.
[   15.429187] systemd[1]: Reached target Local Encrypted Volumes.
[  OK  ] Reached target Local Encrypted Volumes.
[   15.449096] systemd[1]: Reached target Paths.
[  OK  ] Reached target Paths.
[   15.464745] systemd[1]: Reached target Remote File Systems.
[  OK  ] Reached target Remote File Systems.
[   15.484707] systemd[1]: Reached target Slices.
[  OK  ] Reached target Slices.
[   15.500719] systemd[1]: Reached target Swap.
[  OK  ] Reached target Swap.
[   15.520772] systemd[1]: Listening on Syslog Socket.
[  OK  ] Listening on Syslog Socket.
[   15.538503] systemd[1]: Listening on initctl Compatibility Named Pipe.
[  OK  ] Listening on initctl Compatibility Named Pipe.
[   15.562231] systemd[1]: Condition check resulted in Journal Audit 
Socket being skipped.

[   15.574187] systemd[1]: Listening on Journal Socket (/dev/log).
[  OK  ] Listening on Journal Socket (/dev/log).
[   15.594213] systemd[1]: Listening on Journal Socket.
[  OK  ] Listening on Journal Socket.
[   15.619267] systemd[1]: Listening on udev Control Socket.
[  OK  ] Listening on udev Control Socket.
[   15.639342] systemd[1]: Listening on udev Kernel Socket.
[  OK  ] Listening on udev Kernel Socket.
[   15.656768] systemd[1]: Reached target Sockets.
[  OK  ] Reached target Sockets.
[   15.674793] systemd[1]: Condition check resulted in Huge Pages File 
System being skipped.
[   15.685036] systemd[1]: Condition check resulted in POSIX Message 
Queue File System being skipped.

[   15.709769] systemd[1]: Mounting Kernel Debug File System...
[   15.710514] systemd[356]: sys-kernel-debug.mount: Failed to close 
unwanted file descriptors: Invalid argument

 Mounting Kernel Debug File System...
[   15.742045] systemd[1]: Mounting Kernel Trace File System...
[   15.742446] systemd[357]: sys-kernel-tracing.mount: Failed to close 
unwanted file descriptors: Invalid argument

 Mounting Kernel Trace File System...
[   15.770279] systemd[1]: Condition check resulted in Create list of 
static device nodes for the current kernel being skipped.

[   15.789036] systemd[1]: Starting Load Kernel Module configfs...
[   15.789195] systemd[358]: modprobe@configfs.service: Failed to close 
unwanted file descriptors: Invalid argument

 Starting Load Kernel Module configfs...
[   15.821560] systemd[1]: Starting Load Kernel Module fuse...
[   15.821679] systemd[359]: modprobe@fuse.service: Failed to close 
unwanted file descriptors: Invalid argument

 Starting Load Kernel Module fuse...
[   15.850714] systemd[1]: Condition check resulted in Set Up Additional 
Binary Formats being skipped.
[   15.874113] systemd[360]: systemd-journald.service: Failed to close 
unwanted file descriptors: Invalid argument

[   15.874167] systemd[1]: 

Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?

2022-03-03 Thread Lennart Poettering
On Mi, 02.03.22 17:50, Lennart Poettering (lenn...@poettering.net) wrote:

> That said, we could certainly show both the comm field and the PID of
> the offending processes. I am prepping a patch for that.

See: https://github.com/systemd/systemd/pull/22655

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering
On Do, 03.03.22 18:35, Felip Moll (fe...@schedmd.com) wrote:

> I have read and studied all your suggestions and I understand them.
> I also did some performance tests in which I fork+executed a systemd-run to
> launch a service for every step and I got bad performance overall.
> One of our QA tests (test 9.8 of our testsuite) shows a decrease of
> performance of 3x.

systemd-run is synchronous, and unless you specify "--scope" it will
tell systemd to fork things off instead of doing that client-side,
which I understand is what you want to do. The fact it's synchronous,
i.e. waits for completion of the whole operation (including start-up
of dependencies and whatnot) necessarily means it's slow.

> > But note that you can also run your main service as a service, and
> > then allocate a *single* scope unit for *all* your payloads. That way
> > you can restart your main service unit independently of the scope
> > unit, but you only have to issue a single request once for allocating
> > the scope, and not for each of your payloads.
> >
> >
> My questions are, where would the scope reside? Does it have an associated
> cgroup?

Yes, I explicitly pointed you to them, it's why I suggested you use
them.

My recommendation if you hack on stuff like this is reading the docs
btw, specifically:

 https://systemd.io/CGROUP_DELEGATION

It pretty explicitly lists your options in the "Three Scenarios"
section.

It also explains what scope units are and when to use htme.

> I am also curious of what this sentence does exactly mean:
>
> "You might break systemd as a whole though (for example, add a process
> directly to a slice's cgroup and systemd will be very sad).".

if you add a process to a cgroup systemd manages that is supposed to
be an inner one in the tree, you will make creation of children fail
that way, and thus starting services and other operations will likely
start failing all over the place.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Felip Moll
Hi folks, I wanted to keep the case as generic as possible but I think it
is important at this point to comment on what we're talking about, so let
me clarify a little bit the case I am dealing with at the moment.

In SchedMD, we want Slurm to support 'Cgroup v2'. As you may know Slurm is
a HPC resource manager, and for the moment we're limited to Cgroup v1. We
actually use the freezer, memory, cpuset, cpuacct and devices controllers
in v1. We think it is already a good time to add a plugin to our software
to make it capable to run on unified systems, and since systemd is widely
used we want to do this integration as best as we can to coexist with
systemd and not get our pids moved or make systemd mad.

We have a 'slurmd' daemon running on every compute node, waiting for
communications from the controller. The controller submits different kinds
of RPCs to slurmd and at one point one RPC can instruct slurmd to start a
new job step for a specific uid. Slurmd then forks twice; the original
slurmd just ends and goes back to other work. The first fork (child) sets a
bunch of pipes and prepares initialization data, then forks again
generating a grandchild. The grandchild finally exec's the slurmstepd
daemon which will be receiving the initialization data, prepare the
cgroups, and finally fork+exec the user software. This can happen many
times in a second because a user can submit a "job array" which with one
single RPC call can submit thousands of steps, and at the same time
thousands of other steps can be finishing at the same time, so the work
that systemd would need to do starting up new scopes/services and/or
stopping them + monitoring all this stuff could be considerable.

After this introduction I have to say that we successfully managed to work
following systemd rules by just starting a unit file for slurmd with
Delegate=yes and creating our own hierarchy inside. Every slurmstepd would
be forked and started in the delegated cgroup and would create its
directory and move itself where it belongs to (always in the delegated
cgroup), according to our needs. Everything ran smoothly until when I
restarted slurmd and slurmstepds were still running in the cgroup, systemd
was unable to start slurmd again because the cgroup was not deleted, since
it was busy with directories and slurmstepds; main reason for this bug.

Note that one feature of Slurm is that one can upgrade/restart slurmd
without affecting running jobs (slurmstepds) in the compute node.

I have read and studied all your suggestions and I understand them.
I also did some performance tests in which I fork+executed a systemd-run to
launch a service for every step and I got bad performance overall.
One of our QA tests (test 9.8 of our testsuite) shows a decrease of
performance of 3x.

But, the positive thing is that we did a test to manually fork+exec one new
Delegated separate service when starting up slurmd, and we moved new forked
slurmstepd pids *manually* into the new cgroup associated with the new
service. This service contains a 'sleep infinity' as the main pid to make
the cgroup not disappear even if no slurmstepds are running. As I say, this
is a dirty test, which works.

After reading your last two emails, I think the most efficient way we need
to go is this one:

Firing an async D-Bus packet to systemd should be hardly measurable.
>
> But note that you can also run your main service as a service, and
> then allocate a *single* scope unit for *all* your payloads. That way
> you can restart your main service unit independently of the scope
> unit, but you only have to issue a single request once for allocating
> the scope, and not for each of your payloads.
>
>
My questions are, where would the scope reside? Does it have an associated
cgroup?
If I am a new slurmstepd, can I attach myself to this scope or must I be
attached by slurmd before being executed?


> But that too means you have to issue a bus call. If you really don't
> like talking to systemd this is not going to work of course, but quite
> frankly, that's a problem you are making yourself, and I am not
> particularly sympathetic to it.
>

I can study this option. It is not that I like or don't like talking to
systemd, but the idea is that Slurm must work in other OSes, possibly
without systemd but still with cgroup v2, and still be compatible with
cgroup v1 and with no cgroup at all. It's thinking about the future, the
less complexity and particularities it has, the more maintainable and
flexible the software is. I think this is understandable, but if this is
not possible at all we will have to adapt.


> > DelegateCgroupLeaf=. If set to yes an extra directory will be
> > created into the unit cgroup to place the newly spawned service process.
> > This is useful for services which need to be restarted while its forked
> > pids remain in the cgroup and the service cgroup is not a leaf
> > anymore.
>
> No. Let's not add that.
>

I could foresee the benefits of such an option, but I can 

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering
On Mo, 21.02.22 22:16, Felip Moll (lip...@gmail.com) wrote:

> Silvio,
>
> As I commented in my previous post, creating every single job in a separate
> slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra
> fork+execs plus waiting for systemd to fill up its internal structures and
> manage it all is a no-no.

Firing an async D-Bus packet to systemd should be hardly measurable.

But note that you can also run your main service as a service, and
then allocate a *single* scope unit for *all* your payloads. That way
you can restart your main service unit independently of the scope
unit, but you only have to issue a single request once for allocating
the scope, and not for each of your payloads.

But that too means you have to issue a bus call. If you really don't
like talking to systemd this is not going to work of course, but quite
frankly, that's a problem you are making yourself, and I am not
particularly sympathetic to it.

> One other option that I am thinking about is extending the parameters of a
> unit file, for example adding a DelegateCgroupLeaf=yes option.
>
> DelegateCgroupLeaf=. If set to yes an extra directory will be
> created into the unit cgroup to place the newly spawned service process.
> This is useful for services which need to be restarted while its forked
> pids remain in the cgroup and the service cgroup is not a leaf
> anymore.

No. Let's not add that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering
On Mo, 21.02.22 18:07, Felip Moll (lip...@gmail.com) wrote:

> > That's a bad idea typically, and a generally a hack: the unit should
> > probably be split up differently, i.e. the processes that shall stick
> > around on restart should probably be in their own unit, i.e. another
> > service or scope unit.
>
> So, if I understand it correctly you are suggesting that every forked
> process must be started through a new systemd unit?

systemd has two different unit types: services and scopes. Both group
processes in a cgroup. But only services are where systemd actually
forks+execs (i.e. "starts a process"). If you want to fork yourself, that's
fine, then a scope unit is your thing. If you use scope units you do
everything yourself, but as part of your setup you then tell systemd
to move your process into its own scope unit.

> If that's the case it seems inconvenient because we're talking about a job
> scheduler where sometimes may have thousands of forked processes executed
> quickly, and where performance is key.
> Having to manage a unit per each process will probably not work in this
> situation in terms of performance.

You don't really have to "manage" it. You can register a scope unit
asynchronously, it's firing off one dbus message basically at the same
time you fork things off, telling systemd to put it in a new scope unit.

> The other option I can imagine is to start a new unit from my daemon of
> Type=forking, which remains forever until I decide to clean it up even if
> it doesn't have any process inside.
> Then I could put my processes in the associated cgroup instead of inside
> the main daemon cgroup. Would that make sense?

Migrating processes wildly between cgroups is messy, because it fucks
up accounting and is restricted permission-wise. Typically you want to
create a cgroup and populate it, and then stick to that.

> The issue here is that for creating the new unit I'd need my daemon to
> depend on systemd libraries, or to do some fork-exec using systemd commands
> and parsing output.

To allocate a scope unit you'd have to fire off a D-Bus method
call. No need for any systemd libraries.

> I am trying to keep the dependencies at a minimum and I'd love to have an
> alternative.

Sorry, but if you want to rearrange processes in cgroups, or want
systemd to manage your processes orthogonal to the service concept you
have to talk to systemd.

> Yeah, I know and understand it is not supported, but I am more interested
> in the technical part of how things would break.
> I see in systemd/src/core/cgroup.c that it often differentiates a cgroup
> with delegation with one without it (!unit_cgroup_delegate(u)), but it's
> hard for me to find out how or where this exactly will mess up with any
> cgroup created outside of systemd. I'd appreciate it if you can give me
> some light on why/when/where things will break in practice, or just an
> example?

THis depends highly on what precisely you do. At best systemd will
complain or just override the changes you did outside of the tree you
got delegated. You might break systemd as a whole though (for example,
add a process directly to a slice's cgroup and systemd will be very
sad).

Lennart

--
Lennart Poettering, Berlin