from:"Lennart Poettering"

Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] version bump of minimal kernel version supported by systemd?

2022-04-01 Thread Lennart Poettering

On Fr, 01.04.22 13:54, Greg Kroah-Hartman (gre...@linuxfoundation.org) wrote:

> > While it is true that the syscall interface is kept reasonably stable,
> > almost everything else gets monkeyed with a lot, because a lot of
> > kernel developers only consider the syscall interface a program
> > interface. This is a problem because a *lot* of things are only
> > accessible through other means (procfs, sysfs, uevents, etc.).
> >
> > Unfortunately, that means that in practice, the kernel interfaces that
> > userspace *must* depend on break far more than anyone likes.
>
> The above example is an interesting case.  A new feature was added, was
> around for a while, and a few _years later_ it was found out that some

That's not quite true. The breakages were actually reported pretty
quickly to the kernel people who added the offending patches, and they
even changed some things around (an incomplete patch for udev was
posted, which we merged), but the issue was still not properly
addressed. It died down then, nothing much happened, and udev
maintainers didn't bring this up again for a while, as they had other
stuff to do. The issue became more and more visible though as more
subsystems in the kernel started generating these uevents, to a point
where ignoring the issue wasn't sustainable. At that point kernel
people were pretty dismissive though (not that they were particularly
helpful in the beginning either), partly because the change was in now
for so long. So we reworked how udev worked.

> people had userspace "scripts" that broke because the feature was
> added.

nah, this broke C code all over the place, too. Not just "scripts".

I am not even disagreeing though that bind/unbind uevents made
sense to add. I just want to correct how things happened here. There
was a general disinterest from the kernel people who broke things to
fix things, and in particular major disinterest in understanding how
udev actually works and how udev rules are used IRL. (I mean, that
early patch we got and merged literally just changed udev to drop
messages with bind/unbind entirely, thus not fixing anything, just
hiding the problem with no prospect of actually making it useful for
userspace. I figure the kernel devs involved actually cared about
Android, and not classic Linux userspace, i.e. udev.)

I know the kernel people like to carry that mantra of not breaking
userspace quite like a monstrance, but IRL it's broken all the
time. Often for good reasons, quite often also for no reason but lack
of testing. Things like that will happen. But I also think that
Windows for example is probably better at not breaking their
interfaces than Linux is.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] problem starting systemd in a container using parameters --default-standard-output=fd --default-standard-error=fd:stdout

2022-03-31 Thread Lennart Poettering

On Mi, 30.03.22 17:25, masber masber (mas...@hotmail.com) wrote:

> Any idea why --default-standard-output=fd
> --default-standard-error=fd:stdout breaks systemd?

They make no sense? "fd" you can only use if you have a .socket unit
that passes in an fd to a service. But if you don#t have that it just
doesn't make any sense...

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] udevadm: Failed to scan devices: Input/output error

2022-03-31 Thread Lennart Poettering

On Do, 31.03.22 12:58, Belal, Awais (awais_be...@mentor.com) wrote:

> Hi Lennart,
>
> > No distro from the last 10y should use "udevadm settle" in the clean
> > boot path. Please work with your distro to fix that. It doesn't do
> > what people think it does, and clean-written software really doesn't
> > need that in the boot path. It just slows down boot.
>
> Thanks for pointing that out. I will definitely report this and work
> with the distro folks to see why we're doing this and drop it if we
> can work without it. However, the failure I mentioned is in the
> invocation of udevadm triigger. Here's what strace revealed
>
> faccessat(AT_FDCWD, 
> "/sys/devices/virtual/devlink/platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu/uevent",
>  F_OK) = 0
> readlinkat(AT_FDCWD, 
> "/sys/devices/virtual/devlink/platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu/subsystem",
>  "../../../../class/devlink", 4096) = 25
> openat(AT_FDCWD, 
> "/sys/devices/virtual/devlink/platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu/uevent",
>  O_RDONLY|O_CLOEXEC) = 5
> fstat(5, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
> fstat(5, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
> read(5, "", 4096)   = 0
> close(5)= 0
> openat(AT_FDCWD, 
> "/run/udev/data/+devlink:platform:firmware:zynqmp-firmware:clock-controller--platform:fd4b.gpu",
>  O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> getdents64(4, 0x1b02edd0, 32768)= -1 EIO (Input/output
> error)

uh? getdents64() is the syscall that reads directory contents. Smells
like a kernel problem. If EIO is thrown when reading a directory, then
that's almost certainly a fuckup in the kernel, given that this
probably refers to sysfs or so.

Would be good to know which fd 4 refers to. Consider reruning the
strace with "-y". With that it will show you which fd this is
triggered from.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] udevadm: Failed to scan devices: Input/output error

2022-03-31 Thread Lennart Poettering

On Do, 31.03.22 08:53, Belal, Awais (awais_be...@mentor.com) wrote:

> Hi Lennart,
>
> > Which udevadm command is this from? The udevadm trigger invocation we
> > do during boot?
>
> This is a Yocto based build and the boot flow is using an initramfs. After 
> setting up /sys /proc and other related specifics the initramfs calls
>
> $_UDEV_DAEMON --daemon
> udevadm trigger --action=add
> udevadm settle

No distro from the last 10y should use "udevadm settle" in the clean
boot path. Please work with your distro to fix that. It doesn't do
what people think it does, and clean-written software really doesn't
need that in the boot path. It just slows down boot.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] udevadm: Failed to scan devices: Input/output error

2022-03-31 Thread Lennart Poettering

On Mi, 30.03.22 19:23, Belal, Awais (awais_be...@mentor.com) wrote:

> Now, the system boots just fine on some attempts while on others it takes 
> quite a lot of time to boot with logs such as
>
>
> Starting version 244
> Failed to scan devices: Input/output error
>   WARNING: Device /dev/ram0 not initialized in udev database even after 
> waiting 1000 microseconds.
>   WARNING: Device /dev/mmcblk0 not initialized in udev database even after 
> waiting 1000 microseconds.
>   WARNING: Device /dev/ram1 not initialized in udev database even after 
> waiting 1000 microseconds.
>   WARNING: Device /dev/mmcblk0p1 not initialized in udev database
> even after waiting 1000 microseconds.

Which udevadm command is this from? The udevadm trigger invocation we
do during boot?

can you reproduce this if you trigger manually? If you strace, do you
see where the EIO comes from?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] version bump of minimal kernel version supported by systemd?

2022-03-28 Thread Lennart Poettering

On Do, 24.03.22 10:28, Luca Boccassi (bl...@debian.org) wrote:

> > What I am trying to say is that it would actually help us a lot if
> > we'd not just be able to take croupv2 for granted but to take a
> > reasonably complete cgroupv2 for granted.
> >
> > Lennart
> >
> > --
> > Lennart Poettering, Berlin
>
> Yes, that does sound like worth exploring - our README doesn't document
> it though, do we have a list of required controllers and when they were
> introduced?

So I'd argue cgroupsv2 was pretty useless before 4.15, since it lacked
the cpu controller, which I'd argue is actually the one that matters
most. hence, before 4.15 cgroupsv2 was an experiment, not something
you could actually deploy.

some other interesting milestones:

* kcmp → 3.5
* renameat2 on all relevant file systems → 4.0
* pids controller in cgroupv1 → 4.3
* pids controller in cgroupv2 → 4.5
* cgroup namespaces → 4.6
* statx → 4.11
* pidfd → 5.3

This is just some quick search through man pages. There might be a lot
of other stuff that would make sense for us to be able to rely on.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-28 Thread Lennart Poettering

On Do, 24.03.22 14:32, Benjamin Berg (benja...@sipsolutions.net) wrote:

> HI,
>
> On Thu, 2022-03-24 at 12:40 +0100, Felip Moll wrote:
> > False, the JobRemoved signal returns the id, job, unit and result. To
> > wait for JobRemoved only needs a matching rule for this signal. The
> > matching rule can just contain the path. In fact, nothing else than
> > strings can be matched in a rule, so I may be only able to use the
> > path.
>
> I think you need to add a wildcard match before the job is created
> (i.e. before StartTransientUnit). Otherwise registering the match rule
> (using the job's object path) will race with systemd signalling that
> the job has completed.

Correct.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-28 Thread Lennart Poettering

On Do, 24.03.22 00:45, Felip Moll (fe...@schedmd.com) wrote:

> Hi, some days ago we were talking about this:
>
>
> > > Problem number two, there's a significant delay since when creating the
> > > scope, until it is ready and the pid attached into it. The only way it
> > > worked was to put a 'sleep' after the dbus call and make my process wait
> > > for the async call to dbus to be materialized. This is really
> > > un-elegant.
> >
> > If you want to synchronize in the cgroup creation to complete just
> > wait for the JobRemoved bus signal for the job returned by
> > StartTransientUnit().
> >
> >
> StartTransientUnit returns a string to a job object path. To call
> JobRemoved I need the job id, so the easier way to get it is to strip the
> last part of the returned string from StartTransientUnit job object path.
> Am I right?

JobRemoved is a signal, not a method call. i.e. not something you
call, but you are notified about. And it originates from an object and
objects have object paths in D-Bus.

> Once I have the job id, I can then subscribe to JobRemoved bus signal for
> the recently created job, but what happens if during the time I am
> obtaining the ID or parsing the output, the job is finished? Will I lose
> the signal?

Yes. D-Bus sucks that way. You ave to subscribe to all jobs first, and
the filte rout the ones you don#t want.

> What is the correct order of doing a StartTransientUnit and wait for the
> job to be finished (done, failed, whatever) ?

first subscribe to JobRemoved, then issue StartTransientUnit, and then
wait until you see JobRemoved for the unit you just started.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] version bump of minimal kernel version supported by systemd?

2022-03-24 Thread Lennart Poettering

On Do, 24.03.22 14:05, Zbigniew Jędrzejewski-Szmek (zbys...@in.waw.pl) wrote:

> > Yes, that does sound like worth exploring - our README doesn't document
> > it though, do we have a list of required controllers and when they were
> > introduced?
>
> In the README:
>   Linux kernel >= 4.2 for unified cgroup hierarchy support
>   Linux kernel >= 4.10 for cgroup-bpf egress and ingress hooks
>   Linux kernel >= 4.15 for cgroup-bpf device hook
>   Linux kernel >= 4.17 for cgroup-bpf socket address hooks
>
> In this light, 4.19 is better than 4.4 or 4.9 ;)

Well, the list is not complete. i.e. the "io" controller came late
iirc. And killing and stuff too. would take some work to figure out
which features of cgroupv2 we actually make us of, and then when they
were added.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] /etc/os-release but for images

2022-03-24 Thread Lennart Poettering

On Mi, 23.03.22 17:14, Davide Bettio (davide.bet...@secomind.com) wrote:

> I opened this PR: https://github.com/systemd/systemd/pull/22841
>
> This doesn't enable full semver support since that would require allowing
> A-Z range, but I'm not sure if it is something we really want to enable
> (uppercase in semver looks quite uncommon by the way).

how does semver suggest uppercase chars are handled? is "0.1.1a" newer
than "0.1.1A"?

> I would use the UUID as a further metadata in addition to IMAGE_VERSION,
> also for the reasons I describe later here in this mail.

Sounds like something you could just add as suffix to IMAGE_VERSION, no?

> > > Compared to other options an UUID here would be both reliable and easy to
> > > handle and generate.
> >
> > UUID is are effectively randomly generated. That sucks for build
> > processes I am sure, simply because they hence aren't reproducible.
> >
>
> Using a reliable digest, such as the one that can be generated with `casync
> digest`, was my first option, however if you think about an update system
> such such as RAUC and its bundles, you might still have the same exact
> filesystem digest mapping to a number of different bundles, since they can
> bring different hook scripts and whatever.
> I'm also aware of scenarios where the same filesystem tree has been used to
> generate different flash images in order to satisfy different flash sizes /
> NAND technologies.
> So from a practical point of view having an UUID, and forcing a different
> one in /etc/os-release every time a filesystem image / RAUC bundle is
> created allows us to have a reasonable 1:1 mapping between the update
> artifact and the system image that is on the device.
> Last but not least having it in /etc/os-release makes it pretty convenient
> to read it (and for sure using an UUID is easier than trying to embed the
> digest of the filesystem where  /etc/os-release is kept ;) )
> Also there is an interesting bonus: UUID is globally unique also in
> scenarios where users try to delete and recreate version tags without
> incrementing the version number (or other messy scenarios).

Shouldn't you use the fs header uuid? or the GPT partition or overall
uuids?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] version bump of minimal kernel version supported by systemd?

2022-03-24 Thread Lennart Poettering

On Mi, 23.03.22 11:28, Luca Boccassi (bl...@debian.org) wrote:

> At least according to our documentation it wouldn't save us much
> anyway, as the biggest leap is taking cgroupv2 for granted, which
> requires 4.1, so it's included regardless. Unless there's something
> undocumented that would make a big difference, in practical terms of
> maintainability?

Note that "cgroupv2 exists" and "cgroupv2 works well" are two distinct
things. Initially too few controllers supported cgroupv2 for cgroupv2
to be actually useful.

What I am trying to say is that it would actually help us a lot if
we'd not just be able to take croupv2 for granted but to take a
reasonably complete cgroupv2 for granted.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Antw: [EXT] Re: version bump of minimal kernel version supported by systemd?

2022-03-24 Thread Lennart Poettering

On Do, 24.03.22 08:21, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> I wonder:
>
> Why not providing some test suite instead: If the test suite succeeds, systemd
> might work; if it doesn't, manual steps are needed.

One goal here is to reduce our maintainance burden, not increase
it. Another is to communicate clearly what we support and what we
don't. Any such test suite collides with both these goals.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] /etc/os-release but for images

2022-03-23 Thread Lennart Poettering

On Mi, 23.03.22 13:38, Davide Bettio (davide.bet...@secomind.com) wrote:

> > That's the idea: take the packages, build an image, and then append
> > IMAGE_ID/IMAGE_VERSION to it?
>
> Sure, sounds pretty convenient, here my point was about blindly appending
> those additional fields (trusting the distribution didn't already set
> them).

I don't know your distro. But I'd certainly view it as a bug if your
distro fills in these two fields but doesn't actually work based on
pre-built images, but is solely package-based.

> > > I cook a new image, furthermore I plan to replace the whole operating
> > > system image (that I keep read-only) in order to update it, so BUILD_ID
> > > would change at every update (so it sounds slightly different from the
> > > original described semantic).
> >
> > BUILD_ID is not for that. You are looking for IMAGE_VERSION.
>
> IMAGE_VERSION didn't look to me a good option for identifying nightly
> builds, or any kind of build that wasn't cooked from a tagged image build
> recipe.

I think it should be fine for that.

> Also sadly IMAGE_VERSION doesn't allow + which is used from semver for
> build metadata (such as 1.0.0+21AF26D3 or 1.0.0+20130313144700).

Ah, interesting. This looks like something to fix in our syntax
descriptio though. I am pretty sure we should allow all characters
that semver requires.

Can you file an RFE issue about this on github? Or even better, submit
a PR that adds that?

That said, I'd always include some time-based counter in automatic
builds, so that the builds can be ordered. Things like sd-boot's boot
menu entry ordering relies on that.

> That's pretty useful when storing a relation between the exact update
> bundle that has been used to flash a device, and the system flashed on a
> device. It turns out to be pretty useful when collecting stats about a
> device fleet or when remote managing system versions and their updates.
> So what I would do on os-release would be storing an UUID that is generated
> every time a system image is generated, that UUID can be collected/sent at
> runtime from a device to a remote management service.

Why wouldn't the IMAGE_VERSION suffice for that? Why pick a UUID where
a version already works?

> Compared to other options an UUID here would be both reliable and easy to
> handle and generate.

UUID is are effectively randomly generated. That sucks for build
processes I am sure, simply because they hence aren't reproducible.

BTW, there's now also this:

https://systemd.io/BUILDING_IMAGES/#image-metadata

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] /etc/os-release but for images

2022-03-23 Thread Lennart Poettering

On Mi, 23.03.22 10:51, Davide Bettio (davide.bet...@secomind.com) wrote:

> Hello,
>
> First of all, thanks for your answers.
>
> It wasn't really clear to me that the /etc/os-release file was editable
> from a 3rd party other than the distribution maintainers, so thanks for the
> clarifications.

Well, it's not precisely supposed to be something users or admins
should edit. But image builders may.

> Are the distributions required to leave IMAGE_ID and
> IMAGE_VERSION empty?

Well, if the distribution people build both packages and disk images,
they can set IMAGE_ID/IMAGE_VERSION for the latter. But this should
always be part of building images, not of building packages.

> Can I safely just append those fields at the end of
> the copy of the /etc/os-release file?

That's the idea: take the packages, build an image, and then append
IMAGE_ID/IMAGE_VERSION to it?

> Speaking of BUILD_ID, according to the spec sounds like a field
> reserved to

BUILD_ID? That's a different thing...

https://www.freedesktop.org/software/systemd/man/os-release.html#IMAGE_ID=
vs.
https://www.freedesktop.org/software/systemd/man/os-release.html#BUILD_ID=

> distributions: "BUILD_ID may be used in distributions where the original
> installation image version is important", from my side what I need is to
> identify the git revision + build date of the recipe I'm using to cook the
> image installed on the system, also my plan is to change that ID every time
> I cook a new image, furthermore I plan to replace the whole operating
> system image (that I keep read-only) in order to update it, so BUILD_ID
> would change at every update (so it sounds slightly different from the
> original described semantic).

BUILD_ID is not for that. You are looking for IMAGE_VERSION.

> Last but not least, I was looking for a machine parsable unique id, so I
> plan to use BUILD_UUID if it is not kept reserved for other usages, that
> will be an UUID that is freshly generated every time I cook a new image.

What's this for?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] systemd move processes to user.slice cgroup after updating service configuration file

2022-03-23 Thread Lennart Poettering

On Mi, 23.03.22 14:25, 吾为男子 (csren...@qq.com) wrote:

> dear all experts,
>
> now we have such a problem:
>
> we need to update our systemd service configuration file,
>
> before updating, our service has already created some processes and
> make them attach to cgroup
> /system.slice/{our-service-name}.service/{our-service-sub-group},
> this is what we would expect,
>
> but, on some machine, sometimes, after we updating our service
> configuration file, these processesas mentioned above,
> will be moved to /user.slice, this is what we do NOT
> expect,there is a certain probability that this will happen

Is it possible that said service invokes sudo or su or so, or in some
other way opens a PAM session? If so, this will migrate the calling
process into a per-session cgroup below user.slice.

What's the precise cgroup slice of one such occurance?

> how to prevent this action from systemd? it will be a great honor
> for me to get your help, thanks.

Don't use sudo/su from scripts. If you need to acquire privileges from
a script, use util-linux' setpriv tool. It will change privileges for
you but without opening a PAM session, and thus without cgroup
migratory effect.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] /etc/os-release but for images

2022-03-22 Thread Lennart Poettering

On Di, 22.03.22 17:10, Davide Bettio (davide.bet...@secomind.com) wrote:

> Hello,
>
> I would like to figure out if anyone has proposed any kind of standard for
> storing metadata about a system image.

The IMAGE_ID= and IMAGE_VERSION= fields from /etc/os-release are
supposed to be used for that.

i.e. the idea was that you can take a generic distro (fedora, debian,
…) build an image off it, and it will put its own os info in
/usr/lib/os-release, and make /etc/os-release a symlink to it. Your
image build would then replace /etc/os-release with a file, that
incorporates /usr/lib/os-release and adds in IMAGE_ID=/IMAGE_VERSION=.

Each time you rebuild the image your image building tool would repeat
that step. i.e. it would be the image builder tool's job to extend the
generic OS data from /usr/lib/ with info about the image and place the
result in /etc/.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] find_device() and FOREACH_DEVICE_DEVLINK memory leaks on "systemd-249"

2022-03-21 Thread Lennart Poettering

On So, 13.03.22 19:14, Tony Rodriguez (unixpro1...@gmail.com) wrote:

> Valgrind is reporting "still reachable" memory leak (2 blocks) when calling
> find_device() and FOREACH_DEVICE_DEVLINK against "systemd-249". In my case,
> they are both called within fstab-generator.c on "systemd-249". Only code
> modifications, on my end, are within fstab-generator.c

The mempool stuff is not really "leaked": it's an allocation cache,
i.e. subsequent calls will reuse the already allocated objects. The
stuff is hence reachable via the allocation cache.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering

On Mi, 16.03.22 17:35, Michal Koutný (mkou...@suse.com) wrote:

> True, in the unified mode it should be safe doing manually.
> I was worried about migrating e.g. MainPID of a service into this scope
> but PID1 should handle that AFAICS. Also since this has to be performed
> by the privileged user (scopes are root's), the manual migration works.

This is actually a common case: for getty style login process the main
process of the getty service will migrate to the new scope. A service
is thus always a cgroup *and* a main pid for us, in case the main pid
is outside of the cgroup. And conversely, a process can be associated
to multiple units this way. It can be main pid of one service and be
in a cgroup of a scope.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering

On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote:

> > > (The above is slightly misleading) there could be an alternative of
> > > something like RemainAfterExit=yes for scopes, i.e. such scopes would
> > > not be stopped after last process exiting (but systemd would still be in
> > > charge of cleaning the cgroup after explicit stop request and that'd
> > > also mark the scope as truly stopped).
> >
> > Yeah, I'd be fine with adding RemainAfterExit= to scope units
> >
> >
> Note that what Michal is saying is "something like RemainAfterExit=yes for
> scopes", which means systemd would NOT clean up the cgroup tree when there
> are no processes inside.
> AFAIK RemainAfterExit for services actually does cleanup the cgroup tree if
> there are no more processes in it.

It doesn't do that if delegation is on (iirc, if not I'd consider that
a bug). Same logic should apply here.

> If that behavior of keeping the cgroup tree even if there are no pids is
> what you agree with, then I coincide is a good idea to include this option
> to scopes.

Yes, that is what I was suggesting this would do.

> > > Such a recycled scope would only be useful via
> > > org.freedesktop.systemd1.Manager.AttachProcessesToUnit().
> >
> > Well, if delegation is on, then people don#t really have to use our
> > API, they can just do that themselves.
>
> That's not exact. If slurmd (my main process) forks a slurmstepd (child
> process) and I want to move slurmstepd into a delegated subtree from the
> scope I already created, I must use AttachProcessesToUnit(), isn't that
> true?

depends on your privs. You can just move it yourself if you have
enough privs.

See commit msg in 6592b9759cae509b407a3b49603498468bf5d276

> Or are you saying that I can just migrate processes wildly without
> informing systemd and just doing an 'echo > cgroup.procs' from one
> non-delegated tree to my delegated subtree?

yeah, you can do that.

Note that (independently of systemd) you shouldn't migrate stuff to
aggressively, since it fucks up kernel resource accounting. i.e. it is
wise to minimize process migration in cgroups and always migrate plus
shortly after exec(), or even better do a clone(CLONE_INTO_CGROUP) –
though unfortunately the latter cannot work with glibc right now :-(.

i.e. keeping processes that already "have history" around for a long
time after migration kinda sucks.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering

On Mi, 16.03.22 16:15, Felip Moll (fe...@schedmd.com) wrote:

> On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný  wrote:
>
> > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll 
> > wrote:
> > > Meaning that it would be great to have a delegated cgroup subtree without
> > > the need of a service or scope.
> > > Just an empty subtree.
> >
> > It looks appealing to add Delegate= directive to slice units.
> > Firstly, that'd prevent the use of the slice by anything systemd.
> > Then some notion of owner of that subtree would have to be defined (if
> > only for cleanup).
> > That owner would be a process -- bang, you created a service with
> > delegation or a scope with "keepalive" process.
> >
> >
> Correct, this is how the current systemd design works.
> But... what if the concept of owner was irrelevant? What if we could just
> tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it
> or do anything to it or pids residing into it.

No, that's not something we will offer. We bind a lot of meaning to
the cgroup concept. i.e. we derive unit info from it, and many things
are based on that. For example any client logging to journald will do
so from a cgroup and we pick that up to know which service logging is
from, and store that away and use it for filtering, for picking
per-unit log settings and so on.

Moreover we need to be able to shutdown all processes on the system in
a systematic way for shutdown, and we do that based on units, and the
ordering between them. Having processes and cgroups that live entirely
independent makes a total mess from this.

And there's a lot more, like resource mgmt: we want that all processes
on the system are placed in a unit of some form so that we can apply
useful resource mgmt to it.

So yes you can have a delegated subtree, if you like and we'll not
interfere with what you do there mostly, but it must be a leaf of our
tree, and we'll "macro manage" it for you, i.e. define a lifetime for
it, and track processes back to it.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering

On Di, 15.03.22 17:24, Michal Koutný (mkou...@suse.com) wrote:

> On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll  
> wrote:
> > Meaning that it would be great to have a delegated cgroup subtree without
> > the need of a service or scope.
> > Just an empty subtree.
>
> It looks appealing to add Delegate= directive to slice units.

Hm? Slice units are *inner* node of *our* cgroup trees. if we'd allow
delegation of that, then we'd could not put stuff inside it, hence it
wouldn't be a slice because it couldn#t contain anything anymore.

> Firstly, that'd prevent the use of the slice by anything systemd.

yeah, precisely? i don't follow. What would a slice with delegation be
that a scope with delegation isn't already?

> Then some notion of owner of that subtree would have to be defined (if
> only for cleanup).

scopes already have that, so why not use that?

> That owner would be a process -- bang, you created a service with
> delegation or a scope with "keepalive" process.

can't parse this.

> (The above is slightly misleading) there could be an alternative of
> something like RemainAfterExit=yes for scopes, i.e. such scopes would
> not be stopped after last process exiting (but systemd would still be in
> charge of cleaning the cgroup after explicit stop request and that'd
> also mark the scope as truly stopped).

Yeah, I'd be fine with adding RemainAfterExit= to scope units

> Such a recycled scope would only be useful via
> org.freedesktop.systemd1.Manager.AttachProcessesToUnit().

Well, if delegation is on, then people don#t really have to use our
API, they can just do that themselves.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering

On Di, 15.03.22 16:35, Felip Moll (fe...@schedmd.com) wrote:

> > I don't follow. You can enable delegation on the scope. I mean, that's
> > the reason I suggested to use a scope.
> >
> >
> Meaning that it would be great to have a delegated cgroup subtree without
> the need of a service or scope.
> Just an empty subtree.

That's what a scope is. I don't follow?

What do you think a scope is beyond that? It just encapsulates a
cgroup subtree. It auto-cleans it though once it goes empty, and
because it does that it also requires you to provide at least one PID
to add to the scope when it is created.

For services we have a RemainAfterExit= property btw. There were
requests for adding the same for scopes. I'd be fine with adding that,
happy to take a patch.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Lennart Poettering

On Di, 15.03.22 10:50, Felip Moll (fe...@schedmd.com) wrote:

> Another thing I have found is that if the process which created the scope
> (e.g. my main process, slurmd) terminates, then the scope is stopped even
> if I abandoned it and there's a pid inside.
> So this makes the proposed solution not working. What am I missing?
>
> ● gamba11_slurmstepd.scope
>  Loaded: loaded (/run/systemd/transient/gamba11_slurmstepd.scope;
> transient)
>  Transient: yes
>  Active: active (abandoned) since Tue 2022-03-15 10:40:34 CET; 4s ago

It's shown as active, so where is the problem?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Lennart Poettering

On Mo, 14.03.22 23:12, Felip Moll (fe...@schedmd.com) wrote:

> > But note that you can also run your main service as a service, and
> > then allocate a *single* scope unit for *all* your payloads.
>
> The main issue is the scope needs a pid attached to it. I thought that the
> scope could live without any process inside, but that's not happening.
> So every time a user step/job finishes, my main process must take care of
> it, and launch the scope again on the next coming job.

Leave a stub process around in it. i.e something similar to
"/bin/sleep infinity".

> The forked process just does the dbus call, and when the scope is ready it
> is moved to the corresponding cgroup (PIDFile=).

Hmm? PIDFile= is a property of *services*, not *scopes*.

And "scopes" cannot be moved to "cgroups". I cannot parse the above.

Did you read up on scopes and services?

See https://systemd.io/CGROUP_DELEGATION/, it explains the concept of
"scopes". Scopes *have* cgroups, but cannot be moved to "cgroups".

> Problem number one: if other processes are in the scope, the dbus call
> won't work since I am using the same name all the time, e.g.
> slurmstepd.scope.
> So I first need to check if the scope exists and if so put the new
> slurmstepd process inside. But we still have the race condition, if during
> this phase all steps ends, systemd will do the cleanup.

Leave a stub process around in it.

> Problem number two, there's a significant delay since when creating the
> scope, until it is ready and the pid attached into it. The only way it
> worked was to put a 'sleep' after the dbus call and make my process wait
> for the async call to dbus to be materialized. This is really
> un-elegant.

If you want to synchronize in the cgroup creation to complete just
wait for the JobRemoved bus signal for the job returned by
StartTransientUnit().

> If instead I could just ask systemd to delegate a part of the tree for my
> processes, then everything would be solved.

I don't follow. You can enable delegation on the scope. I mean, that's
the reason I suggested to use a scope.

> Do you have any other suggestions?

Not really, except maybe: please read up on the documentation, it
explains a lot of the concepts.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] PrivateNetwork=yes is memory costly

2022-03-10 Thread Lennart Poettering

On Do, 10.03.22 11:50, Christopher Wong (christopher.w...@axis.com) wrote:

> Hi Lennart,
>
>
> It is definitely a functionality we want to use. However, the memory came as 
> an unexpected side effect. Since we are not only enabling this for one single 
> service, instead we are applying it globally for all services.
>
> Now due to this huge memory consumption we are trying to put
> everything into the same namespace using
> JoinsNamespaceOf=. It seems to consume less memory.

This means they will still be isolated from the network, but no longer
from each other.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] PrivateNetwork=yes is memory costly

2022-03-09 Thread Lennart Poettering

On Mo, 07.03.22 15:10, Christopher Wong (christopher.w...@axis.com) wrote:

> Hi,
>
>
> It seems that PrivateNetwork=yes is a memory consuming
> directive. The kernel seems to allocate quite an amount of memory
> for each service (~50 kB) that has this directive enabled. I wonder
> if this is expected and if anyone has had similar experience?

PrivateNetwork=yes means that a private network namespace is allocated
for the service. If you think network namespaces are too expensive in
their current implementation, please bring this up with the kernel
people, because they are a kernel concept after all, we just allocate
them if told so.

network namespaces are an effective way to disconnect a service from
the network, if the service doesn't need it. It's probably one of the
most relevant sandboxing options we offer, since disabling the attack
surface called "network" for a service is of such major
importance. That said, if you disable the network namespace
functionality in the kernel systemd will handle this gracefully, and
not use it. If the feature is available in the kernel we will however
use it.

> Is there any ways to reduce the usage?

Besides turning it off? Nothing I was aware of.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] making firewalld an early boot service

2022-03-09 Thread Lennart Poettering

On Mi, 09.03.22 08:17, Michael Biebl (mbi...@gmail.com) wrote:

> > firewalld requires D-Bus so it must be started after D-Bus. You cannot
> > start it earlier.
>
> See above, being Type=dbus, it has an explicit
> Requires/After=dbus.socket.

It has After=dbus.service, not After=dbus.socket, no?

That's a difference during shutdown: if you order against the service
this means you can still talk via the broker on shutdown. if you only
order against the socket the broker might be dead by the time you
shutdown.

Ideally services would be written in a style that they just exit at
shutdown and don't need to tdo D-Bus anymore just to exit. But of
course reality isn't always ideal.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] making firewalld an early boot service

2022-03-09 Thread Lennart Poettering

On Mi, 09.03.22 08:49, Andrei Borzenkov (arvidj...@gmail.com) wrote:

> On 09.03.2022 00:59, Michael Biebl wrote:
> > Hi,
> >
> > I need help with firewalld issue, specifically
> > https://github.com/firewalld/firewalld/issues/414
> >
> > the TLDR: both firewalld.service and cloud-init-local.service hook
> > into network-pre.target and have a Before=network-pre.target ordering.
> >
> > cloud-init-local.service is an early boot service using
> > DefaultDependencies=no and before sysinit.target.
> > firewalld.service via DefaultDependencies=yes get's an
> > After=sysinit.target ordering.
> >
> > So we have conflicting requirements and a dependency loop that needs
> > to be broken by systemd.
> >
>
> Firewalld is red herring here. cloud-init.service has
>
> After=networking.service

What is this unit? Is this a Debian thing?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] making firewalld an early boot service

2022-03-09 Thread Lennart Poettering

e65;6602;1cOn Di, 08.03.22 22:59, Michael Biebl (mbi...@gmail.com) wrote:

> I wonder if firewald should be turned into an early boot service as
> well.

I doubt you can do that. Thing is that firewalld uses D-Bus, and
services that do D-Bus will have a hard time to run during early boot.

In systemd we have some services which do D-Bus and run in early boot,
specifically networkd, resolved and systemd itself. They do that by
simply not doing D-Bus that early, and watching the d-bus socket so
that they connect the moment it becomes available. It's ugly as fuck,
though and very hard to get right, it took us quite some time to get
this reasonably right and race-free.

Last time I looked firewalld is a bunch of scripts around iptables/nft
shell outs? I have my doubts it's going to be easy to make that work,
i.e. add the glue it needs to instantly connect to D-Bus once it
becomes available in a race-free fashion-

> Currently it looks like this:
>
> [Unit]
> Description=firewalld - dynamic firewall daemon
> Before=network-pre.target

Network management services such as networkd are early-boot
services. A late boot service ordered before network-pre.target and
thus networkd is hence already an ordering cycle.

> After=dbus.service
> After=polkit.service

These two are late boot service, hence hard to move to early boot if
you keep them.

> I wonder if the following would make sense
>
>
> [Unit]
> Description=firewalld - dynamic firewall daemon
> DefaultDependencies=no
> Before=network-pre.target
> Wants=network-pre.target
> After=local-fs.target
> Conflicts=iptables.service ip6tables.service ebtables.service
> ipset.service nftables.service
> Documentation=man:firewalld(1)
>
> [Service]
> ...
> [Install]
> WantedBy=sysinit.target

It should also have Before=sysinit.target really.

> Alias=dbus-org.fedoraproject.FirewallD1.service

> I dropped the After=dbus.service polkit.service orderings, as they are
> either socket or D-Bus activated services, added an explicit
> After=local-fs.target ordering just to be sure and hooked it into
> sysinit.target.

My educated guess is that they want After=dbus.service mostly for
shutdown ordering, i.e. so that they can still be talked to while the
system goes down or so?

The thing though is: i doubt firewalld is able to handle the case
where the dbus broker isn't connectible yet.

> Would you agree that making a firewall service an early boot service
> is a good idea?

Well, I am not a fan of the firewalld concept tbh. But yes, if you buy
into the idea of firewalld, then you have to make it an early boot
service really, if you intend to be compatible with early boot
networking. That said, I think NetworkManager is not early-boot either
right now, is it? So you have to move that too. But in that case too,
not sure if it can deal with D-Bus not being around.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Antw: [EXT] Re: timer "OnBootSec=15m" not triggering

2022-03-07 Thread Lennart Poettering

On Mo, 07.03.22 12:24, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> Thanks for that. The amazing things are that "systemd.analyze verify" finds no
> problem and "enable" virtually succeeds, too:

Because there is no problem really: systemd allows you to define your
targets as you like, and we generally focus on a model where you can
extend stuff without requiring it to be installed. i.e. we want to
allow lose coupling, where stuff can be ordered against other stuff,
or be pulled in by other stuff without requiring that the other stuff
to be hard installed. Thus you can declare that you want to be pulled
in by a target that doesn't exist, and that's *not* considered an
issue, because it might just mean that you haven't installed the
package that defines it.

Example: if you install mysql and apache, then there's a good reason
you want that mysql runs before apache, so that the web apps you run
on apache can access mysql. Still it should be totally OK to install
one without the other, and it's not a bug thus if one refers to the
other in its unit files, even if the other thing is not installed.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] systemd failing to close unwanted file descriptors & FDS spawning and crashing

2022-03-04 Thread Lennart Poettering

On Fr, 04.03.22 09:26, Christopher Obbard (chris.obb...@collabora.com) wrote:

> Right, so it looks like the call to close_range fails. This is a 5.4 kernel
> which doesn;t have close_range - so this is understandable.
>
> For a quick fix, I set have_close_range to false - see the patch attached.
> It seemed to work well.
>
> Since my 5.4 kernel is a heavily modified downstream one - next I will check
> if that syscall was implemented by someone else, and also I will check if
> vanilla systemd works on vanilla 5.4 (there is no reason why it shouldn't,
> right?).

Hmm, this is strange. Our code already has fallback paths in place to
handle cases where the syscall is not implemented, i.e. where we see
ENOSYS when we call it. Our code should handle this perfectly already.

Is it possible that your patched kernel added non-standard syscalls
under the syscall numbers the official kernel later assigned to
close_range()? If so, this would explain that we see EINVAL, as of
course we call the syscall assuming it was close_range(), but if it is
actually something else mit very likely might not be able to make
sense of our parameters and thus return EINVAL.

In this case I am not very sympathetic to your case: squatting syscall
numbers is just a terrible idea...

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] How to find out the processes systemd‑shutdown is waiting for?

2022-03-04 Thread Lennart Poettering

On Fr, 04.03.22 08:20, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> >> Something seems to be off with containerd's integration into systemd but
> >> I'm not sure what.
> >
> > Docker traditionally has not followed any of our documented ways to
>
> You are implying that "our documented ways" is a definitive
> standard?

Not sure what a "standard" is, but yeah, systemd defines a non-trivial
part of the APIs of general purpose Linux distributions.

> > interact with cgroups, even though they were made aware of them, not
> > sure why, I think some systemd hate plays a role there. I am not sure
> > if this has changed, but please contact Docker if you have issues with
> > Docker, they have to fix their stuff themselves, we cannot work around
> > it.
>
> The problem with systemnd (people) is that they try to establish new standards
> outside of systemd.
>
> "If A does not work with systemd", it's always A that is broken, never systemd
> ;-)

It's a stack of software. The lower layers dictate how the upper layers
interact with the lower layers, not the other way around. Yes, systemd
has bugs, but here we are not at fault, we document our interfaces,
but Docker knowingly goes its own way, and there's little I can do
about it.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?

2022-03-03 Thread Lennart Poettering

On Mi, 02.03.22 17:50, Lennart Poettering (lenn...@poettering.net) wrote:

> That said, we could certainly show both the comm field and the PID of
> the offending processes. I am prepping a patch for that.

See: https://github.com/systemd/systemd/pull/22655

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering

On Do, 03.03.22 18:35, Felip Moll (fe...@schedmd.com) wrote:

> I have read and studied all your suggestions and I understand them.
> I also did some performance tests in which I fork+executed a systemd-run to
> launch a service for every step and I got bad performance overall.
> One of our QA tests (test 9.8 of our testsuite) shows a decrease of
> performance of 3x.

systemd-run is synchronous, and unless you specify "--scope" it will
tell systemd to fork things off instead of doing that client-side,
which I understand is what you want to do. The fact it's synchronous,
i.e. waits for completion of the whole operation (including start-up
of dependencies and whatnot) necessarily means it's slow.

> > But note that you can also run your main service as a service, and
> > then allocate a *single* scope unit for *all* your payloads. That way
> > you can restart your main service unit independently of the scope
> > unit, but you only have to issue a single request once for allocating
> > the scope, and not for each of your payloads.
> >
> >
> My questions are, where would the scope reside? Does it have an associated
> cgroup?

Yes, I explicitly pointed you to them, it's why I suggested you use
them.

My recommendation if you hack on stuff like this is reading the docs
btw, specifically:

 https://systemd.io/CGROUP_DELEGATION

It pretty explicitly lists your options in the "Three Scenarios"
section.

It also explains what scope units are and when to use htme.

> I am also curious of what this sentence does exactly mean:
>
> "You might break systemd as a whole though (for example, add a process
> directly to a slice's cgroup and systemd will be very sad).".

if you add a process to a cgroup systemd manages that is supposed to
be an inner one in the tree, you will make creation of children fail
that way, and thus starting services and other operations will likely
start failing all over the place.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering

On Mo, 21.02.22 22:16, Felip Moll (lip...@gmail.com) wrote:

> Silvio,
>
> As I commented in my previous post, creating every single job in a separate
> slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra
> fork+execs plus waiting for systemd to fill up its internal structures and
> manage it all is a no-no.

Firing an async D-Bus packet to systemd should be hardly measurable.

But note that you can also run your main service as a service, and
then allocate a *single* scope unit for *all* your payloads. That way
you can restart your main service unit independently of the scope
unit, but you only have to issue a single request once for allocating
the scope, and not for each of your payloads.

But that too means you have to issue a bus call. If you really don't
like talking to systemd this is not going to work of course, but quite
frankly, that's a problem you are making yourself, and I am not
particularly sympathetic to it.

> One other option that I am thinking about is extending the parameters of a
> unit file, for example adding a DelegateCgroupLeaf=yes option.
>
> DelegateCgroupLeaf=. If set to yes an extra directory will be
> created into the unit cgroup to place the newly spawned service process.
> This is useful for services which need to be restarted while its forked
> pids remain in the cgroup and the service cgroup is not a leaf
> anymore.

No. Let's not add that.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering

On Mo, 21.02.22 18:07, Felip Moll (lip...@gmail.com) wrote:

> > That's a bad idea typically, and a generally a hack: the unit should
> > probably be split up differently, i.e. the processes that shall stick
> > around on restart should probably be in their own unit, i.e. another
> > service or scope unit.
>
> So, if I understand it correctly you are suggesting that every forked
> process must be started through a new systemd unit?

systemd has two different unit types: services and scopes. Both group
processes in a cgroup. But only services are where systemd actually
forks+execs (i.e. "starts a process"). If you want to fork yourself, that's
fine, then a scope unit is your thing. If you use scope units you do
everything yourself, but as part of your setup you then tell systemd
to move your process into its own scope unit.

> If that's the case it seems inconvenient because we're talking about a job
> scheduler where sometimes may have thousands of forked processes executed
> quickly, and where performance is key.
> Having to manage a unit per each process will probably not work in this
> situation in terms of performance.

You don't really have to "manage" it. You can register a scope unit
asynchronously, it's firing off one dbus message basically at the same
time you fork things off, telling systemd to put it in a new scope unit.

> The other option I can imagine is to start a new unit from my daemon of
> Type=forking, which remains forever until I decide to clean it up even if
> it doesn't have any process inside.
> Then I could put my processes in the associated cgroup instead of inside
> the main daemon cgroup. Would that make sense?

Migrating processes wildly between cgroups is messy, because it fucks
up accounting and is restricted permission-wise. Typically you want to
create a cgroup and populate it, and then stick to that.

> The issue here is that for creating the new unit I'd need my daemon to
> depend on systemd libraries, or to do some fork-exec using systemd commands
> and parsing output.

To allocate a scope unit you'd have to fire off a D-Bus method
call. No need for any systemd libraries.

> I am trying to keep the dependencies at a minimum and I'd love to have an
> alternative.

Sorry, but if you want to rearrange processes in cgroups, or want
systemd to manage your processes orthogonal to the service concept you
have to talk to systemd.

> Yeah, I know and understand it is not supported, but I am more interested
> in the technical part of how things would break.
> I see in systemd/src/core/cgroup.c that it often differentiates a cgroup
> with delegation with one without it (!unit_cgroup_delegate(u)), but it's
> hard for me to find out how or where this exactly will mess up with any
> cgroup created outside of systemd. I'd appreciate it if you can give me
> some light on why/when/where things will break in practice, or just an
> example?

THis depends highly on what precisely you do. At best systemd will
complain or just override the changes you did outside of the tree you
got delegated. You might break systemd as a whole though (for example,
add a process directly to a slice's cgroup and systemd will be very
sad).

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?

2022-03-02 Thread Lennart Poettering

On Mi, 02.03.22 12:29, Manuel Wagesreither (man...@fastmail.fm) wrote:

> Hi all,
>
> My embedded system is shutting down rather slow, and I'd like to find out the 
> responsible processes.
>
> [ 7668.571133] systemd-shutdown[1]: Waiting for process: dockerd, python3
> [ 7674.670684] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
>
> Is there a way for systemd-shutdown to give me the PID of the
> processes it waits for?

The messages above are from the second phase of shutdown, the "killing
spree phase": it kills processes/unmounts file systems/detaches
storage/… that is left over from the first phase of shutdown. If stuff
is terminated here, and in particular processes are killed this almost
certainly indicates something is seriously broken in the first phase
of shutdown. Or in other words: don#t try to debug the second phase
too much, the first phase is where the brokeness is.

Since this is dockerized stuff: look at docker, they are pretty bad at
cooperating with systemd in the cgroup hierarchy, and they really need
to clean up their own stuff properly when going down. If they don't do
that, you need to fix that in Docker. Or in other words: talk to the
docker people aout all this.

That said, we could certainly show both the comm field and the PID of
the offending processes. I am prepping a patch for that.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] How to find out the processes systemd-shutdown is waiting for?

2022-03-02 Thread Lennart Poettering

On Mi, 02.03.22 13:02, Arian van Putten (arian.vanput...@gmail.com) wrote:

> I've seen this a lot with docker/containerd. It seems as if for some reason
> systemd doesn't wait for their  cgroups to cleaned up on shutdown. It's
> very easy to reproduce. Start a docker container and then power off the
> machine. Since the move to cgroups V2 containerd should be using systemd to
> manage the cgroup tree so a bit puzzling why it's always happening.
>
> Something seems to be off with containerd's integration into systemd but
> I'm not sure what.

Docker traditionally has not followed any of our documented ways to
interact with cgroups, even though they were made aware of them, not
sure why, I think some systemd hate plays a role there. I am not sure
if this has changed, but please contact Docker if you have issues with
Docker, they have to fix their stuff themselves, we cannot work around
it.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Lennart Poettering

On Mo, 21.02.22 14:14, Felip Moll (lip...@gmail.com) wrote:

> Hello,
>
> I am creating a software which consists of one daemon which forks several
> processes from user requests.
> This is basically acting like a job scheduler.
>
> The daemon is started using a unit file and with Delegate=yes option,
> because every process must be constrained differently. I manage my cgroup
> hierarchy, create some leaves into the tree and put each pid there.
> For example, after starting up the service and receiving 3 user requests, a
> tree under /sys/fs/cgroup/system.slice/ could look like:
>
> sgamba1.service/
> ├── daemon_pid
> ├── user1_stuff
> ├── user2_stuff
> └── user3_stuff
>
> I create the hierarchy and set cgroup.subtree_control in the root directory
> (sgamba1.service in the example) and everything runs smoothly, until when I
> decide to restart my service.
>
> The service then cannot restart:
>
> feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to
> cgroup /system.slice/sgamba1.service: Device or resource busy
> feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step
> CGROUP spawning /path_to_bin/mydaemond: Device or resource busy
>
> This is because systemd tries to put the pid of the new daemon in
> sgamba1.service/cgroup.procs and this would break the "no internal process
> constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore
> because it has subtree_control enabled for the user stuff.
>
> One hard requirement is that user stuff must live even if the service is
> restarted.

Hmm? Hard requirement of what? Not following?

You are leaving processes around when your service dies/restarts?
That's a bad idea typically, and a generally a hack: the unit should
probably be split up differently, i.e. the processes that shall stick
around on restart should probably be in their own unit, i.e. another
service or scope unit.

> What's the way to achieve that? I see one easy way, which is to move user
> stuff into its own cgroup and out of sgamba1.service/, but then it will run
> outside a Delegate=yes unit. What can happen then?
> Will systemd eventually migrate my processes?
> How do services workaround that issue?
> If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will
> systemd touch my directories?

That's not supported. You may only create your own cgroups where you
turned on delegation, otherwise all bets are off. If you but stuff in
/sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's
"-.slice" without telling it so, and things will break sooner or
later, and often in non-obvious ways.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] [RFC] systemd-resolved: Send d-bus signal after DNS resolution

2022-02-16 Thread Lennart Poettering

On Mi, 16.02.22 12:13, Dave Howorth (syst...@howorth.org.uk) wrote:

> > This could be used by applications for auditing/logging services
> > downstream of the resolver, or to update the firewall on the system.
>
> Perhaps an example use case would help but I'm not clear how a DNS
> resolution would usefully cause a state change in the firewall without
> some further external guidance?

Yeah, I am not sure I grok the relationship to firewalls here,
either. Updatign firewalls asynchronously based on DNS lookups sounds
wrong to me...

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Restart=on-failure and SuccessAction=reboot-force causing reboots on every exit of Main PID

2022-02-16 Thread Lennart Poettering

On Mi, 16.02.22 11:45, Michał Rudowicz (michal.rudow...@fl9.eu) wrote:

> Hi,
>
> I am trying to write a .service file for an application which is supposed to
> run indefinitely. The approach I have is:
>
>  - if the application crashes (exits with a non-zero exit code), I want
>it to be restarted. This can be achieved easily using the Restart
>directive, like Restart=on-failure in the [Service] section.

>  - if the application exits cleanly (with a zero exit code), I want the
>whole device to reboot (eg. because a software update was done). I've
>found the SuccessAction directive, which - after being set to reboot or
>reboot-force in the [Unit] section - seems to do what I want.
>
> The issue I have is: when I use both SuccessAction=reboot-force and
> Restart=on-failure in one .service file, the system reboots when I kill
> the Main PID of the service (causing non-clean exit for testing).

Can you provide a minimal .service file that shows the issue? Smells
like a bug. SuccessAction= should not be triggred if a service process
exits with SIGKILL...

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] [RFC] systemd-resolved: Send d-bus signal after DNS resolution

2022-02-16 Thread Lennart Poettering

On Di, 15.02.22 22:37, Suraj Krishnan (sura...@microsoft.com) wrote:

> Hello,
>
> I'm reaching out to the community to gather feedback about a feature
> to broadcast a d-bus signal notification from systemd-resolved when
> a DNS query is completed. The message would contain information
> about the query and IP addresses received from the DNS server.

Broadcasting this on the system bus sounds like a bit too heavy. I am
sure there are setups which will resolve a *lot* of names in a very
short time, and you'd flood the bus with that. D-Bus is expressly not
built for streaming more than control data, but if you have a flood of
DNS requests it becomes substantially more than that.

Also, given that in 99.9%of all cases the broadcast messages would
just be dropped by the broker because nothig is listening this sounds
needlessly expensive.

What would make sense is adding a Varlink interface for this
however. resolved uses varlink anyway it could just build on
that. Varlink has the benefit that no broker is involved: if noone is
listening we wouldn't do anything and not have to pay for it. Moreover
varlink has no issues with streaming large amounts of data. And its
easy to secure to ensure nobody unprivileged will see this (simply by
making the socket have a restrictive access mode).

So yes, i think adding the concept makes a ton of sense. But not via
D-Bus, but via Varlink. Would love to review/merge a patch that adds
that and then exposes this via "resolvectl monitor" or so.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Q: Perform action for reboots happen too frequently?

2022-02-16 Thread Lennart Poettering

On Mi, 16.02.22 14:09, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> Hi!
>
> I wonder: Is it possible with systemd to detect multiple reboots
> within a specific time interval, and react, like executing some
> systemctl command (that is expected to "improve" things)?  With
> "reboots" I'm mainly thinking of unexpected boots, not so much the
> "shutdown -r" commands, but things like external resets, kernel
> panics, watchdog timeouts, etc.

The information why a system was reset is hard to get. Some watchdog
hw will tell you if it was recently triggered, and some expensive hw
might log serious errors to the acpi pstore stuff, but it's all icky
and lacks integration.

A safe approach would probably to instead track boots where the root
fs or so is in dirty state. Pretty much all of today's file systems
track that.

It sounds like an OK feature to add to the systemd root fs mount logic
to check for the dirty flag before doing that and then using that is
hint to boot into a target other than default.target that could then
apply further policy (and maybe then go on to enter default.target).

TLDR: nope, this does not exist yet, but parts of it sound like
worthwile feature additions to systemd.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Proposal to extend os-release/machine-info with field PREFER_HARDENED_CONFIG

2022-02-16 Thread Lennart Poettering

On Di, 15.02.22 19:05, Stefan Schröder (ste...@tokonoma.de) wrote:

> Situation:
>
> Many packages in a distribution ship with a default configuration
> that is not considered 'secure'.

Do they? What dos "secure" mean? If there's a security vulnerability,
maybe talk to the distro about that? They should be interested...

> Hardening guidelines are available for all major distributions. Each is a 
> little different.
> Many configuration suggestions are common-sense among security-conscious 
> administrators, who have to apply more secure configuration using some 
> automation framework after installation.
>
> PROPOSAL
>
> os-release or machine-info should be amended with a field
>
>   PREFER_HARDENED_CONFIG
>
> If the value is '1' or 'True' or 'yes' a package manager can opt to
> configure an alternative, more secure default configuration (if
> avaialble).

I am not sure what "hardening" means, sounds awfully vague to me. I
mean, i'd assume that a well meaning distro would lock everything down
as much as they can *by*default*, except if this comes at too high a
price on performance or maintainance or so. But how is a single
boolean going to address that?

If security was as easy as just turning on a boolean, then why would
anyone want to set that to false?

> According to the 'Securing Debian Manual' [1] the login configuration is 
> configured as
> auth   optional   pam_faildelay.so  delay=300
> whereas
> delay=1000
> would provide a more secure default.
>
> The package 'login' contains the file /etc/pam.d/login. If dpkg (or
> apt or rpm or pacman or whatever) detected that os-release asks for
> secure defaults, the alternative /etc/pam.d./login.harden could be
> made the default. (This file doesn't exist yet, the details are left
> to the packaging infrastructure or package maintainer.)

If the default settings are too low, why nor raise them for everyone?

I must say, I am very sure that the primar focus should always be on
locking things down as well as we can for *everyone* and as
*default*. Making security an opt-in sounds like a systematically
wrong approach. If specific security tech cannot be enabled by
default, then work should be done to make it something that can be
enabled by default. And if that's not possible then it apparently
comes at some price, but a simple config boolean somewhere can't
decide whether that price is worth it...

So, quite frnakly, I am not convinced this is desirable.

That said, You can extend machine-info with anything you like, it's
supposed to be extensible. But please make sure you prefix the
variables with some prefix that makes collisions unlikely.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Passive vs Active targets

2022-02-15 Thread Lennart Poettering

On Di, 15.02.22 09:14, Kenneth Porter (sh...@sewingwitch.com) wrote:

> Given that interfaces can come and go, does network.target imply that all
> possible interfaces are up?

No, totally and absolutely not. It's only very vaguely defined what
reaching network.target at boot actually means. Usually not more than
"some network managing daemon is running now" and nothing else. It
explicitly does not say anythign about network interfaces being up or
down or even existing.

Things are more interesting during shutdown here: because in systemd
shutdown order is the reverse of the startup order it means that stuff
ordered After=network.target will be shutdown *before* the network is
shut down, so they should still be able to send "goodbye" packets if
they want, before the network goes away.

Note that there's another target: network-online.target which is an
active target that is supposed to encapsulate the point where the
machine is actually "online". But given that this can mean a myriad of
different things (local interface up vs. ping works vs. DNS reachable,
vs. DHCP acquired, …) it's also pretty vaguely defined, and ultimately
must be filled with meaning by the admin.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Passive vs Active targets

2022-02-15 Thread Lennart Poettering

On Di, 15.02.22 17:30, Thomas HUMMEL (thomas.hum...@pasteur.fr) wrote:

> > > Also, it seems that there are more than one way to pull in a passive
> > > dependency (or maybe several providers which can "publish" it). Like for
> > > instance network-pre.target wich is pulled in by both nftables.service
> > > and/or rdma-ndd.service.
> >
> > nftables.service should pull it in and order itself before it, if it
> > intends to set up the firewall before the first network iterface is
> > configured.
>
> It makes sense but I'm still a bit confused here : I thought that a unit
> which pulled a passive target in was conceptually "publishing it" for *other
> units* to sync After= or Before= it but not to use it itself. What you're
> saying here seems to imply that nftables.services uses itself the passive
> target it "publishes".

A passive unit is a sync point that should be pulled in by the service
that actually needs it to operate correctly. hence: ask the question whether
networkd/NetworkManager will operate only correctly if nftables
finished start-up before it? I think that answer is a clear "no". But
the opposite holds, i.e. nftables only operates as a safe firewall if
it is run *before* networkd/NM start up. Thus it should be nftables
that pulls network-pre.target in, not networkd/NM, because it matters
to nftables, and it doesn't to networkd/NM.

> Or maybe it is the other way around : by pulling it *and* knowing that
> network interface is configured After= nftable.service is guaranteed to set
> up its firewall before any interface gets configured.

So yeah, passive units are mostly about synchronization, i.e. if they
are pulled in they should have units on both sides, otherwise they
make no sense.

> > not sure what rdma-ndd does, can't comment on that.
>
> My point was more : is it legit for 2 supposedly different units to pull in
> the same passive target ?

Yes. If there are multiple services that really want to be started before
some other set of services are started, then the passive target is a
great way to generically put a synchornization point between them. It
can be any set of services before, and any set of services after it.

> Anyway both point above seem to confirm that one cannot take for granted
> that some passive target will be pulled in, correct ? So before ordering
> around it one can make sure some unit pulls the checkpoint ?

Yeah, that's the idea: passive units are mostly synchronization
points, that allow lose coupling for ordering things: for generically
ordering stuff before and after it without actually listing the
servicess explicitly on either side.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Passive vs Active targets

2022-02-15 Thread Lennart Poettering

On Di, 15.02.22 08:46, Kenneth Porter (sh...@sewingwitch.com) wrote:

> --On Tuesday, February 15, 2022 11:52 AM +0100 Lennart Poettering
>  wrote:
>
> > Yes, rsyslog.service should definitely not pull in network.target. (I
> > am not sure why a syslog implementation would bother with
> > network.target at all, neither Wants= nor After= really makes any
> > sense. i.e. if people want rsyslog style logging but it would be
> > delayed until after network is up then this would mean they could
> > never debug DHCP leases or so with it, which sounds totally backwards
> > to me. But then again, I am not an rsyslog person, so maybe I
> > misunderstand what it is supposed to do.)
>
> Presumably it's to run the log server for other hosts. (I use it to log for
> my routers and IoT devices.)

Well, I presumed people actually ran it because they want traditional
/var/log/messages files or so. But if that's not the case...

> On RHEL/CentOS, rsyslog uses imjournal to read the systemd journal so
> presumably DHCP debugging messages would be retrieved from that.

Sure, but that means if people use /var/log/messages to look for DHCP
issues it won't see anything until DHCP was actually acquired, which
makes it useless fo debugging DHCP...

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Passive vs Active targets

2022-02-15 Thread Lennart Poettering

On Mo, 31.01.22 20:13, Thomas HUMMEL (thomas.hum...@pasteur.fr) wrote:

> Hello,
>
> I'm successully using systemd with some non trivial (for me!) unit
> dependencies including some performing:
>
>   custom local disk formatting and mounting at boot
>   additionnal nics configuration by running postscripts fetched from the
> network
>   Infiniband initialisation
>   NFS remote mounts
>   Infiniband remote mounts
>   HPC scheduler and its side services activation
>
> and I've read
> https://www.freedesktop.org/software/systemd/man/systemd.special.html
>
> Still I do not fully (or at all ?) understand the concept of passive vs
> active targets and some related points:
>
> The link above states :
>
> "Note specifically that these passive target units are generally not pulled
> in by the consumer of a service, but by the provider of the service. This
> means: a consuming service should order itself after these targets (as
> appropriate), but not pull it in. A providing service should order itself
> before these targets (as appropriate) and pull it in (via a Wants= type
> dependency)."
>
> And also :
>
> "Note that these passive units cannot be started manually, i.e. "systemctl
> start time-sync.target" will fail with an error. They can only be pulled in
> by dependency."
>
> Since my first look at a passive dependency was network.target which I
> indeed saw was pulled in by NetworkManager.service which ordered itself
> Before it and which I compared with the active network-online.target which
> pulls in the NetworkManager-wait-online.service I first deduced the
> following:
>
> a) a passive target "does" nothing and serves only as an ordering checkpoint
> b) an active target "does" actually something

Yes, you could see it that way.

> I thought that a passive target could be seen as "published" by the
> corresponding provider
> But this does not seems as simple as that:
>
> For one I see on my system that rsyslog.service also pulls in network.target
> (but orders itself After it and thus does not seeems to be the actual
> "publisher" of it as opposed the NetworkManager.service)

There might very well be wrong users of this, that use this the wrong
way around. We do not systematically review other people's unit files,
and hence there might be a lot of issues lurking.

Yes, rsyslog.service should definitely not pull in network.target. (I
am not sure why a syslog implementation would bother with
network.target at all, neither Wants= nor After= really makes any
sense. i.e. if people want rsyslog style logging but it would be
delayed until after network is up then this would mean they could
never debug DHCP leases or so with it, which sounds totally backwards
to me. But then again, I am not an rsyslog person, so maybe I
misunderstand what it is supposed to do.)

Anyway, it would be excellent to file a bug against the rsyslog
package and ask it to drop the deps.

> Then rpcbind.target seems to auto pull itself so without the Before ordering
> we see in the NetworkManager.service pulling network.target example

Can't parse this.

> Also, it seems that there are more than one way to pull in a passive
> dependency (or maybe several providers which can "publish" it). Like for
> instance network-pre.target wich is pulled in by both nftables.service
> and/or rdma-ndd.service.

nftables.service should pull it in and order itself before it, if it
intends to set up the firewall before the first network iterface is
configured.

not sure what rdma-ndd does, can't comment on that.

> Finally, my understanding is some passive targets are not to be taken for
> granted, i.e. they may not be pulled in at all and it is to the user to
> check it if actually is the case if he want to order a unit againt it. I'm
> not talking here about obvious targets we don't have because out of our
> scope (like not having remote mounts related targets if system is purely
> local) but some we could think we have but maybe not. For instance on my
> system I see remote-fs-pre.target pulled in by nfs-client.target but would
> be remote-fs-pre-target be pulled in (by who?) if I had only Infiniband
> remote mounts ?

remote-fs-pre.target should be pulled in by whoever wants to run
*before* any remote mounts (i.e. do Wants= + Before= on it). The remote mounts 
should only order
themselves *after* it, but not pull it in.

> So my question would revolve around the above points
>
> Can you help me figuring out the correct way to see those concepts ?

I think you mostly got things right but the services you listed are
simply buggy.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] mdmon@md127 is stopped early

2022-02-14 Thread Lennart Poettering

On Fr, 11.02.22 13:50, Mariusz Tkaczyk (mariusz.tkac...@linux.intel.com) wrote:

> > Otherwise there might simply be another program that explicitly tells
> > systemd to shut this stuff down, i.e. some script or so. Turn on debug
> > logging (systemd-analyze log-level debug) before shutting down, the
> > logs should tell you a thing or two about why the service is stopped.
> >
>
> That is ridiculous when I enabled debug logging by command provided, it
> is not killed:

A heisenbug. Usually some race then. i.e. the extra debug logging
probably slows down relevant processes long enough so that others can
catch up that previously couldn't.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Failed to add PIDs to scope's control group: No such process

2022-02-10 Thread Lennart Poettering

On Do, 03.02.22 18:39, Gena Makhomed (g...@csdoc.com) wrote:

> Hello, All!
>
> Periodically I see in /var/log/messages error message about
>
> Failed to add PIDs to scope's control group: No such process
> Failed with result 'resources'.
>
> How I can resolve or workaround this error?

It usually just means that the process that makes up a user session
dies more quickly than we have time to actually set up the session for
it. It's not really a problem, mostly just noise.

If this is reproducible on current upstream systemd versions, please
file a bug upstream and we can look into fixing this. But fixing would
mostly entail to just downgrade logging in this case, i.e. just
cosmetically suppressing the noisy logging about this case.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Need a systemd unit example that checks /etc/fstab for modification and sends a text message

2022-02-10 Thread Lennart Poettering

On Di, 08.02.22 17:27, Tony Rodriguez (unixpro1...@gmail.com) wrote:

> From my understanding, it is frowned by systemd developers to
> "automatically" reload systemd via "systemctl daemon-reload" when /etc/fstab
> is modified.  Guessing this is why such functionality hasn't already been
> incorporated into systemd.  However, I would like to send a simple text
> message. Instructing users to manually invoke "systemctl  deamon-reload"
> after modifying /etc/fstab via dmesg command or inside /var/log/messages.
>
> Unsure how to do so inside a systemd UNIT.  Will someone please provide an
> example how to do so?

At least Fedora puts a comment about this in /etc/fstab, explaining
the situation. Tht sounds a lot more appropriate to me rather then
making this appear in the logs...

You can use a PathModified= .path unit for this if you like.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] systemd-journald namespace persistence

2022-02-10 Thread Lennart Poettering

On Mi, 09.02.22 10:18, Roger James (ro...@beardandsandals.co.uk) wrote:

> How do I create a persistent systemd-journald namespace?
>
> I have a backup service that is run by a systemd timer. I would like that to
> use it's own namespace. I can create the namespace manually using systemctl
> start systemd-journald@mynamespace.service. However I cannot find a way to
> do that successfully at boot time. I have tried a RequiredBy and a Requires
> in the timer unit but neither seem to work.

Not sure I follow? the journald instance sockets should get
auto-activated as dependency of your backup service, implicitly, as
LogNamespace= side effect. There should be no need to run it all the
time.

The socket units come with StopWhenUnneeded=yes set, so they
automatically go away if no service needs them.

Why would you want to run those services continously?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Strange behavior of socket activation units

2022-02-10 Thread Lennart Poettering

On Do, 10.02.22 10:09, Tuukka Pasanen (pasanen.tuu...@gmail.com) wrote:

> Hello,
>
> Thank you for your sharp answer. Is there any way to debug what launches
> these sockets and makes socket activation?

these log messages look like they are generated client side. hence
figure out where they come from.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] mdmon@md127 is stopped early

2022-02-10 Thread Lennart Poettering

On Mi, 09.02.22 17:16, Mariusz Tkaczyk (mariusz.tkac...@linux.intel.com) wrote:

> It is probably wrong, but it worked this way for many years:
> "Again: if your code is being run from the root file system, then this
> logic suggested above is NOT for you. Sorry. Talk to us, we can
> probably help you to find a different solution to your problem."[3]
>
> How can I block the service from being stopped? In initramfs there is a
> mdmon restart procedure, for example in dracut[4]. I need to save
> mdmon process from being stopped.
>
> I will try to adapt our implementation to your[3] suggestions but it is
> longer topic, I want to workaround the issue first.
>
> [1]https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
> [2]https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd/mdmon@.service

So with that unit systemd shouldn#t stop it the service at all, given
that you set DefaultDependencies=no.

It would be good to figure out why it is stopped anyway. i.e. check
with "systemctl show" on the unit what kind of requirement/conflicts
deps there are which might explain it.

Otherwise there might simply be another program that explicitly tells
systemd to shut this stuff down, i.e. some script or so. Turn on debug
logging (systemd-analyze log-level debug) before shutting down, the
logs should tell you a thing or two about why the service is stopped.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Run "ipmitool power cycle" after lib/systemd/system-shutdown scripts

2022-02-10 Thread Lennart Poettering

On Mi, 09.02.22 22:05, Etienne Champetier (champetier.etie...@gmail.com) wrote:

> Hello systemd hackers,
>
> After flashing the firmware of some pcie card I need to power cycle
> the server to finish the flashing process.
> For now I have a simple script in lib/systemd/system-shutdown/ running
> "ipmitool power cycle" but I would like to make sure it runs after
> other scripts like fwupd.shutdown or mdadm.shutdown
>
> Is there any way to have systemd cleanly power cycle my server instead
> of rebooting it ?

What does "power cycle" entail that "reboot" doesnt? i.e. why doesn't
"systemctl reboot" suffice?

/usr/lib/systemd/system-shutdown/ drop-ins are executed before the OS
transitions back into the initrd — the initrd will then detach the
root fs (i.e. undo what it attached at boot) and actually reboot. This
means if your command turns off the power source you should stick it
in the initrd's shutdown logic, and not into
/usr/lib/systemd/system-shutdown/. If you are using RHEL this means
into dracut. But adding it there is something to better discuss with
the dracut community than here.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] systemd.sockets vs xinetd

2022-02-10 Thread Lennart Poettering

On Do, 10.02.22 08:41, Yolo von BNANA (y...@bnana.de) wrote:

> Hello,
>
> i read the following in an LPIC 1 Book:
>
> "
> If you’ve done any investigation into systemd.sockets, you may believe that 
> it makes super servers like xinetd obsolete. At this point in time, that is 
> not true. The xinetd super server offers more functionality than 
> systemd.sockets can currently deliver.
> "
>
> I thought, that this information could be deprecated.
>
> Is systemd.sockets at this point in time a good replacement for xined?

xinetd supports various things systemd does not:

- tcp_wrappers support
- implementation of various internal mini-servers, such as RFC868 time
  server and so on
- SUN RPC support
- configurable access times
- precise configuration of generated log message contents
- stream redirection

and a couple of other minor things. The first 3 of these are outright
obsolete I am sure. We don't implement them for that reason.

Instead of configurable access times we allow you to start/stop the
socket units individually any time, and you could bind that to a clock
on anything else really, it's up to you. I think systemd's logic is
vastly more powerful there. For stream redirection we have
systemd-socket-proxy, which should be at least as good, but is not
implemented in the socket unit logic itself, but as an auxiliary
service.

So yes, it does some stuff we don't. Are there some people who want
those things? I guess there are. But I am also sure that they are
either obsolete if you look at the bigger pictue or better ways to do
them, which we do support.

Or to say this differently: it has been years that anyone filed an RFE
bug on systemd github asking for a feature from xinetd that we lack.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Strange behavior of socket activation units

2022-02-07 Thread Lennart Poettering

On Mo, 07.02.22 09:01, Tuukka Pasanen (pasanen.tuu...@gmail.com) wrote:

> Hello,
> I have encountered this kind of problem. This particularly is MariaDB 10.6
> installation to Debian.
> Nearly every application that is restarted/started dies on:
>
>
> Failed to get properties: Unit name mariadb-extra@.socket is missing the
> instance name.
> Failed to get properties: Unit name mariadb@.socket is missing the instance
> name.
>
>
> This is kind of strange as those are for multi target version and should not
> be launched if I don't have mariadb@something.service added or make
> specified systemctl enable mariadb@something.socker which I haven't.
> They don't show up in dependencies of mariadb.service or other service file?

My educated guess is that some script specific to mariadb in debian
assumes that you only run templated instances of mariadb, and if you
don#t things break. Plese work with the mariadb people at Debian to
figure this out, there's nothing much we can do from systemd upstream
about that.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] OnCalendar weekday range syntax

2022-02-04 Thread Lennart Poettering

On Fr, 04.02.22 06:23, Kenneth Porter (sh...@sewingwitch.com) wrote:

> <https://www.freedesktop.org/software/systemd/man/systemd.time.html>
>
> Shows a range of weekdays separated by two dots:
>
> Mon..Fri
>
> When I use this on CentOS 7.9.2009, systemd-219-78.el7_9.5.x86_64, I get
> this error from systemd-analyze verify:

Probably not a good idea to use the documentation of a current systemd
versions from 2022 with a systemd version from 2015. 7 (!) years
changes an awful lot.

".." is the official way to denote workday ranges, "-" is accpeted for
compat, but not documented.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Udevd and dev file creation

2022-02-01 Thread Lennart Poettering

On Di, 01.02.22 16:04, Nishant Nayan (nayan.nishant2...@gmail.com) wrote:

> One thought
> Is it advisable to turn off systemd-udevd if I am sure that I won't be
> adding /removing any devices to my server.

Note that there are various synthetic/virtual devices that are created
on app request, i.e. lvm, loopback, network devices and such. We live
in a dynamic and virtual world, where devices come and go all the
time.

Moreover there are various devices that send out uevents for
change notification of various forms. If you turn off udev apps won't
get those either. i.e. udev is about more than just plug + unplug.

If you stop udev apps waiting for their devices to show up won't be
able to ever get the ready notifications for that and thus will stop
working.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] systemd killing processes on monitor wakeup?

2022-01-31 Thread Lennart Poettering

On Mo, 31.01.22 09:47, Raman Gupta (rocketra...@gmail.com) wrote:

> > Honestly this just sounds like systemd killing "leftover" processes within
> > the plasma-plasmashell cgroup, after the "main" process of that service has
> > exited. That's not a bug; that's standard behavior for systemd services.
> >
>
> What determines whether a process becomes part of the plasma-plasmashell
> cgroup or not? When I run plasmashell independently of systemd, processes
> do indeed start as child processes of plasmashell. I'm guessing this
> implies that when plasmashell is run under systemd, all these processes
> become part of the cgroup, and this is why systemd "cleans up" all these
> child processes after a plasmashell crash?

I don't know plasma, but generally: whatever is forked off from a
process is part of the same cgroup as the process — unless it is
expicitly moved away. Moving things away explicitly is typically done
by PID 1 or the per-user instance of systemd, on explicit API
request.

So, don#t know if plasma calls into systemd at all. If it doesn't all
its children will be part of the same cgroup as plasma. If it otoh
does IPC calls to systemd in some form and tells it to fork/take over
the processes then they might end up in a different cgroup however.

>
> It's also interesting to me that many applications *do not* exit in this
> scenario -- Slack Desktop exits about 50% of the time, and IDEA exits
> pretty consistently. Most other apps remain running. Not sure why that
> would be -- if systemd is cleaning up, shouldn't all apps exit?

"systemd-cgls" should give you a hint which cgroups exists and which
processes remain children of plasma inside its cgroup, and which ones
got their own cgroup.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Launching script that needs network before suspend

2022-01-31 Thread Lennart Poettering

On So, 23.01.22 22:13, Tomáš Hnyk (tomash...@gmail.com) wrote:

> Hello,
> I have my computer hooked up to an AVR that runs my home cinema and ideally
> I would like the computer to turn off the AVR when I turn it off or suspend
> it. The only way to do this is over network and I wrote a simple script that
> does just that. Hooking it to shutdown was quite easy using network.target
> that is defined when shutting down.
>
>
> I am struggling to make it work with suspend though. When I look at the
> logs, terminating network seems to be the first thing that happens when
> suspend is invoked.

That shouldn't happen. Normally networking shouldn't be shut down
during suspend. If your network management solution does this
explicitly, I am puzzled, why it would do that.

> I tried putting the script to
> /usr/lib/systemd/system-sleep/ and it runs, but only after network si down,
> so it fails. Running the script with systemd-inhibit
> (ExecStart=/usr/bin/systemd-inhibit --what=sleep my_script) tells me that
> "Failed to inhibit: The operation inhibition has been requested for is
> already running".

Inhibitors are designed to be taken by long running apps, and not by
the stuff that is run when you are already suspending. i.e. it's too
late then: if you want to temporarily stall suspending to do your
stuff, it's already too late once these scripts are invoked, because
they are invoked once it was decided to actually go into suspend now.

> Is there a way to make this work with service files by specifying that the
> script needs to be run before network is shut down or would I need to run a
> daemon listening for PrepareForSleep as here:
> https://github.com/davidn/av/blob/master/av ?

Usually that's what you do, yes: you take an inhibitor lock while you
are running, and wait until you are informed about system suspend,
then you do your thing, and release the lock once you are done at
which point the suspend continues.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Translating --machine parameter to a service file

2022-01-31 Thread Lennart Poettering

On Di, 25.01.22 13:04, Tomáš Hnyk (tomash...@gmail.com) wrote:

> Hello,
> I want to run a script invoked by udev to run a pactl script. I am now using
> a udev rule SUBSYSTEM=="drm", ACTION=="change",
> RUN+="/usr/local/bin/my_script"
>
> which calls (drew is my username):
> systemctl --machine=drew@.host --user --now my.service
>
>
> which has:
> [Service]
> Type=oneshot
> ExecStart=/usr/local/bin/my_script.py
>
> and in the my_script.py, I do what I need. I cannot call my_script.py
> directly
> from the udev rule because, if I understood it correctly, scripts triggered
> by udev run in a limited environment and pactl runs as user.
>
> Needless to say, this feels rather hackish. Ideally I would use something
> like TAG+="systemd",  ENV{SYSTEMD_WANTS}="my.service"
>
> I can specify "User=" in the service file but I could not figure out
> to translate the --machine=drew@.host parameter to it.

This is not supported. Containers run in their own little world, and
generally get their own devices (i.e. just virtual devices such as
/dev/null and similar), hence we do not have infra to propagate evnts
to containers.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Udevd and dev file creation

2022-01-31 Thread Lennart Poettering

On So, 30.01.22 17:14, Nishant Nayan (nayan.nishant2...@gmail.com) wrote:

> I have started reading about udevd.
> I was trying to find out if there is a way to play with udev without
> plugging in/out any devices.
> Is there a way to trigger a uevent without plugging in devices?

use "udevadm trigger" to fire uevents for existing devices.

Or create new, synthetic virtual devices during runtime, for example
via "losetup".

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] sd_bus_process() + sd_bus_wait() is it not suitable for application?

2022-01-28 Thread Lennart Poettering

On Sa, 22.01.22 14:08, www (ouyangxua...@163.com) wrote:

> Dear all,
>
>
> When using sd_bus_process() + sd_bus_wait()  to implement the 
> application(Service), call the methods function on the service can obtain the 
> correct information.  Run a certain number of times will lead to insufficient 
> memory and memleak does occur.
>
>
> It should not be a problem with the DBUS method, because a single call does 
> not increase memory, it needs to call the method 65 ~ 70 times, and you will 
> see the memory increase. After stopping the call, the memory will not 
> decrease. It seems that it has nothing to do with the time interval when the 
> method is called.
>
>
> code implementation：
> int main()
> {
> ..
> r = sd_bus_open_system();
> ...
> r = sd_bus_add_object_vtable(bus, ..);
> ..
> r= sd_bus_request_name(bus, "xxx.xx.xx.xxx");
> ..
>
>
> for( ; ; )
> {
> r = sd_bus_process(bus, NULL);
> ...
> r = sd_bus_wait(bus, -1);
> ..
> }
> sd_bus_slot_unref(slot);
> sd_bus_unref(bus);
> }

Maybe the callback handlers you added in the vtable keep some objects
pinned?

Also note that unreffing the bus in the end is typically not enough,
if it still has messages queued. Use sd_bus_flush() + sd_bus_close()
first (or combine them in one sd_bus_flush_close_unref()).

Otherwise it might happen that messages still not flushed out at the
end remain pinned.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Initial system date and time set by systemd

2022-01-03 Thread Lennart Poettering

On Mo, 03.01.22 13:13, Sergei Poselenov (sposele...@emcraft.com) wrote:

> Where systemd takes this date "Tue 2020-07-14 19:19:34 UTC"?

It's configurable via the "time-epoch" meson variable at build
time. If you don't specify it, the SOURCE_DATE_EPOCH env var is used
(defined by the reproducible build folks). If that's not set, then
it's the date of the latest git tag in the history of your git
checkout, and if the sources didn't come via git but as tarball or so,
it's the timestamp of the NEWS file.

Or in other words: it's typically the build time of the package, or
the time the release of systemd was done.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-20 Thread Lennart Poettering

On Fr, 10.12.21 12:25, Chris Murphy (li...@colorremedies.com) wrote:

> On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering
>  wrote:
>
> > That said: naked squashfs sucks. Always wrap your squashfs in a GPT
> > wrapper to make things self-descriptive.
>
> Do you mean the image file contains a GPT, and the squashfs is a
> partition within the image? Does this recommendation apply to any
> image? Let's say it's a Btrfs image. And in the context of this
> thread, the GPT partition type GUID would be the "super-root" GUID?

Yes, I'd always add a GPT wrapper around disk images. It's simple,
extensible and first and foremost self-descriptive: you know what you
are looking at, safely, before parsing the fs. It opens the door for
adding verity data in a very natural way, and more.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Q: When will WorkingDirectory be checked?

2021-12-20 Thread Lennart Poettering

On Fr, 17.12.21 08:11, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> Hi!
>
> I have a simple question: When will WorkingDirectory be checked?
> Specifically: Will it be checked before ExecStartPre? I could not
> get it form the manual page.

What do you mean by "checked"?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Predictable Network Interface Name Bug?

2021-12-16 Thread Lennart Poettering

On Mi, 15.12.21 21:37, Tim Safe (timsafeem...@gmail.com) wrote:

> Hello-
>
> I have an Ubuntu Server 20.04 (systemd 245 (245.4-4ubuntu3.13)) box that I
> recently installed a Intel quad-port Gigabit ethernet adapter (E1G44ETBLK).
>
> It appears that the predictable interface naming is only renaming the first
> two interfaces (ens8f0, ens8f1) and the second two fail to be renamed
> (eth2, eth3).

Consider updating your systemd version to a newer version (or ask your
distro to backport the relevant patches). The predictable network
interface naming received a number of tweaks since 245, and it's
pretty likely this has since been fixed. Specifically, the
NAMING_SLOT_FUNCTION_ID feature flag introduced with v249 will likely
fix your case.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] [RFC] Switching to OpenSSL 3?

2021-11-23 Thread Lennart Poettering

On Di, 23.11.21 11:53, Dimitri John Ledkov (dimitri.led...@canonical.com) wrote:

> Just an update from Ubuntu - for the upcoming release of Jammy (22.04
> LTS targeting release in April 2022) we have started transition to
> OpenSSL 3 and currently upgrading to systemd v249.

Did Ubuntu adopt Debian's stance of accepting OpenSSL as system
component? i.e. is OpenSSL 3 compatible with both (L)GPL 2.x code
*and* GPL3 code in Ubuntu's eyes? Or only the latter?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] give unprivileged nspawn container write access to host wayland socket

2021-11-23 Thread Lennart Poettering

On Mo, 22.11.21 16:02, Nozz (n...@protonmail.com) wrote:

> I recently moved to pure wayland, I want to run a graphical
> application in a unprivileged container(user namespace isolation)
> . The application needs write access to wayland socket on the host
> side. What's the best way to achieve this?  I've been able to do
> this if I map the host UID/GID range using --private-users=0:65536
> but then there is no namespace isolation. Also I would have to map
> the same range to every container and documentation states it's bad
> security wise to have it overlap.

Well, if you run n containers and all n have the same UID/GID mapping
then of course they can access/change each other resources should they
be able to see it. That might or might not be OK.

In the upcoming 250 release nspawn bind mounts are changed (if a
kernel with uidmap support in the fs layer is available that is) so
that bind mounts placed in the kernel are optionally idmapped,
i.e. that host UID 0 is mapped to container UID 0 for such bind
mounts, instead of "nobody". That should make what you are trying to
do pretty easy, as you can mout individual inodes and make them appear
under their original ownership.

We might want to extend this later on: when bind mounting
non-directory inodes (such as sockets) we could even allow fixing
ownership to any uid of your choice, to give you full freedom there.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Networking in a systemd-nspawn container

2021-11-19 Thread Lennart Poettering

On Fr, 22.10.21 19:54, Tobias Hunger (tobias.hun...@gmail.com) wrote:

> Hello Systemd Mailing List!
>
> I have a laptop and run a couple of systemd-nspawn containers on that
> machine. This works great, except that name resolution insode the
> containers fails whenever the network on the outside changes.
>
> This is not too surprising: At setup time the resolver information is
> copied into the containers and never updated. That is sup-optimal for
> my laptop that I keep moving between networks.
>
> I have been wondering: Would it be possible to forward the containers
> resolver to the host machine resolver somehow?
>
> Could e.g. systemd-nspawn optionally make the hosts resolver available
> in the containers network namespace? Maybe by setting up some port
> forwarding or by putting a socket into the container somewhere?
>
> Any ideas? I can do some of the work with a bit of guidance.

You could use DNSStubListenerExtra= in resolved.conf to make the stub
listen on some additional container-facing IP address, or multiple of
them. But it's not pretty, as it requires manual configuration.

Two ideas I recently thought about:

1. Maybe resolved's "stub" logic should support listening on yet
   another local IP address: 127.0.0.54 or so, where the same stub
   listens as on 127.0.0.53, but where we unconditionally enable
   "bypass" mode. This mode in resolved means that we'll not process
   the messages ourselves very much, but just look at the domains
   mentioned in it for routing info and then pass the lookup upstream
   and its answer back almost unmodified. (we'd also not consider the
   lookups for mdns/llmnr and such). Right now we only enable that mode
   if we encounter traffic we otherwise don't understand. Thus, if you
   use that other IP address you can use resolved basically as a proxy
   towards whatever the current DNS server is, nothing else. (though
   we'd still translate classic UDP and TCP DNS to DoT if configured)

2. Then, teach nspawn to optionally set up nftables/iptables NAT so
   that port 53 of some veth tunnel IP of the host is automatically
   NAT'ed to 127.0.0.54:53.

That way you then get what you are looking for, as you could then
advertise the host-side IP address of your veth tunnel as DNS server
unconditionally, and the right thing would happen.

(i figure wifi tethering applications could make use of this too?)

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-19 Thread Lennart Poettering

On Do, 18.11.21 15:01, Chris Murphy (li...@colorremedies.com) wrote:

> On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy  wrote:
> >
> > How to do swapfiles?
> >
> > Currently I'm creating a "swap" subvolume in the top-level of the file
> > system and /etc/fstab looks like this
> >
> > UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> > /var/swap/swapfile1 none swap defaults 0 0
> >
> > This seems to work reliably after hundreds of boots.
> >
> > a. Is this naming convention for the subvolume adequate? Seems like it
> > can just be "swap" because the GPT method is just a single partition
> > type GUID that's shared by multiboot Linux setups, i.e. not arch or
> > distro specific
> > b. Is the mount point, /var/swap, OK?
> > c. What should the additional naming convention be for the swapfile
> > itself so swapon happens automatically?
>
> Actually I'm thinking of something different suddenly... because
> without user ownership of swapfiles, and instead systemd having domain
> over this, it's perhaps more like:
>
> /x-systemd.auto/swap -> /run/systemd/swap

I'd be conservative with mounting disk stuff to /run/. We do this for
removable disks because the mount points are kinda dynamic, hence it
makes sense, but for this case it sounds unnecessary, /var/swap sounds
fine to me, in particular as the /var/ partition actually sounds like
the right place to it if /var/swap/ is not a mount point in itself but
just a plain subdir.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-19 Thread Lennart Poettering

On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:

> How to do swapfiles?

Is this really a concept that deserves too much attention? I mean, I
have the suspicion that half the benefit of swap space is that it can
act as backing store for hibernation. But swap files are icky for that
since that means the resume code has to mount the fs first, but given
the fs is dirty during the hibernation state this is highly problematic.

Hence, I have the suspicion that if you do swap you should probably do
swap partitions, not swap files, because it can cover all usecase:
paging *and* hibernation.

> Currently I'm creating a "swap" subvolume in the top-level of the file
> system and /etc/fstab looks like this
>
> UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> /var/swap/swapfile1 none swap defaults 0 0
>
> This seems to work reliably after hundreds of boots.
>
> a. Is this naming convention for the subvolume adequate? Seems like it
> can just be "swap" because the GPT method is just a single partition
> type GUID that's shared by multiboot Linux setups, i.e. not arch or
> distro specific

I'd still put it one level down, and marke it with some non-typical
character so that it is less likely to clash with anything else.

> b. Is the mount point, /var/swap, OK?

I see no reason why not.

> c. What should the additional naming convention be for the swapfile
> itself so swapon happens automatically?

To me it appears these things should be distinct: if automatic
activation of swap files is desirable, then there should probably be a
systemd generator that finds all suitable files in /var/swap/ and
generates .swap units for them. This would then work with any kind of
setup, i.e. independently of the btrfs auto-discovery stuff. The other
thing would be the btrfs auto-disocvery to then actually mount
something there automatically.

> Also, instead of /@auto/ I'm wondering if we could have
> /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and
> while I'm a big fan of the @ symbol for typographic history reasons,
> it's being used in the subvolume/snapshot regimes rather haphazardly
> for different purposes which might be confusing? e.g. Timeshift
> expects subvolumes it manages to be prefixed with @. Meanwhile SUSE
> uses @ for its (visible) root subvolume in which everything else goes.
> And still ZFS uses @ for their (read-only) snapshots.

I try to keep the "systemd" name out of entirely generic specs, since
there are some people who have an issue with that. i.e. this way we
tricked even Devuan to adopt /etc/os-release and the /run/ hierarchy,
since they probably aren't even aware that these are systemd things.

Other chars could be used too: /+auto/ sounds OK to me too. or
/_auto/, or /=auto/ or so.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] hardware conditional OS boot/load

2021-11-19 Thread Lennart Poettering

On Do, 18.11.21 19:16, lejeczek (pelj...@yahoo.co.uk) wrote:

> Hi guys.
>
> I hope an expert(or two) could shed some light - I ain't a kernel nor
> hardware expert so go easy on me please - on whether it is possible to boot
> system only under certain conditions, meaning: as early as possible (grub?)
> and similarly securely, Linux checks for certain hardware, eg. CPU serial
> no. and continue to load only when such conditions are met?
>
> I realize that perhaps kernel devel be the place for such questions but seen
> I'm subscriber here, knowing people here are experts of same caliber, I
> decided to ask.

You can certainly hack something up like this, but to my knowledge
none of the boot loaders currently implement something like this.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] How to get array[struct type] using sd_bus_message_* API's

2021-11-19 Thread Lennart Poettering

On Fr, 19.11.21 12:31, Manojkiran Eda (manojkiran@gmail.com) wrote:

> In the `busctl monitor` i could confirm that i am getting a message of
> signature a{sas} from the dbus call, and here is the logic that I could
> come up with to read the data.
>
> r = sd_bus_message_enter_container(reply, SD_BUS_TYPE_ARRAY, "{sas}");
> if (r < 0)
> goto exit;
> while ((r = sd_bus_message_enter_container(reply,
> SD_BUS_TYPE_DICT_ENTRY,
>"sas")) > 0)
> {
> char* service = NULL;
> r = sd_bus_message_read(reply, "s", );
> if (r < 0)
> goto exit;
> printf("service = %s\n", service);
> r = sd_bus_message_enter_container(reply, 'a', "s");
> if (r < 0)
> goto exit;
> for (;;)
> {
> const char* s;
> r = sd_bus_message_read(reply, "s", );
> if (r < 0)
> goto exit;
> if (r == 0)
> break;
> printf("%s\n", s);
> }
> }
>
> Output:
> service = xyz.openbmc_project.EntityManager
> org.freedesktop.DBus.Introspectable
> org.freedesktop.DBus.Peer
> org.freedesktop.DBus.Properties
>
> But, I was only able to get the data from the first dictionary, can anyone
> help me to solve this issue? what am I missing?

You always need to leaver each container again once you read its
contents. i.e. each sd_bus_message_enter_container(…) must be paired
with sd_bus_message_leave_container(…)

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] How to build a unified kernel for aarch64?

2021-11-12 Thread Lennart Poettering

On Fr, 12.11.21 00:00, Zameer Manji (zma...@gmail.com) wrote:

> I have noticed there exists a systemd-stub for aarch64, the file
> is located at `/usr/lib/systemd/boot/efi/linuxaa64.efi.stub`.
>
> The manpage instructs users to use `objcopy` to build the unified
> kernel from the stub. However when I use objcopy I get the following
> error:
>
> ```
> objcopy: /usr/lib/systemd/boot/efi/linuxaa64.efi.stub: file format not
> recognized
> ```
> I also noticed the GNU binutils bug tracker indicates that PE executables on
> ARM are not yet supported for objcopy [0].
>
> How do systemd developers build the unified kernel on aarch64? Is there
> an alternative toolchain used?
>
> [0]: https://sourceware.org/bugzilla/show_bug.cgi?id=26206

I personally never played around with this for anything
non-x86-64. But I wonder, maybe llvm-objcopy supports this for aarch64?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-11 Thread Lennart Poettering

On Do, 11.11.21 18:27, Lennart Poettering (mzerq...@0pointer.de) wrote:

> A patch for that should be pretty easy to do, and be very generically
> useful. I kinda like it. What do you think?

For now I added TODO list items for these ideas:

https://github.com/systemd/systemd/commit/af11e0ef843c19cbf8ccaefb93a44dbe4602f7a8#diff-337e547a950fc8a98592f10d964c1e79a304961790a8da0ce449a1f000cefabb

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-11 Thread Lennart Poettering

On Mi, 10.11.21 10:34, Topi Miettinen (toiwo...@gmail.com) wrote:

> > Doing this RootDirectory= would make a ton of sense too I guess, but
> > it's not as obvious there: we'd need to extend the setting a bit I
> > think to explicitly enable this logic. As opposed to the RootImage=
> > case (where the logic should be default on) I think any such logic for
> > RootDirectory= should be opt-in for security reasons because we cannot
> > safely detect environments where this logic is desirable and discern
> > them from those where it isn't. In RootImage= we can bind this to the
> > right GPT partition type being used to mark root file systems that are
> > arranged for this kind of setup. But in RootDirectory= we have no
> > concept like that and the stuff inside the image is (unlike a GPT
> > partition table) clearly untrusted territory, if you follow what I am
> > babbling.
>
> My images don't have GPT partition tables, they are just raw squashfs file
> systems. So I'd prefer a way to identify the version either by contents of
> the image (/@auto/ directory), or something external, like name of the image
> (/path/to/image/foo.version-X.Y). Either option would be easy to implement
> when generating the image or directory.

Hmm, so thinking about this again, I think we might get away with a
check "/@auto/ exists and /usr/ does not". i.e. the second part of the
check removes any ambiguity: since we unified the OS in /usr it's an
excellent way to check if something is or could be an OS tree.

That said: naked squashfs sucks. Always wrap your squashfs in a GPT
wrapper to make things self-descriptive.

> But if you have several RootDirectories or RootImages available for a
> service, what would be the way to tell which ones should be tried if there's
> no GPT? They can't all have the same name. I think using a specifier (like
> %q) would solve this issue nicely and there wouldn't be a need for /@auto/
> in that case.

A specifier is resolved at unit file load time only. It wouldn#t be the
right fit here, since we don#t want to require that the paths
specified in RootDirectory=/RootImage= are already accessible at the
time PID 1 reads/parses the unit file.

What about this: we could entirely independently of the proposal
originally discussed here teach RootDirectory= + RootImage= one magic
trick: if the path specified ends in ".auto.d/" (or so) then we'll not
actually use the dir/image as-is but assume the path refers to a
directory, and we'd pick the newest entry inside it as decided by
strverscmp().

Or in other words, we'd establish the general rule that dirs ending in
".auto.d/" contains versioned resources inside, that we could apply
here and everywhere else where it fits, too.

of course intrdocuing this rule would be kind of a compat breakage
because if anyone happened to have named their dirs like that already
we'd suddenly do weird stuff with it the user might not expect. But I
think I could live with that.

A patch for that should be pretty easy to do, and be very generically
useful. I kinda like it. What do you think?

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-09 Thread Lennart Poettering

On Di, 09.11.21 19:48, Topi Miettinen (toiwo...@gmail.com) wrote:

> > i.e. we'd drop the counting suffix.
>
> Could we have this automatic versioning scheme extended also to service
> RootImages & RootDirectories as well? If the automatic versioning was also
> extended to services, we could have A/B testing also for RootImages with
> automatic fallback to last known good working version.

At least in the case of RootImage= this was my implied assumption:
we'd implement the same there, since that uses the exact same code as
systemd-nspawn's image dissection and we definitely want it there.

Doing this RootDirectory= would make a ton of sense too I guess, but
it's not as obvious there: we'd need to extend the setting a bit I
think to explicitly enable this logic. As opposed to the RootImage=
case (where the logic should be default on) I think any such logic for
RootDirectory= should be opt-in for security reasons because we cannot
safely detect environments where this logic is desirable and discern
them from those where it isn't. In RootImage= we can bind this to the
right GPT partition type being used to mark root file systems that are
arranged for this kind of setup. But in RootDirectory= we have no
concept like that and the stuff inside the image is (unlike a GPT
partition table) clearly untrusted territory, if you follow what I am
babbling.

Or in other words: to enable this for RootDirectory= we probably need
a new option RootDirectoryVersioned= or so that takes a boolean.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-09 Thread Lennart Poettering

On Di, 09.11.21 14:48, Ludwig Nussel (ludwig.nus...@suse.de) wrote:

> > and so on. Until boot succeeds in which case we'd rename it:
> >
> >/@auto/root-x86-64:fedora_36.0
> >
> > i.e. we'd drop the counting suffix.
>
> Thanks for the explanation and pointer!
>
> Need to think aloud a bit :-)
>
> That method basically works for systems with read-only root. Ie where
> the next OS to boot is in a separate snapshot, eg MicroOS.
> A traditional system with rw / on btrfs would stay on the same subvolume
> though. Ie the "root-x86-64:fedora_36.0" volume in the example. In
> openSUSE package installation automatically leads to ro snapshot
> creation. In order to fit in I suppose those could then be named eg.
> "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the
> subvolume would never be booted.
>
> Anyway, let's assume the ro case and both efi partition and btrfs volume
> use this scheme. That means each time some packages are updated we get a
> new subvolume. After reboot the initrd in the efi partition would try to
> boot that new subvolume. If it reaches systemd-bless-boot.service the
> new subvolume becomes the default for the future.
>
> So far so good. What if I discover later that something went wrong
> though? Some convenience tooling to mark the current version bad again
> would be needed.

In the sd-boot/kernel case any time you like you can rename an entry
to "…+0" to mark it as "bad", you could drop the suffix to mark it as
"good" or you could mark it as "+3" to mark it as
"dont-know/try-again".

Now, at least in theory we could declare the same for this new
directory auto-discovery scheme. But I am not entirely sure this will
work out trivially IRL because I have the suspicion one cannot rename
subvolumes which are the source of a bind mount (i.e. once you boot
into one root subtree, then it might be impossible to rename that
top-level inode without rebooting first). Would be something to try
out. If it doesn't work it might suffice to move things one level
down, i.e. that the dir that actually becomes root is
/@auto/root-x86-64:fedora_36.0/payload/ or so, instead of just
/@auto/root-x86-64:fedora_36.0/. I think that that would work, and
might be desirable anyway so that the enumeration of entries doesn't
already leak fs attributes/ownership/access modes/…  of actual root
fs.

> But then having Tumbleweed in mind it needs some capability to boot any
> old snapshot anyway. I guess the solution here would be to just always
> generate a bootloader entry, independent of whether a kernel was
> included in an update. Each entry would then have to specify kernel,
> initrd and the root subvolume to use.
> This approach would work with a separate usr volume also. In that case
> kernel, initrd, root and usr volume need to be linked by means of a
> bootloader entry.

For the GPT case if you want to bind a kernel together with a specific
root fs, you'd do this by specifying 'root=PARTLABEL=fooos_0.3' on the
kernel cmdline. I'd take inspiration from that and maybe introduce
'rootentry=fedora_36.2' or so which would then be honoured by the logic
we are discussing here, and would hard override which subdir to use,
regardless of versioning preference, assesment counting and so on.

(Yeah, the subvol= mount option for btrfs would work too, but as
mentioned I'd keep this reasonably independent of btrfs where its
easy, plain dirs otherwise are fine too after all. Which reminds me,
recent util-linux implements the X-mount.subdir= mount option, which
means one could also use 'rootflags=X-mount.subdir=@auto/fedora_36.2'
as non-btrfs-specific way to express the btrfs-specific
'rootflags=subvol=@auto/fedora_36.2')

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-08 Thread Lennart Poettering

On Mo, 08.11.21 14:24, Ludwig Nussel (ludwig.nus...@suse.de) wrote:

> Lennart Poettering wrote:
> > [...]
> > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named
> >[:]. The type should have a similar vocubulary
> >as the GPT spec type UUIDs, but probably use textual identifiers
> >rater than UUIDs, simply because naming dirs by uuids is
> >weird. Examples:
> >
> >/@auto/root-x86-64:fedora_36.0/
> >/@auto/root-x86-64:fedora_36.1/
> >/@auto/root-x86-64:fedora_37.1/
> >/@auto/home/
> >/@auto/srv/
> >/@auto/tmp/
> >
> >Which would be assembled by the initrd into the following via bind
> >mounts:
> >
> >/ → /@auto/root-x86-64:fedora_37.1/
> >/home/→ /@auto/home/
> >/srv/ → /@auto/srv/
> >/var/tmp/ → /@auto/tmp/
> >
> > If we do this, then we should also leave the door open so that maybe
> > ostree can be hooked up with this, i.e. if we allow the dirs in
> > /@auto/ to actually be symlinks, then they could put their ostree
> > checkotus wherever they want and then create a symlink
> > /@auto/root-x86-64:myostreeos pointing to it, and their image would be
> > spec conformant: we'd boot into that automatically, and so would
> > nspawn and similar things. Thus they could switch their default OS to
> > boot into without patching kernel cmdlines or such, simply by updating
> > that symlink, and vanille systemd would know how to rearrange things.
>
> MicroOS has a similar situation. It edits /etc/fstab.

microoos is a suse thing?

> Anyway in the above example I guess if you install some updates you'd
> get eg root-x86-64:fedora_37.2, .3, .4 etc?

Well, the spec wouldn't mandate that. But yeah, the idea is that you
could do it like that if you want. What's important is to define the
vocabulary to make this easy and possible, but of course, whether
people follow such an update scheme is up to them. I mean, it's the
same as with the GPT auto discovery logic: it already implements such
a versioning scheme because its easy to implement, but if you don't
want to take benefit of the versioning, then don't, it's fine
regardless. the logic we'd define here is about *consuming* available
OS root filesystems, not about *installing* them, after all.

The GPT auto-discovery thing basically does an strverscmp() on the
full GPT partition label string, i.e. it does not attempt to split a
name from a version, but assumes strverscmp() will handle a common
prefix nicely anyway. I'd do it the exact same way here: if there are
multiple options, then pick the newest as per strverscmp(), but that
also means it's totally fine to not version your stuff and instead of
calling it "root-x86-64:fedora_37.3" could could also just name it
"root-x86-64:fedora" if you like, and then not have any versioning.

> I suppose the autodetection is meant to boot the one sorted last. What
> if that one turns out to be bad though? How to express rollback in that
> model?

Besides the GPT auto-discovery where versioning is implemented the way
I mentioned, there's also the sd-boot boot loader which does roughly
the same kind of OS versioning with the boot entries it discovers. So
right now, you can already chose whether:

1. you want to do OS versioning on the boot loader entry level: name
   your EFI binary fooos-0.1.efi (or fooos-0.1.conf, as defined by the
   boot loader spec) and similar and the boot loader automatically
   picks it up, makes sense of it and boots the newest version
   installed.

2. you want to do OS versioning on the GPT partition table level: name
   your partitions "fooos-0.1" and similar, with the right GPT type,
   and tools such as systemd-nspawn, systemd-dissect, portable
   services, RootImage= in service unit files all will be able to
   automatically pick the newest version of the OS among the ones in
   the image.

and now:

3. If we implement what I proprose above then you could do OS version
   on the file system level too.

(Or you could do a combination of the above, if you want — which is
highly desirable I think in case you want a universal image that can
boot on bare metal and in nspawn in a nice versioned way.)

Now, in sd-boot's versioning logic we implement an automatic boot
assesment logic on top of the OS versioning: if you add a "+x-y"
string into the boot entry name we use it as x=tries-left and
y=tries-done counters. i.e. fooos-0.1+3-0.efi is semantically the same
as fooos-0.1.efi, except that there are 3 attempts left and 0 done
yet. On each boot attempt the boot loader decreases x and increases
y. i.e. fooos-0.1+3-0.efi → fooos-0.1+2-1.efi → fooos-0.1+1-2.efi →
fooos-0.1+0-3.efi. If a boot succeeds the two counters are dropped
from the filename, i

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-04 Thread Lennart Poettering

e could do
   its thing in some other subdir of the root fs if it wants to)

3. Inside the "@auto" dir of the "super-root" fs, have dirs named
   [:]. The type should have a similar vocubulary
   as the GPT spec type UUIDs, but probably use textual identifiers
   rater than UUIDs, simply because naming dirs by uuids is
   weird. Examples:

   /@auto/root-x86-64:fedora_36.0/
   /@auto/root-x86-64:fedora_36.1/
   /@auto/root-x86-64:fedora_37.1/
   /@auto/home/
   /@auto/srv/
   /@auto/tmp/

   Which would be assembled by the initrd into the following via bind
   mounts:

   / → /@auto/root-x86-64:fedora_37.1/
   /home/→ /@auto/home/
   /srv/ → /@auto/srv/
   /var/tmp/ → /@auto/tmp/

If we do this, then we should also leave the door open so that maybe
ostree can be hooked up with this, i.e. if we allow the dirs in
/@auto/ to actually be symlinks, then they could put their ostree
checkotus wherever they want and then create a symlink
/@auto/root-x86-64:myostreeos pointing to it, and their image would be
spec conformant: we'd boot into that automatically, and so would
nspawn and similar things. Thus they could switch their default OS to
boot into without patching kernel cmdlines or such, simply by updating
that symlink, and vanille systemd would know how to rearrange things.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] [EXT] Question about timestamps in the USER_RECORD spec

2021-10-28 Thread Lennart Poettering

On Do, 28.10.21 11:46, Arian van Putten (arian.vanput...@gmail.com) wrote:

> Indeed it mentions it; but after careful reading there is no normative
> suggestion to actually adhere to it. (no SHOULD and definitely not a MUST,
> not even a RECOMMENDED).
>
> They just say that to increase interoperability no more than 53 bits of
> integer precision should be assumed without making a clear normative
> decision about it.   The only normative part in that section is that
> numbers consist of an integer part and a fractional part.
>
> They also say that implementations are allowed to set any limits on the
> range and precision of numbers accepted.
>
> So yeah Lennart seems to be technically correct. Even when reading the RFC
> by the letter.

BTW:

https://github.com/systemd/systemd/pull/21168

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Question about timestamps in the USER_RECORD spec

2021-10-26 Thread Lennart Poettering

On Di, 26.10.21 10:41, Arian van Putten (arian.vanput...@gmail.com) wrote:

> Hey list,
>
> I'm reading the https://systemd.io/USER_RECORD/ spec and I have a question
>
> There are some fields in the USER_RECORD spec which are described as
> "unsigned 64 bit integer values".   Specifically the fields describing
> time.
>
> However JSON lacks integers and only has doubles [0]; which would mean 53
> bit integer precision is about the maximum we can reach.

The spec itself doesn't really mandate this is implemented in
double, the spec just says "sticking to doubles would be nice".

Actual implementations implement this differently IRL. Python based
implementations have arbitrary precision for this. sd-bus uses
uint64_t, or int64_t or long double. It prefers the integer types if
the value fits, and uses the floating point type otherwise. json-glibc
uses int64_t or double.

There are plenty of specs that rely that 64bit integers work with full
range (OCI for example, for much of its resource management stuff).

> It's unclear to me from the spec whether I should use doubles to
> encode these fields or use strings.  Would it be possible to further
> clarify it?  If it is indeed a number literal; this means the
> maximum date we can encode is 9007199254740991 which corresponds to
> Tuesday, June 5, 2255 . This honestly is too soon in the future for
> my comfort.

You appear to plan for quite a long life ;-)

Frankly, this is not a problem specific to user records. A multitude
of JSON formats tend to store dates this way. The overflow is still
200y out. I think that leaves plenty time to teach implementations
full 64bit support, and I am pretty sure that the ones that will deal
with user records will catch up sooner or later.

> I suggest encoding 64
> bit integers as string literals instead to avoid the truncation
> problem.

I am sorry, but I am not convinced this is a pressing issue. I value
cleanliness and obviousness a lot more than theoretic issues that
might happen 200 years from now. In particular as they are issues that
can be dealt with in offending JSON implementations, and limitations
in the parser implementations shouldn't really leak in to the spec I think.

I mean, at this point it isn't even clear humanity will survive that
long, and I seriously doubt that systemd is the one project that
survives humanity.

What might make sense is to add a comment about the whole situation to
the spec and be done with it:

 "Please not that this specification assumes that JSON numbers may
 cover the full integer range of -2^63 … 2^64-1 without loss of
 accuracy (i.e. INT64_MIN … UINT64_MAX). Please read, write and
 process user records following this specification only with JSON
 implementations that guarantee this range."

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] A questions about modules-load service in systemd

2021-10-25 Thread Lennart Poettering

On Sa, 23.10.21 02:27, Joakim Zhang (qiangqing.zh...@nxp.com) wrote:

> > It doesn't do that actually. But udev when it loads kernel modules does 
> > things
> > from a bunch of worker processes all in parallel.
>
> Ok, is there a way to disable this parallel tasks in systemd-udev
> service?

udev.children_max=1 on the kernel command line.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] loose thoughts around portable services

2021-10-22 Thread Lennart Poettering

On Mi, 20.10.21 16:01, Umut Tezduyar Lindskog (u...@tezduyar.com) wrote:

> > That said: systemd's nss-systemd NSS module can nowadays (v249) read
> > user definitions from drop-in JSON fragments in
> > /run/host/userdb/. This is is used by nspawn's --bind-user= feature to
> > make a host user easily available in a container, with group info,
> > password and so on. My plan was to also make use of this in the unit
> > executor, i.e. so that whenever RootDirectory=/RootImage= are used the
> > service manager places such minimal user info for the selected user
> > there, so that the user is perfectly resolvable inside the service
> > too. This is particularly relevant for DynamicUser=1 services. I
> > haven't come around actually implementing that though. Given
> > nss-systemd is enabled in most bigger distro's nssswitch.conf file
> > these days I think this is a really nice approach to propagate user
> > databases like that.
> >
>
> Why don't we also make the varlink user API available to most of the
> profiles? This way sandboxed service doesn't need any of the nss conf and
> libraries if they don't want to. Most profiles allow dbus communication. I
> guess in a similar thought, most system services should be able to do a
> user lookup in a modern way.

I sympathize with the idea, but I am not entirely sure this is
desirable to do this 1:1, as this means we'd leak a ton of stuff that
might only make sense on the host into something that is supposed to
be an isolated container. i.e. home dir info and things like
that. shell paths and so on.

Maybe we can find a middle ground on this though. i.e. we could make
systemd-userdb.service listen on a new varlink service socket that
provides the host's database to sandboxed environments in a restricted
form, i.e. with basically all records dumbed down to just contain
uid/gid/name info and nothing else.

We'd then update the portabled profiles that do not use PrivateUsers=
to bind mount that one socket, so that they get the full db,
dynamically.

I kinda like the idea.

> We could implement our own profiles without needing nesting but we believe
> it is beneficial to collaborate on profiles upstream and have common
> additions to upstream profiles with nesting other profiles. If we get to it
> before other people, we would really like to contribute and send a patch on
> this.

A patch adding .d/ style drop-ins for profiles would make a ton of
sense. Happy to take that.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] A questions about modules-load service in systemd

2021-10-22 Thread Lennart Poettering

On Fr, 22.10.21 10:31, Joakim Zhang (qiangqing.zh...@nxp.com) wrote:

>
> Hi systemd experts,
>
> I saw you guys did much contributions in modules-load part recently, I have a 
> questions, some insight you input would be appreciated, thanks in advance.
>
> Do you know how to load all modules in a single task? In other
> words, load all modules within a single task as I want they process
> sequentially.

Are you sure you mean "systemd-modules-load"? Most module loading
happens via udev, not systemd-modules-load. That service is only
required for a few select modules that do not support auto-loading.

udev loads all modules as the hw they are for shows up. And no there's
no way to make that sequential.

Why do you need this? For debugging purposes? To work around a broken driver?

> If I understand correctly, systemd-modules-load service now will
> fork many tasks to process different kernel modules parallelly.

It doesn't do that actually. But udev when it loads kernel modules
does things from a bunch of worker processes all in parallel.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] loose thoughts around portable services

2021-10-18 Thread Lennart Poettering

On Mi, 13.10.21 13:38, Umut Tezduyar Lindskog (umut.tezdu...@axis.com) wrote:

> Hi, we have been playing around more with the portable services and
> lots of loose thoughts came up. Hopefully we can initiate
> discussions.
>
> The PrivateUsers and DynamicUsers are turned off for the trusted
> profile in portable services but none of the passwd/group and nss
> files are mapped to the sandbox by default essentially preventing
> the sandbox to do a user look up. Is this a use case that should be
> offered by the “trusted” profile or should this be handled by the
> services that would like to do a look-up?

The "trusted" profile basically means you dealt with that
synchronization yourself in some way.

That said: systemd's nss-systemd NSS module can nowadays (v249) read
user definitions from drop-in JSON fragments in
/run/host/userdb/. This is is used by nspawn's --bind-user= feature to
make a host user easily available in a container, with group info,
password and so on. My plan was to also make use of this in the unit
executor, i.e. so that whenever RootDirectory=/RootImage= are used the
service manager places such minimal user info for the selected user
there, so that the user is perfectly resolvable inside the service
too. This is particularly relevant for DynamicUser=1 services. I
haven't come around actually implementing that though. Given
nss-systemd is enabled in most bigger distro's nssswitch.conf file
these days I think this is a really nice approach to propagate user
databases like that.

> Is there a way to have PrivateUsers=yes and map more host users to
> the sandbox? We have dynamic, uid based authorization on dbus
> methods. Up on receiving a method, the server checks the sender uid
> against a set of rule files.

I guess we could add BindUser= or so, which could control the
/run/host/userdb/ propagation I proposed above.

> Would it benefit others if the “profile” support was moved out of
> the portable services and be part of the unit files? For example
> part of the [Install] section.

Right now profiles are a concept of portabled, not of the service
manager. There's a github issue somewhere where people asked us to
make this generically usable from services too, so I guess you are not
the only one who'd like someting like that.

> Has there been any thought about nesting profiles? Example, one
> profile can include other profiles in it.

File an RFE issue. I guess we could support that for any profile x
we'd implicitly also pull in x.d/*.conf, or so.

> Systemd analyze security is great! We believe it would be easier to
> audit if we had a way to compare a service file’s sandboxing
> directives against a profile and find the delta. Then score the
> service file against delta.

Interesting idea.

Current git has all kinds of JSON hookup for systemd-analyze security
btw, so tools could do that externally too. But you are right, doing
this implicitly might indeed make sense. Please file an RFE issue on
github.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] [systemd‑devel] Removing bold fonts from boot messages

2021-10-14 Thread Lennart Poettering

On Mi, 13.10.21 18:29, Frank Steiner (fsteiner-ma...@bio.ifi.lmu.de) wrote:

> Ulrich Windl wrote:
>
> > Stupid question: If you see bold face at the end of the serial line, 
> > wouldn't
> > changing the terminal type ($TERM) do?
> > Maybe construct your own terminal capabilities.
>
> I'd need a TERM that has colors but disallows bold fonts. For some
> reason I wasn't even able to construct a terminfo that would disallow
> colors when using that $TERM inside xterm (and starting a new bash).
> It seems that xterm always has certain capabilities, i.e. "ls --color"
> is always showing colors in xterm, also with TERM=xterm-mono and
> everything else I tried.
>
> Anway, settings a translation to bind "allow-bold-fonts(toggle)"
> to a key in xterm resources allows to block bold fonts whenever
> watching systemd boot messages via ipmi or AMT in a xterm...

Note that systemd doesn't care about terminfo/termcap or anything like
that. We only support exactly three types of terminals:

1. TERM=dumb → you get no ANSI sequences, no fancy emojis or other
   non-ASCII unicode chars, no clickable links.

2. TERM=linux → you do get ANSI sequences, but no fancy emojis, but
   some simpler known-safe unicode chars (TERM=linux is the Linux
   console/VT subsystem), no clickable links.

3. everything else → you get ANSI sequences, fancy emojis, fancy
   unicode chars, clickable links.

And that's really it. It's 2021 and so far this was unproblematic. The
ANSI sequences we use aren't crazy exotic stuff but pretty much
baseline and virtually any terminal from the last 25 years probably
supports them.

You can turn these features off individually, too.

SYSTEMD_COLORS=0 → no ANSI colors sequences (alternatively: "NO_COLOR=1" as 
per https://no-color.org/)

SYSTEMD_EMOJI=0 → no unicode emojis

LC_CTYPE=ANSI_X3.4-1968 → no non-ASCII chars (which also means no emojis)

SYSTEMD_URLIFY=0 → no clickable links

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] troubleshooting Clevis

2021-10-12 Thread Lennart Poettering

On Di, 12.10.21 16:17, lejeczek (pelj...@yahoo.co.uk) wrote:

> > > I have 'clevis' set to get luks pin from 'tang' but unlock does not happen
> > > at/during boot time and I wonder if someone can share thoughts on how to
> > > investigate that?
> > > I cannot see anything obvious fail during boot, moreover, manual
> > > 'clevis-luks-unlock' works no problems.
> > This is the systemd mailing list, not the clevis/tang mailing
> > list. Please contact the clevis/tang community instead.
>
> May ask of any possible plans where systemd would, somehow similarly to
> 'tpm', utilize 'tang'(or similar) technique to unlock luks encrypted
> devices?

You mean that networked unlock feature? I mean, it's not always clear
what belongs and systemd and what does not. But outside of data
centers I am not sure tang/clevis really has much use, and that's
quite a limited userbase, so I'd say: no this should be done outside
of systemd. Maybe a plugin for libcryptsetup's "token" feature.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Removing bold fonts from boot messages

2021-10-12 Thread Lennart Poettering

On Di, 12.10.21 12:09, Frank Steiner (fsteiner-ma...@bio.ifi.lmu.de) wrote:

> Hi,
>
> after upgrading from SLES 15 SP2 (systemd 2.34) to SP3 (systemd 2.46)
> the boot messages are not only colored (which I like for seeing failures
> in red) but partially printed in bold face. This makes messages indeed
> harder to read on serial console (with amt or ipmi), so I wonder if
> there is a place where the ascii sequences for colors and font faces
> are defined and can be adjusted?

Sounds like in an graphics issue in your terminal emulator, no?

> Or is there some option to remove the bold face only, but not the colors?
> systemd.log_color=0 removes all formatting, but I'd like to keep the
> colors...

No, this is not configurable. We are not a themeable desktop, sorry.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Tempering the Logging Data when Knowing the Verification Key / Time Synchronization

2021-10-11 Thread Lennart Poettering

On Mo, 11.10.21 17:08, Andreas Krueger (andreas.krue...@fmc-ag.com) wrote:

> Hi Folks,
>
>
> I am currently working in an embedded project that uses Journal for logging. 
> The logging data shall be protected by the Journal's sealing mechanism FSS 
> and for various reasons the verification key is located unprotected in memory.
>
> Regarding this constellation, my first question is that:
>
> If an attacker knows the verification key, is he able to modify the
> logging data in such a way that its tempering remains undetected,
> even if this has happened e.g. one day ago (which means that several
> new sealing keys has been generated in the meantime) ?

Yes, the verification key should be kept secret. (The text output when
it is generated should make this very clear, actually.)

if you don't keep it secret, then all bets are off, the construction
of the underlying cryptography does not work then.

> Since sealing is always done for a time interval, my second question is that:
>
> What will happen to the logging data and sealing mechanism when the
> system clock is suddenly modified? This can e.g. happen, when the
> board starts first with a default time value and then synchronized
> after a while by a time daemon.

the sealing key is "evolved" based on time (which means a new key is
generated from the old and the old one is securely deleted). When time
jumps forward, then this scheme automatically keeps up, and if needed
will evolve a number of steps at once, as necessary.

If time jumps backwards things are more problematic though: the key
appropriate for the old time has already been generated likely, and
while a newer key can be derived from the old an older cannot be
derived from the new (this fact is after all the whole point of the
excercise).

For cases like this it might make sense to ensure that flushing of the
journal to disk (i.e. systemd-journald-flush.service) is scheduled
after correct time has been acquired (i.e. time-sync.target).

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] dm-integrity volume with TPM key?

2021-10-11 Thread Lennart Poettering

On Fr, 08.10.21 21:15, Sebastian Wiesner (sebast...@swsnr.de) wrote:

> Am Montag, dem 04.10.2021 um 14:49 +0200 schrieb Lennart Poettering:
> > On Do, 30.09.21 21:20, Sebastian Wiesner (sebast...@swsnr.de) wrote:
> >
> > > Hello,
> > >
> > > thanks for quick reply, I guess this explains the lack of
> > > instructions
> >
> > btw, coincidentally this was posted on github on the day you posted
> > this:
> >
> > https://github.com/systemd/systemd/pull/20902
> >
> > so hopefully we'll have te missing tools in place soon too.
>
> Great, so it looks as if everything's in place with systemd 250
> perhaps?

Dunno, we'll see, if the submitter rolls another revision possibly,
but it all depends on that. Would love to see this happen, but right
now the ball is in the field of the submitter of that PR.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] [systemd]: How to set systemd not to generate loop0.device and mtdblockx.device?

2021-10-11 Thread Lennart Poettering

On Sa, 09.10.21 11:27, www (ouyangxua...@163.com) wrote:

> systemd version: V242
>
> In our system, the whole machine starts too slowly. We want to do
> some optimization. I found that two services( loop0.device and
> mtdblock5.device) started slowly. I want to remove them (I
> personally think our system are not need them). I want to ask you
> how to avoid generating these two device files and not start them?

/dev/loop0 is a loopback block device. It's probably some tool that
needs them you are using.

/dev/mtdblock5 is some physical hw you have. And it's probably mounted
by something you are using.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] Q: write error, watchdog, journald core dump, ordering of entries

2021-10-11 Thread Lennart Poettering

On Mo, 11.10.21 10:57, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> > Now when journald hangs due to some underlying IO issue, then it might
> > miss the watchdog deadline, and PID 1 might then kill it to get it
> > back up. It will log about this to the journal, but given tha tthe
> > journal is hanging/being killed it's not going to write the messages
> > to disk, the mesages will remain queued in the logging socket for a
> > bit. Until eventually journald starts up again, and resumes processing
> > log messages. it will then process the messages already queued in the
> > sockets from when it was hanging, and thus the order might be
> > surprising.
>
> Hi!
>
> Thanks for explaining.
> Don't you have some OOB-logging, that is: Log a message before processing the
> queue logs.

The "Journal started" message is inserted into the log stream by
journald itself before processing the already queued messages.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Q: write error, watchdog, journald core dump, ordering of entries

2021-10-11 Thread Lennart Poettering

On Mi, 06.10.21 10:29, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> Hi!
>
> We had a stuck networkc card on a server that seems to have caused the RAID 
> controller with two SSDs to be stuck on write as well.
> Anyway journald dumped core with this stack:
> Oct 05 20:13:25 h19 systemd-coredump[26759]: Process 3321 (systemd-journal) 
> of user 0 dumped core.
> Oct 05 20:13:25 h19 systemd-coredump[26759]: Coredump diverted to 
> /var/lib/systemd/coredump/core.systemd-journal.0.a4eb19afcc314d99936cbdd5542e4fed.3321.163345758500.lz4
> Oct 05 20:13:25 h19 systemd-coredump[26759]: Stack trace of thread 3321:
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #0  0x7f913492d0c2 
> journal_file_append_object (libsystemd-shared-234.so)
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #1  0x7f913492dba3 n/a 
> (libsystemd-shared-234.so)
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #2  0x7f913492fc79 
> journal_file_append_entry (libsystemd-shared-234.so)
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #3  0x557fe532908d n/a 
> (systemd-journald)
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #4  0x557fe532b15f n/a 
> (systemd-journald)
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #5  0x557fe5324664 n/a 
> (systemd-journald)
> Oct 05 20:13:25 h19 systemd-coredump[26759]: #6  0x557fe5326a80 n/a 
> (systemd-journald)
> Oct 05 20:13:25 h19 kernel: printk: systemd-coredum: 6 output lines 
> suppressed due to ratelimiting
>
> (systemd-234-24.90.1.x86_64 of SLES15 SP2 on x86_64)
>
> journald seems to have restarted later, but I wonder about the ordering of 
> the entries following:
> Oct 05 20:13:25 h19 systemd-journald[26760]: Journal started
> Oct 05 20:13:25 h19 systemd-journald[26760]: System journal 
> (/var/log/journal/8695c89eb080463dad2ca9f9aaedf162) is 928.0M, max 4.0G, 3.0G 
> free.
>
> Oct 05 20:12:52 h19 systemd[1]: systemd-journald.service: Watchdog timeout 
> (limit 3min)!
> Oct 05 20:12:52 h19 systemd[1]: systemd-journald.service: Killing process 
> 3321 (systemd-journal) with signal SIGABRT.
> Oct 05 20:13:25 h19 systemd[1]: Starting Flush Journal to Persistent 
> Storage...
> Oct 05 20:13:25 h19 systemd[1]: Started Flush Journal to Persistent Storage.
>
> I don't understand why the core dump is logged before the signal
> being sent and the watchdog timeout.

PID 1 logs to journald. PID 1 also runs and supervises
journald. That's quite a special relationship: PID1 both is client to
journald and manages it.

Now when journald hangs due to some underlying IO issue, then it might
miss the watchdog deadline, and PID 1 might then kill it to get it
back up. It will log about this to the journal, but given tha tthe
journal is hanging/being killed it's not going to write the messages
to disk, the mesages will remain queued in the logging socket for a
bit. Until eventually journald starts up again, and resumes processing
log messages. it will then process the messages already queued in the
sockets from when it was hanging, and thus the order might be
surprising.

--
Lennart Poettering, Berlin

Re: [systemd-devel] dm-integrity volume with TPM key?

2021-10-04 Thread Lennart Poettering

On Do, 30.09.21 21:20, Sebastian Wiesner (sebast...@swsnr.de) wrote:

> Hello,
>
> thanks for quick reply, I guess this explains the lack of
> instructions

btw, coincidentally this was posted on github on the day you posted
this:

https://github.com/systemd/systemd/pull/20902

so hopefully we'll have te missing tools in place soon too.

> As a workaround you'd use a regular file key for dm-integrity and put
> that on a TPM-protected partition, if I understand you correctly?

yes.

> I.e. you'd
>
> 1. enable secureboot (custom keys or shim),
> 2. bundle kernel & initrd into signed UEFI image for systemd-boot,
> 3. make / a LUKS-encrypted parition with systemd-cryptenroll, bound to
> the TPM (perhaps PCR 0 and 7) aund unlocked automatically at boot,

only pcr 7, for the reasons explained in the blog story.

> 4. make /home a dm-integrity partition, with a regular keyfile from
> e.g. /etc/integrity.key (which is on the encrypted partition), and

actually, after thinking a bit more about this I figure the ultimate
path for this would be /etc/integritysetup-keys.d/home.key – because
we already implemented in systemd-cryptsetup a scheme where we search
for the encryption key for volume xyz in
/etc/cryptsetup-keys.d/xyz.key, and we should probably do it similar
for verity keys, too.

> 5. use homed for LUKS-encrypted home areas on /home?
>
> Does this sound reasonable?  

Yes!

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Authenticated Boot and Disk Encryption on Linux

2021-10-04 Thread Lennart Poettering

On Do, 30.09.21 18:54, Łukasz Stelmach (stl...@poczta.fm) wrote:

> > I have been working on code in homed to "balance" free space between
> > active home dirs in regular intervals (shorter intervals when disk
> > space is low, higher intervals when there's plenty). Also, right now
> > we already run FITRIM on home dirs on logout, to make sure all air is
> > removed then. I intend to also add logic to shrink to minimal size
> > then (and conversely grow on login again).
> >
> > This will only really work in case btrfs is used inside the homedir
> > images, as only then we can both shrink and grow the fs whenever we
> > want to.
>
> Interesting. Apparently[1] loopback driver punches holes in the image
> files and makes them sparse.

We currently issue FITRIM on logout (thus making the file sparse), and
on login we issue fallocate() to remove the holes again, and being
able to give disk space guarantees and disable overcommit during runtime.

> > [Encryption] isn't typically needed for /usr/ given that it generally
> > contains no secret data
>
> This isn't IMHO precisely true. Especially not for laptops. And I don't
> mean the presence of "hacking tools" you mentioned below. Even when all
> the binaries in the /usr all come from the Internet there are many
> different versions available. Knowledge which versions are running on a
> device may be quite valuable for an attacker to mount an remote on-line
> attack and extract data with malware.

Well, that's security through obscurity to some level. I know some
people are concerned about this, and they can encrypt that if they
really thinkg they must. But I doubt that this makes sense for the
cases where your OS payload comes in flatpaks, containers, sysexts,
portable services, …, i.e. is not written to /usr.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Prefix for direct logging

2021-10-04 Thread Lennart Poettering

On Mi, 29.09.21 20:21, Arjun D R (drarju...@gmail.com) wrote:

> Hi Lennart,
>
> Please help me understand how the journald is figuring out the PID of the
> log line.

Google SCM_CREDENTIALS.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] Authenticated Boot and Disk Encryption on Linux

2021-09-30 Thread Lennart Poettering

On Mi, 29.09.21 21:09, Łukasz Stelmach (stl...@poczta.fm) wrote:

> Hi, Lennart.
>
> I read your blog post and there is little I can add regarding
> encryption/authentication*. However, distributions need to address one
> more detail, I think. You've mentioned recovery scenarios, but even with
> an additional set of keys stored securely, there are enough moving parts
> in FDE that something may go wrong beyond what recovery keys could
> fix. To help users minimise the risk of data loss distributions should
> provide backup tools and help configure them securely.
>
> This is of course outside of the scope of your original post, but IMHO
> it is a good moment to mention this.
>
> * Well there is one tiny detail.
>
> You noted double encryption needs to be avoided in case of home
> directory images by storing them on a separate partition. Separating
> /home may be considered a slight inefficiency in storage usage, but
> using LVM to distribute storage space between the root(+/usr) and /home
> might help. However, to best of my knowledge (which I will be glad to
> update) there is no tool to dynamically and automatically manage storage
> space used by home images. In theory the code is there, but UX of
> resize2fs(8) and dd(1) is far from satisfying and I am not entirely sure
> what happens if one truncates (after resize2fs, which will work)
> a file containing a mounted image.
>
> The first solution that comes to my mind is to make systemd-homed resize
> home filesystem images according to some policy upon locking and
> unlocking. But it's not perfect as users would need to log out(?) to
> trigger allocation of more storage should they fill their home
> directory.

I have been working on code in homed to "balance" free space between
active home dirs in regular intervals (shorter intervals when disk
space is low, higher intervals when there's plenty). Also, right now
we already run FITRIM on home dirs on logout, to make sure all air is
removed then. I intend to also add logic to shrink to minimal size
then (and conversely grow on login again).

This will only really work in case btrfs is used inside the homedir
images, as only then we can both shrink and grow the fs whenever we
want to.

Lennart

--
Lennart Poettering, Berlin

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 8651 matches

Mail list logo