Re: [systemd-devel] RFC: Passing on initial client user in systemd-userdbd

2022-11-29 Thread Lennart Poettering
On Di, 29.11.22 11:50, Dominik George (n...@naturalnet.de) wrote:

> Hi,
>
> > in theory, I have implemented that now […]
>
> In practice now, as well:
>
>   https://github.com/systemd/systemd/pull/25556
>
> However, something kicked back here a bit… systemd-userdbd drops all
> capabilities, and sending SO_PASSCRED requires CAP_SYS_ADMIN…
>
> What do we do about that?

Just add the capability to the service unit file.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] RFC: Passing on initial client user in systemd-userdbd

2022-11-28 Thread Lennart Poettering
On Mo, 28.11.22 16:27, Dominik George (n...@naturalnet.de) wrote:

> Hi,
>
> > You don't have to send that really, the kernel will implicitly attach it
> > automatically whenever the sender's credentials change. Thus, a
> > receiver can safely assume that the ucred remains the same as the
> > SO_PEERCRED data until it receives a new SCM_CREDENTIALS that says
> > otherwise.
> >
> > You want to send SCM_CREDENTIALS explicitly only when you actively try
> > to impersonate someone else.
>
> I'm not convinced of that. Of course, sending the creds if it does not
> differ from the process running would be sufficient, but doing it in
> all cases removes a lot of complexity.

>From the receiving side there's very little difference in behaviour:
the kernel will automatically send this stuff *anyway* if needed.

hence on the receiving side in the Varlink object just add a new
"struct ucred" field that stores the last SCM_CREDENTIALS ucred that
was received. Update it whenever a new SCM_CREDENTIALS is received. It
will look the exact same way from the receiving side if on the sending
side the kernel sent it automatically because the sender's uid changed
or if the sender appended it explicitly because it felt like it, for
example because it wants to impersonate someone.

> Sending SCM_CREDENTIALS selectively would mean we would have to
> introduce a distinction between systemd-userdbd acting as multiplexer
> and not doing so, which would require moving quite a bit of code
> around that is now neatly generic.

userdbd should always impersonate the client it received the request
on. What I am saying is that regular varlink client code (i.e. not
userdbd, not an impersonator) should not bother with this at all,
since the kernel well attach this info anyway if needed. Only
impersonators need to attach SCM_CREDENTIALS explicitly, and userdb
should be one of these impersonators.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] RFC: Passing on initial client user in systemd-userdbd

2022-11-28 Thread Lennart Poettering
On Mo, 28.11.22 00:14, Dominik George (n...@naturalnet.de) wrote:
65;6800;1c
> Hi,
>
> > The approach brings me a bit farther away from being able to implement it 
> > myself, but not too far I guess ;).
>
> I've spent some time reading the userdb code now, and it actually
> seems pretty easy to do.
>
> Here's my rough plan:
>
>  1. In src/userdb/userdbd-manager.c manager_startup(), set teh
> SO_PASSCRED socket option
>  2. In src/shared/varlink.c, change the behaviour in two places:
>  - In varlink_read, use recvmsg to read the SCM_CREDENTIALS
>message and, if we get one and its uid is valid, store the
>ucred in the varlink struct and set its ucred_acquired to truw
>  - In varlink_write, always send an SCM_CREDENTIALS message —
>if ucred_acquired is true on the varlink object, send this
>ucred struct' if it is false, send an empty message to use
>our real credentials

You don't have to send that really, the kernel will implicitly attach it
automatically whenever the sender's credentials change. Thus, a
receiver can safely assume that the ucred remains the same as the
SO_PEERCRED data until it receives a new SCM_CREDENTIALS that says
otherwise.

You want to send SCM_CREDENTIALS explicitly only when you actively try
to impersonate someone else.

> Given that all userdbd services in systemd, including the multiplexer,
> use the same code, this should be all there is to it to enable the
> discussed behaviour in systemd, and downstream service implementations
> could start using it.
>
> If yhere is nothing fundamentally wrong with my assessment, I'll give
> the implementation a shot.

Sounds great! Happy to review a PR for that.

In the varlink API please report the SCM_CREDENTIALS ucred seperately
from the SO_PEERCRED though (i.e. from the current ucreds we already
store). For various purposes it is interesting to know the identity of
the process initiating the connection, if it's different from the
process actually sending messages over it.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] RFC: Passing on initial client user in systemd-userdbd

2022-11-25 Thread Lennart Poettering
On Fr, 25.11.22 15:19, Dominik George (n...@naturalnet.de) wrote:

> Hi,
>
> I would like to extend the methods of the User/Group Lookup API[1]
> with an optional argument "onBehalfOf" that carries the authenticated
> user who made the initial method call.
>
> The argument must only be set by a privileged client.
>
> When a client makes a lookup request to the multiplexer, the
> multiplexer authenticates the client using SO_PEERCRED. In each
> subsequent call to other services, it sets the authenticated user in
> the onBehalfOf argument to the method call.
>
> Services must only honour the argument if the connecting client was
> identified as a privileged client, i.e. it would receive the
> "privileged" section of the User or Group Record. In all other cases,
> they must ignore the argument and use SO_PEERCRED themselves to
> determine the client user.
>
> The concrete use case for this is to allow a service to take more
> fine-grained control of the data it returns, e.g. it strips location
> or realName from the record if an unprivileged user make a query, or
> chooses a user-bound OAuth token to make calls to a Web API in
> response to the request.
>
> What do others think of this?

Sounds superficially OK to do. I presume you intend to pass the numer
UID there?

Usually passing around numeric UIDs is a bit problematic, due to
user namespaces and so on. i.e. the two ends of an AF_UNIX stream might live in
different userns and thus have a different idea what UID 4711 means.

This hence raises the question if we can find a better way. Right now,
systemd's varlink implementation uses exclusively SO_PEERCRED to
identify the peer i.e. a UID pinned at connection time.

But there's also SCM_CREDENTIALS, which allows receiving and sending
UIDs at arbitrary times. When sending them, clients can only send
their own uids (euid or uid), or if they are privileged any. Hence, we
could just build on that: when reading messages of the varlink socket
do so with recvmsg() so that we get this info. Then when doing lookups
we'd use SCM_CREDENTIAL info when available, and SO_PEERCRED
otherwise. A client could then transmit its varlink messages with a
SCM_CREDENTIAL metadata field to execute stuff on-behalf of some other
client.

The big benefits of this approach would be: automatic translation of
UIDs by the kernel in regards to userns, and the kernel will
implicitly validate for us whether the on-behalf-of impersonation
shall be allowed or not.

Does that make sense?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Some questions on userdbd and providing a compatible service

2022-11-24 Thread Lennart Poettering
On Do, 24.11.22 14:29, Dominik George (n...@naturalnet.de) wrote:

> Hi Lennart,
>
> > (BTW; I kinda hope that one day systemd-homed could directly
> > authenticate home directories via OIDC too. In fact, I want it so that
> > you can just type in any OpenID identity on a login prompt, and this
> > would authenticate a user and create a local homedir on the fly if
> > needed.)
>
> One more question on this: Does homed only handle user sessions for
> users created with homed? Or can any userdb backend provide a user
> record exposing a "storage" section, and systemd-homed will handle
> this user as well?

The former. But you can register users with homed easily. i.e. just
"upload" a JSON user record to it, and it will manage it. But this
step is necessary.

> In other words: Can I use data from my own userdb backend to make
> homed start managing the home directory for this user?

Nope, currently not. homed is a *provider* of user records, not a
consumer.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Some questions on userdbd and providing a compatible service

2022-11-24 Thread Lennart Poettering
On Do, 24.11.22 13:36, Dominik George (n...@naturalnet.de) wrote:

> Hi,
>
> > (BTW; I kinda hope that one day systemd-homed could directly
> > authenticate home directories via OIDC too. In fact, I want it so that
> > you can just type in any OpenID identity on a login prompt, and this
> > would authenticate a user and create a local homedir on the fly if
> > needed.)
>
> that's basically what I am building.

how do you intend to support getty logins, i.e. non-graphical
text-based only logins, where you cannot just open a webbrowser? oidc
device flow?

(I mean, from an environment like gdm it might actually make a ton of
sense to just open a webbrowser dialog, but for the getty crap? or sudo?)

> I guess my approach will be coming up with a custom Varlink interface
> for PAM authentication and experiment with it.

That's tough. PAM has a lot on implicit and explicit state attached to
the PAM handle... And you can have PAM conversations and so on
(i.e. prompting arbitrary questions) which makes PAM compat really
really messy...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Some questions on userdbd and providing a compatible service

2022-11-24 Thread Lennart Poettering
On Do, 24.11.22 12:46, Dominik George (n...@naturalnet.de) wrote:

> Ah, so what would happen here is that even if the MUltiplexer, which
> is privileged, talks to my IPC service and receives the "privileged"
> part, the Multiplexer will strip it off for me unless a privileged
> user is talking to it.

correct.

> > Yeah, you have to deal with PAM yourself (unless you add classic
> > hashed UNIX passwords in the "privileged" section of your use records
> > – in that case pam_unix will just use that).
>
> That won't work. Actually, the final goal is to authenticate without
> ever handling the user password, e.g. using the OIDC Authorization
> Code Grant Flow or Device Code Grant Flow.

Yeah, I figured.

(BTW; I kinda hope that one day systemd-homed could directly
authenticate home directories via OIDC too. In fact, I want it so that
you can just type in any OpenID identity on a login prompt, and this
would authenticate a user and create a local homedir on the fly if
needed.)

> But generally, are the fields in the User Record objects fixed, or can
> I add my own fields? If I do, will they be ignored and passed on
> verbatim, or stripped, or cause an error preventing the User Record
> from being handled at all?

It's supposed to be extensible.

→ https://systemd.io/USER_RECORD/#extending-these-records

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Some questions on userdbd and providing a compatible service

2022-11-24 Thread Lennart Poettering
On Do, 24.11.22 00:58, Dominik George (n...@naturalnet.de) wrote:

> Hi,
>
> for some time now, I have been investigating how to best make a
> desktop system talk to a web API (HTTP, REST) for user management, so
> NSS and PAM make HTTP requests to an API to verify authentication
> (using OIDC) and to retrieve NIS information (using REST endpoints).
>
> One of the approaches I am evaluating involves systemd-userdbd,
> because it seems to be designed with extensibility with modular
> service implementations in mind.

That would make a lot of sense to me.

> Right now, I have a few questions concerning its architecture and use:
>
>  * Why was Varlink chosen over D-Bus, given that most other parts of
>systemd seem to talk D-Bus?

Three reasons:

1. We want this to work in early boot, i.e. before dbus-daemon is up

2. dbus-daemon policies involve user names, thus dbus-daemon must be
   able to resolve them, and this deadlocks if dbus-daemon is used as
   transport for the resolution requests dbus-daemon itself makes.

3. D-Bus is suitable for transmitting control data only. It puts
   limits on queued messages, and thus is not suitable to transfer
   bulk data. This means it's kinda unsuitable for doing things like a
   dump of a user database, which might have millions of entries. If
   we'd ignore that fact, then whenever the user would try to dump the
   user databsae userdbd would be forcibly kicked off the bus, under
   the assumption it tried to flood the bus.

Every single one of the three is a killer really. using varlink avoids
that mess, it requires no broker, is efficient for transferring bulk
data, and works from any context.

(There are other reasons: for example we want user records to be json
objects, and it's just a lot nicer if json is passed around over a
json-based IPC then nesting it inside of dbus marshalling. But hese
are "softer" reasons, so I'll spare you the rest)

>  * How does protection of privileged fields work? In a different
>approach (using my own gRPC-based protocol), I used peer
>credentials on the UNIX socket for authorisation, but it seems this
>should break with userdbd when going through the
>multipelxer. However, I see "Warning: lacking rights to acquire
>privileged fields of user record of 'testnik', output incomplete."
>when I try to inspect another user as an unprivileged user. How
>does userdbd determine that?

See:

https://systemd.io/USER_RECORD/

Basically, a user record consist of multiple sections (i.e. json
fields contain subobjects), one is called "privileged": this contains
everything traditionally found in /etc/shadow basically, and
everything else pretty much that should only be visible to privileged
users and the user itself.

Any IPC service is supposed to strip "privileged" from user records it
sends or passes on, and report that in the "incomplete" boolean return
parameter – unless it can know for sure (via SO_PEERCRED or so) that
whoever it is talking to is the user itself or is privileged.

See this for details on the IPC API: https://systemd.io/USER_GROUP_API/

>  * userdbd only helps for user information, i.e. for providing data to
>NSS through a decoupled interface. I would need to do the same for
>PAM, but intil now, I could not find an existing standard for
>verifying credentials. Was that just not done yet, or is there a
>design decision that userdbd should not offer methods for
>authentication? I see that systemd-homed implements its own API
>through D-Bus…

Yeah, you have to deal with PAM yourself (unless you add classic
hashed UNIX passwords in the "privileged" section of your use records
– in that case pam_unix will just use that).

One of these days people should revisit PAM and redesign it around
IPC, but today is not that day I fear.

>  * Ultimately, I would like to retrieve and store an OAuth token on
>user login. It would somehow be a good fit for the "secret" section
>of the User Record, but the fields allowed in it seem to be
>static. Are there any ideas around here where such a token could be
>stored during the user session?

Kernel keyring for the user? It's where kerberos stuff is stored, and
is probably the best place. The API is a bit convoluted, but this has
been done before.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Prevent firmware from falling back to next EFI boot option on secure boot failure?

2022-11-23 Thread Lennart Poettering
On Mi, 23.11.22 17:56, Lennart Poettering (lenn...@poettering.net) wrote:

> > If this is a bug, I'd be willing to attempt a pull request submission
> > if a suggested fix is given.  Overall we like the functionality
> > sd-boot provides and the integration with systemd, but this is likely
> > a hard requirement for our use case.
>
> Yes please file an issue on github first, and this does sound a lot
> like something we should fix, hence a PR that addresses this would be
> more than welcome, too.

BTW, I think we should treat an EFI binary like a system we can't boot
as per the boot assessment logic. i.e. whenever we fail to invoke a
binary (regardless if the reason is the security check or something
else), then we should count down it's counters, and then stop using it
once it hits zero.

i.e. i think this should hook into the logic described in
https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Prevent firmware from falling back to next EFI boot option on secure boot failure?

2022-11-23 Thread Lennart Poettering
On Mi, 23.11.22 11:44, Daniel Harms (jdha...@gmail.com) wrote:

> Lennart,
>
> That is how we're hoping it should work, so it's good to hear.  I
> suppose I'm not sure that it's the firmware driving this process--I
> just assumed because I know that the UEFI spec has verbiage requiring
> EFI boot managers to try next options in case of certain failure
> cases.  I think you're probably right in that sd-boot *should* be able
> to continue onwards down the list.
>
> We're seeing the following error message in red text:
>
> 
>
> Error loading \EFI\Linux\linux-5.15.0-unsigned.efi: Security Policy Violation
>
> Failed to execute [entry config name]
> (\EFI\Linux\linux-5.15.0-unsigned.efi): Security Policy Violation
>
> 
>
> What I believe is happening based on these messages is that
> image_start() is returning an error here:
> https://github.com/systemd/systemd/blob/v252/src/boot/efi/boot.c#L2747
> and the `goto out;` is being executed, ending/preventing any looping
> over boot options.
>
> If this is a bug, I'd be willing to attempt a pull request submission
> if a suggested fix is given.  Overall we like the functionality
> sd-boot provides and the integration with systemd, but this is likely
> a hard requirement for our use case.

Yes please file an issue on github first, and this does sound a lot
like something we should fix, hence a PR that addresses this would be
more than welcome, too.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Prevent firmware from falling back to next EFI boot option on secure boot failure?

2022-11-23 Thread Lennart Poettering
On Mi, 23.11.22 10:22, Daniel Harms (jdha...@gmail.com) wrote:

> Hello,
>
> We are doing some experiments with booting self-signed Unified Kernel
> Images (UKIs) using systemd-boot.  Our eventual use-case is edge/IoT
> devices, so no interactive user will be present for most OS upgrade
> flows.
>
> In doing some testing on the boot option fallback features (in a
> vmware vm) we’ve run into a snag—when we set up an unsigned UKI as the
> first option and a properly signed UKI as the second option,
> systemd-boot appears to attempt to boot the unsigned one (as
> expected), the system reports a security violation, but then the
> firmware kicks us to the next boot option.

Hmm, are you sure this is the firmware? Normally a security violation
should just be returned as an error to sd-boot, and sd-boot should be
able to pick the next option then. Not entirely sure this works
correctly though. There might be a bug lurking somewhere.

it's simply not a case we regular test for. But it should be a case
that just works.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] dependent services status

2022-11-21 Thread Lennart Poettering
On Do, 17.11.22 08:52, Ted Toth (txt...@gmail.com) wrote:

> I have a set of services that depend on each other however when
> services are started and considered 'active' that does not necessarily
> mean they are in a state that a dependent service requires them to be
> in to operate properly (for example an inotify watch has been
> established).

This is a bug in those services. THey should not report startup
completion when they haven't completed startup. Part of startup is
considered evertyhing like "establish sockets", "establish inotify
watches" and so on...

> systemd services, I think,  have a substate, is there a
> way I can set that to a custom value to indicate the services idea of
> its own state?

For Type=notify services use sd_notify("READY=1") to communicate
startup completion.

For Type=forking serviecs exit in the parent process when the main
service process finished startup.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Preventing automatic driver loading on live boot disk

2022-11-18 Thread Lennart Poettering
On Do, 17.11.22 21:41, Andrei Borzenkov (arvidj...@gmail.com) wrote:

> On 17.11.2022 20:48, Lennart Poettering wrote:
> > On Do, 17.11.22 18:17, Vadim Lebedev (vadiml1...@gmail.com) wrote:
> >
> > > Awesome, thanks, it is EXTREMELY useful
> > >   | Find the right one and denylist it.
> > > One more question:  how do I  'denylist'  the offending alias?
> >
> > Via the "blacklist" stanza in the modprobe configuration files, like
> > you already are using.
> >
>
> Care to provide example how to use "blacklist" stanza to denylist only
> specific PCI ID? Because kmod documentation does not explain it.

Oh, right "blacklist" actually works differently than I was
remembering. But this should work:

alias pci:v8086dA0EDsv17AAsd22D5bc0Csc03i30 letsmaskthisone

if you do that, then the kernel requesting that modalias will tell
modprobe to load letsmaskthisone.ko but that does not exist, so it
will fail and not load anything.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Preventing automatic driver loading on live boot disk

2022-11-17 Thread Lennart Poettering
On Do, 17.11.22 18:17, Vadim Lebedev (vadiml1...@gmail.com) wrote:

> Awesome, thanks, it is EXTREMELY useful
>  | Find the right one and denylist it.
> One more question:  how do I  'denylist'  the offending alias?

Via the "blacklist" stanza in the modprobe configuration files, like
you already are using.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Preventing automatic driver loading on live boot disk

2022-11-17 Thread Lennart Poettering
On Mi, 16.11.22 10:24, Vadim Lebedev (vadiml1...@gmail.com) wrote:

> I'm preparing ubuntu-based live boot disk. It works fine mostly, but on
> some machines equipped with Nvidia Quadro cards the default nouveau driver
> causes problems (temporary freezes). I've determined that buy blacklisting
> nouveau driver (in /etc/modprobe.d/blacklist.conf) I can fix the problem.
> However this approach inhibits nouveau driver for every nvidia equipped
> machine which is an overkill. Of course, i can detect the presence of the
> Quadro card after the boot, blacklist it, do update-initramfs -u and reboot
> but this approach modifies live boot disk and I would like to avoid that. I
> wonder if there is a way to detect the presence of nvidia Quadro somewhere
> very early in the boot sequence and prevent loading of the offending driver
> and fall back to standard VESA driver.

PCI drivers are loaded via a "modalias" string, which is synthesized
from the PCI and USB vendor and product IDs (and other PCI
info). Drivers declare in their kmod metadata which of these modalias
strings they want to be responsible for.

Do "modinfo nouveau" for example, which will show you this
information:


…
alias:  pci:v12D2d*sv*sd*bc03sc*i*
alias:  pci:v10DEd*sv*sd*bc03sc*i*
…


The "*" are wildcard expressions.

Now, the kernel will never ask userspace for the "nouveau" driver but
only for a driver for such a modalias string.

You can denylist that string for your hw and thus disable the
autoloading.

Use "grep . /sys/bus/*/*/*/modalias" to get a list of the actual
modalias strings requested on your system. The one nuveau.ko matched
against will be among them. Find the right one and denylist it.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-repart with multiple block devices

2022-11-17 Thread Lennart Poettering
On Mi, 16.11.22 17:00, Mehmet Akbulut (mehmet.akbu...@motional.com) wrote:

> This email contains information belonging to Motional AD LLC or its
> affiliates and may contain confidential, proprietary, copyrighted and/or
> privileged information. Any unauthorized review, use, reliance, disclosure,
> distribution or copying is prohibited. If you are not the intended
> recipient, immediately destroy all copies of the original email and any
> attachments and contact the sender by reply email.

Sorry, but this is not OK to send to a public mailing list and expect
people to respect that or even respond to you then.

Public mailing lists have public archives, they are not confidential,
hence do not send an email to it you expect to remain confidential.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Antw: [EXT] [systemd???devel] starting networking from within single user mode?

2022-11-14 Thread Lennart Poettering
On Mo, 14.11.22 15:06, Michael Biebl (mbi...@gmail.com) wrote:

> Yeah, can we please block this Ulrich Windl guy.
> He's been more of a nuisance than a benefit to this community.

I have put him on moderation now.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] starting networking from within single user mode?

2022-11-11 Thread Lennart Poettering
On Fr, 11.11.22 09:05, Brian Reichert (reich...@numachi.com) wrote:

> On Fri, Nov 11, 2022 at 08:08:58AM +0200, Mantas Mikul??nas wrote:
> > Boot with either "s" (aka "single" aka "rescue") or "-b" (aka "emergency")
> > for two variants of single-user mode with init. The former starts some
> > basic stuff (it's the real single-user mode) including udev so that modules
> > for your network interfaces still get loaded automatically, while the
> > latter doesn't start anything except init and a shell (emergency mode is
> > *almost* like init=/bin/sh but in theory might at least let you `systemctl
> > start` something).
>
> I was able to get into the emergency target, using these notes:
>
>   https://suay.site/?p=1681=noscript
>
> The speed bump this article helped me with was to overcome systemd's
> misconception that the root account was locked.

systemd doesn't manage your root user. That's between you and
"shadow-utils" really.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] starting networking from within single user mode?

2022-11-11 Thread Lennart Poettering
On Do, 10.11.22 17:04, Brian Reichert (reich...@numachi.com) wrote:

> I've managed to hose a SLES12 SP5 host; it starts to boot, then hangs.
>
> If I get it into single-user mode (getting into the grub menu, and adding
> init=/bin/bash) I can at least review the file system.

That's not single-user mode. That's not running an init system at all.

To boot into single-user mode specify "1" or "single" on the kernel
cmdline. Has been that way since sysvinit times.

> What I want to do is get networking running, so that I can at least gather
> logs, etc.
>
> When I try to start networking with 'systemctl', I see this error:
>
> systemd "failed to connect to bus; No such file or directory"
>
> What can I do to minimally bring up the networking service? I don't even
> have any network devices at this point...

You can't have systemd services without systemd. Sorry.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Warning "Supervising process..." due to SIGCHLD from grand-parent

2022-10-31 Thread Lennart Poettering
On Mo, 31.10.22 11:40, Lennart Poettering (lenn...@poettering.net) wrote:

> This is almost certainly a bug in chrony. If you use Type=forking,
> then the process that systemd forks off (let's call it "P") should
> wait until all of the below holds:
>
> 1. The middle child P' has exited
> 2. The grandchild (and main daemon process) P'' is running
> 3. The PID file has been successfully written to contain the PID of P''.

BTW, let me add an explanation, *why* this is needed: if they leave
P'' running for a bit longer, then there's a race: if for some reason
the deamon ends up failing shortly after starting up there is a race
if P' or P'' die first. If P'' dies first, then the service manager
will never see its SIGCHLD and cannot determine there was a
failure. If P' dies first then all is good, as the P'' SIGCHLD will be
properly collected by the service manager.

But anyway, it's 2022, chrony being stuck in sysv semantics is
sad. Use sd_notify().

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Warning "Supervising process..." due to SIGCHLD from grand-parent

2022-10-31 Thread Lennart Poettering
On Mo, 31.10.22 08:04, Christopher Wong (christopher.w...@axis.com) wrote:

> Hi,
>
>
> We have during boot received the "Supervising process..." warning
> from systemd related to chronyd.service. This is not always
> happening, but when it happens systemd receives SIGCHLD from
> grand-parent (22955) before the parent (22956). See logs below:

grand-parent? do you mean grand-child? I am a bit confusing what you
are trying to say?

> Oct 25 10:34:55.104980 axis-accc8ed1c728 systemd[22955]: chronyd.service: 
> Executing: /usr/sbin/chronyd -u chronyd -f /run/chrony/chronyd.conf
> Oct 25 10:34:55.117999 axis-accc8ed1c728 chronyd[22957]: chronyd version 4.2 
> starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP -SCFILTER -SIGND +ASYNCDNS 
> +NTS -SECHASH +IPV6 +DEBUG)
> Oct 25 10:34:55.120781 axis-accc8ed1c728 chronyd[22957]: Frequency -8.172 +/- 
> 1.366 ppm read from /var/lib/chrony/drift
> Oct 25 10:34:55.124304 axis-accc8ed1c728 systemd[1]: 
> systemd-journald.service: Received EPOLLHUP on stored fd 82 (stored), closing.
> Oct 25 10:34:55.126460 axis-accc8ed1c728 systemd[1]: Received SIGCHLD from 
> PID 22955 (chronyd).
> Oct 25 10:34:55.126708 axis-accc8ed1c728 systemd[1]: Child 22955 (chronyd) 
> died (code=exited, status=0/SUCCESS)
> Oct 25 10:34:55.126920 axis-accc8ed1c728 systemd[1]: chronyd.service: Child 
> 22955 belongs to chronyd.service.
> Oct 25 10:34:55.127000 axis-accc8ed1c728 systemd[1]: chronyd.service: Control 
> process exited, code=exited, status=0/SUCCESS (success)
> Oct 25 10:34:55.127027 axis-accc8ed1c728 systemd[1]: chronyd.service: Got 
> final SIGCHLD for state start.
> Oct 25 10:34:55.127160 axis-accc8ed1c728 systemd[1]: chronyd.service: 
> Potentially unsafe symlink chain, will now retry with relaxed checks: 
> /run/chrony/chronyd.pid
> Oct 25 10:34:55.127571 axis-accc8ed1c728 systemd[1]: chronyd.service: New 
> main PID 22957 belongs to service, we are happy.
> Oct 25 10:34:55.127598 axis-accc8ed1c728 systemd[1]: chronyd.service: Main 
> PID loaded: 22957
> Oct 25 10:34:55.127759 axis-accc8ed1c728 systemd[1]: Custom log in 
> process-util.c fnc pid_is_my_child(): pid: 22957, ppid: 22956, cached_pid: 1.
> Oct 25 10:34:55.127785 axis-accc8ed1c728 systemd[1]: chronyd.service: 
> Supervising process 22957 which is not our child. We'll most likely not 
> notice when it exits.
> Oct 25 10:34:55.127964 axis-accc8ed1c728 systemd[1]: chronyd.service: Changed 
> start -> running
> Oct 25 10:34:55.128006 axis-accc8ed1c728 systemd[1]: chronyd.service: Job 
> 117032 chronyd.service/start finished, result=done
> Oct 25 10:34:55.128053 axis-accc8ed1c728 systemd[1]: Started NTP 
> client/server.
> ...
> Oct 25 10:34:55.158173 axis-accc8ed1c728 systemd[1]: Received SIGCHLD from 
> PID 22956 (chronyd).
> Oct 25 10:34:55.158436 axis-accc8ed1c728 systemd[1]: Child 22956 (chronyd) 
> died (code=exited, status=0/SUCCESS)
> Oct 25 10:34:55.158679 axis-accc8ed1c728 systemd[1]: chronyd.service: Child 
> 22956 belongs to chronyd.service.
>
>
> The chronyd does two forks. In the normal case the parent will die
> first and then the grand-parent will die. This behavior is according
> to the SysV Daemons implementation
> https://www.freedesktop.org/software/systemd/man/daemon.html
> However, it seems scheduling for parent and grand-parent can vary
> and result in a different behavior.

This is almost certainly a bug in chrony. If you use Type=forking,
then the process that systemd forks off (let's call it "P") should
wait until all of the below holds:

1. The middle child P' has exited
2. The grandchild (and main daemon process) P'' is running
3. The PID file has been successfully written to contain the PID of P''.

That all said, it's 2022, maybe chrony should just use Type=notify and
sd_notify() like any modern code?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Antw: Re: Antw: [EXT] Re: SOLVED: daemon-reload does not pick up changes to /etc/systemd/system during boot

2022-10-24 Thread Lennart Poettering
On Mo, 24.10.22 12:24, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> >>> Andrei Borzenkov  schrieb am 24.10.2022 um 10:26 in
> Nachricht
> :
> > On Mon, Oct 24, 2022 at 9:48 AM Ulrich Windl
> >  wrote:
> >>
> >> >>> Alex Aminoff  schrieb am 21.10.2022 um 18:11 in 
> >> >>> Nachricht
> >> :
> >>
> >> ...
> >> > Just to close out this thread, I am happy to report that
> >> >
> >> > ExecStart=systemctl start --no-block multi-user.target
> >> >
> >> > worked great.
> >>
> >> Makes me wonder: How does systemd handle indirect recursive starts (like 
> >> the
> > one shown)?
> >>
> >
> > What do you call a "recursive start"? "systemctl start" simply tells
>
> starting multi-user.target via ExecStart=systemctl start starts all depending 
> units, and probably one of those starts the multi-user.target again.
> That's what I call recursive.

If you enqueue a unit for starting while it is already enqueued for
starting this has no effect.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-container: Trying to use a bookworm chroot with a buster host fails / Failed to create /init.scope control group

2022-10-20 Thread Lennart Poettering
On Mo, 17.10.22 01:38, Michael Biebl (mbi...@gmail.com) wrote:

> What are you Missing?

Come on, you have the original email:

https://lists.freedesktop.org/archives/systemd-devel/2022-October/048453.html

"What is mounted to /sys/fs/cgroup and below?"

"if you force container into cgroupsv1 mode as the host (by adding
systemd.unified_cgroup_hierarchy=no to the nspawn cmdline, does that
work?"

Also, please provide the relevant output from "strace -f -s 500 -y -o
/tmp/log.strace" (put on some pastebin)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd.mount - How to disable the auto-creation of the directory (directories)

2022-10-20 Thread Lennart Poettering
On Do, 20.10.22 09:18, kAja Ziegler (ziegl...@gmail.com) wrote:

> Hello,
>
> Is there any way to turn off the automatic directory (directories) creation
> during mount unit start/run? To make the [auto-generated] mount unit behave
> the same as the mount command - to end with an error?

Add a .mount drop-in for your unit that sets AssertPathExists= to your
path in the [Unit] section.

i.e. create /etc/systemd/system/mnt-x.mount.d/50-myassert.conf, and
add:

[Unit]
AsserPathExists=/mnt/x

into it.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Antw: [EXT] Finding network interface name in different distro

2022-10-19 Thread Lennart Poettering
On Di, 18.10.22 16:03, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> > When changing distro or distro major versions, network interfaces'
> > names sometimes change.
> > For example on some Dell server running CentOS 7 the interface is
> > named em1 and running Alma 8 it's eno1.
>
> Wasn't the idea of "BIOS device name" that the interface's name
> matches the label printed on the chassis?

Yes, but not all devices have the necessary firmware
metadata.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Finding network interface name in different distro

2022-10-19 Thread Lennart Poettering
On Di, 18.10.22 09:10, Greg Oliver (oliver.g...@gmail.com) wrote:

> On Fri, Oct 14, 2022 at 7:42 PM Etienne Champetier <
> champetier.etie...@gmail.com> wrote:
>
> > Hi All,
> >
> > When changing distro or distro major versions, network interfaces'
> > names sometimes change.
> > For example on some Dell server running CentOS 7 the interface is
> > named em1 and running Alma 8 it's eno1.
> >
> > I'm looking for a way to find the new interface name in advance
> > without booting the new OS.
> > One way I found is to unpack the initramfs, mount bind /sys, chroot,
> > and then run
> > udevadm test-builtin net_id /sys/class/net/INTF
> > Problem is that it doesn't give me right away the name according to
> > the NamePolicy in 99-default.link
> >
> > Is there a command to get the future name right away ?
> >
>
> I do not like the biosdevname introduced stuff for machines with 4 or less
> interfaces, so another option is to disable the auto-naming:
>
> biosdevname=0 net.ifnames=0

biosdevname is pretty much obsoleted by systemd's own network naming.

Usually, if you have more than a single interface you want the
systemd naming though because otherwise probing order is usually
undefined and thus your "eth0" might sometimes be "eth1" and vice
versa...

> on the kernel cmdline will do it.  Also, the biosdevname package needs to
> be installed.  This will yield the traditional ethX, wlanX, etc interface
> names that are ordered by default the way they used to be.  Of course, this
> does not scale well when you have hotplug devices with many pci ports and
> ethernet cards if you ever need to replace one card.  Just my $.02

Uninstall biosdevname. It's 2022.

It's a bit contradictory to install it explicitly and then turn it off
via biosdevname=0...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Finding network interface name in different distro

2022-10-19 Thread Lennart Poettering
On Di, 18.10.22 11:10, Etienne Champetier (champetier.etie...@gmail.com) wrote:

> > > I think I found what I need:
> > > bash-4.4# udevadm test /sys/class/net/em1 2>/dev/null | awk  -F=
> > > '/ID_NET_NAME=/ {print $2}'
> > > eno1
> >
> > The name depends on local and distro policy, systemd version,
> > kernel version and selected network naming scheme level (see
> > systemd.net-naming-scheme man page)
>
> When running in a chroot of the new system, only the kernel varies,
> we have the right policy, naming scheme level and systemd version.
> For "classic" amd64 servers does the kernel really have an impact on
> naming ?

As kernels are improved and developed they tend to expose more sysfs
attributes on devices, that the udev interface naming logic might pick
up then.

> > Use "udevadm info /sys/class/net/" to query the udev db for
> > automatically generated names.
> >
> > Relevant udev props to look out for are:
> >
> > ID_NET_NAME_FROM_DATABASE
> > ID_NET_NAME_ONBOARD
> > ID_NET_NAME_SLOT
> > ID_NET_NAME_PATH
> > ID_NET_NAME_MAC
> >
> > These using hwdb info, firmware info, slot info, device path info or
> > MAC addresss for naming.
>
> What I'm looking for is I think ID_NET_NAME,
> ie I don't want to read the policy myself and then go find the right
> ID_NET_NAME_*
> sadly ID_NET_NAME is not always present, so I don't have a good
> solution for now.
> (I'm assuming policy kernel can be ignored on amd64 servers, maybe
> I'm wrong)

udev will rename interfaces it finds based on the data in
ID_NET_NAME. I the ID_NET_NAME prop is never set, then udev won't
rename the interface.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] limiting NFS activity

2022-10-18 Thread Lennart Poettering
On Mo, 17.10.22 18:52, Weatherby,Gerard (gweathe...@uchc.edu) wrote:

> We have a requirement to limit / throttle the IO activity to an NFS mount for 
> a particular system slice. I’m trying to use cgroups v2
>
> Does IODeviceLatencyTargetSec work for NFS mounts?

No. That only works for block devices. NFS does not involve block devices.

> Does cgroups v2 support net_prio? Can I set it in a 
> /etc/systemd/system/*slice.d/*conf file?

No. cgroupsv2 does not support that.

I am not sure if NFS supports what you are trying to do at all, as
traffic generated by NFS is probably not attributed back to a process
and hence a cgroup. You might want to ask the NFS community about
that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-container: Trying to use a bookworm chroot with a buster host fails / Failed to create /init.scope control group

2022-10-16 Thread Lennart Poettering
On So, 16.10.22 21:02, Michael Biebl (mbi...@gmail.com) wrote:

> Am So., 16. Okt. 2022 um 16:23 Uhr schrieb Lennart Poettering
> :
> >
> > On Fr, 14.10.22 22:57, Michael Biebl (mbi...@gmail.com) wrote:
> >
> > > Hi,
> > >
> > > since the issue came up on the Debian bug tracker at
> > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1019147 , I figured
> > > I ask here:
> >
> > Do you have any MACs in effect?
>
> No SELinux or Apparmor active
>
> > Does the host use cgroupsv2 or cgroupsv2 or hybrid? What is mounted to
> > /sys/fs/cgroup and below?
>
> The host system uses systemd v241, compiled with default-hierarchy=hybrid
>
>
> > Was the container configured to use either?
>
> The container uses systemd v251 with default-hierarchy=unified
>
> Trying to boot this container v251 container via systemd-nspawn leads to
>
> Welcome to Debian GNU/Linux bookworm/sid!
>
> Hostname set to .
> Failed to create /init.scope control group: Operation not permitted
> Failed to allocate manager object: Operation not permitted
> [!!] Failed to allocate manager object.
>     Exiting PID 1...
> Container test-bookworm failed with error code 255.

Please answer the questions I asked, otherwise not actionable...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] user unit with delayed users homes mount - ?

2022-10-16 Thread Lennart Poettering
On Fr, 14.10.22 10:59, lejeczek (pelj...@yahoo.co.uk) wrote:

> Hi guys.
>
> I'm on Centos 8 S with systemd 239.
> Users homes are mounted at later (latest?) stage off NFS so when such a user
> logs in then:
>
> -> $ systemctl --user status -l xyz.service
> Unit xyz.service could not be found.
> -> $ systemctl --user daemon-reload
> -> $ systemctl --user status -l xyz.service
> ● xyz.service - Podman container-xyz.service
>    Loaded: loaded (/apps/appownia/.config/systemd/user/xyz.service; enabled;
> vendor preset: enabled)
>    Active: inactive (dead)
>  Docs: man:podman-generate-systemd(1)
>
> Is it possible and if so then how, to make "systemd" account for such a
> "simple" case - where home dir is net mounted very late?

I don't get this scenario. You talk to the systemd --user instance,
which is the per-user instance, so $HOME of that user should be
mounted at that time. But then you issue a reload and new stuff
appears and you appear to suggest that now the user's $HOME was
mounted?

So what now? Usually, the assumption is that first the user logs in,
which is the point where $HOME must be mounted at the latest, and then
systemd --user gets started off it and the user's login session is
allowed to begin.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-container: Trying to use a bookworm chroot with a buster host fails / Failed to create /init.scope control group

2022-10-16 Thread Lennart Poettering
On Fr, 14.10.22 22:57, Michael Biebl (mbi...@gmail.com) wrote:

> Hi,
>
> since the issue came up on the Debian bug tracker at
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1019147 , I figured
> I ask here:

Do you have any MACs in effect?

Does the host use cgroupsv2 or cgroupsv2 or hybrid? What is mounted to
/sys/fs/cgroup and below?

Was the container configured to use either?

This is new payload on old host?

if you force container into cgroupsv1 mode as the host (by adding
systemd.unified_cgroup_hierarchy=no to the nspawn cmdline, does that
work?

Generally, systemd should discover everything on its own and just work
when run in an older container manager/cgroup environment. But it's
not something we would regularly test.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Finding network interface name in different distro

2022-10-16 Thread Lennart Poettering
On Fr, 14.10.22 22:24, Etienne Champetier (champetier.etie...@gmail.com) wrote:

> Le ven. 14 oct. 2022 à 20:41, Etienne Champetier
>  a écrit :
> >
> > Hi All,
> >
> > When changing distro or distro major versions, network interfaces'
> > names sometimes change.
> > For example on some Dell server running CentOS 7 the interface is
> > named em1 and running Alma 8 it's eno1.
> >
> > I'm looking for a way to find the new interface name in advance
> > without booting the new OS.
> > One way I found is to unpack the initramfs, mount bind /sys, chroot,
> > and then run
> > udevadm test-builtin net_id /sys/class/net/INTF
> > Problem is that it doesn't give me right away the name according to
> > the NamePolicy in 99-default.link
> >
> > Is there a command to get the future name right away ?
>
> I think I found what I need:
> bash-4.4# udevadm test /sys/class/net/em1 2>/dev/null | awk  -F=
> '/ID_NET_NAME=/ {print $2}'
> eno1

The name depends on local and distro policy, systemd version,
kernel version and selected network naming scheme level (see
systemd.net-naming-scheme man page)

Use "udevadm info /sys/class/net/" to query the udev db for
automatically generated names.

Relevant udev props to look out for are:

ID_NET_NAME_FROM_DATABASE
ID_NET_NAME_ONBOARD
ID_NET_NAME_SLOT
ID_NET_NAME_PATH
ID_NET_NAME_MAC

These using hwdb info, firmware info, slot info, device path info or
MAC addresss for naming.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] daemon-reload does not pick up changes to /etc/systemd/system during boot

2022-10-13 Thread Lennart Poettering
On Mi, 12.10.22 16:54, Alex Aminoff (amin...@nber.org) wrote:

> As soon as the system is up I can ssh in and run systemctl start autofs and
> it works just fine. In journalctl -b I can see my rc.initdiskless running
> followed by the daemon-reload. But no autofs and no evidence that systemd
> tried to start autofs.
>
> My only guess is that somehow daemon-reload is not enough because as far as
> systemd is concerned we already queued up for starting all the services
> needed by multi-user.target back when we switched root from the
> initrd.

daemon-reload just tells PID 1 to reload units, it has no direct effect on
the job queue, it won't enqueue any deps that might have been added.

You can issue "systemctl start --no-block multi-user.target" to
reenqueue multi-user.target again which will then also reenqueue all
its deps again, taking the new deps into consideration.

An alternative is to use add in Upholds= type deps from
multi-user.target to your service. That (somewhat recently added) dep
type has a "continious" effect: whenever a unit is up that has a dep
(or multiple of this kind it will have the effect that the listed dep
will be started if not running. It means "systemctl stop" of a
dependent service will be immediately undone though, i.e. it has quite
different semantics from the usual Wants=.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-udevd -any way to list triggered rules with their files etc ?

2022-10-10 Thread Lennart Poettering
On Mo, 03.10.22 02:17, Branko (bran...@avtomatika.com) wrote:

> cat /etc/udev/rules.d/99-zz-network.rules:
> ACTION=="add", DRIVERS=="?*", ATTR{address}=="11:22:33:44:55:66",
> NAME="wlan17", OWNER="chosen_user", GROUP="chosen_group",
> MODE="0666"

ACTION="add" is almost always the wrong expression. Because devices
will be triggered via "change" and similar, and thus your props will
be dropped again once that happens.

You want ACTION!="remove" instead, i.e. match all "positive" events,
i.e. where the device is still there afterwards (which is
systematically different from "remove" where it isn't).

> I know it does get triggered, since after replugging the WIFi stick I
> do get "wlan17" interface.But resulting created device in
> /dev/bus/usb/00x/00y gets created with MODE=0640 and root:usb

As mentioned elsewhere, what's a usbfs file, not a netif. network
interfaces have no ownership concept.

> I'm at a loss here. How is one supposed to get more detailed info on
> what's and WHY is going on with systemd-udevd tree processing ?

if you boot up with "debug" you should get tons of debug output to
wade through.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Attaching virtual session (e.g. SSH) to seat

2022-10-10 Thread Lennart Poettering
On Sa, 01.10.22 15:46, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> I am logging in on a PC using SSH and need to access some peripherals
> which are attached to seat0.
> loginctl shows that my session is not attached to any seat:
>
> SESSION  UID USER  SEAT TTY
>  50 1000 septatrix  pts/0
>
> The devices are added to the seat using udev rules
> and I explicitly want to avoid making the device world read-/writeable
> or adding it to a group.
> Reading through the man pages for systemd-logind, pam_systemd etc
> did not lead me anywhere helpful but only confirmed the fact
> that virtual sessions are not assigned any seat by default.
> However I was unable to find information on how it is determined
> if a session is "virtual" or whether it can be configured for 
> pam/logind/udev...

So in logind "seat" is a way to group hw, and that hw-bound sessions
hence associate with one of these "seat"s. Non-hw-bound sessions
don't. It's how this was designed.

There's simply no way to say "hey, I am non-hw-bound session with a
seat", and I am not convinced the usecase is convincing enough to add
such a concept.

So, you might get quite far via setting XDG_SEAT env var in the PAM
session, but it's really a mess, I am pretty sure this will not work
properly, because it's not designed like that. i.e. multiple sessions
on the same seat are supposed to be session switchable, i.e. one in
the fg and all others in the bg, but any of them could be put in the
fg any time. but that simply makes no conceptual sense if an SSH
session is in the mix.

Sorry if that's disappointing.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-repart help requested please

2022-10-10 Thread Lennart Poettering
On Mo, 03.10.22 22:04, bl...@baaa.sh bl...@baaa.sh (bl...@baaa.sh) wrote:

> Greetings to you all,
>
> I read through this
> https://0pointer.net/blog/fitting-everything-together.html, several
> times and was inspired to try build this
> https://0pointer.net/blog/images/partitions.svg, as an exercise to
> help me learn.
>
> I've got this https://gitlab.com/baaash/aio/-/blob/main/aio.org so
> far.  (probably worthy of a chuckle for some, but we all start
> somewhere right?)
>
> anyways, when I do a:
>
> sudo mkosi
>
> the image builds fine.  cool.
>
> when i boot into with systemd-nspawn I see no such growing of
> partitions, and furthermore am prompted to enter a new root
> password.

systemd-nspawn will boot the image for you. It will mount the file
systems in the image first, and it will issue the fsgrow ioctls on the
mounted file systems if that's requested in the GPT partition
flags. But it will *not* grow the partitions, that's what systemd-repart
can do for you. You may use the "--image=" switch to invoke it
directly on the disk image. You can also specify "--size=" to grow the
image file on disk first. (this will work only if you have some
suitable /usr/lib/repart.d/ drop-ins in place that tell repart what to
actually grow)

So, if you build an image with mkosi, you could then grow/complete it
with systemd-repart, and then boot it up with nspawn, and things
should just work.

> The password thing I assume is because I need to remove the
> reference in mkosi, and pass this to systemd-nspawn, as described in
> the systemd.firstboot man page...[edit:confirmed], But the repart
> thing has me stumped.

you can either provision a root pw:

1. in mkosi via the mkosi.rootpw file
2. at first boot by padding in a credential via nspawn's new
   --set-credential=passwd.hashed-password.root:… switch
3. at first boot interactively via systemd-firstboot.

the systemd-firstboot stuff is done only on first boot, and if no root
pw has been configured yet. First boot is defined by whether
/etc/machine-id being initialized or not. Recent mkosi versions will
ensure that file is reset properly ensure this works. (in fact, for
now I'd recommend working with git versions of mkosi)

> Asking in IRC it was pointed out systemd-repart should just work
> automatically provided the partition info was sitting in
> /usr/lib/repart.d directory, but that it needs no MachineID set in
> order to qualify as "first.boot".

Correct.

> I don't have one set but one is being created in the process.  I'm
> missing a piece to this puzzle.
>
> my eyes burn, my head hurts, and i'm no closer to understanding
> this, so i wondered if anyone on the list can succinctly explain
> this to me or perhaps provide a link to a basic working example i
> can try get my head around; provided of course, someone has already
> undertaken this exercise on their own, and wouldn't mind sharing.

Happy to help!

We should probably open a group chat somewhere for people who want to
build images like that. Since I am usually at home in Signal for
things like that, maybe we should open a chat room there for that?

(nah, not an IRC fan, not gonna return there, sorry)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] prevent systemd-journald rotating message

2022-10-10 Thread Lennart Poettering
On Do, 06.10.22 10:10, d tbsky (tbs...@gmail.com) wrote:

> Hi:
> when I type "dmesg" I saw it is fulled by systemd-journald
> rotating message like below. is there parameter to prevent the
> rotating warning?
>
> [708993.589762] systemd-journald[515]: Data hash table of
> /run/log/journal/93f434f608654cf990b2e70c656dfacd/system.journal has a
> fill level at 75.1 (3283 of 4373 items, 2519040 file size, 767 bytes
> per hash table item), suggesting rotation.
> [708993.590947] systemd-journald[515]:
> /run/log/journal/93f434f608654cf990b2e70c656dfacd/system.journal:
> Journal header limits reached or header out-of-date, rotating.

No, we have no concept of turning off individual log messages.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Is it possible to let systemd create a listening socket and yet be able to have that socket activate nothing, at least temporarily?

2022-10-10 Thread Lennart Poettering
On Fr, 07.10.22 07:24, Klaus Ebbe Grue (g...@di.ku.dk) wrote:

> Hi systemd-devel,
>
> I have a user question which I take the liberty to send here since
> "about systemd-devel" says "... it's also OK to direct user
> questions to this mailing list ...".
>
> I have a daemon, /usr/bin/mydaemon, which listens on one and only
> one TCP port, say , and which does no more than communicating
> over  and creating, reading, writing and deleting files in
> /home/me/mydaemon/.
>
> Mydaemon leaves it to systemd to create a socket which listens at
> .
>
> It is unimportant whether or not mydaemon is started at boot and it
> is also unimportant whether or not mydaemon is socket activated. As
> long as it is at least one of the two.
>
> Now I want to upgrade mydaemon to a new version using a script,
> without race conditions and without closing the listening socket. I
> want the listening socket to stay open since otherwise there can be
> a one minute interval during which it is impossible to reopen .
>
> If it is just a clean upgrade, the script could replace
> /usr/bin/mydaemon, then stop mydaemon. If the daemon is socket
> activated there is no more to do. If the daemon is activated only on
> boot then the script must end up restarting mydaemon.
>
> But now I want to do some more while mydaemon is not running. It
> could be that my script should take a backup of /home/me/mydaemon/
> in case things go wrong. It could be the script should translate
> some file in /home/me/mydaemon/ to some new format required by the
> new mydaemon or whatever.
>
> So I need to stop mydaemon in such a way that mydaemon cannot wake
> up while my script fiddles with /home/me/mydaemon/.
>
> According to https://0pointer.de/blog/projects/three-levels-of-off
> it seems that that was possible in 2011: just do "systemctl disable
> mydaemon.service". But when I try that, mydaemon still wakes up if I
> connect to  using eg netcat.

Well, that's a misunderstanding...

> I have also tried to mask mydaemon. But if I then connect to 
> using netcat, then netcat gets kicked of. And if I try again then
>  is no longer listening.
>
> QUESTION: Is it possible to let systemd create a listening socket
> and yet be able to have that socket activate nothing, at least
> temporarily?

Can't you run your upgrade script in idempotent way as a helper
service that is pulled in by your main daemon and ordered before it,
but conditions itself out if it already did its job? that's usually
the most robust way, since then it's sufficient to just restart your
daemon or reboot, and everything will always catch up correctly.

i.e. if you have foo-daemon.socket + foo-daemon.service then define
foo-upgrade.service that is pulled in from foo-daemon.service via
`Wants=foo-upgrade.service` + `After=foo-upgrade.service`. And then
add `ConditionFileExists=!/some/touch/file` to `foo-upgrade.service` to
make it a NOP if things have already been updated, using a touch
file. (some better, smarter condition check might work as well, see
man pages of things systemd can check for you).

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Connect /usr/bin/init to docker container's STDOUT/STDIN

2022-09-30 Thread Lennart Poettering
On Do, 29.09.22 19:42, Nicola Mori (nicolam...@aol.com) wrote:

> So I believe this problem might have been introduced by a systemd version
> subsequent to 219 and that hopefully it might be fixed somehow by means of
> e.g. proper configuration of the container/environment, but I need some
> advice about what to do since I'm clueless.

Docker is explicitly anti-systemd, you'll always having a hard time
making this work.

Note that since a longer time we'll close /dev/console in PID 1
whenever we can, and only open it immediately before printing stuff to
the console, for compatibility with the kernel's SAK feature which
otherwise would kill PID 1 if SAK is hit.

Thus you really need to pass a proper pty into the container as
/dev/console, if you want systemd to run inside it.

We documented our expectations clearly here:

https://systemd.io/CONTAINER_INTERFACE

Pretty much all container managers implement this more or less. Just
Docker does not...

You might be able to replace docker with podman, where supposed all
this just works out of the box.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] networkd D-Bus API for link up/down?

2022-09-22 Thread Lennart Poettering
On Mi, 21.09.22 06:48, Kevin P. Fleming (ke...@km6g.us) wrote:

> When the D-Bus API for systemd-networkd was added there were
> indications that it could be used for bringing links up and down.
> However, when I review the API docs at:
>
> https://www.freedesktop.org/software/systemd/man/org.freedesktop.network1.html#
>
> I don't see any methods for doing those operations. networkctl uses
> netlink messages for these operations as well.
>
> I want to create a cluster resource agent for Pacemaker which can
> manage networkd links, and using D-Bus would be easier than using
> netlink since there is already D-Bus support in the resource agent for
> systemd units.

This is currently not available. But do note that you can use
ActivationPolicy= in a .network file and then simply toggle the IFF_UP flag
on the net device, and networkd is happy.

If you don#t want to bother with rtnetlink for that you could even use
the old BSD ioctls, i.e. SIOCSIFFLAGS.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] path service ExecStart arguments

2022-09-22 Thread Lennart Poettering
On Mi, 21.09.22 08:54, Ted Toth (txt...@gmail.com) wrote:

> Is info about what changed (i.e. the name of the file created in the
> directory) available to a path service ExecStart process? If so, how
> does a service access the info?

This is is generally not available on released versions of
systemd. Current git main added some limited support for passing this
in via env var, but this is useful for debugging only really, since
multiple events can result in a single service invocation, and thus
you lose events.

Usually if you want this information for anything more than debugging,
then things should be implemented differently, i.e. you get called and
then scan yourself what is in the directory you watch. That makes
things robust towards lost events.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Compile Standalone binaries

2022-09-22 Thread Lennart Poettering
On Mi, 21.09.22 20:27, Caleb M. Hurley (hurleymca...@protonmail.com) wrote:

> Having trouble compiling the standalone binaries for systemd; setup
> a question at
> https://unix.stackexchange.com/questions/718163/trouble-compiling-systemd-standalone-binaries

Binaries of what precisely?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] boot-complete.target dependencies issue

2022-09-17 Thread Lennart Poettering
On Fr, 16.09.22 10:10, Antonio Murdaca (run...@redhat.com) wrote:

> Hi, following
> https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/#how-to-adapt-this-scheme-to-other-setups
> I've been experimenting on a fedora system
> with systemd-boot-check-no-failures.service and the ability to have
> services run "after" boot-complete.target. The basic use case would just to
> have something that checks services are up and running, reach boot-complete
> if they are, and start other services afterwards.
> I've taken from that blog this piece specifically:
> ```
> To support additional components that shall only run on boot success,
> simply wrap them in a unit and order them after boot-complete.target,
> pulling it in.
> ```
> So I've done the following with an example service and by enabling :
>
> # cat /etc/systemd/system/test.service
> [Unit]
> Description="Order after boot-complete.target, pulling it in"
> After=boot-complete.target
> Requires=boot-complete.target
>
> [Service]
> Type=oneshot
> ExecStart=/usr/bin/echo "Additional component that shall only run on boot
> success"
> RemainAfterExit=yes
>
> [Install]
> WantedBy=default.target
>
> # systemctl enable test.service  systemd-boot-check-no-failures.service
> Created symlink /etc/systemd/system/default.target.wants/test.service →
> /etc/systemd/system/test.service.
> Created symlink
> /etc/systemd/system/boot-complete.target.requires/systemd-boot-check-no-failures.service
> → /usr/lib/systemd/system/systemd-boot-check-no-failures.service.
>
> # systemctl reboot
>
> Unfortunately, the above results in:
>
> systemd[1]: multi-user.target: Found ordering cycle on test.service/start
> systemd[1]: multi-user.target: Found dependency on
> boot-complete.target/start
> systemd[1]: multi-user.target: Found dependency on
> systemd-boot-check-no-failures.service/start
> systemd[1]: multi-user.target: Found dependency on multi-user.target/start
> systemd[1]: multi-user.target: Job test.service/start deleted to break
> ordering cycle starting with multi-user.target/start
>
> so what's the correct way to perform the mentioned "order [units] after
> boot-complete.target", if they cannot be pulled in through the usual
> default/multi-user targets? If I add DefaultDependencies=no to test.service
> it now appears to work w/o the dependency cycle.

It should suffice adding After=multi-user.target to your service.

The things is that systemd-boot-check-no-failures.service runs late,
after the startup transaction is done to check if everything
succeeded. But now you want to run something more, so by default
s-b-c-n-f.s would also want to run after that, to know if it
succeeded. But htat of course makes little sense: the output of
your service cannot be part of the input of s-b-c-n-f.s if your
service should run after s-b-c-n-f.s!

So, my recommended fix: add After=multi-user.target to your
service. Note that systemd handling of .wants/ works like this:

1. add Wants= type dep
2. if no After=/Before= dep is set, then also add Before=

This means, that just adding an explicit After=multi-user.target to
your service means rule #2 won't take effect anymore.

With that in place things should just work (untested, but afaics), as
it means s-b-c-n-f.s can run after multi-user.target, and then
boot-complete.target after that, and then finally your service.

Does that make sense?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Real-time scheduling doesn't work with StartupCPUWeight/CPUWeight

2022-09-17 Thread Lennart Poettering
On Mi, 14.09.22 15:03, Robert Tiemann (r...@gmx.de) wrote:

> Hi!
>
> I have optimized boot times for an embedded system by setting
> StartupCPUWeight= and CPUWeight= for a few services. The startup
> values are set to various values. All unit files I have touched also
> contain the line "CPUWeight=100" so that the system is running with
> defaults after startup. Some unit files contain Nice= assignments
> (placed there before my optimizations, so I kept them in place).
>
> Now, the problem is with one process in the system which requires
> real-time priorities. It calls pthread_setschedparam() to configure
> two of its threads for SCHED_RR policy at priority 99. This used to
> work before my optimizations, but now pthread_setschedparam() fails
> with EPERM error. I have added LimitRTPRIO=infinity, but it still
> doesn't work. The threads are created and configured after the startup
> phase has finished.

Please consult README, look for comment on CONFIG_RT_GROUP_SCHED=n.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] /run/systemd/propagate/example.service deletion

2022-09-15 Thread Lennart Poettering
On Mo, 12.09.22 08:13, Stefan Catargiu (stefan.catar...@gmx.de) wrote:

> Hello all,
>
> I have noticed that when using certain sandboxing features for units, e.g. 
> ProtectHome,
> a directory will get created in /run/systemd/propagate with the name of the 
> service,
> e.g. /run/systemd/propagate/example.service, which systemd is then using for 
> certain bind mounts.
>
> Now, the thing is, that directory is never going to be deleted after the 
> service stops,
> which is all good, after all /run is a tmpfs, but this is becoming slightly 
> problematic
> when using instantiated services, you can end up with large numbers of 
> directories
> under /run/systemd/propagate.
>
> I have seen some extreme cases where /run runs out of inodes because of this.
> One extreme example : way too many directories are created under 
> /run/systemd/propagate when a lot
> of coredumps are generated on a system which uses systemd-coredump.
> You will have one instantiated unit per coredump, hence a directory like
> /run/systemd/propagate/systemd-coredump@1-1234-0.service is going to be 
> created and so on.
>
> All things considered, shouldn’t these directories be deleted after a service 
> stops?

THis is probably a bug. Can you please file an issue on systemd github
about this?

https://github.com/systemd/systemd/issues/new?assignees==bug+%F0%9F%90%9B=bug_report.yml

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] socket activation socket context when using SELinuxContextFromNet

2022-09-14 Thread Lennart Poettering
On Mo, 12.09.22 13:26, Ted Toth (txt...@gmail.com) wrote:

> I've been looking at the issue of systemd setting the socket
> activation socket context to init_t when using SELinuxContextFromNet.
> My initial thought was to use the port context set by running semanage
> and compute the socket context using a type transition for the port
> type to a socket type. However after consulting the selinux community
> the consensus is not to do this but rather to simply use the target
> executables context. Currently systemd does compute the executables
> context when SELinuxContextFromNet is not used. Can anyone explain why
> the computed executables context is not used when
> SELinuxContextFromNet is set?

The SELinux hookup originally came from SELinux people. These are
questions only SELinux people really can answer.

If you think the SELinux code in systemd should work differenntly,
please file a PR changing it, and get a review/blessing from the
SELinux people and we'll basically merge anything that codewise looks
OK.

Don't assume we as systemd people would also be SELinux people with a
deep understanding how SELinux should operate. We are generally not.

Sorry, if that's disappointing.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-network and loopback

2022-09-09 Thread Lennart Poettering
On Fr, 09.09.22 14:45, Andrea Pappacoda (and...@pappacoda.it) wrote:

> Hi all,
>
> yesterday I was playing a bit with systemd-network, and I noticed that it is
> possible for it to manage the loopback interface. Is it useful in any way?
> Should the loopback interface be managed in systems where sd-network is the
> only program managing interfaces (like my desktop pc)?
>
> I tried looking at systemd.network(5), but that didn't seem to answer my
> question.

People sometimes route stuff onto the loopback device in addition to
the the usual 127.0.0.0/8 traffic so that it ends up on local sockets.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] socket activation selinux context on create

2022-09-02 Thread Lennart Poettering
On Fr, 02.09.22 09:04, Ted Toth (txt...@gmail.com) wrote:

> I have set the type for the port in question using the 'semanage port'
> command so the loaded policy has a type which systemd should use when
> calling setsockcreatecon. It is my opinion that
> socket_determine_selinux_label function should query policy for the
> port type and if it has been set use it and if not fallback to its
> current behavior.

Sure, patch very welcome.

SELinux code really requires external contributions, none of the core
developers know SELinux too well to do feel confident to implement
that.

(consider filing an RFE issue on github, so that this is tracked)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] socket activation selinux context on create

2022-08-26 Thread Lennart Poettering
On Do, 25.08.22 14:46, Ted Toth (txt...@gmail.com) wrote:

> I've tested setting the type of the port using semanage port -a
> however when I start the service netstat still shows the type as
> init_t. I don't know of any other way to get a type transition of a
> socket to happen, do you?. I've also posted to the selinux list but
> haven't gotten any responses yet.

Uh, that's a question for the selinux people. I only have a limited
insight into selinux, and wouldn't know how to do such things.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Ordering units and targets with devices

2022-08-25 Thread Lennart Poettering
On Do, 25.08.22 10:50, Michael Cassaniti (mich...@cassaniti.id.au) wrote:

> It seems to be somewhat more complicated than that, and perhaps it has more
> to do with my setup. Here's my /etc/crypttab which just might explain a bit:
>
>     # Mount root and swap
>     # These will initially have an empty password
>     root /dev/disk/by-partlabel/root - 
> fido2-device=/dev/yubico-fido2,token-timeout=0,try-empty-password=true,x-initrd.attach
>     swap /dev/disk/by-partlabel/swap - 
> fido2-device=/dev/yubico-fido2,token-timeout=0,try-empty-password=true,x-initrd.attach
>
> I think the fact that both of these get setup at boot and will concurrently
> try to access the FIDO2 token is causing issues. That crypttab is included
> in the initrd.

There was an issue with concurrent access to FIDO2 devices conflicting
with each other. This was addressed in libfido2 though, it will now
take a BSD lock on the device while talking to it, thus synchronizing
access properly.

See this bug:

https://github.com/systemd/systemd/issues/23889

Maybe it's sufficient to update libfido2 on your system?


Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Ordering units and targets with devices

2022-08-25 Thread Lennart Poettering
On Mi, 17.08.22 13:23, Michael Cassaniti (mich...@cassaniti.id.au) wrote:

> Hi,
>
> I'm trying to order my units and targets during early boot so that:
> 1. A symlink to the specific FIDO2 token I'm using gets created. I already
> have a udev rule in place for this and it successfully creates the symlink
> under /dev. Because I have two tokens I need to specify which one to use.
> 2. The unit for systemd-cryptsetup@root.service has to wait for this unit.
> The unit gets generated from systemd-cryptsetup-generator so I can't just
> add Requires= stanzas to the unit. I do have a /etc/crypttab file.

systemd-cryptsetup can wait on its own for a FIDO2 token, no need to
do that with unit deps?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Are logs at /run/log/journal automerged?

2022-08-25 Thread Lennart Poettering
On Mo, 22.08.22 13:02, Yuri Kanivetsky (yuri.kanivet...@gmail.com) wrote:

> Hi,
>
> I'm experiencing this on Digital Ocean. The machine id there changes
> (which I think shouldn't happen) on the first boot (supposedly by
> cloud-init).

The machine ID may change during the initrd to host-fs
transition. Otherwise that's not OK though.

When the logs from /run/ are flushed to /var/ they are all merged
together into one.

By default journalctl will show logs associated with the current
machine ID and those associated with the current boot ID. The latter
should usually ensure that logs from the initrd phase are shown as
well if it has a different machine ID.

> In Ubuntu 22.04 droplets, where logs are stored at
> /var/log/journal, that leads to journalctl outputting no records
> (because the log for the new machine-id has not been created), unless
> I pass --file or --merge. Also, the records continue to be added to
> the old log (for the old machine id).
>
> In CentOS 9 droplets, where logs are stored at /run/log/journal,
> journalctl outputs records from all 3 files:
>
> cb754b7b85bb42d1af6b48e7ca843674/system.journal
> 61238251e3db916639eaa8cd54998712/system@6600bdad291b419c8a0b1fea2564c472-0001-0005e6d123825866.journal
> 61238251e3db916639eaa8cd54998712/system.journal
>
> In this case records also are being added to the old log. But the new
> log somehow contains the beginning of the log (starting with boot).
>
> Is my guess correct? Logs at /run/log/journal are automerged, logs at
> /var/run/journal aren't.

As mentioned abive, when the logs are flushed from /run/ to /var/ in
systemd-journal-flush.service they are merged into one new journal
file, which is located in the machine ID subdir of the actual machine
ID of the system.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] socket activation selinux context on create

2022-08-25 Thread Lennart Poettering
On Mi, 24.08.22 11:50, Ted Toth (txt...@gmail.com) wrote:

> I don't see a way to set the context of the socket that systemd
> listens on. If there is a way to do this please tell me otherwise I'd
> like to see an option (SELinuxCreateContext?) added to be able to set
> the context (setsockcreatecon) to be used by systemd when creating the
> socket. Currently as an extra layer of security I add code called in
> the socket activation ExecStartPre process to check that the source
> context (peercon) can connect to the target context (getcon). If a
> sockets context was set by systemd I would have to perform this
> additional check as my SELinux policy would do it for me.

This was proposed before, but SELinux maintainers really want that the
loaded selinux policy picks the label, and not unit files.

i.e. as I understand their philosophy: how labels are assigned should
be encoded in the database and in the policy but not elsewhere,
i.e. in unit files. I think that philosophy does make sense.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] nfs-convert.service

2022-08-22 Thread Lennart Poettering
On Fr, 19.08.22 11:21, Steve Dickson (ste...@redhat.com) wrote:

> Hello,
>
> I'm trying to remove nfsconvert from Fedora but I'm
> getting the following systemd error after I removed
> the command and the service file.
>
> # systemctl restart nfs-server
> Failed to restart nfs-server.service: Unit nfs-convert.service not
> found

This is expected if you remove the file first?

> There is nothing in the nfs-utils files that
> has that service in it... and when I do a
>
> systemctl list-dependencies --all | grep -1 nfs-convert
>
> I see every nfs related service dependent on nfs-convert.service

Did you issue "systemctl daemon-reload"?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] What is the shutdown sequence with systemd and dracut?

2022-08-15 Thread Lennart Poettering
On Sa, 13.08.22 20:46, Patrick Schleizer (patrick-mailingli...@whonix.org) 
wrote:

> 2. /lib/systemd/system-shutdown (shutdown.c) runs

Still, this binary is called systemd-shutdown, i.e. with one more 'd'.

> 4. /lib/systemd/system-shutdown performs further cleanup (similar to
> dracut, probably some functionality duplicated with dracut, includes
> kill all remaining processes, unmount remaining file systems)

I am not sure dracut has another killing spree.

> 6. /run/initramfs/shutdown (which is at time of writing only implemented
> in dracut) attempts to kill all remaining processes, unmount remaining
> file systems and calls kernel.

I think the arch initrd also implements this scheme. And IIRC they use
a neat trick, and chainload systemd-shutdown from the initrd, so that
it runs again, and does the actual final shutdown, but that time
without transitioning back into an initrd env. Hence for them PID 1 during
shutdown first transitions from the service manager into
systemd-shutdown, and then from there into into the initrd script, and
then back into systemd-shutdown. I like their approach.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Antw: [systemd‑devel] Antw: [EXT] What is the shutdown sequence with systemd and dracut?

2022-08-15 Thread Lennart Poettering
On Mo, 08.08.22 14:54, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> >> 1. systemd runs systemd units for systemd shutdown.target
> >>
> >> 2. /lib/systemd/system‑shutdown (shutdown.c) runs
> >>
> >> 3. /lib/systemd/system‑shutdown executes /run/initramfs/shutdown (which
> >> is dracut)
> >>
> >> 4. dracut shutdown.sh performs various cleanup tasks (such as kill all
> >> remaining processes and unmount root disk)
> >
> > If dracut unmounts the root disk, the following /usr and /lib mist the in
> > initrd, right?
>
> Sorry: s/mist the in/must be in the"

systemd-shutdown actually pivots the rootdir into the /run/initramfs
subdir, when invoking the initrd shutdown script. Thus at that point
all fs paths refer to subdirs below /run/initramfs.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] What is the shutdown sequence with systemd and dracut?

2022-08-15 Thread Lennart Poettering
On Mo, 08.08.22 12:24, Patrick Schleizer (patrick-mailingli...@whonix.org) 
wrote:

> Hi!
>
> This is what I think but please correct me if I am wrong.
>
> 1. systemd runs systemd units for systemd shutdown.target
>
> 2. /lib/systemd/system-shutdown (shutdown.c) runs

I presume you mean /usr/lib/systemd/systemd-shutdown? (i.e. there's a
*d* in the file name; and the path outside of /usr/ is only done by
legacy distros, who still stick to spit /usr/ setups, which we do not
support anymore)

> 3. /lib/systemd/system-shutdown executes /run/initramfs/shutdown (which
> is dracut)

systemd-shutdown runs as PID 1 at this time, and it then chain loads
/run/initramfs/shutdown also as PID1 – if it exists. Thus, at that
moment no systemd code runs anymore, dracut is the only userspace code
remaining.

> 4. dracut shutdown.sh performs various cleanup tasks (such as kill all
> remaining processes and unmount root disk)

It should have been systemd-shutdown between steps 2 and 3 above which
should have already killed everything. But yeah, dracut is supposed to
detach the root fs.

> 5. /lib/systemd/system-shutdown runs scripts in the
> /usr/lib/systemd/system-shutdown/ folder

This is actually done before step 3 above.

> 6. /lib/systemd/system-shutdown performs further cleanup (similar to
> dracut, probably some functionality duplicated with dracut, includes
> kill all remaining processes, unmount the root risk) and eventually
> halt/reboot/poweroff/kexec.

Nah, the killing of processes it already did between steps 2 and
3. Also, as mentioned systemd-shutdown doesn't run at this time anymore.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-nspawn container not starting on RHEL9.0

2022-08-10 Thread Lennart Poettering
On Mi, 10.08.22 10:13, Thomas Archambault (t...@tparchambault.com) wrote:

> Thank you again Lennart, and thx Kevin.
>
> That makes total sense, and accounts for the application's high level
> start-up delay which appears to be what we are stuck with if we are over
> xfs. Unfortunately, it's difficult to dictate to the client to change their
> fs type, consequently we can't develop / ship a tool with that baseline
> latency on our primary target platform (RHEL xx.)
>
> So the next obvious question would be, is XFS reflink support on the
> systemd-nspawn roadmap or actually, (and even better) has support been
> incorporated already in the latest and greatest src and I'm just behind the
> curve working with the older version of nspawn as shipped in RHEL90?
>
> I'm asking because according to the RHEL 9 docs 
> (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_file_systems/index#the-xfs-file-system_assembly_overview-of-available-file-systems)
> it's the current default fs and is configured for "Reflink-based file
> copies."

We issue copy_file_range() syscall, which should do reflinks on xfs,
if it supports that. Question is if your kernel supports that too. I
have no experience with xfs though, no idea how xfs hooked up reflink
initially. And we never tested that really. I don't think outside RHEL
many people use xfs.

If you provide a more complete strace output, you should see the
copy_file_range() stuff there.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-nspawn container not starting on RHEL9.0

2022-08-10 Thread Lennart Poettering
On Di, 09.08.22 12:40, Thomas Archambault (t...@tparchambault.com) wrote:

> Thank you Lennart for the follow-up.
>
> There does appear to be mostly filesystem operations prior to my manually
> killing nspawn as you suggested. I only let it run about 3 minutes prior to
> sending a signal given that the strace output = ~25M.
>
> One obvious issue is the non-zero return from an ioctl call with the
> BTRFS_IOC_SUBVOL_CREATE arg at line 410, in the snippet below from my
> RHEL9.0 strace capture; this is occurring right after the initial blast of
> debug log messages. I'm trying to get a stack trace for that error
> currently.
>
>
> 410-2064 ioctl(5, BTRFS_IOC_SUBVOL_CREATE, {fd=0,
> name=".#machine.c8578d59f810b73d"}) = -1 ENOTTY (Inappropriate ioctl for
> device)

That's the btrfs subvolume ioctl. It's expected to fail on non-btrfs
with ENOTTY, and given you have xfs this is behaving as it should.

It then starts copying things manually, which is slow. i.e. it's then
basically doing what "cp -a" does.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-nspawn container not starting on RHEL9.0

2022-08-04 Thread Lennart Poettering
On Do, 04.08.22 13:30, Thomas Archambault (t...@tparchambault.com) wrote:

> Following up on xfs and reflinks, it appears they are enabled on my
> out-of-box RHEL9.0. Fwiw, this is a VBox VM however so if the FC34 system
> which works correctly, but is using btrfs.
>
> As always, appreciate any help/references.

Try straceing nspawn, to see what it does.

strace -f -y -s 500 -o /tmp/nspawnstrace.log systemd-nspawn …

Then look at the generated log and see what is busy doing... If unsure
paste things somewhre.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-nspawn container not starting on RHEL9.0

2022-08-04 Thread Lennart Poettering
On Mi, 03.08.22 15:40, Thomas Archambault (t...@tparchambault.com) wrote:

> Good day everyone on the dev list,
> We are adding an analysis tool to our application that uses the host's
> rootfs as one of its inputs.
>
> As a proof of concept, we used systemd-nspawn on Fedora 34 to create an
> isolated container environment using the host's rootfs as the container's
> rootfs and things worked correctly and as expected. The host's rootfs is
> analyzed with tmp and results files generated within the container without
> persistent modifications affecting the host's rootfs. Since RHEL is our
> ultimate target platform, I've been trying to duplicate our work over
> RHEL9.0 without success with the container not being instantiated.
>
> I've tried to boil down the duplication code to the simplest example, which
> is also an example in the man page $ sudo systemd-nspawn -xbD/. As with my
> prototyping, the container does not seem to be instantiated.
> Any help with troubleshooting, or specific known issues, or requests for
> more data would be appreciated.

"-x" is ephemeral mode. This means nspawn will make a copy of the OS
tree before booting into it, and remove it afterwards.

"-x" on btrfs is very fast and space efficient, because btrfs supports
both snapshots and reflinks. nspawn will make a subvol snapshot if the
root you specify is a subvol. It will make reflink-based file copies
otherwise.

Other file systems have a more 1990's feature set, i.e. no reflinks
nor snapshots. (modern xfs on very new kernels can support reflinks if
this is opt-in'ed to.) In that case we have to copy the data files
with their contents, and that's slow.

Hence, what backing fs do you use?

if you use non-btrfs it might hence simply be that we are busy
individually copying all files...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How can we debug systemd-gpt-auto-generator failures?

2022-07-28 Thread Lennart Poettering
On Do, 28.07.22 07:40, Kevin P. Fleming (ke...@km6g.us) wrote:

> Thanks for that, it did indeed produce some output, but unfortunately
> it doesn't seem to lead anywhere specific :-)
>
> root@edge21-a:~# SYSTEMD_LOG_LEVEL=debug SYSTEMD_LOG_TARGET=console
> LIBBLKID_DEBUG=all
> /usr/lib/systemd/system-generators/systemd-gpt-auto-generator
> Found container virtualization none.
> Disabling root partition auto-detection, root= is defined.
> Disabling root partition auto-detection, root= is defined.
> Failed to open device: No such device
>
> Adding strace to the command provides something more useful:
>
> openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = 3
> openat(3, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 4
> fstat(4, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
> close(3)= 0
> openat(4, "dev", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 3
> fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
> close(4)= 0
> openat(3, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 4
> fstat(4, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
> close(3)= 0
> openat(4, "0:0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = -1 ENOENT (No
> such file or directory)
> close(4)
>
> So it's trying to open() /sys/dev/block/0:0, but my system does not
> have that device file. The only files in /sys/dev/block are 8:0
> through 8:3.

→ https://github.com/systemd/systemd/issues/22504

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Antw: [EXT] Re: Feedback sought: can we drop cgroupv1 support soon?

2022-07-28 Thread Lennart Poettering
On Do, 28.07.22 09:48, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> Hi!
>
> What about making cgroup1 support _configurable_ as a first step?
> So maybe people could try how well things work when there is no cgroups v1
> support in systemd.

It's already runtime configurable. Kernel command line option 
systemd.unified_cgroup_hierarchy=yes|no

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] sd_bus_process semantics

2022-07-25 Thread Lennart Poettering
On Mo, 25.07.22 12:21, Mathis MARION (mamari...@silabs.com) wrote:

> I looked a bit into the source code:
>
> This part is responsible for storing the message processed in *ret:
>
> static int process_running(sd_bus *bus, sd_bus_message **ret) {
> [...]
> r = process_message(bus, m);
> if (r != 0)
> goto null_message;
>
> if (ret) {
> r = sd_bus_message_rewind(m, true);
> if (r < 0)
> return r;
>
> *ret = TAKE_PTR(m);
> return 1;
> }
> [...]
> null_message:
> if (r >= 0 && ret)
> *ret = NULL;
>
> return r;
> }
>
> static int process_message(sd_bus *bus, sd_bus_message *m) {
> [...]
> r = process_hello(bus, m);
> if (r != 0)
> goto finish;
> [...]
> r = process_builtin(bus, m);
> if (r != 0)
> goto finish;
>
> r = bus_process_object(bus, m);
>
> finish:
> bus->current_message = NULL;
> return r;
> }
>
> My analysis might be flawed since I am still new to sd-bus, but to me it
> seems like 'process_message' should return 0 on success, but since
> 'bus_process_object' returns 0 on failure it does not quite work as
> intended.

So, the idea is that sd_bus_process() only returns a message that
otherwise nothing was interested in processing. i.e. if you add a
filter or object handler or so, and it decided to process a message
(and thus returned 1 in its handler) then the message is considered
processed and not processed further, and thus not propagated back to
the caller. Only messages that no registered handler has indicated
"ownership" in will be returned to the caller.

I guess we should document that. Added to TODO list.

Th idea is basically that you have two choices for processing
messages: install a filter/handler, or process them via
sd_bus_process() returns. Pick one.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2022-07-22 Thread Lennart Poettering
On Fr, 22.07.22 12:15, Lennart Poettering (mzerq...@0pointer.de) wrote:

> > I guess that would mean holding on to cgroup1 support until EOY 2023
> > or thereabout?
>
> That does sound OK to me. We can mark it deprecated before though,
> i.e. generate warnings, and remove it from docs, as long as the actual
> code stays around until then.

So I prepped a PR now that documents the EOY 2023 date:

https://github.com/systemd/systemd/pull/24086

That way we shoudn't forget about this, and remind us that we still
actually need to do it then.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2022-07-22 Thread Lennart Poettering
On Fr, 22.07.22 12:37, Wols Lists (antli...@youngman.org.uk) wrote:

> On 22/07/2022 11:15, Lennart Poettering wrote:
> > > I guess that would mean holding on to cgroup1 support until EOY 2023
> > > or thereabout?
>
> > That does sound OK to me. We can mark it deprecated before though,
> > i.e. generate warnings, and remove it from docs, as long as the actual
> > code stays around until then.
>
> You've probably thought of this sort of thing already, but can you wrap all
> v1-specific code in #ifdefs? Especially if it's inside an if block, the
> compiler can then optimise the test away if you compile with that set to
> false.
>
> Upstream can then set the default to false, while continuing to support it,
> but it will then become more and more a conscious effort on the part of
> downstream to keep it working.
>
> Once it's visibly bit-rotting you can dump it :-)

The goal really is to reduce code size, not to increase it further by
having to maintain a ton of ifdeffery all over the place.

we generally frown on ifdeffery in "main" code aleady, i.e. we try to
isolate ifdeffery into "library" calls that hide it internally, and then
return EOPNOTSUPP if somethings is compiled out. That way the "main"
code can then treat compiled out stuff via usual error handling,
greatly simplifying conditionalizations and the combinatorial
explosion from having many optional deps.

ifdeffery comes at a price, and is very hard to test for (because CIs
do not test in all combinations of present and absent optional deps),
hence the goal should be to minimize, isolate it, not emphasize it and
sprinkle it over the whole codebase as if it was candy.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2022-07-22 Thread Lennart Poettering
On Do, 21.07.22 16:24, Stéphane Graber (stgra...@ubuntu.com) wrote:

> Hey there,
>
> I believe Christian may have relayed some of this already but on my
> side, as much as I can sympathize with the annoyance of having to
> support both cgroup1 and cgroup2 side by side, I feel that we're sadly
> nowhere near the cut off point.
>
> >From what I can gather from various stats we have, over 90% of LXD
> users are still on distributions relying on CGroup1.
> That's because most of them are using LTS releases of server
> distributions and those only somewhat recently made the jump to
> cgroup2:
>  - RHEL 9 in May 2022
>  - Ubuntu 22.04 LTS in April 2022
>  - Debian 11 in August 2021
>
> OpenSUSE is still on cgroup1 by default in 15.4 for some reason.
> All this is also excluding our two largest users, Chromebooks and QNAP
> NASes, neither of them made the switch yet.

At some point I feel no sympathy there. If google/qnap/suse still are
stuck in cgroupv1 land, then that's on them, we shouldn't allow
ourselves to be held hostage by that.

I mean, that Google isn't forward looking in these things is well
known, but I am a bit surprised SUSE is still so far back.

> I honestly wouldn't be holding deprecating cgroup1 on waiting for
> those few to wake up and transition.
> Both ChromeOS and QNAP can very quickly roll it out to all their users
> should they want to.
> It's a bit trickier for OpenSUSE as it's used as the basis for SLES
> and so those enterprise users are unlikely to see cgroup2 any time
> soon.
>
> Now all of this is a problem because:
>  - Our users are slow to upgrade. It's common for them to skip an
> entire LTS release and those that upgrade every time will usually wait
> 6 months to a year prior to upgrading to a new release.
>  - This deprecation would prevent users of anything but the most
> recent release from running any newer containers. As it's common to
> switch to newer containers before upgrading the host, this would cause
> some issues.
>  - Unfortunately the reverse is a problem too. RHEL 7 and derivatives
> are still very common as a container workload, as is Ubuntu 16.04 LTS.
> Unfortunately those releases ship with a systemd version that does not
> boot under cgroup2.

Hmm, cgroupv1 named hiearchies should still be available even on
cgroupv2 hosts. I am pretty sure nspawn at least should have no
problem with running old cgroupv1 payloads on a cgroupv2 host.

Isn't this issue just an artifact of the fact that LXD doesn't
pre-mount cgroupfs? Or does it do so these days? because systemd's
PID1 since time began would just use the cgroup setup it finds itself
in if it's already mounted/set up. And only mount and make a choice
between cgroup1 or cgroupv2 if there's really nothing set up so far.

Because of that I see no reason why old systemd cgroupv1 payloads
shouldn#t just work on cgroupv2 hosts: as long as you give them a
pre-set-up cgroupv1 environemnt, and nothing stops you from doing
that. In fact, this is something we even documented somewhere: what to
do if the host only does a subset of the cgroup stuff you want, and
what you have to do to set up the other stuff (i.e. if host doesn't
manage your hierarchy of choice, but only others, just follow the same
structure in the other hierarchy, and clean up after yourself). This
is what nspawn does: if host is cgroupv2 only it will set up
name=systemd hierarchy in cgroupv1 itself, and pass that to the
container.

(I mean, we might have regressed on this, since i guess this kind of
setup is not as well tested with nspawn, but I distinctly remember
that I wrote that stuff once upon a time, and it worked fine then.)

> That last issue has been biting us a bit recently but it's something
> that one can currently workaround by forcing systemd back into hybrid
> mode on the host.

This should not be necessary, if LXD would do minimal cgroup setup on
its own.

> With the deprecation of cgroup1, this won't be possible anymore. You
> simply won't be able to have both CentOS7 and Fedora XYZ running in
> containers on the same system as one will only work on cgroup1 and the
> other only on cgroup2.

I am pretty sure this works fine with nspawn...

> I guess that would mean holding on to cgroup1 support until EOY 2023
> or thereabout?

That does sound OK to me. We can mark it deprecated before though,
i.e. generate warnings, and remove it from docs, as long as the actual
code stays around until then.

Thank you, for the input,

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2022-07-22 Thread Lennart Poettering
On Do, 21.07.22 11:55, Christian Brauner (brau...@kernel.org) wrote:

> In general, I wouldn't mind dropping cgroup1 support in the future.
>
> The only thing I immediately kept thinking about is what happens to
> workloads that have a v1 cgroup layout on the host possibly with an
> older systemd running container workloads using a newer distro with a
> systemd version without cgroup1 support.
>
> Think Ubuntu 18.04 host running a really new Ubuntu LTS that has a
> version of systemd with cgroup1 support already dropped. People do
> actually do stuff like that. Stéphane and Serge might know more about
> actual use-cases in that area.

The question is though how much can we get away with at that
front. i.e. I think we can all agree that if you attempt to run an
extremely new container on an extremely old host is something we
really don't have to support, once the age difference is beyond some
boundary. The question is at what that boundary is.

Much the same way as we have a baseline on kernel versions systemd
supports (currently 3.15, soon 4.5), we probably should start to
define a baseline of what to expect from a container manager.

Lennart

--
Lennart Poettering, Berlin


[systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2022-07-21 Thread Lennart Poettering
Heya!

It's currently a terrible mess having to support both cgroupsv1 and
cgroupsv2 in our codebase.

cgroupsv2 first entered the kernel in 2014, i.e. *eight* years ago
(kernel 3.16). We soon intend to raise the baseline for systemd to
kernel 4.3 (because we want to be able to rely on the existance of
ambient capabilities), but that also means, that all kernels we intend
to support have a well-enough working cgroupv2 implementation.

hence, i'd love to drop the cgroupv1 support from our tree entirely,
and simplify and modernize our codebase to go cgroupv2-only. Before we
do that I'd like to seek feedback on this though, given this is not
purely a thing between the kernel and systemd — this does leak into
some userspace, that operates on cgroups directly.

Specifically, legacy container infra (i.e. docker/moby) for the
longest time was cgroupsv1-only. But as I understand it has since been
updated, to cgroupsv2 too.

Hence my question: is there a strong community of people who insist on
using newest systemd while using legacy container infra? Anyone else
has a good reason to stick with cgroupsv1 but really wants newest
systemd?

The time where we'll drop cgroupv1 support *will* come eventually
either way, but what's still up for discussion is to determine
precisely when. hence, please let us know!

Thanks,

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Issues with /usr GPT auto-mount

2022-07-14 Thread Lennart Poettering
On Do, 14.07.22 12:40, Michael Cassaniti (mich...@cassaniti.id.au) wrote:

> Should I at least raise a feature request in GitHub?

Please do!

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Issues with /usr GPT auto-mount

2022-07-14 Thread Lennart Poettering
On Do, 14.07.22 12:08, Michael Cassaniti (mich...@cassaniti.id.au) wrote:

> Hi,
>
> I've read the below two articles and potentially made wrong assumptions
> about how automatic discovery of GPT partitions works for EFI boot rather
> than from systemd-nspawn.
>
> https://0pointer.net/blog/the-wondrous-world-of-discoverable-gpt-disk-images.html
> (particularly the Versioning + Multi-Arch section)
> https://0pointer.net/blog/fitting-everything-together.html
>
> After reading this on top of the gpt-auto-generator.c code (read doesn't
> mean perfectly understood) I believe that only the root file system gets
> mounted under an EFI boot. I was hoping that both root and /usr get mounted
> as appropriate. I can confirm when using 'systemd-dissect /dev/sda' that
> there is a designated /usr partition with a label that isn't '_empty' and
> for the correct architecture.
>
> Either I've done something wrong and need help or systemd-gpt-auto-generator
> is working correctly and I'm wrong. All feedback is appreciated.

This functionality is still missing in systemd-gpt-generator
currently. Would love to review/merge a patch that fills in the gap.

(In my own usecase I always used usrhash= on the kernel cmdline, to
pin a specific /usr/ fs to a specific kernel, thus /usr/ auto
discovery was never needed, but we should definitely support that too)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] Running actual systemd‑based distribution image in systemd‑nspawn

2022-07-11 Thread Lennart Poettering
On Mo, 11.07.22 13:57, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> > That said: I strongly recommend that distros ship empty /etc/fstab by
> > default, and rely on GPT partition auto discovery
> > (i.e. systemd‑gpt‑auto‑generator) to mount everything, and only depart
> > from that if there's a strong reason to, i.e. default mount options
> > don't work, or external block device referenced or so.
>
> What if you have multiple operating systems in various partitions on one
> disk?
> /etc/fstab absolutely makes sense

See boot loader spec about that. Basically, the assumption that for
things like swap, /var/tmp or /home it's OK or even a good thing if
shared between OSes. The major execption is /var/ itself, which is per
OS installation and should not be shared between multiple
installations. The boot loader spec hence by default will only
auto-mount /var/ partitions only if the GPT partition uuid of that is
hashed from the machine id.

But there are two distinct concepts here:

1. tag your partitions properly by type uuid. This is always a good
idea, and makes nspawn just work. and all other tools that recognize
partition type uuids, i.e. all the --image= switches systemd tools
have, and so on.

2. actually ship an empty /etc/fstab and rely solely on gpt auto
discovery. i'd always do that whereever possible (i.e. any OS where
multibot does't matter, i.e. appliances, images for VMs/nspawn, cloud
stuff, servers).

i.e. concept 1 should always be done. If you then also adopt concept 2
is up to you. You can, but you don't have to.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Service output missing from journal?

2022-07-04 Thread Lennart Poettering
On Mo, 04.07.22 23:15, Michael Biebl (mbi...@gmail.com) wrote:

> Am Mo., 4. Juli 2022 um 19:36 Uhr schrieb Lennart Poettering
> :
> >
> > eOn So, 03.07.22 19:29, Uwe Geuder (systemd-devel-ugeu...@snkmail.com) 
> > wrote:
> >
> > > Hi!
> > >
> > > When I run the command given below on a current Fedora CoreOS system
> > > (systemd 250 (v250.6-1.fc36)) I get a result I absolute cannot understand.
> > > Can anybody help me with what is wrong there?
> > >
> > > $ systemd-run --user sh -c 'while true; do echo foo; df -h 
> > > /var/log/journal/; echo $?; sleep 3; done'
> > > Running as unit: run-r9a155474889b4d40a1ac119823bdc2bf.service
> > > $ journalctl --user -f -u run-r9a155474889b4d40a1ac119823bdc2bf
> > > [ ... similar lines redacted ... ]
> > > Jul 03 15:25:08 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:11 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:11 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:14 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:14 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:17 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:17 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:20 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:20 ip-172-31-8-116 sh[366941]: Filesystem  Size  Used 
> > > Avail Use% Mounted on
> > > Jul 03 15:25:20 ip-172-31-8-116 sh[366941]: /dev/nvme0n1p4  9.5G  8.1G  
> > > 1.5G  86% /var
> > > Jul 03 15:25:20 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:23 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:23 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:26 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:26 ip-172-31-8-116 sh[366900]: 0
> > > Jul 03 15:25:29 ip-172-31-8-116 sh[366900]: foo
> > > Jul 03 15:25:29 ip-172-31-8-116 sh[366900]: 0
> > > [...]
> > >
> > > So the output from the df command appears in the journal pretty rarely,
> > > appearingly at random intervals. When I run the same loop on the
> > > command line the output occurs every time.
> > >
> > > The problem was originally noted in a somewhat loaded system. However,
> > > above reproducer (including the 2 echo commands and a shorter sleep)
> > > shows the same problem even on an idling machine.
> >
> > https://github.com/systemd/systemd/issues/2913
>
> I thought about this as well, but in this case the service is still
> running. So I'm not sure if #2913 applies here.

The service is, but the "df" process exits extremely quickly, before
we can figure out what it belongs to. See the PIDs where it works,
they are different from your shell script's PID, because they are
short-lived child processes.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Service output missing from journal?

2022-07-04 Thread Lennart Poettering
eOn So, 03.07.22 19:29, Uwe Geuder (systemd-devel-ugeu...@snkmail.com) wrote:

> Hi!
>
> When I run the command given below on a current Fedora CoreOS system
> (systemd 250 (v250.6-1.fc36)) I get a result I absolute cannot understand.
> Can anybody help me with what is wrong there?
>
> $ systemd-run --user sh -c 'while true; do echo foo; df -h /var/log/journal/; 
> echo $?; sleep 3; done'
> Running as unit: run-r9a155474889b4d40a1ac119823bdc2bf.service
> $ journalctl --user -f -u run-r9a155474889b4d40a1ac119823bdc2bf
> [ ... similar lines redacted ... ]
> Jul 03 15:25:08 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:11 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:11 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:14 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:14 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:17 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:17 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:20 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:20 ip-172-31-8-116 sh[366941]: Filesystem  Size  Used Avail 
> Use% Mounted on
> Jul 03 15:25:20 ip-172-31-8-116 sh[366941]: /dev/nvme0n1p4  9.5G  8.1G  1.5G  
> 86% /var
> Jul 03 15:25:20 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:23 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:23 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:26 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:26 ip-172-31-8-116 sh[366900]: 0
> Jul 03 15:25:29 ip-172-31-8-116 sh[366900]: foo
> Jul 03 15:25:29 ip-172-31-8-116 sh[366900]: 0
> [...]
>
> So the output from the df command appears in the journal pretty rarely,
> appearingly at random intervals. When I run the same loop on the
> command line the output occurs every time.
>
> The problem was originally noted in a somewhat loaded system. However,
> above reproducer (including the 2 echo commands and a shorter sleep)
> shows the same problem even on an idling machine.

https://github.com/systemd/systemd/issues/2913

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] show container limits?

2022-07-04 Thread Lennart Poettering
On Mo, 04.07.22 12:37, Harald Dunkel (harald.dun...@aixigo.com) wrote:

> Hi folks,
>
> systemctl status does a nice job showing LXC containers and their
> process trees, but I wonder if it could show memory and cpu limits,
> memory utilization, swap, etc as well, even if the LXC or docker or
> whatever container wasn't started by systemd? cgroup1 and unified,
> if possible.

systemctl status shows memory/cpu limits of the cgroups/units it
manages. If docker/lxc create per-container units through systemd,
then this should just work. but it really depends how they implemented
stuff.

To my knowledge docker not implementing the delegation
model of cgroups at all, and just fucks around in the tree directly at
random places, hence systemd won't know about it at all... i.e. they
refuse to acknowledge the existance of this, because they think
systemd is stupid, or something like that:

https://systemd.io/CGROUP_DELEGATION

LXC is better and follow these docs, to my knowledge, but not sure
which model they actually followed — i.e. the model of "one unit per
container" or the model of "a single unit for all containers". If the
latter you cannot use systemd tools to inspect or manage resources.

You can use "systemd-cgtop" to show current resource usage of any
cgroup (regardless if managed by systemd or not), but it doesn't show
limits bein enforced, but that would probably make sense to add...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Unable to check 'effective' cgroup limits

2022-07-04 Thread Lennart Poettering
On Do, 09.06.22 11:40, Lewis Gaul (lewis.g...@gmail.com) wrote:

> Hi everyone,
>
> [Disclaimer: cross posting from
> https://github.com/containers/podman/discussions/14538]
>
> Apologies that this is more of a Linux cgroup question than specific to
> systemd, but I was wondering if someone here might be able to enlighten
> me...
>
> Two questions:
>
>- Why on cgroups v1 do the cpuset controller's
>cpuset.effective_{cpus,mems} seem to simply not work?

systemd doesn't support cpuset on cgroupsv1. It's too broken.

systemd supports cpuset only on cgroupsv2.

>- Is there any way to check effective cgroup memory or hugetlb limits?
>(cgroups v1 or v2)

We do not support hugetlb at all.

We currently do not have an API for querying effective cgroup limits
(but sounds OK to aff, file an RFE), but you can of course go to
cgroupfs and read what's set there, for now?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] homed: Purpose of assert(!h->current_operation)

2022-07-04 Thread Lennart Poettering
On So, 26.06.22 02:57, Léo (leeo97...@gmail.com) wrote:

> Hello,
>
> I would like to know the purpose of this assert operation:
> https://github.com/systemd/systemd/blob/v251/src/home/homed-home.c#L2697
>
> What does it mean if it fails?

It just encodes that this function expects to be called with no
operation currently being executed.

Like all assert() it just encodes assumptions made by the programmer,
i.e. stuff that should be guranteed at this point, and if not met
indicate a programming error (as opposed to a runtime error).

Specifically, for each home dir, we allow exactly one operation to be
executed at once, and all other ones are queued. Thus, when we start
to execute one operation we check that there is none already being
executed, because if it was, then there's a bug somewhere.

Why do you ask? did you actually see the assertion being hit?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Waiting for network routes to appear

2022-07-04 Thread Lennart Poettering
On Mi, 15.06.22 07:31, Kevin P. Fleming (ke...@km6g.us) wrote:

> I've got a number of systems that use BIRD to learn the routes
> available on their networks, and as a result some services on those
> systems attempt to start up before the routes have been learned. If
> those services attempt to make network connections (even just DNS
> queries), they will fail, and that's unpleasant.
>
> I can't use existing systemd facilities to make these services wait,
> because there's no mechanism available for BIRD to indicate that any
> specific route has been learned, or a way to configure a service to
> wait for a specific route.
>
> I'm considering just writing a smallish Python program which will
> accept (via configuration) a list of routes, and will listen to
> netlink to wait for all of those routes to appear. I'd then make my
> services dependent on this service reporting success. However, since
> networkd already listens to netlink, it would certainly be possible
> for it to provide this facility in some way.
>
> If you'll pardon the analogy, I'm thinking of something like
> RequiresMountsFor=, which makes service startup wait until mount units
> have succeeded. Of course following this analogy we'd end up creating
> the concept of a 'route unit', and I'm not sure that's really the
> right thing to do here.
>
> Is it worth trying to design some way for networkd to provide this
> facility? if not, I'll just continue down the road of doing this
> outside of systemd proper.

networkd manages an interface in full or not at all. I am pretty sure
we shouldn't make it something that watches asynchronously what other
software does, and then acts on it. That's racy and fragile.

It appears to me you should ask the "bird" project for this
functionalit instead?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] mkosi inside a toolbox container

2022-07-04 Thread Lennart Poettering
On Di, 28.06.22 22:01, Ananth Bhaskararaman (ant...@gmail.com) wrote:

> Has anyone had success using mkosi to generate images inside a
> toolbox container? I'm running Fedora 36 Silverblue.

mkosi needs loopback block devices. They are not virtualized for
containers on Linux. That's a kernel issue.

> I keep getting errors related to systemd not booting up as PID 1,
> misc. systemd-networkd errors, and something about a btrfs device scan
> lacking permissions.
>
> I'd love pointers on how to get this working, or hear from people
> who've tried anything similar.

You have to fix the kernel to properly virtualize block devices for
kernels. Good luck!

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] capabilities for systemd --user

2022-07-04 Thread Lennart Poettering
On Mo, 27.06.22 23:36, Lukasz Stelmach (l.stelm...@samsung.com) wrote:

> Hi,
>
> I need an apparently exotic configuration and I don't know how to
> approach the problem. Here are the requirements:
>
> - user@1234.service (systemd --user)
>   + runs with Priv SMACK label (SmackProcessLabel in user@.service)
>   + has cap_mac_admin (and a few other capabilities) to assign SMACK
> labels to its children (AmbientCapabilities in user@.service)
>
> - children (session services) run with Reg SMACK label (I added
>   support for DefaultSmackProcessLabel to user.conf, to avoid
>   modifications of all unit files)

sounds upstreamable.

>
> - children DO NOT inherit capabilites from systemd --user (they do now)
>
> This last is a problem because I'd like to avoid modifications of all
> service files. I tried to drop inheritable caps before execve() (in
> exec_child()) but as described in capabilities(7) this results in
> dropping caps from the ambient set too, which means systemd --user
> doens't get what it needs.
>
> Is there anything I am missing? Is there any way to start a service with
> UID!=0, some capabilities set but not implicitly inheritable by
> processes spawned by the service?

Quite frankly that should probably be the default behaviour.

I'd probably merge a patch that unconditionally resets all caps
passed to children of the --user manager even if the manager itself
got some ambient caps passed. It might be a slight compat breakage,
but I think it would be safer that way, as the service execution
environment becomes more uniform then.

Security credentials should be passed down to user services opt-in,
not opt-out after all.

Can you prep a patch for that and submit via github?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] https://github.com/QubesOS/qubes-issues/issues/7335

2022-07-04 Thread Lennart Poettering
On Mo, 30.05.22 08:13, Ulrich Windl (ulrich.wi...@rz.uni-regensburg.de) wrote:

> Hi!
>
> Just in case: Does anybody have any idea what might be causing this
> effect (https://github.com/QubesOS/qubes-issues/issues/7335)?

LVM issues you have to ask the LVM people about really.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [libudev] is there a function to filter message from kernel with property and value

2022-07-04 Thread Lennart Poettering
On Di, 31.05.22 02:48, Wang, Yuan1 (yuan1.w...@intel.com) wrote:

> Hi
>
> Need your kind help for one question!
>
> Do libudev have a function that could be used to filter the message with 
> property from kernel socket?

No, because that is not optimizable. i.e. we have no way to filter
these missages in kernel. We could filter them in userspace on the
library once we have them, but that's not really too useful, since you
might as well do that yourself, the library wouldn't be any more
efficient.

Usually, what you want to do instead is mark devices to filter for
with a "tag". Which are short strings that devices can be labelled
with. Each device can have zero, one or more tags. You set them via
udev rules. These tags can then efficiently be filtered for via the
library. This is internally implemented via a Bloom Filter, that is
tested via a BPF socket filter, which means the kernel already filters
the messages, and userspace is never bothered anymore (well, bloom
filters are probabilistic, so userspace has to check for false
positives).

Anyway, long story short: filtering by properties is not supported
because you should not do that, and should use tags instead.

Also, libudev is obsolete and does not recieve new additions. Use the
sd-device API instead.

--
Lennart Poettering, Berlin


Re: [systemd-devel] Unit shutdown order not always respected

2022-07-01 Thread Lennart Poettering
On Do, 30.06.22 09:46, David Gubler (david.gub...@vshn.net) wrote:

> Hi list,
>
> I have a situation where I need to run a command and wait for its completion
> before unmounting a file system (/enc in my case). My problem is that
> systemd sometimes waits for the completion of the command, and sometimes
> doesn't.
>
> So the setup is:
> * /enc is a mounted encrypted (luks) volume
> * /var/lib/mysql is a bind mount to /enc/mysql
> * MariaDB is using /var/lib/mysql
> * We've set up a "requires" and "after" dependency chain from MariaDB all
> the way to the luks volume. This works 100% reliably during startup, even if
> something takes too long, fails, or if we have to manually fiddle with
> stuff.
> * Ubuntu 20.04 with systemd 245.4
>
>
> The unit of the command that needs to run before unmounting /enc looks like
> this:
>
>
> [Unit]
> Description=server-secrets-prepare-reboot-enc service
> After=network-online.target enc.mount
> Requires=network-online.target
> BindsTo=enc.mount
>
> [Service]
> ExecStart=/bin/true

You can drop this line.

> ExecStop=/usr/local/sbin/server-secrets reboot "/enc"
> Restart=no

This is the implied default.

> StandardOutput=syslog
> StandardError=syslog

These lines are obsolete. Drop them.

> SyslogIdentifier=server-secrets-prepare-reboot-enc
> User=root

This is the impleid default.

> Type=oneshot
> RemainAfterExit=yes
>
> [Install]
> WantedBy=enc.mount
>
>
>
>
> I've added a 10s sleep to the "server-secrets" command in order to eliminate
> "works by chance" situations.
>
> In normal circumstances everything works perfectly. I can reboot the server
> and systemd waits >10s with unmounting /enc until server-secrets is done. I
> can stop mariadb, umount /var/lib/mysql and /enc, and systemd runs
> server-secrets just fine. I can re-mount /enc, unmount it again, everything
> works.
>
> Where things start to fall apart is when I reboot the server ("reboot" or
> "shutdown -r now") after manually stopping MariaDB. In this case systemd
> starts the server-secrets command before unmounting /enc, but does not wait
> for its completion, and immediately unmounts /enc, causing the command to
> fail.
>
> Note that manually stopping MariaDB does not unmount anything, it just...
> stops MariaDB. But somehow it causes systemd to change its ordering behavior
> during shutdown.
>
> Does this ring any bells?

You are starting this service at boot right? Or are you starting it
only during shutdown? Genreally if code shall run during shutdown the
right approach is to make it a service that starts at boot-up, but has
an ExecStart= that is empty, but an ExecStop= set that has a valid command.

Either way, consider having a look at the debug logs, to see what
happens. Usually you probably have some odering cycle between units,
which we'll try to fix for you, but which will of course mean the
ordering is not going to be executed in full.

See:

https://freedesktop.org/wiki/Software/systemd/Debugging/#diagnosingshutdownproblems

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Running actual systemd-based distribution image in systemd-nspawn

2022-06-30 Thread Lennart Poettering
On Sa, 18.06.22 07:45, Andrei Borzenkov (arvidj...@gmail.com) wrote:

> On 16.06.2022 11:27, Colin Guthrie wrote:
> > Andrei Borzenkov wrote on 15/06/2022 16:56:
> >> I tried it (loop mounting qemu image):
> >>
> >> systemd-nspawn -D ./hd0 -b
> >>
> >> and it failed miserably with "Timeout waiting for device
> >> dev-disk-by...". Which is not surprising as there are no device units
> >> inside of container (it stops in single user allowing me to use sysctl
> >> -t device).
> >>
> >> Is it supposed to work at all? Even if I bind mount /dev/disk it does
> >> not help as systemd does not care whether device is actually present or 
> >> not.
> >
> > I've not tried "booting" a real install inside nspawn before (just
> > images installed by mkosi mostly), but could this just be a by-product
> > of it trying to do what /etc/fstab (or other mount units) say to do?
> >
> > Can you try something like:
> >
> > touch blank
> > systemd-nspawn --bind-ro=./blank:/etc/fstab -D ./hd0 -b
> >
>
> Yes, --bind=/dev/null:/etc/fstab
>
> allows boot to complete. Of course next it refuses root login because
> pts/0 is not secure :)

pam_securetty is archaic cruft, and a broken idea. Please work with
your distribution to remove it. It might have made some vague sense on
1980's fixed line terminal environments, but is security theatre and a
nothing more than a nuisance in today's world.

Modern distributions do not enable it anymore.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Running actual systemd-based distribution image in systemd-nspawn

2022-06-30 Thread Lennart Poettering
On Do, 16.06.22 09:27, Colin Guthrie (gm...@colin.guthr.ie) wrote:

> Andrei Borzenkov wrote on 15/06/2022 16:56:
> > I tried it (loop mounting qemu image):
> >
> > systemd-nspawn -D ./hd0 -b
> >
> > and it failed miserably with "Timeout waiting for device
> > dev-disk-by...". Which is not surprising as there are no device units
> > inside of container (it stops in single user allowing me to use sysctl
> > -t device).
> >
> > Is it supposed to work at all? Even if I bind mount /dev/disk it does
> > not help as systemd does not care whether device is actually present or not.
>
> I've not tried "booting" a real install inside nspawn before (just images
> installed by mkosi mostly), but could this just be a by-product of it trying
> to do what /etc/fstab (or other mount units) say to do?
>
> Can you try something like:
>
> touch blank
> systemd-nspawn --bind-ro=./blank:/etc/fstab -D ./hd0 -b

This should not be necessary, as systemd-fstab-generator actually
ignores all /etc/fstab entries referencing block devices.  See this:

https://github.com/systemd/systemd/blob/main/src/fstab-generator/fstab-generator.c#L602

(i.e. container managers such as nspawn should mount /sys/ read-only,
which is indication to container payloads that device management
should not be done by them but is done somewhere else. This is used as
check whether to ignore the fstab entries that look like device patjs,
i.e. start with /dev).

How precisely does the offending fstab line look like for you?
Normally it should be ignored just like that. If it is not ignored,
this looks like a bug.

> to override the /etc/fstab (there may be other more elegant ways to disable
> fstab processing!) and see if that helps?

No need. Should happen automatically.

That said: I strongly recommend that distros ship empty /etc/fstab by
default, and rely on GPT partition auto discovery
(i.e. systemd-gpt-auto-generator) to mount everything, and only depart
from that if there's a strong reason to, i.e. default mount options
don't work, or external block device referenced or so.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Running actual systemd-based distribution image in systemd-nspawn

2022-06-30 Thread Lennart Poettering
On Mi, 15.06.22 18:56, Andrei Borzenkov (arvidj...@gmail.com) wrote:

> I tried it (loop mounting qemu image):
>
> systemd-nspawn -D ./hd0 -b
>
> and it failed miserably with "Timeout waiting for device
> dev-disk-by...". Which is not surprising as there are no device units
> inside of container (it stops in single user allowing me to use sysctl
> -t device).
>
> Is it supposed to work at all? Even if I bind mount /dev/disk it does
> not help as systemd does not care whether device is actually present or not.

Yes, this should just work. I use it daily for work.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Questions around cgroups, systemd, containers

2022-05-21 Thread Lennart Poettering
On Fr, 20.05.22 17:12, Lewis Gaul (lewis.g...@gmail.com) wrote:

> To summarize the questions (taken from the second post linked above):
> - Why are private cgroups mounted read-only in non-privileged
> containers?

"private cgroups"? What do you mean by that? The controllers?

Controller delegation on cgroupsv1 is simply not safe, that's all. You
can provide invalid configuration to the kernel, and DoS the machine
through it. cgroups are simply not a suitable privilege boundary on
cgroupsv1.

If you want safe delegation, use cgroupsv2, where delegation is safe.

> - Is it sound to override Docker’s mounting of the private container
> cgroups under v1?

I don't know what Docker does these days, but they used to be entirely
ignorant towards safe cooperation in the cgroup tree. i.e. they
ignored https://systemd.io/CGROUP_DELEGATION in its entirety, as they
don't really accepted systemd's existance.

Today most distros I think switched over to other ways to run
containers, i.e. podman and so on, which have a more professional
approach to all this, and can safely cooperate in a cgroup tree.

>   - What are the concerns around the approach of passing '-v
> /sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
> cgroups?

I don't know what this does. Is this a Docker thing?

>   - Is modifying/replacing the cgroup mounts set up by the container engine
> a reasonable workaround, or could this be fragile?

I am not sure I follow? A workaround for what? One shouldn't assume
one even has the privs to modify cgroup mounts.

But why would one even?

> - When is it valid to manually manipulate container cgroups?

When you asked for your own delegated subtree first, see docs:

https://systemd.io/CGROUP_DELEGATION

>   - Do container managers such as Docker and Podman correctly delegate
> cgroups on hosts running Systemd?

podman probably does this correctly. docker didn't do, not sure if
that changed.

>   - Are these container managers happy for the container to take ownership
> of the container’s cgroup?

I am not sure I grok this question, but a correctly implemented
container manager should be able to safely run cgroups-using payloads
inside the container. In that model, a host systemd manages the root
of the tree, the container manager a cgroup further down, and the
payload of the container (for example another systemd run inside the
container) the stuff below.

> - Why are the container’s cgroup limits not set on a parent cgroup under
> Docker/Podman?

I don't grok the question?

>   - Why doesn’t Docker use another layer of indirection in the cgroup
> hierarchy such that the limit is applied in the parent cgroup to the
> container?

I don't understand the question. And I can't answer docker questions.

> - What happens if you have two of the same cgroup mount?

what do you mean by a "cgroup mount"? A cgroupfs controller mount? If
they are within the same cgroup namespace they will be effectively
bind mounts of each other, i.e. show the exact same contents.

>   - Are there any gotchas/concerns around manipulating cgroups via multiple
> mount points?

Why would you do that though?

> - What’s the correct way to check which controllers are enabled?

enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"
maybe? in your container mgr? depends on that.

>   - What is it that determines which controllers are enabled? Is it kernel
> configuration applied at boot?

Enabled where?

>   - Is it possible to have some controllers enabled for v1 at the same time
> as others are enabled for v2?

Yes.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-cryptsetup@.service crash during boot with fido2-device=auto

2022-05-18 Thread Lennart Poettering
On Di, 17.05.22 23:03, Anton Hvornum (an...@hvornum.se) wrote:

> Hi.
>
> I've been asking around everywhere for some assistance.
> The full issue can be found here:
> https://www.reddit.com/r/archlinux/comments/urnj8x/help_getting_fido2_and_systemdcryptenroll_working/
>
> The short version is, I got `systemd-cryptenroll --fido2-device=auto
> /dev/sda2` to work.
> Unlocking it works with a password, but it's not trying to use the
> fido2-device as expected.
>
> Whenever I add `/etc/crypttab` to the initramfs
> `systemd-cryptsetup@luksdev.service` crashes.

Crashes? What does that mean? As in segfault?

If so, please provide a stacktrace, otherwise this is not actionable
to us.

> And I'm wondering, is it required for the USB device to come alive
> before this service tries to execute?

Some initrds don't pick up the relevant fido2 udev
rules. i.e. 60-fido-id.rules and such. Contact your distro's initrd
maintainers for help on that.

>
> As far as I can tell, it executed:
> /lib/systemd/systemd-cryptsetup attach 'luksdev' '/dev/sda2' 'none'
> 'luks,fido2-device=auto'
>
> And by default if executed on a live medium that will hang waiting for
> the HSM to be inserted and will work. But I can't figure out why the
> service would break if that is all it does.
>
> As soon as I create a /etc/crypttab or omit tpm2-device=auto from the
> kernel command-line, the boot process breaks. Buf it I don't use
> /etc/crypttab or I have tpm2-device=auto the service succeeds - but
> won't use the fido device.. And that's probably obvious for everyone
> here but I'm stumped.

hmm, fido? or tpm?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Should `MACAddressPolicy=persistent` for bridges/bonds/all-software-devices be reconsidered?

2022-05-12 Thread Lennart Poettering
On Mo, 09.05.22 22:37, Dusty Mabe (du...@dustymabe.com) wrote:

> > This is true. But one can just as well argument that with
> > MACAddressPolicy=persistent the address is even more predictable. If
> > you know the machine-id and device name, you can calculate the address
> > in advance, even before deciding if the device will e.g. have this or
> > that card attached.
>
> Regarding machine-id, isn't that unique and set on first boot?

Not necessarily. We will initialize it from the ID passed in through
DMI if we detect execution in a VM and the ID is not set yet. This
means cloud providers can control the machine ID a system will use
ahead of time.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Should `MACAddressPolicy=persistent` for bridges/bonds/all-software-devices be reconsidered?

2022-05-12 Thread Lennart Poettering
On Mo, 09.05.22 19:27, Zbigniew Jędrzejewski-Szmek (zbys...@in.waw.pl) wrote:

> FWIW, I still think it's a better _default_. The patch that finally
> introduced this was my patch [1], so I'm obviously biased… Some more
> considerations:

I agree with this.

Finding good defaults is always difficult, but I must say stability
and predictability is a property I like above a lot of other stuff. I
understand that in plenty environments it's important not to add new
MAC addresses to the mix, but it's impossible to know in which
environment we are.

So, either way, some people will always be unhappy with the defaults
we pick. Changing defaults makes sense if it's highly likely that the
vast majority of users would benefit from the new default. But I don't
see that here... And changing defaults comes at a price, because it
will break people's setup. We made a change once here, but I wouldn't
use that as excuse to change it again...

So, I am not convinced.

What makes me wonder about all of this: we are talking about synthetic
devices, which means some tool is used to create them. If those tools
prefer a different MAC policy, why don't they drop in a .link file
that matches against the devices that specific software creates?

i.e. let's say NM prefers to use a different MAC policy: they could
drop in a udev rules file that adds some udev property onto the
network devices they manage (i.e. invoke a callout binary, or do a
TEST check or so which checks the iface against some NM state, and
then set ID_NM_OWNED_AND_OPERATED=1 or so). Then, ship a .link file
(or multiple) with a Property= match in the [Match] section, that sets
the desired policy.

With such a logic, NM could make its own choices on MAC policy, but
the default systemd policy wouldn't have to change.

Also, afaik OSes that run in clouds all have some tool like cloud-init
or ignition or so, which generate .network files in /run with the right
configuration. Why not generate .link files in /run the same way with
a MAC policy appropriate for the cloud provider?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [SPAM] Re: Custom options and passing options via command line.

2022-05-10 Thread Lennart Poettering
On Di, 10.05.22 18:29, Kamil Jońca (kjo...@op.pl) wrote:

> Lennart Poettering  writes:
>
> > On Di, 10.05.22 17:59, Kamil Jońca (kjo...@op.pl) wrote:
> >
> >> Maybe I was not clear.
> >> I have ("internal") interfaces qemu1 and qemu2. and interface eth 
> >> ("external")
> >> I wat to nat traffic from interface qemu1 via eth , but I do not want
> >> nat traffic from interface qemu2 via eth2/
> >>
> >> How to achieve this?
> >
> > hmm, eth? eth2? is the latter a typo?
> >
> > Assuming it is a typo: set IPMasquerade=yes only in the .network file
> > that matches qemu1, not the one matching qemu2.
> Wait.
> eth = interface which got (statically or by dhcp) address 192.168.1.1
> qemu1 = bridge interface with bunch of VMs, address 192.168.2.1 subnet /24
> qemu2 = bridge interface with bunch of VMs, address 192.168.3.1 subnet /24
>
> I want that outgoing via eth traffic from qemu1 was masquaraded to
> 192.168.1.1
> and also want that outgoing via eth traffic from qemu2 was not touched
> (ie. has have source addresses 192.168.3.0/24)

Yes. So for the two bridge interfaces, define two distinct .network
files, and set IPMasquerade=yes in one and leave it off in the other.

> >> Of course. Like most nontrivial things I want to do.
> >> That was my point.
> >
> > But why involve a callout at all if it's not dynamic?
> Why do you think it is not "dynamic"?
> Subnet for which I want to mask is given via ipsec (and I understand
> that this should be handled by ipsec scripts)  or DHCP (how?)

Ah, well, OK so if the stuff is dynamic, but based on something else
than a network interface? then networkd is not the right place to
configure that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [SPAM] Re: Custom options and passing options via command line.

2022-05-10 Thread Lennart Poettering
On Di, 10.05.22 17:59, Kamil Jońca (kjo...@op.pl) wrote:

> Maybe I was not clear.
> I have ("internal") interfaces qemu1 and qemu2. and interface eth ("external")
> I wat to nat traffic from interface qemu1 via eth , but I do not want
> nat traffic from interface qemu2 via eth2/
>
> How to achieve this?

hmm, eth? eth2? is the latter a typo?

Assuming it is a typo: set IPMasquerade=yes only in the .network file
that matches qemu1, not the one matching qemu2.

> > If this does not deal in interfaces, but in IP addresses instead, no
> > need to involve networkd. Just define the firewall outside of
> > networkd?
> Of course. Like most nontrivial things I want to do.
> That was my point.

But why involve a callout at all if it's not dynamic?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [SPAM] Re: Custom options and passing options via command line.

2022-05-10 Thread Lennart Poettering
On Di, 10.05.22 17:46, Kamil Jońca (kjo...@op.pl) wrote:

> Lennart Poettering  writes:
>
> > On Di, 10.05.22 12:00, Kamil Jońca (kjo...@op.pl) wrote:
> >
> >> > The engine is decided at build time, i.e. can be either iptables or
> >> > nftables.
> >>
> >> But there are two kind of "nat' in *tables suites: 1.masquerade or 2.snat.
> >
> > It uses DNAT or MASQUERADE.
> >
> >> Especially what wyould be equivalent of:
> >>
> >> --8<---cut here---start->8---
> >> iface qemu inet static
> >> address 192.168.11.1
> >> netmask 255.255.255.0
> >> bridge_ports none
> >> --8<---cut here---end--->8---
> >> This creates "bridge" with assigned IP, without any ports (and with
> >> scripts it can create/drop some nftables rules ...)
> >
> > A .netdev file with Kind=bridge to create the bridge + a .network file
> > that assigns an IP address to it?
>
> No. Does not work. interface is in "no-carrier" "configuring" state.

ConfigureWithoutCarrier=

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd tries to terminate a process that seems to have exited

2022-05-10 Thread Lennart Poettering
On Di, 10.05.22 08:44, Yuri Kanivetsky (yuri.kanivet...@gmail.com) wrote:

> The one that produces the messages is 249.11 (that is running in a
> docker container):
>
> https://packages.ubuntu.com/jammy/systemd
>
> The one running on the host is 215-17 (Debian 8).

that's ancient... i figure this then also means you are stuck with
cgroupv1. Which means cgroup empty notifications in containers
typically don#t work.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd tries to terminate a process that seems to have exited

2022-05-10 Thread Lennart Poettering
On Mo, 09.05.22 23:43, Yuri Kanivetsky (yuri.kanivet...@gmail.com) wrote:

> Hi Andrei,
>
> Thanks for the suggestion. It becomes more verbose, but it still seems
> like `systemd` fails to notice that `gnome-keyring` exited:
>
> May 09 17:52:47 cb6d1c84f84e systemd[106]: gnome-keyring.service:
> Passing 0 fds to service
> May 09 17:52:47 cb6d1c84f84e systemd[106]: gnome-keyring.service:
> About to execute /usr/local/bin/gnome-keyring-daemon --start
> --components pkcs11,secrets
> May 09 17:52:47 cb6d1c84f84e systemd[106]: gnome-keyring.service:
> Forked /usr/local/bin/gnome-keyring-daemon as 310
> May 09 17:52:47 cb6d1c84f84e systemd[106]: gnome-keyring.service:
> Changed dead -> start
> May 09 17:52:47 cb6d1c84f84e systemd[106]: Starting Start
> gnome-keyring for the Secrets Service, and PKCS #11...
> May 09 17:52:47 cb6d1c84f84e systemd[310]: Skipping PR_SET_MM, as
> we don't have privileges.
> May 09 17:52:47 cb6d1c84f84e systemd[310]: gnome-keyring.service:
> Executing: /usr/local/bin/gnome-keyring-daemon --start --components
> pkcs11,secrets

My educated guess: you are running in cgroupsv1 mode. cgroup empty
notifications do not work reliably in containers on cgroupsv1.

Use cgroupsv2.

(but i think docker doesn't support that)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd tries to terminate a process that seems to have exited

2022-05-10 Thread Lennart Poettering
On Do, 05.05.22 04:41, Yuri Kanivetsky (yuri.kanivet...@gmail.com) wrote:

> Hi,
>
> This might be not a systemd issue. But the behavior is weird, and I'm not 
> sure.
>
> I'm trying to run GNOME in a docker container. And gnome-keyring
> fails to start:

To my knowledge Docker is not capable of running a proper
systemd-based userspace as a container. I.e. it does not implement
this:

https://systemd.io/CONTAINER_INTERFACE

As I understand, they are not interested in this, think this is out of
focus. Which is certainly their right. But if you want to run systemd
as container payload, then bettr use a different container manager,
like podman, lxc, systemd-spawn. They all are a lot more open to
supporting systemd as payload in a way that just works.

Docker is particularly borked when it comes to the way cgroups are set
up. And given that they are stuck on cgroupsv1 (or did that change?) i
see no perspective there.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [SPAM] Re: Custom options and passing options via command line.

2022-05-10 Thread Lennart Poettering
On Mo, 09.05.22 20:00, Kamil Jońca (kjo...@op.pl) wrote:

> Kamil Jońca  writes:
>
>
> > Let's see.
> > from SYSTEMD.NETWORK(5)
> > ...
> > IPMasquerade=
> >Configures IP masquerading for the network interface. If
> >enabled, packets forwarded from the network interface will be
> >appear as coming from the local host.
> > 
> >
> >
> > I still do not know what mean "local host" here. I guess that this will
> > be interface address.  :)
> >
> > I still do not know if this is rather "snat" or rather "masquerade". How
> > can I decide which to use. And what engine is used here.
> >
>
> Another question:
> 1. "partial nat"
>3 interfaces  qemu1 , qemu2, and eth
>I want to nat treffic from qemu1 via eth but not qemu2
>(NB this is the place, where I use mu custom option in
>/etc/network/interfaces which means "NAT this traffic" )

This sounds as if you just want to set IPMasquerade=yes on the
.network file that matche's qemu1's interface, and that's it.

> 2. nat based on destination network.
>
> I want to nat only traffic to say, 192.168.10.0/24, leaving rest
> untouched. (This is case when I have ipsec tunnel and I want to nat only
> traffic to other endpoint)

If this does not deal in interfaces, but in IP addresses instead, no
need to involve networkd. Just define the firewall outside of
networkd?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [SPAM] Re: Custom options and passing options via command line.

2022-05-10 Thread Lennart Poettering
On Di, 10.05.22 12:00, Kamil Jońca (kjo...@op.pl) wrote:

> > The engine is decided at build time, i.e. can be either iptables or
> > nftables.
>
> But there are two kind of "nat' in *tables suites: 1.masquerade or 2.snat.

It uses DNAT or MASQUERADE.

> Especially what wyould be equivalent of:
>
> --8<---cut here---start->8---
> iface qemu inet static
> address 192.168.11.1
> netmask 255.255.255.0
> bridge_ports none
> --8<---cut here---end--->8---
> This creates "bridge" with assigned IP, without any ports (and with
> scripts it can create/drop some nftables rules ...)

A .netdev file with Kind=bridge to create the bridge + a .network file
that assigns an IP address to it?

> >> > Afaics RouteMetric= [DHCPv4] section already does all you need. just
> >> > give the iface whose default route you want to take precedence a lower
> >> > metric and you are done.
> >>
> >> How? By editing files? And what with other examples?
> >
> > I am not sure I follow? when do you intend to change the preference?
>
> When I manually up interface
> (ie. when, for example, issue comand networkctl up "interface name")

We don#t support any explicit logic with that. But you can add a
drop-in for the .network file to /run/ and then reload before upping
the iface.

networkd always wants a complete, declarative idea of what it is
supposed to configure, so that it can adjust things to that. by doing
callouts that modify state you lose that ability, since networkd never
has a complete idea of what is supposed to be in effect, and once you
reload config things will be very confusing.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Relationship between cgroup hierarchy and slice names

2022-05-10 Thread Lennart Poettering
On Do, 05.05.22 19:12, Yeongjin Kwon (yeongjink...@gmail.com) wrote:

> On Thu, May 5, 2022 at 11:17 AM Lennart Poettering
>  wrote:
> >
> > On Do, 05.05.22 10:44, Yeongjin Kwon (yeongjink...@gmail.com) wrote:
> >
> > > On Wed, May 4, 2022 at 4:03 AM Lennart Poettering
> > >  wrote:
> > > >
> > > > The slice names match 1:1 to the position in the cgroup tree, that's
> > > > where they were designed.
> > > >
> > > > Basically our rule is: if the object unit types encapsulates
> > > > already have a file system path as name then we don't allow you to
> > > > make up a new name, but insist that the unit name is derived from that
> > > > pre-existing file system path.
> > >
> > > I see, thank you for responding.
> > >
> > > Then would it be possible to somehow override the "Slice" property for
> > > all units underneath the slice so that it points to another, custom
> > > created slice? Maybe using some conditional overriding mechanism?
> >
> > You can override the slice for any non-slice unit that is backed by a
> > cgroup via a dropin that overrides Slice=.
> >
> > But maybe I don#t grok your question.
>
> Sorry if I was being unclear.
>
> I want to relocate all non-slice units that are in a slice to a
> different slice. I know I can do this by overriding Slice=
> individually for each unit with a dropin, but that would be tedious
> and prone to flaws. For example, if a package put a unit under the
> slice that I wanted to move all units from, then I would have to
> override that new unit as well, which I might forget to do. I could
> use a post-install script, but that may not be the best solution
> either. And so I was wondering if there was a way to automatically
> override all units that are under the slice, without having to
> override Slice= for each of them individually. Since I haven't seen
> anything like this so far, I guess this is also a sort of feature
> request.

There's no such feature.

But how would it even work? if you say "everything in foo.slice should
now be in bar.slice". but once you say it, these units would be in
bar.slice so the rule doesn't apply anymore... it's messy to use a
property as matching parameter that is also the parameter you want to
change, because then things cannot possibly be declarative/idempotent
anymore...

Note that systemd units allow hierarchal drop-ins. i.e. the drop-in
directory "foo-.service.d/" is read for all units matching the glob
pattern "foo-*.service". Thus, if you want to migrate a bunch of units
at once, you could do so — as long as you enforce a naming scheme that
makes all units that "belong together" use the same name prefix.

In the past we had some discussion to have a global setting
DefaultSlice=, which would allow changing which slice to default to
for units that don't specify it. (i.e instead of system.slice). But I
am not convinced this really makes too much sense...

Anyway, I don't really grok the usecase...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [SPAM] Re: Custom options and passing options via command line.

2022-05-10 Thread Lennart Poettering
On Mo, 09.05.22 19:13, Kamil Jońca (kjo...@fastmail.com) wrote:

> >> 3. decide where to resolve names based on domain and existence of ipsec
> >> or openvpn tunnel.
> >
> > Sounds like a job for the resolved domain routing logic, which already
> > exists?
>
> Not quite. When I asked previously  I got response, that resolved is
> based on interfaces. But ipsec tunnel does not need dedicated
> interface.

but networkd-dispatcher stuff is also interface based, no? so it
wouldn't solve your problem either?

> I still do not know what mean "local host" here. I guess that this will
> be interface address.  :)

Yes.

> I still do not know if this is rather "snat" or rather "masquerade". How
> can I decide which to use. And what engine is used here.

The engine is decided at build time, i.e. can be either iptables or nftables.

> I know that networkd cannot handle bridge without ports (quite
> convenient when you use it as dummy interface with qemu machines)

It cannot?

> > Afaics RouteMetric= [DHCPv4] section already does all you need. just
> > give the iface whose default route you want to take precedence a lower
> > metric and you are done.
>
> How? By editing files? And what with other examples?

I am not sure I follow? when do you intend to change the preference?

> > Note anyway that networkd assumes it manages an interface in its
> > entirety: if you muck with what it sets up it likely will override
> > your changes sooner or later, when some event happens... you have a
>
> I do not want interfere with interfaces "per se" I simply want to get
> some info from systemd and pass it to dnsmasq (for DNS) or nftables (for
> filtering) . That's it.

You started out asking about default routes?

Lennart

--
Lennart Poettering, Berlin


  1   2   3   4   5   6   7   8   9   10   >