Re: [systemd-devel] Delaying VM startup until block devices are available
On 27.01.2024 00:40, Orion Poplawski wrote: On 1/26/24 01:21, Lennart Poettering wrote: On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote: We have various VMs that are back by luks encrypted LVs. At boot the volumes are decrypted by clevis. The problem we are seeing at the moment is that the VMs are started before the block devices are decrypted. Our current solution is: We generally wait for all devices listed in /etc/crypttab, unless you set noauto or nofail. We are setting 'nofail', because I don't think I want to fail the boot in general. They are not required for the system itself to function, just certain VMs. e.g: luks-backup /dev/vg_root/backup-raw none discard,_netdev,nofail See below for more though. # cat /etc/systemd/system/virtqemud.service.d/override.conf [Unit] After=blockdev@dev-mapper-luks\x2dbackup.target blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target Where we list each of the volumes to be decyrpted as blocking the virtqemud service. Does anyone have any better alternatives? My main issue it that it feels somewhere in between fine-grained and coarse-grained control. Ideally I think one would be able to have each individual VM startup automatically delayed until the devices each used became available, but I don't see how to do this. I am not sure how libvirt works, but if it runs every VM in a systemd unit, then you could just order the device before that unit, or the unit after the device. Really depends on how libvirt splits things up. I'm honestly not sure how libvirt works here either. But there seems to be this: # rpm -qf /usr/lib/systemd/system/virtqemud.service libvirt-daemon-driver-qemu-9.5.0-7.el9_3.alma.2.x86_64 which gets started: Jan 25 14:42:58 systemd[1]: Starting Virtualization qemu daemon... Jan 25 14:42:58 systemd[1]: Started Virtualization qemu daemon. Then the qemu-kvm processes end up in their own scope: ● machine-qemu\x2d1\x2dsrv\x2dmry01.scope - Virtual Machine qemu-1-srv-mry01 Loaded: loaded (/run/systemd/transient/machine-qemu\x2d1\x2dsrv\x2dmry01.scope; transient) Transient: yes Active: active (running) since Thu 2024-01-25 14:42:58 PST; 22h ago Tasks: 6 (limit: 16384) Memory: 15.6G CPU: 1h 15min 44.863s CGroup: /machine.slice/machine-qemu\x2d1\x2dsrv\x2dmry01.scope └─libvirt └─9086 /usr/libexec/qemu-kvm -name guest=... Alternatively it seems like one should be able to delay all VM startup until all volumes in /etc/crypttab were unlocked, rather than having to specify each one. But I don't see a target for that. This is default behaviour. Anything listed in /etc/crypttab is ordered before cryptsetup.target, which is ordered before sysinit.target, which is ordered before basic.target, which is ordered before regular services. We are specifying _netdev because they require the network to unlock. This I think puts them under remote-cryptsetup.target, and I used to depend on that. But with EL9 I'm seeing: # j -b -u remote-cryptsetup.target -u 'blockdev@dev-mapper-luks\x2dbackup.target' -u clevis-luks-askpass.service --no-hostname Jan 25 14:42:12 systemd[1]: Reached target Remote Encrypted Volumes. Jan 25 14:42:12 systemd[1]: Started Forward Password Requests to Clevis. Jan 25 14:42:48 clevis-luks-askpass[1706]: Unlocked /dev/vg_root/backup-raw (UUID=d6d25a85-2d43-4780-a312-e0e9b2383807) successfully Jan 25 14:42:54 systemd[1]: Reached target Block Device Preparation for /dev/mapper/luks-backup. Jan 25 14:42:59 systemd[1]: clevis-luks-askpass.service: Deactivated successfully. # systemctl list-dependencies remote-cryptsetup.target remote-cryptsetup.target ● ├─systemd-cryptsetup@luks\x2dbackup.service # j --no-hostname -b -u 'systemd-cryptsetup@luks\x2dbackup.service' Jan 25 14:42:12 systemd[1]: Starting Cryptography Setup for luks-backup... Jan 25 14:42:42 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64, key size 512 bits for device /dev/vg_root/backup-raw. Jan 25 14:42:47 systemd-cryptsetup[1697]: Failed to activate with specified passphrase. (Passphrase incorrect?) Jan 25 14:42:48 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64, key size 512 bits for device /dev/vg_root/backup-raw. Jan 25 14:42:54 systemd[1]: Finished Cryptography Setup for luks-backup. # systemctl show 'systemd-cryptsetup@luks\x2dbackup.service' | grep Type Type=oneshot So, if I'm following things correctly, this doesn't seem right. remote-cryptsetup.target depends on systemd-cryptsetup@luks\x2dbackup.service. This is a oneshot that is considered started after the main process exits, and above is shown as 14:42:54. But we are seeing 'Reached target Remote Encrypted Volumes' at 14:42:12. What am I missing? systemd-252-18.el9.x86_64 "nofail" encrypted devices are not ordered before (remote-)cryptsetup.target to not delay startup. The reasoning is, if you do not care whether this device exists or not, there is no reason to globall
Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services
> > Interepreting arbitrary regexes configured by unpriv code in priv code > > comes at some risk,. becose afair constructing them can come at O(2^n) > > time, i.e. a rogue regex could make use consume unbounded time on > > processing journal messages. > > Which regex engine is used? glibc’s engine is not safe for use with > untrusted input, but Rust’s is, so that might be an option in the > future. It isn’t OOM-safe, though. Rust isn't used. To my knowledge libpcre2 is used (at least it was at the time the feature landed). That library does not seem to allow setting any restrictions which would make it possible to parse untrusted input. For how exactly this works for rust see the documentation of the crate: https://docs.rs/regex/latest/regex/index.html#untrusted-input So in theory it is certainly possible to allow a regex subset though I am not aware of any C library which does this. A simple workaround we have done in a project I work on is to restrict the set of allowed characters. Doing it that way however puts more restrictions in place than theoretically possible. For the foreseeable future I agree with Lennart that documenting this quirk should be the most important thing. Afterwards this could be made configurable somehow, as well as showing a message or exiting with a non-zero code to indicate that this is not allowed.
Re: [systemd-devel] Delaying VM startup until block devices are available
On 1/26/24 01:21, Lennart Poettering wrote: > On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote: > >> We have various VMs that are back by luks encrypted LVs. At boot the volumes >> are decrypted by clevis. The problem we are seeing at the moment is that the >> VMs are started before the block devices are decrypted. Our current >> solution is: > > We generally wait for all devices listed in /etc/crypttab, unless you > set noauto or nofail. We are setting 'nofail', because I don't think I want to fail the boot in general. They are not required for the system itself to function, just certain VMs. e.g: luks-backup /dev/vg_root/backup-raw none discard,_netdev,nofail See below for more though. >> # cat /etc/systemd/system/virtqemud.service.d/override.conf >> [Unit] >> After=blockdev@dev-mapper-luks\x2dbackup.target >> blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target >> >> Where we list each of the volumes to be decyrpted as blocking the virtqemud >> service. >> >> Does anyone have any better alternatives? My main issue it that it feels >> somewhere in between fine-grained and coarse-grained control. >> >> Ideally I think one would be able to have each individual VM startup >> automatically delayed until the devices each used became available, but I >> don't see how to do this. > > I am not sure how libvirt works, but if it runs every VM in a systemd > unit, then you could just order the device before that unit, or the > unit after the device. > > Really depends on how libvirt splits things up. I'm honestly not sure how libvirt works here either. But there seems to be this: # rpm -qf /usr/lib/systemd/system/virtqemud.service libvirt-daemon-driver-qemu-9.5.0-7.el9_3.alma.2.x86_64 which gets started: Jan 25 14:42:58 systemd[1]: Starting Virtualization qemu daemon... Jan 25 14:42:58 systemd[1]: Started Virtualization qemu daemon. Then the qemu-kvm processes end up in their own scope: ● machine-qemu\x2d1\x2dsrv\x2dmry01.scope - Virtual Machine qemu-1-srv-mry01 Loaded: loaded (/run/systemd/transient/machine-qemu\x2d1\x2dsrv\x2dmry01.scope; transient) Transient: yes Active: active (running) since Thu 2024-01-25 14:42:58 PST; 22h ago Tasks: 6 (limit: 16384) Memory: 15.6G CPU: 1h 15min 44.863s CGroup: /machine.slice/machine-qemu\x2d1\x2dsrv\x2dmry01.scope └─libvirt └─9086 /usr/libexec/qemu-kvm -name guest=... > >> Alternatively it seems like one should be able to delay all VM startup until >> all volumes in /etc/crypttab were unlocked, rather than having to specify >> each >> one. But I don't see a target for that. > > This is default behaviour. Anything listed in /etc/crypttab is ordered > before cryptsetup.target, which is ordered before sysinit.target, > which is ordered before basic.target, which is ordered before regular > services. We are specifying _netdev because they require the network to unlock. This I think puts them under remote-cryptsetup.target, and I used to depend on that. But with EL9 I'm seeing: # j -b -u remote-cryptsetup.target -u 'blockdev@dev-mapper-luks\x2dbackup.target' -u clevis-luks-askpass.service --no-hostname Jan 25 14:42:12 systemd[1]: Reached target Remote Encrypted Volumes. Jan 25 14:42:12 systemd[1]: Started Forward Password Requests to Clevis. Jan 25 14:42:48 clevis-luks-askpass[1706]: Unlocked /dev/vg_root/backup-raw (UUID=d6d25a85-2d43-4780-a312-e0e9b2383807) successfully Jan 25 14:42:54 systemd[1]: Reached target Block Device Preparation for /dev/mapper/luks-backup. Jan 25 14:42:59 systemd[1]: clevis-luks-askpass.service: Deactivated successfully. # systemctl list-dependencies remote-cryptsetup.target remote-cryptsetup.target ● ├─systemd-cryptsetup@luks\x2dbackup.service # j --no-hostname -b -u 'systemd-cryptsetup@luks\x2dbackup.service' Jan 25 14:42:12 systemd[1]: Starting Cryptography Setup for luks-backup... Jan 25 14:42:42 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64, key size 512 bits for device /dev/vg_root/backup-raw. Jan 25 14:42:47 systemd-cryptsetup[1697]: Failed to activate with specified passphrase. (Passphrase incorrect?) Jan 25 14:42:48 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64, key size 512 bits for device /dev/vg_root/backup-raw. Jan 25 14:42:54 systemd[1]: Finished Cryptography Setup for luks-backup. # systemctl show 'systemd-cryptsetup@luks\x2dbackup.service' | grep Type Type=oneshot So, if I'm following things correctly, this doesn't seem right. remote-cryptsetup.target depends on systemd-cryptsetup@luks\x2dbackup.service. This is a oneshot that is considered started after the main process exits, and above is shown as 14:42:54. But we are seeing 'Reached target Remote Encrypted Volumes' at 14:42:12. What am I missing? systemd-252-18.el9.x86_64 -- Orion Poplawski he/him/his - surely the least important thing about me Manager of IT Systems 720-772-5637 NWRA, Boulder/CoRA Office FAX
Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services
On Fri, Jan 26, 2024 at 09:11:24AM +0100, Lennart Poettering wrote: > On Do, 25.01.24 22:29, Farblos (akfkqu.9df...@vodafonemail.de) wrote: > > > Hi. > > > > I sent below mail some week ago, Barry's reply left me unsure as to > > whether this would be a bug or not. I still tend do assume that I'm > > "doing something wrong". > > This is currently not supported. The filters are communicated by the > service manager to journald via xattrs on the cgroups, and journald > will only consider those for cgroups owned by root, i.e. not on > cgroups delegated to unpriv users like this done for systemd --user > instances. > > Interepreting arbitrary regexes configured by unpriv code in priv code > comes at some risk,. becose afair constructing them can come at O(2^n) > time, i.e. a rogue regex could make use consume unbounded time on > processing journal messages. Which regex engine is used? glibc’s engine is not safe for use with untrusted input, but Rust’s is, so that might be an option in the future. It isn’t OOM-safe, though. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab signature.asc Description: PGP signature
[systemd-devel] umount fails on system with huge (2TiB) buff/cache
Hello, with huge memory buff/cache a umount can take very long: time umount /mnt/u2 real 10m43.026s user 0m0.000s sys 10m27.985s which then causes a system not to umount the device properly when rebooting, watching serial console: [ OK ] Deactivated swap /dev/sda4. [ OK ] Unmounted /boot. [ OK ] Deactivated swap /dev/disk…0-c79c-4437-922e-21bb8c926cf8. [ OK ] Stopped File System Check …9-80c6-4dc9-86db-f3a9ec179b3c. [244820.237512] audit: type=1131 audit(1706268318.490:23423): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-disk-by\x2duuid-d079cce9\x2d80c6\x2d4dc9\x2d86db\x2df3a9ec179b3c comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped File System Check on /dev/mapper/datao1db. [244820.273498] audit: type=1131 audit(1706268318.526:23424): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-mapper-datao1db comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Unmounted /mnt/hdd1. [ OK ] Stopped File System Check on /dev/mapper/datao1da. [244820.409479] audit: type=1131 audit(1706268318.662:23425): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-mapper-datao1da comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ ***] A stop job is running for /mnt/u2 (2min 55s / no limit) Note it states 'no limit' and one can see after some minutes it says it umounted /mnt/u2: [ OK ] Unmounted /mnt/u2. [ OK ] Reached target Unmount All Filesystems. [ OK ] Stopped File System Check …f-5fca-4f3b-b51d-9bd61b3afed2. [245090.42] audit: type=1131 audit(1706268588.681:23426): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-disk-by\x2duuid-37da47ff\x2d5fca\x2d4f3b\x2db51d\x2d9bd61b3afed2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Removed slice Slice /system/systemd-fsck. [ OK ] Stopped target Preparation for Local File Systems. Stopping Device-Mapper Multipath Device Controller... [ OK ] Stopped Remount Root and Kernel File Systems. [245090.487727] audit: type=1131 audit(1706268588.741:23427): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-remount-fs comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped File System Check on Root Device. [245090.516713] audit: type=1131 audit(1706268588.770:23428): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Create Static Device Nodes in /dev. [245090.544717] audit: type=1131 audit(1706268588.798:23429): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-setup-dev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Stopped Device-Mapper Multipath Device Controller. [245090.575750] audit: type=1131 audit(1706268588.829:23430): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=multipathd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Reached target System Shutdown. [ OK ] Reached target Late Shutdown Services. [ OK ] Finished System Reboot. [245090.619726] audit: type=1130 audit(1706268588.873:23431): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-reboot comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [245090.639307] audit: type=1131 audit(1706268588.873:23432): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-reboot comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ OK ] Reached target System Reboot. [245090.672034] audit: type=1334 audit(1706268588.925:23433): prog-id=60 op=UNLOAD [245090.679574] audit: type=1334 audit(1706268588.925:23434): prog-id=59 op=UNLOAD [245090.687105] audit: type=1334 audit(1706268588.927:23435): prog-id=63 op=UNLOAD [245120.718949] systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 3377584. [245120.730995] systemd-journald[2863]: Failed to send WATCHDOG=1 notification message: Connection refused [245120.774387] systemd-journald[2863]: Received SIGTERM from PID 1 (systemd-shutdow). [245130.775151] systemd-shutdown[1]: Waiting for process: 3377515 (umount), 3377584 ((sd-sync)) [245210.836634] systemd-shutdown[1]: Sending SIGKILL to PID 3377515 (umount). [245210.857197] systemd-shutdown[1]: Sending SIGKILL to PID 3377584 ((sd-sync)). [245220.880927] systemd-shutdown[1]: Waiting for process: 3377515 (umount), 3377584 ((sd-sync)) [245280.135213] INFO: task (sd-sync):3377584 blocked for more than 122 seconds. [245280.155688]
Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services
> Interepreting arbitrary regexes configured by unpriv code in priv code > comes at some risk,. becose afair constructing them can come at O(2^n) > time, i.e. a rogue regex could make use consume unbounded time on > processing journal messages. > > Hence, I wouldn't hold your breath. Unless someone figures out a smart > way to deal with this it's unlikely to be supported. I am not sure about construction but checking for matches with arbitrary regexes can definitely result in DOS. Restricting the allowed features, however, alleviates this problem. E.g. the rust regex crate can check in O(m*n) with m = Regex Size and n = Input size. It does this by now allowing (amongst other things) no look-arounds or backrefs. I am not sure how configurable pcre2pattern is but maybe the supported features could be restricted for regexes from users. Nils
Re: [systemd-devel] Delaying VM startup until block devices are available
On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote: > We have various VMs that are back by luks encrypted LVs. At boot the volumes > are decrypted by clevis. The problem we are seeing at the moment is that the > VMs are started before the block devices are decrypted. Our current > solution is: We generally wait for all devices listed in /etc/crypttab, unless you set noauto or nofail. > > # cat /etc/systemd/system/virtqemud.service.d/override.conf > [Unit] > After=blockdev@dev-mapper-luks\x2dbackup.target > blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target > > Where we list each of the volumes to be decyrpted as blocking the virtqemud > service. > > Does anyone have any better alternatives? My main issue it that it feels > somewhere in between fine-grained and coarse-grained control. > > Ideally I think one would be able to have each individual VM startup > automatically delayed until the devices each used became available, but I > don't see how to do this. I am not sure how libvirt works, but if it runs every VM in a systemd unit, then you could just order the device before that unit, or the unit after the device. Really depends on how libvirt splits things up. > Alternatively it seems like one should be able to delay all VM startup until > all volumes in /etc/crypttab were unlocked, rather than having to specify each > one. But I don't see a target for that. This is default behaviour. Anything listed in /etc/crypttab is ordered before cryptsetup.target, which is ordered before sysinit.target, which is ordered before basic.target, which is ordered before regular services. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services
On Do, 25.01.24 22:29, Farblos (akfkqu.9df...@vodafonemail.de) wrote: > Hi. > > I sent below mail some week ago, Barry's reply left me unsure as to > whether this would be a bug or not. I still tend do assume that I'm > "doing something wrong". This is currently not supported. The filters are communicated by the service manager to journald via xattrs on the cgroups, and journald will only consider those for cgroups owned by root, i.e. not on cgroups delegated to unpriv users like this done for systemd --user instances. Interepreting arbitrary regexes configured by unpriv code in priv code comes at some risk,. becose afair constructing them can come at O(2^n) time, i.e. a rogue regex could make use consume unbounded time on processing journal messages. Hence, I wouldn't hold your breath. Unless someone figures out a smart way to deal with this it's unlikely to be supported. We should document this however I guess. Hence if you file an issue that would be more than welcome, so that we can keep trakc of this. Lennart -- Lennart Poettering, Berlin