Re: [systemd-devel] How to properly wait for udev?
On Wed, Nov 29, 2023 at 9:48 AM Lennart Poettering wrote: > > Why doesn't udev flock() every device it is probing? > > Or asked differently, why is this feature opt-in instead of opt-out? > > Some software really doesn't like it if we take BSD locks on their > devices, hence we don't take it blanket everywhere. And what's more > important even: for various devices it simply isn't safe to just > willy-nilly even open them (tape drivers and things, which might start > to pull in a tape if we do). For others we might not be able to even > open thing at all with the wrong flags (for example, because they are > output only). > > Bock devices have relatively well defined semantics, there it's > generally safe to do this, hence we do. I see. > Hence, it might be safe for UBI, but for the general case it might > not be. > > That said, would BSD locking even address your issue? If you devices > are exclusive access things and we first open() them and then flock() > them, then that's not atomic. So if your test cases open the devices, > then flock() them you might still get into conflict udev because it > just open()ed the device, but didn#t get to call flock() yet. > > Doesn't UBI have something like O_EXCL-behaviour that grants true > exclusive access? It has in-kernel support for that but not from userspace, I'm currently evaluating whether it's worth exposing exclusive access to userspace. Stay tuned. :-) -- Thanks, //richard
Re: [systemd-devel] How to properly wait for udev?
On Mon, Nov 27, 2023 at 9:29 AM Lennart Poettering wrote: > If they conceptually should be considered block device equivalents, we > might want to extend the udev logic to such UBI devices too. Patches > welcome. Why doesn't udev flock() every device it is probing? Or asked differently, why is this feature opt-in instead of opt-out? -- Thanks, //richard
Re: [systemd-devel] How to properly wait for udev?
On Mon, Nov 27, 2023 at 9:29 AM Lennart Poettering wrote: > On So, 26.11.23 00:39, Richard Weinberger (richard.weinber...@gmail.com) > wrote: > > > Hello! > > > > After upgrading my main test worker to a recent distribution, the UBI > > test suite [0] fails at various places with -EBUSY. > > The reason is that these tests create and remove UBI volumes rapidly. > > A typical test sequence is as follows: > > 1. creation of /dev/ubi0_0 > > 2. some exclusive operation, such as atomic update or volume resize on > > /dev/ubi0_0 > > 3. removal of /dev/ubi0_0 > > > > Both steps 2 and 3 can fail with -EBUSY because the udev worker still > > holds a file descriptor to /dev/ubi0_0. > > Hmm, I have no experience with UBI, but are you sure we open that? why > would we? are such devices analyzed by blkid? We generally don't open > device nodes unless we have a reason to, such as doing blkid on it or > so. I think it came via commit: dbbf424c8b77 ("rules: ubi mtd - add link to named partitions (#6750)") Here is the bpftrace output of a failed mkvol_basic run. The test created a new volume and tried to delete it via ioctl(). Right after creating the volume, udev started inspecting it and mkvol_basic was unable to delete it because the delete operation needs exclusive ownership. mkvol_basic(530): open() = /dev/ubi0 mkvol_basic(530): ioctl(cmd: 1074032385) (udev-worker)(531): open UBI volume 0 = 0x96644533ac80 mkvol_basic(530): open UBI volume 0 = 0xfff0 mkvol_basic(530): failed ioctl() = -16 (udev-worker)(531): closing UBI volume 0x96644533ac80 > What precisely fails for you? the open()? or some operation on the > opened fd? All of that. I depends on the test. Basically every test assumes that it has the full ownership of a volume it has created. > > > > FWIW, the problem can also get triggered using UBI's shell utilities > > if the system is fast enough, e.g. > > # ubimkvol -N testv -S 50 -n 0 /dev/ubi0 && ubirmvol -n 0 /dev/ubi0 > > Volume ID 0, size 50 LEBs (793600 bytes, 775.0 KiB), LEB size 15872 > > bytes (15.5 KiB), dynamic, name "testv", alignment 1 > > ubirmvol: error!: cannot UBI remove volume > > error 16 (Device or resource busy) > > > > Instead of adding a retry loop around -EBUSY, I believe the best > > solution is to add code to wait for udev. > > For example, having a udev barrier in ubi_mkvol() and ubi_rmvol() [1] > > seems like a good idea to me. > > For block devices we implement this: > > https://systemd.io/BLOCK_DEVICE_LOCKING > > I understand UBI aren't block devices though? Exactly, UBI volumes are character devices, just like MTDs. > If they conceptually should be considered block device equivalents, we > might want to extend the udev logic to such UBI devices too. Patches > welcome. > > We provide "udevadm lock" to lock a block device according to this > scheme from shell scripts. > > > What function from libsystemd do you suggest for waiting until udev is > > done with rule processing? > > My naive approach, using udev_queue_is_empty() and > > sd_device_get_is_initialized(), does not resolve all failures so far. > > Firstly, udev_queue_is_empty() doesn't seem to be exported by > > libsystemd. I have open-coded it as: > > static int udev_queue_is_empty(void) { > >return access("/run/udev/queue", F_OK) < 0 ? > >(errno == ENOENT ? true : -errno) : false; > > } > > This doesn't really work. udev might still process the device in the > background. I see. -- Thanks, //richard
Re: [systemd-devel] How to properly wait for udev?
On Sun, Nov 26, 2023 at 10:36 PM Mantas Mikulėnas wrote: > > If I remember correctly, udev (recent versions) takes a BSD lock using > flock(2) while processing the device, and tools are supposed to do the same. > The flock() call can be set to wait until the lock can be taken. Hmm, indeed. But it seems to do so only for block devices. This explain also why none of my syscall tracing showed flock() calls so far. UBI volumes are character devices. -- Thanks, //richard
[systemd-devel] How to properly wait for udev?
Hello! After upgrading my main test worker to a recent distribution, the UBI test suite [0] fails at various places with -EBUSY. The reason is that these tests create and remove UBI volumes rapidly. A typical test sequence is as follows: 1. creation of /dev/ubi0_0 2. some exclusive operation, such as atomic update or volume resize on /dev/ubi0_0 3. removal of /dev/ubi0_0 Both steps 2 and 3 can fail with -EBUSY because the udev worker still holds a file descriptor to /dev/ubi0_0. FWIW, the problem can also get triggered using UBI's shell utilities if the system is fast enough, e.g. # ubimkvol -N testv -S 50 -n 0 /dev/ubi0 && ubirmvol -n 0 /dev/ubi0 Volume ID 0, size 50 LEBs (793600 bytes, 775.0 KiB), LEB size 15872 bytes (15.5 KiB), dynamic, name "testv", alignment 1 ubirmvol: error!: cannot UBI remove volume error 16 (Device or resource busy) Instead of adding a retry loop around -EBUSY, I believe the best solution is to add code to wait for udev. For example, having a udev barrier in ubi_mkvol() and ubi_rmvol() [1] seems like a good idea to me. What function from libsystemd do you suggest for waiting until udev is done with rule processing? My naive approach, using udev_queue_is_empty() and sd_device_get_is_initialized(), does not resolve all failures so far. Firstly, udev_queue_is_empty() doesn't seem to be exported by libsystemd. I have open-coded it as: static int udev_queue_is_empty(void) { return access("/run/udev/queue", F_OK) < 0 ? (errno == ENOENT ? true : -errno) : false; } Additionally, sd_device_get_is_initialized() seems to return sometimes true even if the udev worker still has the volume open. In short, which API do you recommend to ensure that the device my thread has created is actually usable? [0]: http://git.infradead.org/mtd-utils.git/tree/HEAD:/tests/ubi-tests [1]: http://git.infradead.org/mtd-utils.git/blob/HEAD:/lib/libubi.c#l994 -- Thanks, //richard
Re: [systemd-devel] kdbus refactoring?
On Mon, Nov 9, 2015 at 12:30 AM, Greg KH <gre...@linuxfoundation.org> wrote: > On Sun, Nov 08, 2015 at 10:39:43PM +0100, Richard Weinberger wrote: >> On Sun, Nov 8, 2015 at 10:35 PM, Greg KH <gre...@linuxfoundation.org> wrote: >> > On Sun, Nov 08, 2015 at 10:06:31PM +0100, Richard Weinberger wrote: >> >> Hi all, >> >> >> >> after reading on the removal of kdbus from Rawhide[1] I've searched >> >> the mailinglist archives for more details but didn't find anything. >> >> So, what are your plans? >> >> >> >> [1] >> >> https://lists.fedoraproject.org/pipermail/kernel/2015-October/006011.html >> > >> > As that link said, based on the result of the code being in Rawhide, it >> > is now being reworked / redesigned. The result will be posted for >> > review "when it's ready". >> >> If you rework/redesign something you have to know what you want to change. >> That's why I was asking for the plan... > > Since when do people post "plans" or "design documents" on lkml without > real code? Again, code will be posted when it's ready, like any other > kernel submission. Nobody asked for a design document. And yes, people often say what they do or want to do. Anyway, heise.de has details from the systemd.conf: http://www.heise.de/open/meldung/Linux-Kdbus-soll-universeller-werden-Videos-von-Systemd-Konferenz-2910910.html According to them the plan is making kdbus more universal and less dbus specific. Which sounds great. -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] modules in container
On Sun, Nov 8, 2015 at 1:17 PM, arnaud gabourywrote: > I am trying to understand how kernel modules are "passed" to nspawn container. A container must not load any module as the kernel is a shared resource. -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] kdbus refactoring?
On Sun, Nov 8, 2015 at 10:35 PM, Greg KH <gre...@linuxfoundation.org> wrote: > On Sun, Nov 08, 2015 at 10:06:31PM +0100, Richard Weinberger wrote: >> Hi all, >> >> after reading on the removal of kdbus from Rawhide[1] I've searched >> the mailinglist archives for more details but didn't find anything. >> So, what are your plans? >> >> [1] https://lists.fedoraproject.org/pipermail/kernel/2015-October/006011.html > > As that link said, based on the result of the code being in Rawhide, it > is now being reworked / redesigned. The result will be posted for > review "when it's ready". If you rework/redesign something you have to know what you want to change. That's why I was asking for the plan... -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] kdbus refactoring?
Hi all, after reading on the removal of kdbus from Rawhide[1] I've searched the mailinglist archives for more details but didn't find anything. So, what are your plans? [1] https://lists.fedoraproject.org/pipermail/kernel/2015-October/006011.html -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] nspawn dependencies
Lennart, Am 11.06.2015 um 12:08 schrieb Lennart Poettering: On Thu, 11.06.15 09:40, Richard Weinberger (richard.weinber...@gmail.com) wrote: Hi! Recent systemd-nspawn seems to support unprivileged containers (user namespaces). That's awesome, thank you guys for working on that! Well, the name unprivileged containers usually is used for the concept where you don't need any privs to start and run a container. We don't support that, and that's turned off in the kernel of Fedora at least, for good reasons. Depends. Container stuff is that much hyped these days that namings change all the time. :-) I understand unprivileged containers as containers which do not run as root. While I don't care whether you have to be root to spawn them. We do support user namespaces now, but we require privs on the host to set them up. I doubt though that UID namespacing as it is now is really that useful though: you have to prep your images first, apply a uid shift to all file ownership and ACLs of your tree, and this needs to be done manually. This makes it pretty hard to deploy since you cannot boot unmodified container images this way you download from the internet. Also, since there is no sane, established scheme for allocating UID ranges for the containers automatically. So far uid namespaces hence appear mostly like an useless excercise, far from being deployable in real life hence. What I care about is that root within the container is not the real root. Hence, what user namespaces do. Maybe you can help me so sort this out, can I run any systemd enabled distribution using the most current systemd-nspawn? Say, my host is FC22 using systemd-nspawn from git, can it spawn an openSUSE 13.2 container which has only systemd v210? Or has the systemd version on the container side to match the systemd version on the host side? It generally does not have to match. We try to maintain compatibility there (though we make no guarantees -- the stuff is too new). That said, newer systemd versions work much better in nspawn than older ones, and v210 is pretty old already. Okay. Thanks for the clarification. From reading the source it seems like you mount the whole cgroup hierarchy into the container's mount namespace, rebind /sys/fs/cgroup/systemd/yadda/.../yadda/ to /sys/fs/cgroup/systemd and remount some parts read only. Does this play well with the cgroup release_agent/notify_on_release mechanism? Some time ago I've played with that and found that always only systemd on the host side receives the notify. Mostly due to the broken design of cgroups. ;-\ One more question, how does systemd-nspawn depend on the host systemd? On this machine runs openSUSE with systemd v210. I build current systemd-nswpan and gave it a try wit no luck. rw@sandpuppy:~/work/systemd (master) sudo ./systemd-nspawn -bD /fc22 Spawning container fc22 on /fc22. Press ^] three times within 1s to kill container. Failed to register machine: Unknown method 'CreateMachineWithNetwork' or interface 'org.freedesktop.machine1.Manager' I suspect I was too naive to think it would work out. :-) Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] x86: defconfig: Enable CONFIG_FHANDLE
Am 01.12.2014 um 01:18 schrieb Dave Chinner: On Sun, Nov 30, 2014 at 10:08:01PM +0100, Richard Weinberger wrote: Am 30.11.2014 um 21:54 schrieb Dave Chinner: On Wed, Nov 26, 2014 at 12:36:52AM +0100, Richard Weinberger wrote: systemd has a hard dependency on CONFIG_FHANDLE. If you run systemd with CONFIG_FHANDLE=n it will somehow boot but fail to spawn a getty or other basic services. As systemd is now used by most x86 distributions it makes sense to enabled this by default and save kernel hackers a lot of value debugging time. The bigger question to me is this: why does systemd need to store/open by handle rather than just opening paths directly when needed? This interface is intended for stable, pathless access to inodes across unmount/mount contexts (e.g. userspace NFS servers, filesystem backup programs, etc) so I'm curious as to the problem systemd is solving using this interface. I just can't see the problem being solved here, and why path based security checks on every open() aren't necessary... Digging inter systemd source shows that they are using name_to_handle_at() to get the mount id of a given path. From the name_to_handle_at() man page: The mount_id argument returns an identifier for the filesystem mount that corresponds to pathname. This corresponds to the first field in one of the records in /proc/self/mountinfo. Opening the pathname in the fifth field of that record yields a file descriptor for the mount point; that file descriptor can be used in a subsequent call to open_by_handle_at(). So why do they need CONFIG_FHANDLE to get the mount id in userspace? Indeed, what do they even need the mount id for? The actual struct file_handle result is always ignored. That sounds like a classic case of interface abuse. i.e. using an interface for something it was not designed or intended for CC'ing systemd folks. Lennart, can you please explain why you need CONFIG_FHANDLE for systemd? Maybe I'm reading the source horrible wrong. Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-cgroups-agent not working in containers
Am 28.11.2014 um 06:33 schrieb Martin Pitt: Hello all, Cameron Norman [2014-11-27 12:26 -0800]: On Wed, Nov 26, 2014 at 1:29 PM, Richard Weinberger rich...@nod.at wrote: Hi! I run a Linux container setup with openSUSE 13.1/2 as guest distro. After some time containers slow down. An investigation showed that the containers slow down because a lot of stale user sessions slow down almost all systemd tools, mostly systemctl. loginctl reports many thousand sessions. All in state closing. This sounds similar to an issue that systemd-shim in Debian had. Martin Pitt (helps to maintain systemd in Debian) fixed that issue; he may have some ideas here. I CC'd him. The problem with systemd-shim under sysvinit or upstart was that shim didn't set a cgroup release agent like systemd itself does. Thus the cgroups were never cleaned up after all the session processes died. (See 1.4 on https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt for details) I don't think that SUSE uses systemd-shim, I take it in that setup you are running systemd proper on both the host and the guest? Then I suggest checking the cgroups that correspond to the closing sessions in the container, i. e. /sys/fs/cgroup/systemd/.../session-XX.scope/tasks. If there are still processes in it, logind is merely waiting for them to exit (or set KillUserProcesses in logind.conf). If they are empty, check that /sys/fs/cgroup/systemd/.../session-XX.scope/notify_on_release is 1 and that /sys/fs/cgroup/systemd/release_agent is set? The problem is that within the container the release agent is not executed. It is executed on the host side. Lennart, how is this supposed to work? Is the theory of operation that the host systemd sends org.freedesktop.systemd1.Agent Released via dbus into the guest? The guests systemd definitely does not receive such a signal. Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd-cgroups-agent not working in containers
Am 26.11.2014 um 22:29 schrieb Richard Weinberger: Hi! I run a Linux container setup with openSUSE 13.1/2 as guest distro. After some time containers slow down. An investigation showed that the containers slow down because a lot of stale user sessions slow down almost all systemd tools, mostly systemctl. loginctl reports many thousand sessions. All in state closing. The vast majority of these sessions are from crond an ssh logins. It turned out that sessions are never closed and stay around. The control group of a said session contains zero tasks. So I started to explore why systemd keeps it. After another few hours of debugging I realized that systemd never issues the release signal from cgroups. Also calling the release agent by hand did not help. i.e. /usr/lib/systemd/systemd-cgroups-agent /user.slice/user-0.slice/session-c324.scope Therefore systemd never recognizes that a server/session has no more tasks and will close it. First I thought it is an issue in libvirt combined with user namespaces. But I can trigger this also without user namespaces and also with systemd-nspawn. Tested with systemd 208 and 210 from openSUSE, their packages have all known bugfixes. Any idea where to look further? How do you run the most current systemd on your distro? Btw: I face exactly the same issue also on fc21 (guest is fc20). Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] systemd-cgroups-agent not working in containers
Hi! I run a Linux container setup with openSUSE 13.1/2 as guest distro. After some time containers slow down. An investigation showed that the containers slow down because a lot of stale user sessions slow down almost all systemd tools, mostly systemctl. loginctl reports many thousand sessions. All in state closing. The vast majority of these sessions are from crond an ssh logins. It turned out that sessions are never closed and stay around. The control group of a said session contains zero tasks. So I started to explore why systemd keeps it. After another few hours of debugging I realized that systemd never issues the release signal from cgroups. Also calling the release agent by hand did not help. i.e. /usr/lib/systemd/systemd-cgroups-agent /user.slice/user-0.slice/session-c324.scope Therefore systemd never recognizes that a server/session has no more tasks and will close it. First I thought it is an issue in libvirt combined with user namespaces. But I can trigger this also without user namespaces and also with systemd-nspawn. Tested with systemd 208 and 210 from openSUSE, their packages have all known bugfixes. Any idea where to look further? How do you run the most current systemd on your distro? Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] How to use cgroups within containers?
Am 20.10.2014 um 19:27 schrieb Lennart Poettering: On Mon, 20.10.14 19:16, Richard Weinberger (rich...@nod.at) wrote: Have you read the link I posted? Sure, I've also been in the room in Düsseldorf while you've read it in front of us. Not that I changed it since then... ;-) Yes, I test systemd inside containers. Daily. Actually it's my primary way of testing systemd, since it is extremely quick and allows me to attach from the host with debugging tools... As long as you follow the suggestions in the document I linked systemd will work without modifications in container managers. At least libvirt-lxc and nspawn follows these suggestions, not sure about the other container managers. If I read the source of nspwan correctly, it does not use user namespaces. Ah, this is about user namespaces? No I have not played around with them so far. Sorry. Yep. Please have a look at them. There are some pitfalls. libvirt-lxc is currently not sure how to support systemd. So far it bind mounts only the machine specific part of cgroups into the container. Which is not really nice but better than exposing the whole hierarchy into the container. It really should also bind mount the upper parts, but possibly mark them read-only (which nspawn currently doesn't do). Okay. Or maybe cgroup namespaces will help. Let's find out. :) Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] How to use cgroups within containers?
Am 20.10.2014 um 18:51 schrieb Lennart Poettering: On Mon, 20.10.14 18:49, Richard Weinberger (rich...@nod.at) wrote: Am 20.10.2014 um 18:24 schrieb Lennart Poettering: On Fri, 17.10.14 23:35, Richard Weinberger (richard.weinber...@gmail.com) wrote: Dear systemd and container folks, at Plumbers the question raised how to provide cgroups to a systemd that lives in a container (with user namespaces). Due to the GDL train strikes I had to leave very soon and had no chance to talk to you in person. Was a solution proposed? All I want to know is how to provide cgroups in a sane and secure way to systemd. :-) The cgroups setup systemd requires to be able to run cleanly without changes in a container is documented here: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ You have to mount the full cgroupfs hierarchies into the containers, so that /proc/$PID/cgroup makes sense inside the containers (that file lists absolute paths...). They can be mounted read-only up to the container's root, but further down they need to be writable to the container, so that systemd inside the container can do its job. And what solution do you propose? Solution? For what problem precisely? Running systemd inside Linux container (including user namespaces). :-) Will cgroup namespaces make systemd finally happy? I have no idea about cgroup namespaces and what they entail. systemd is quite happy already, if you follow the guidelines for container managers we put together... Have you ever used systemd inside a container? Say, LXC or libvirt-lxc... Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] How to use cgroups within containers?
Am 20.10.2014 um 19:04 schrieb Lennart Poettering: On Mon, 20.10.14 18:55, Richard Weinberger (rich...@nod.at) wrote: Am 20.10.2014 um 18:51 schrieb Lennart Poettering: On Mon, 20.10.14 18:49, Richard Weinberger (rich...@nod.at) wrote: Am 20.10.2014 um 18:24 schrieb Lennart Poettering: On Fri, 17.10.14 23:35, Richard Weinberger (richard.weinber...@gmail.com) wrote: Dear systemd and container folks, at Plumbers the question raised how to provide cgroups to a systemd that lives in a container (with user namespaces). Due to the GDL train strikes I had to leave very soon and had no chance to talk to you in person. Was a solution proposed? All I want to know is how to provide cgroups in a sane and secure way to systemd. :-) The cgroups setup systemd requires to be able to run cleanly without changes in a container is documented here: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ You have to mount the full cgroupfs hierarchies into the containers, so that /proc/$PID/cgroup makes sense inside the containers (that file lists absolute paths...). They can be mounted read-only up to the container's root, but further down they need to be writable to the container, so that systemd inside the container can do its job. And what solution do you propose? Solution? For what problem precisely? Running systemd inside Linux container (including user namespaces). :-) Will cgroup namespaces make systemd finally happy? I have no idea about cgroup namespaces and what they entail. systemd is quite happy already, if you follow the guidelines for container managers we put together... Have you ever used systemd inside a container? Say, LXC or libvirt-lxc... Have you read the link I posted? Sure, I've also been in the room in Düsseldorf while you've read it in front of us. Yes, I test systemd inside containers. Daily. Actually it's my primary way of testing systemd, since it is extremely quick and allows me to attach from the host with debugging tools... As long as you follow the suggestions in the document I linked systemd will work without modifications in container managers. At least libvirt-lxc and nspawn follows these suggestions, not sure about the other container managers. If I read the source of nspwan correctly, it does not use user namespaces. libvirt-lxc is currently not sure how to support systemd. So far it bind mounts only the machine specific part of cgroups into the container. Which is not really nice but better than exposing the whole hierarchy into the container. This is why I was asking for cgroup namespaces... Also read: http://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/ We have documented this all so nicely, I can only recommend to actually take the time to read this. Thanks! Thanks a lot! //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] How to use cgroups within containers?
Dear systemd and container folks, at Plumbers the question raised how to provide cgroups to a systemd that lives in a container (with user namespaces). Due to the GDL train strikes I had to leave very soon and had no chance to talk to you in person. Was a solution proposed? All I want to know is how to provide cgroups in a sane and secure way to systemd. :-) -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] How to use cgroups within containers?
...fixing LXC devel mailinglist... :-\ On Fri, Oct 17, 2014 at 11:35 PM, Richard Weinberger richard.weinber...@gmail.com wrote: Dear systemd and container folks, at Plumbers the question raised how to provide cgroups to a systemd that lives in a container (with user namespaces). Due to the GDL train strikes I had to leave very soon and had no chance to talk to you in person. Was a solution proposed? All I want to know is how to provide cgroups in a sane and secure way to systemd. :-) -- Thanks, //richard -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Should user mode linux register with machined?
Lennart, Am 10.10.2014 um 18:44 schrieb Lennart Poettering: It's a bit more complex. While UML, qemu, kvm, currently don't, LXC, systemd-nspawn and libvirt-lxc all do talk directly to machined. (Note that LXC and libvirt-lxc are separate codebases, the latter is *not* a wrapper around the former). So, dunno, it really is up to how you intend UML to be used. If UML shall be nice and useful without libvirt, then it's worth doing the registration natively, but it's also OK to just leave this to libvirt, if that's your primary envisioned usecase... What is the benefit of this registration? I boot all day long UML and qemu-kvm VMs without registering them to systemd, so I don't really know what I'm missing. :-) But if there is a nice use case I'll happily add the registration to UML. Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Timed out waiting for device dev-disk-by...
On Mon, Sep 29, 2014 at 8:29 PM, Thomas Meyer tho...@m3y3r.de wrote: Hi, I get a timeout in the Fedora 21 alpha: [ TIME ] Timed out waiting for device dev-disk-by\x2duuid-008af19d\x2d2562\x2d49bd\x2d8907\x2d721ea08f3e14.device. But all devices are available from early kernel start: # ls -l /dev/disk/by-uuid/ total 0 lrwxrwxrwx 1 root root 11 Sep 29 20:17 008af19d-2562-49bd-8907-721ea08f3e14 - ../../ubda1 lrwxrwxrwx 1 root root 11 Sep 29 20:17 e2bffa45-d84f-47bc-81ba-e7a395751fa6 - ../../ubda3 lrwxrwxrwx 1 root root 11 Sep 29 20:17 f452f020-a446-41ed-93c0-ee5ce56d6ea4 - ../../ubda2 It feels like some event notification is lost in the boot process or something like this?! What exactly makes the device unit go into the state active/plugged? This is a boot of the Fedora 21 alpha under user mode linux. Any ideas what could be wrong here? Please always CC me and/or the UML mailinglist in case of UML related issues. I'm very interested in having UML work with systemd. -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [RESEND][PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling
Am 09.09.2014 11:09, schrieb Richard Weinberger: If one has a config like: d /tmp 1777 root root - X /tmp/important_mount All files below /tmp/important_mount will be deleted as the /tmp/important_mount item will spuriously inherit a max age of 0 from /tmp. /tmp has a max age of 0 but age_set is (of course) false. This affects also the PrivateTmp feature of systemd. All tmp files of such services will be deleted unconditionally and can cause service failures and data loss. Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic. --- src/tmpfiles/tmpfiles.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c index 79fd0b7..c8d4abb 100644 --- a/src/tmpfiles/tmpfiles.c +++ b/src/tmpfiles/tmpfiles.c @@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool ignore_enoent) { candidate_item = j; } -if (candidate_item) { +if (candidate_item candidate_item-age_set) { i-age = candidate_item-age; i-age_set = true; } ping? Is there something horrible wrong with this patch or the submission itself? Please tell me. :) Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Should user mode linux register with machined?
On Tue, Sep 16, 2014 at 5:31 PM, Thomas Meyer tho...@m3y3r.de wrote: Hi, I wrote a small patch for user-mode linux to register with machined by calling CreateMachine. Is this a good idea to do so? I think machined gives you a nice overview over all running UML instances, also you get the scope unit and the control groups with above registration to machined. anything else on the plus side? The user-mode-mailing list did ask why exactly my patch is needed. The user-mode-mailing is also reading this list BTW. :) -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [RESEND][PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling
If one has a config like: d /tmp 1777 root root - X /tmp/important_mount All files below /tmp/important_mount will be deleted as the /tmp/important_mount item will spuriously inherit a max age of 0 from /tmp. /tmp has a max age of 0 but age_set is (of course) false. This affects also the PrivateTmp feature of systemd. All tmp files of such services will be deleted unconditionally and can cause service failures and data loss. Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic. --- src/tmpfiles/tmpfiles.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c index 79fd0b7..c8d4abb 100644 --- a/src/tmpfiles/tmpfiles.c +++ b/src/tmpfiles/tmpfiles.c @@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool ignore_enoent) { candidate_item = j; } -if (candidate_item) { +if (candidate_item candidate_item-age_set) { i-age = candidate_item-age; i-age_set = true; } -- 2.0.1 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling
Am 27.08.2014 14:55, schrieb Richard Weinberger: If one has a config like: d /tmp 1777 root root - X /tmp/important_mount All files below /tmp/important_mount will be deleted as the /tmp/important_mount item will spuriously inherit a max age of 0 from /tmp. /tmp has a max age of 0 but age_set is (of course) false. Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic. Signed-off-by: Richard Weinberger rich...@nod.at --- src/tmpfiles/tmpfiles.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c index 79fd0b7..c8d4abb 100644 --- a/src/tmpfiles/tmpfiles.c +++ b/src/tmpfiles/tmpfiles.c @@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool ignore_enoent) { candidate_item = j; } -if (candidate_item) { +if (candidate_item candidate_item-age_set) { i-age = candidate_item-age; i-age_set = true; } Ping? Would be nice to see this merged, it fixes a nasty issue with PrivateTmp=yes. Without that patch all files in private /tmp and /var/tmp will get deleted unconditionally by systemd-tmpfiles if you configure it *not* to delete anything in /tmp and /var/tmp. i.e: d /tmp 1777 root root - d /var/tmp 1777 root root - This is the default on openSUSE. Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling
If one has a config like: d /tmp 1777 root root - X /tmp/important_mount All files below /tmp/important_mount will be deleted as the /tmp/important_mount item will spuriously inherit a max age of 0 from /tmp. /tmp has a max age of 0 but age_set is (of course) false. Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic. Signed-off-by: Richard Weinberger rich...@nod.at --- src/tmpfiles/tmpfiles.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c index 79fd0b7..c8d4abb 100644 --- a/src/tmpfiles/tmpfiles.c +++ b/src/tmpfiles/tmpfiles.c @@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool ignore_enoent) { candidate_item = j; } -if (candidate_item) { +if (candidate_item candidate_item-age_set) { i-age = candidate_item-age; i-age_set = true; } -- 2.0.1 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Device units and LXC
Hi! As of my understanding of systemd, device units depend hard udev. Units like network@.service contain lines like BindsTo=sys-subsystem-net-devices-%i.device Within a Linux container this is a problem because there is no udev. There systemd never receives an event for this device and the device unit never shows up. I'm wondering what we can do to improve the situation. At least for Ethernet devices systemd could just use sysfs to find out whether the device is present or not. What do you think? Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] [RFC] Ignore OOMScoreAdjust in Linux containers
Am 09.04.2014 19:19, schrieb Tom Gundersen: On Mon, Apr 7, 2014 at 9:47 PM, Richard Weinberger rich...@nod.at wrote: At least LXC does not allow the container root to change the OOM Score adjust value. Signed-off-by: Richard Weinberger rich...@nod.at --- Hi! Within Linux containers we cannot use OOMScoreAdjust nor CapabilityBoundingSet (and maybe more related settings). This patch tells systemd to ignore OOMScoreAdjust if it detects a container. Are you fine with such a change? Otherweise regular distros need a lot of changes in their .service file to make them work within LXC. As detect_virtualization() detects more than LXC we have to find out whether OOMScoreAdjust cannot be used on OpenVZ and other container as well. I'd volunteer to identify all settings and sending patches... Hm, is there a fundamental reason why this is not possible in containers in general, or is it simply an LXC restriction? Regardless, would it not be best to simply degrade gracefully and ignore the setting with a warning if it fails? See the comment Lennart just posted on the recent PrivateNetwork= patch. This sounds like a very similar situation. Writing to oom_score_adj is disallowed by design within user namespaces. Please see: https://lkml.org/lkml/2013/4/25/596 I'm also fine with ignoring OOMScoreAdjust if it fails. All I want is a painless Linux userspace on top of systemd within my Containers. :-) Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH] [RFC] Ignore OOMScoreAdjust in Linux containers
Am 09.04.2014 20:28, schrieb Tom Gundersen: On Wed, Apr 9, 2014 at 7:39 PM, Richard Weinberger rich...@nod.at wrote: Am 09.04.2014 19:19, schrieb Tom Gundersen: On Mon, Apr 7, 2014 at 9:47 PM, Richard Weinberger rich...@nod.at wrote: At least LXC does not allow the container root to change the OOM Score adjust value. Signed-off-by: Richard Weinberger rich...@nod.at --- Hi! Within Linux containers we cannot use OOMScoreAdjust nor CapabilityBoundingSet (and maybe more related settings). This patch tells systemd to ignore OOMScoreAdjust if it detects a container. Are you fine with such a change? Otherweise regular distros need a lot of changes in their .service file to make them work within LXC. As detect_virtualization() detects more than LXC we have to find out whether OOMScoreAdjust cannot be used on OpenVZ and other container as well. I'd volunteer to identify all settings and sending patches... Hm, is there a fundamental reason why this is not possible in containers in general, or is it simply an LXC restriction? Regardless, would it not be best to simply degrade gracefully and ignore the setting with a warning if it fails? See the comment Lennart just posted on the recent PrivateNetwork= patch. This sounds like a very similar situation. Writing to oom_score_adj is disallowed by design within user namespaces. Please see: https://lkml.org/lkml/2013/4/25/596 But I guess we still want to use this in containers that don't use user namespaces. Containers without user namespaces and a uid 0 user are horrible broken and insecure. They will hopefully die soon. I'm also fine with ignoring OOMScoreAdjust if it fails. Sounds like the right way (might be other things like this too I suppose). Okay, I'll send patches for OOMScoreAdjust and other settings to ignore failures. This way systemd can also support containers without user namespaces. No matter how useful these are. (hello docker.io folks! ;)) Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [PATCH] [RFC] Ignore OOMScoreAdjust in Linux containers
At least LXC does not allow the container root to change the OOM Score adjust value. Signed-off-by: Richard Weinberger rich...@nod.at --- Hi! Within Linux containers we cannot use OOMScoreAdjust nor CapabilityBoundingSet (and maybe more related settings). This patch tells systemd to ignore OOMScoreAdjust if it detects a container. Are you fine with such a change? Otherweise regular distros need a lot of changes in their .service file to make them work within LXC. As detect_virtualization() detects more than LXC we have to find out whether OOMScoreAdjust cannot be used on OpenVZ and other container as well. I'd volunteer to identify all settings and sending patches... Thanks, //richard --- src/core/load-fragment.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/src/core/load-fragment.c b/src/core/load-fragment.c index c604f90..13f6107 100644 --- a/src/core/load-fragment.c +++ b/src/core/load-fragment.c @@ -59,6 +59,7 @@ #include bus-error.h #include errno-list.h #include af-list.h +#include virt.h #ifdef HAVE_SECCOMP #include seccomp-util.h @@ -423,6 +424,12 @@ int config_parse_exec_oom_score_adjust(const char* unit, assert(rvalue); assert(data); +if (detect_virtualization(NULL) == VIRTUALIZATION_CONTAINER) { +log_syntax(unit, LOG_ERR, filename, line, EPERM, + Setting the OOM score adjust value is not allowed within containers); +return 0; +} + r = safe_atoi(rvalue, oa); if (r 0) { log_syntax(unit, LOG_ERR, filename, line, -r, -- 1.8.4.2 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Howto run systemd within a linux container
Hi! We're heavily using Linux containers in our production environment. As modern Linux distributions move forward to systemd have to make sure that systemd works within our containers. Sadly we're facing issues with cgroups. Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1. In a plain setup systemd stops immediately because it is unable to create the cgroup hierarchy. Mostly because the container uid 0 is in a user namespace and has no rights to do that. Bootlog: ---cut--- systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX -IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) Detected virtualization 'lxc-libvirt'. Welcome to openSUSE 13.1 (Bottle) (x86_64)! Set hostname to test1. Failed to install release agent, ignoring: No such file or directory Failed to create root cgroup hierarchy: Permission denied Failed to allocate manager object: Permission denied ---cut--- Next try, trigger the Ingo Molnar-branch by mounting a tmpfs to /sys/fs/cgroup/, systemd segfaults. Bug filed to https://bugs.freedesktop.org/show_bug.cgi?id=74589 Bootlog: ---cut--- systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX -IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) Detected virtualization 'lxc-libvirt'. Welcome to openSUSE 13.1 (Bottle) (x86_64)! Set hostname to test1. No control group support available, not creating root group. Cannot add dependency job for unit getty@console.service, ignoring: Unit getty@console.service failed to load: Invalid argument. Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to load: No such file or directory. [ OK ] Listening on Syslog Socket. [ OK ] Reached target Remote File Systems (Pre). [ OK ] Reached target Remote File Systems. [ OK ] Listening on Delayed Shutdown Socket. [ OK ] Listening on /dev/initctl Compatibility Named Pipe. [ OK ] Reached target Encrypted Volumes. [ OK ] Listening on Journal Socket. Starting Create dynamic rule for /dev/root link... Caught SEGV, dumped core as pid 11. Freezing execution. ---cut--- Next try, fool systemd by mounting a tmpfs to /sys/fs/cgroup/systemd/. This seems to work. openSUSE boots, I can start/stop services... Shutdown hangs forever, had no time to investigate so far. But is this tmpfs hack the correct way to run systemd in a container? I really don't think so. Can one please explain me how to achieve this in a sane and unhacky way? -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Howto run systemd within a linux container
On Thu, Feb 6, 2014 at 1:08 AM, Kay Sievers k...@vrfy.org wrote: On Thu, Feb 6, 2014 at 12:56 AM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 05.02.14 23:44, Richard Weinberger (richard.weinber...@gmail.com) wrote: We're heavily using Linux containers in our production environment. As modern Linux distributions move forward to systemd have to make sure that systemd works within our containers. Sadly we're facing issues with cgroups. Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1. In a plain setup systemd stops immediately because it is unable to create the cgroup hierarchy. Mostly because the container uid 0 is in a user namespace and has no rights to do that. Make sure to either make the name=systemd cgroups hierarchy available in the container, or to grant it CAP_SYS_MOUNT so that it can do it on its own. Make sure that your container manager sets up thigns like described here: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ Next try, trigger the Ingo Molnar-branch by mounting a tmpfs to /sys/fs/cgroup/, systemd segfaults. Bug filed to https://bugs.freedesktop.org/show_bug.cgi?id=74589 Yeah, this is never tested, and likely to break all the time. We probably should remove this feature, since we cannot guarantee it work, and apparently nobody has noticed it to be broken since a while. Yeah, we should remove it now. We will never really be able to support that, init=/bin/sh is probably the better option than a systemd going crazy or crashing. Starting Create dynamic rule for /dev/root link... This is so bogus that it hurts ^^^ Seems some distros cannot let bad ideas die. :) But is this tmpfs hack the correct way to run systemd in a container? I really don't think so. Nope. Please mount tmpfs to /sys/fs/cgroup as tmps, and then the name=systemd cgroup hierarchy to /sys/fs/cgroup/systemd, see above. User namespaces are involved and uid 0 is mapped to an ordinary user. Never tried, but it might be needed that the subtree in the container is chown()ed to the mapped user. As discussed on IRC, I'll try that tomorrow. :-) -- Thanks, //richard ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel