Re: [systemd-devel] How to properly wait for udev?

2023-11-30 Thread Richard Weinberger
On Wed, Nov 29, 2023 at 9:48 AM Lennart Poettering
 wrote:
> > Why doesn't udev flock() every device it is probing?
> > Or asked differently, why is this feature opt-in instead of opt-out?
>
> Some software really doesn't like it if we take BSD locks on their
> devices, hence we don't take it blanket everywhere. And what's more
> important even: for various devices it simply isn't safe to just
> willy-nilly even open them (tape drivers and things, which might start
> to pull in a tape if we do). For others we might not be able to even
> open thing at all with the wrong flags (for example, because they are
> output only).
>
> Bock devices have relatively well defined semantics, there it's
> generally safe to do this, hence we do.

I see.

> Hence, it might be safe for UBI, but for the general case it might
> not be.
>
> That said, would BSD locking even address your issue? If you devices
> are exclusive access things and we first open() them and then flock()
> them, then that's not atomic. So if your test cases open the devices,
> then flock() them you might still get into conflict udev because it
> just open()ed the device, but didn#t get to call flock() yet.
>
> Doesn't UBI have something like O_EXCL-behaviour that grants true
> exclusive access?

It has in-kernel support for that but not from userspace, I'm
currently evaluating
whether it's worth exposing exclusive access to userspace.

Stay tuned. :-)

-- 
Thanks,
//richard


Re: [systemd-devel] How to properly wait for udev?

2023-11-27 Thread Richard Weinberger
On Mon, Nov 27, 2023 at 9:29 AM Lennart Poettering
 wrote:
> If they conceptually should be considered block device equivalents, we
> might want to extend the udev logic to such UBI devices too.  Patches
> welcome.

Why doesn't udev flock() every device it is probing?
Or asked differently, why is this feature opt-in instead of opt-out?

-- 
Thanks,
//richard


Re: [systemd-devel] How to properly wait for udev?

2023-11-27 Thread Richard Weinberger
On Mon, Nov 27, 2023 at 9:29 AM Lennart Poettering
 wrote:
> On So, 26.11.23 00:39, Richard Weinberger (richard.weinber...@gmail.com) 
> wrote:
>
> > Hello!
> >
> > After upgrading my main test worker to a recent distribution, the UBI
> > test suite [0] fails at various places with -EBUSY.
> > The reason is that these tests create and remove UBI volumes rapidly.
> > A typical test sequence is as follows:
> > 1. creation of /dev/ubi0_0
> > 2. some exclusive operation, such as atomic update or volume resize on
> > /dev/ubi0_0
> > 3. removal of /dev/ubi0_0
> >
> > Both steps 2 and 3 can fail with -EBUSY because the udev worker still
> > holds a file descriptor to /dev/ubi0_0.
>
> Hmm, I have no experience with UBI, but are you sure we open that? why
> would we? are such devices analyzed by blkid? We generally don't open
> device nodes unless we have a reason to, such as doing blkid on it or
> so.

I think it came via commit:
dbbf424c8b77 ("rules: ubi mtd - add link to named partitions (#6750)")

Here is the bpftrace output of a failed mkvol_basic run.
The test created a new volume and tried to delete it via ioctl().
Right after creating the volume, udev started inspecting it and mkvol_basic
was unable to delete it because the delete operation needs exclusive ownership.

mkvol_basic(530):   open() = /dev/ubi0
mkvol_basic(530):   ioctl(cmd: 1074032385)
(udev-worker)(531): open UBI volume 0 = 0x96644533ac80
mkvol_basic(530):   open UBI volume 0 = 0xfff0
mkvol_basic(530):   failed ioctl() = -16
(udev-worker)(531): closing UBI volume 0x96644533ac80

> What precisely fails for you? the open()? or some operation on the
> opened fd?

All of that. I depends on the test.
Basically every test assumes that it has the full ownership of a
volume it has created.

> >
> > FWIW, the problem can also get triggered using UBI's shell utilities
> > if the system is fast enough, e.g.
> > # ubimkvol -N testv -S 50 -n 0 /dev/ubi0 && ubirmvol -n 0 /dev/ubi0
> > Volume ID 0, size 50 LEBs (793600 bytes, 775.0 KiB), LEB size 15872
> > bytes (15.5 KiB), dynamic, name "testv", alignment 1
> > ubirmvol: error!: cannot UBI remove volume
> >  error 16 (Device or resource busy)
> >
> > Instead of adding a retry loop around -EBUSY, I believe the best
> > solution is to add code to wait for udev.
> > For example, having a udev barrier in ubi_mkvol() and ubi_rmvol() [1]
> > seems like a good idea to me.
>
> For block devices we implement this:
>
> https://systemd.io/BLOCK_DEVICE_LOCKING
>
> I understand UBI aren't block devices though?

Exactly, UBI volumes are character devices, just like MTDs.

> If they conceptually should be considered block device equivalents, we
> might want to extend the udev logic to such UBI devices too.  Patches
> welcome.
>
> We provide "udevadm lock" to lock a block device according to this
> scheme from shell scripts.
>
> > What function from libsystemd do you suggest for waiting until udev is
> > done with rule processing?
> > My naive approach, using udev_queue_is_empty() and
> > sd_device_get_is_initialized(), does not resolve all failures so far.
> > Firstly, udev_queue_is_empty() doesn't seem to be exported by
> > libsystemd. I have open-coded it as:
> > static int udev_queue_is_empty(void) {
> >return access("/run/udev/queue", F_OK) < 0 ?
> >(errno == ENOENT ? true : -errno) : false;
> > }
>
> This doesn't really work. udev might still process the device in the
> background.

I see.

-- 
Thanks,
//richard


Re: [systemd-devel] How to properly wait for udev?

2023-11-26 Thread Richard Weinberger
On Sun, Nov 26, 2023 at 10:36 PM Mantas Mikulėnas  wrote:
>
> If I remember correctly, udev (recent versions) takes a BSD lock using 
> flock(2) while processing the device, and tools are supposed to do the same. 
> The flock() call can be set to wait until the lock can be taken.

Hmm, indeed. But it seems to do so only for block devices.
This explain also why none of my syscall tracing showed flock() calls so far.
UBI volumes are character devices.

--
Thanks,
//richard


[systemd-devel] How to properly wait for udev?

2023-11-25 Thread Richard Weinberger
Hello!

After upgrading my main test worker to a recent distribution, the UBI
test suite [0] fails at various places with -EBUSY.
The reason is that these tests create and remove UBI volumes rapidly.
A typical test sequence is as follows:
1. creation of /dev/ubi0_0
2. some exclusive operation, such as atomic update or volume resize on
/dev/ubi0_0
3. removal of /dev/ubi0_0

Both steps 2 and 3 can fail with -EBUSY because the udev worker still
holds a file descriptor to /dev/ubi0_0.

FWIW, the problem can also get triggered using UBI's shell utilities
if the system is fast enough, e.g.
# ubimkvol -N testv -S 50 -n 0 /dev/ubi0 && ubirmvol -n 0 /dev/ubi0
Volume ID 0, size 50 LEBs (793600 bytes, 775.0 KiB), LEB size 15872
bytes (15.5 KiB), dynamic, name "testv", alignment 1
ubirmvol: error!: cannot UBI remove volume
 error 16 (Device or resource busy)

Instead of adding a retry loop around -EBUSY, I believe the best
solution is to add code to wait for udev.
For example, having a udev barrier in ubi_mkvol() and ubi_rmvol() [1]
seems like a good idea to me.

What function from libsystemd do you suggest for waiting until udev is
done with rule processing?
My naive approach, using udev_queue_is_empty() and
sd_device_get_is_initialized(), does not resolve all failures so far.
Firstly, udev_queue_is_empty() doesn't seem to be exported by
libsystemd. I have open-coded it as:
static int udev_queue_is_empty(void) {
   return access("/run/udev/queue", F_OK) < 0 ?
   (errno == ENOENT ? true : -errno) : false;
}

Additionally, sd_device_get_is_initialized() seems to return sometimes
true even if the udev worker still has the volume open.
In short, which API do you recommend to ensure that the device my
thread has created is actually usable?

[0]: http://git.infradead.org/mtd-utils.git/tree/HEAD:/tests/ubi-tests
[1]: http://git.infradead.org/mtd-utils.git/blob/HEAD:/lib/libubi.c#l994

-- 
Thanks,
//richard


Re: [systemd-devel] kdbus refactoring?

2015-11-09 Thread Richard Weinberger
On Mon, Nov 9, 2015 at 12:30 AM, Greg KH <gre...@linuxfoundation.org> wrote:
> On Sun, Nov 08, 2015 at 10:39:43PM +0100, Richard Weinberger wrote:
>> On Sun, Nov 8, 2015 at 10:35 PM, Greg KH <gre...@linuxfoundation.org> wrote:
>> > On Sun, Nov 08, 2015 at 10:06:31PM +0100, Richard Weinberger wrote:
>> >> Hi all,
>> >>
>> >> after reading on the removal of kdbus from Rawhide[1] I've searched
>> >> the mailinglist archives for more details but didn't find anything.
>> >> So, what are your plans?
>> >>
>> >> [1] 
>> >> https://lists.fedoraproject.org/pipermail/kernel/2015-October/006011.html
>> >
>> > As that link said, based on the result of the code being in Rawhide, it
>> > is now being reworked / redesigned.  The result will be posted for
>> > review "when it's ready".
>>
>> If you rework/redesign something you have to know what you want to change.
>> That's why I was asking for the plan...
>
> Since when do people post "plans" or "design documents" on lkml without
> real code?  Again, code will be posted when it's ready, like any other
> kernel submission.

Nobody asked for a design document.
And yes, people often say what they do or want to do.
Anyway, heise.de has details from the systemd.conf:
http://www.heise.de/open/meldung/Linux-Kdbus-soll-universeller-werden-Videos-von-Systemd-Konferenz-2910910.html

According to them the plan is making kdbus more universal and less
dbus specific.
Which sounds great.

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] modules in container

2015-11-08 Thread Richard Weinberger
On Sun, Nov 8, 2015 at 1:17 PM, arnaud gaboury  wrote:
> I am trying to understand how kernel modules are "passed" to nspawn container.

A container must not load any module as the kernel is a shared resource.

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] kdbus refactoring?

2015-11-08 Thread Richard Weinberger
On Sun, Nov 8, 2015 at 10:35 PM, Greg KH <gre...@linuxfoundation.org> wrote:
> On Sun, Nov 08, 2015 at 10:06:31PM +0100, Richard Weinberger wrote:
>> Hi all,
>>
>> after reading on the removal of kdbus from Rawhide[1] I've searched
>> the mailinglist archives for more details but didn't find anything.
>> So, what are your plans?
>>
>> [1] https://lists.fedoraproject.org/pipermail/kernel/2015-October/006011.html
>
> As that link said, based on the result of the code being in Rawhide, it
> is now being reworked / redesigned.  The result will be posted for
> review "when it's ready".

If you rework/redesign something you have to know what you want to change.
That's why I was asking for the plan...

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] kdbus refactoring?

2015-11-08 Thread Richard Weinberger
Hi all,

after reading on the removal of kdbus from Rawhide[1] I've searched
the mailinglist archives for more details but didn't find anything.
So, what are your plans?

[1] https://lists.fedoraproject.org/pipermail/kernel/2015-October/006011.html

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] nspawn dependencies

2015-06-11 Thread Richard Weinberger
Lennart,

Am 11.06.2015 um 12:08 schrieb Lennart Poettering:
 On Thu, 11.06.15 09:40, Richard Weinberger (richard.weinber...@gmail.com) 
 wrote:
 
 Hi!

 Recent systemd-nspawn seems to support unprivileged containers (user
 namespaces). That's awesome, thank you guys for working on that!
 
 Well, the name unprivileged containers usually is used for the
 concept where you don't need any privs to start and run a
 container. We don't support that, and that's turned off in the kernel
 of Fedora at least, for good reasons.

Depends. Container stuff is that much hyped these days that namings change
all the time. :-)
I understand unprivileged containers as containers which do not run as root.
While I don't care whether you have to be root to spawn them.

 We do support user namespaces now, but we require privs on the host to
 set them up. I doubt though that UID namespacing as it is now is
 really that useful though: you have to prep your images first, apply a
 uid shift to all file ownership and ACLs of your tree, and this needs
 to be done manually. This makes it pretty hard to deploy since you
 cannot boot unmodified container images this way you download from the
 internet. Also, since there is no sane, established scheme for
 allocating UID ranges for the containers automatically. So far uid
 namespaces hence appear mostly like an useless excercise, far from
 being deployable in real life hence.

What I care about is that root within the container is not the real root.
Hence, what user namespaces do.

 Maybe you can help me so sort this out, can I run any systemd enabled
 distribution
 using the most current systemd-nspawn?
 Say, my host is FC22 using systemd-nspawn from git, can it spawn an
 openSUSE 13.2 container which has only systemd v210?

 Or has the systemd version on the container side to match the systemd
 version on the host side?
 
 It generally does not have to match. We try to maintain compatibility
 there (though we make no guarantees -- the stuff is too new). That
 said, newer systemd versions work much better in nspawn than older
 ones, and v210 is pretty old already.

Okay. Thanks for the clarification.

From reading the source it seems like you mount the whole cgroup hierarchy into 
the
container's mount namespace, rebind /sys/fs/cgroup/systemd/yadda/.../yadda/ to 
/sys/fs/cgroup/systemd
and remount some parts read only.
Does this play well with the cgroup release_agent/notify_on_release mechanism?
Some time ago I've played with that and found that always only systemd on the
host side receives the notify.
Mostly due to the broken design of cgroups. ;-\

One more question, how does systemd-nspawn depend on the host systemd?
On this machine runs openSUSE with systemd v210. I build current systemd-nswpan
and gave it a try wit no luck.

rw@sandpuppy:~/work/systemd (master) sudo ./systemd-nspawn -bD /fc22
Spawning container fc22 on /fc22.
Press ^] three times within 1s to kill container.
Failed to register machine: Unknown method 'CreateMachineWithNetwork' or 
interface 'org.freedesktop.machine1.Manager'

I suspect I was too naive to think it would work out. :-)

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] x86: defconfig: Enable CONFIG_FHANDLE

2014-11-30 Thread Richard Weinberger
Am 01.12.2014 um 01:18 schrieb Dave Chinner:
 On Sun, Nov 30, 2014 at 10:08:01PM +0100, Richard Weinberger wrote:
 Am 30.11.2014 um 21:54 schrieb Dave Chinner:
 On Wed, Nov 26, 2014 at 12:36:52AM +0100, Richard Weinberger wrote:
 systemd has a hard dependency on CONFIG_FHANDLE.
 If you run systemd with CONFIG_FHANDLE=n it will somehow
 boot but fail to spawn a getty or other basic services.
 As systemd is now used by most x86 distributions it
 makes sense to enabled this by default and save kernel
 hackers a lot of value debugging time.

 The bigger question to me is this: why does systemd need to
 store/open by handle rather than just opening paths directly when
 needed? This interface is intended for stable, pathless access to
 inodes across unmount/mount contexts (e.g. userspace NFS servers,
 filesystem backup programs, etc) so I'm curious as to the problem
 systemd is solving using this interface. I just can't see the
 problem being solved here, and why path based security checks on
 every open() aren't necessary...

 Digging inter systemd source shows that they are using name_to_handle_at()
 to get the mount id of a given path.
 
 From the name_to_handle_at() man page:
 
 The mount_id argument returns an identifier for the filesystem
 mount that corresponds to pathname.  This corresponds to the
 first  field in  one  of  the records in /proc/self/mountinfo.
 Opening the pathname in the fifth field of that record yields a
 file descriptor for the mount point; that file descriptor can be
 used in a subsequent call to open_by_handle_at().
 
 So why do they need CONFIG_FHANDLE to get the mount id in userspace?
 Indeed, what do they even need the mount id for?
 
 The actual struct file_handle result is always ignored.
 
 That sounds like a classic case of interface abuse. i.e. using an
 interface for something it was not designed or intended for

CC'ing systemd folks.

Lennart, can you please explain why you need CONFIG_FHANDLE for systemd?
Maybe I'm reading the source horrible wrong.

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-cgroups-agent not working in containers

2014-11-28 Thread Richard Weinberger
Am 28.11.2014 um 06:33 schrieb Martin Pitt:
 Hello all,
 
 Cameron Norman [2014-11-27 12:26 -0800]:
 On Wed, Nov 26, 2014 at 1:29 PM, Richard Weinberger rich...@nod.at wrote:
 Hi!

 I run a Linux container setup with openSUSE 13.1/2 as guest distro.
 After some time containers slow down.
 An investigation showed that the containers slow down because a lot of stale
 user sessions slow down almost all systemd tools, mostly systemctl.
 loginctl reports many thousand sessions.
 All in state closing.

 This sounds similar to an issue that systemd-shim in Debian had.
 Martin Pitt (helps to maintain systemd in Debian) fixed that issue; he
 may have some ideas here. I CC'd him.
 
 The problem with systemd-shim under sysvinit or upstart was that shim
 didn't set a cgroup release agent like systemd itself does. Thus the
 cgroups were never cleaned up after all the session processes died.
 (See 1.4 on https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
 for details)
 
 I don't think that SUSE uses systemd-shim, I take it in that setup you
 are running systemd proper on both the host and the guest? Then I
 suggest checking the cgroups that correspond to the closing sessions
 in the container, i. e. /sys/fs/cgroup/systemd/.../session-XX.scope/tasks.
 If there are still processes in it, logind is merely waiting for them
 to exit (or set KillUserProcesses in logind.conf). If they are empty,
 check that /sys/fs/cgroup/systemd/.../session-XX.scope/notify_on_release is 1
 and that /sys/fs/cgroup/systemd/release_agent is set?

The problem is that within the container the release agent is not executed.
It is executed on the host side.

Lennart, how is this supposed to work?
Is the theory of operation that the host systemd sends 
org.freedesktop.systemd1.Agent Released
via dbus into the guest?
The guests systemd definitely does not receive such a signal.

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-cgroups-agent not working in containers

2014-11-27 Thread Richard Weinberger
Am 26.11.2014 um 22:29 schrieb Richard Weinberger:
 Hi!
 
 I run a Linux container setup with openSUSE 13.1/2 as guest distro.
 After some time containers slow down.
 An investigation showed that the containers slow down because a lot of stale
 user sessions slow down almost all systemd tools, mostly systemctl.
 loginctl reports many thousand sessions.
 All in state closing.
 
 The vast majority of these sessions are from crond an ssh logins.
 It turned out that sessions are never closed and stay around.
 The control group of a said session contains zero tasks.
 So I started to explore why systemd keeps it.
 After another few hours of debugging I realized that systemd never
 issues the release signal from cgroups.
 Also calling the release agent by hand did not help. i.e.
 /usr/lib/systemd/systemd-cgroups-agent 
 /user.slice/user-0.slice/session-c324.scope
 
 Therefore systemd never recognizes that a server/session has no more tasks
 and will close it.
 First I thought it is an issue in libvirt combined with user namespaces.
 But I can trigger this also without user namespaces and also with 
 systemd-nspawn.
 Tested with systemd 208 and 210 from openSUSE, their packages have all known 
 bugfixes.
 
 Any idea where to look further?
 How do you run the most current systemd on your distro?

Btw: I face exactly the same issue also on fc21 (guest is fc20).

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] systemd-cgroups-agent not working in containers

2014-11-26 Thread Richard Weinberger
Hi!

I run a Linux container setup with openSUSE 13.1/2 as guest distro.
After some time containers slow down.
An investigation showed that the containers slow down because a lot of stale
user sessions slow down almost all systemd tools, mostly systemctl.
loginctl reports many thousand sessions.
All in state closing.

The vast majority of these sessions are from crond an ssh logins.
It turned out that sessions are never closed and stay around.
The control group of a said session contains zero tasks.
So I started to explore why systemd keeps it.
After another few hours of debugging I realized that systemd never
issues the release signal from cgroups.
Also calling the release agent by hand did not help. i.e.
/usr/lib/systemd/systemd-cgroups-agent 
/user.slice/user-0.slice/session-c324.scope

Therefore systemd never recognizes that a server/session has no more tasks
and will close it.
First I thought it is an issue in libvirt combined with user namespaces.
But I can trigger this also without user namespaces and also with 
systemd-nspawn.
Tested with systemd 208 and 210 from openSUSE, their packages have all known 
bugfixes.

Any idea where to look further?
How do you run the most current systemd on your distro?

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to use cgroups within containers?

2014-10-24 Thread Richard Weinberger
Am 20.10.2014 um 19:27 schrieb Lennart Poettering:
 On Mon, 20.10.14 19:16, Richard Weinberger (rich...@nod.at) wrote:
 
 Have you read the link I posted?

 Sure, I've also been in the room in Düsseldorf while you've read it
 in front of us.
 
 Not that I changed it since then... ;-)
 
 Yes, I test systemd inside containers. Daily. Actually it's my primary
 way of testing systemd, since it is extremely quick and allows me to
 attach from the host with debugging tools...

 As long as you follow the suggestions in the document I linked systemd
 will work without modifications in container managers. At least
 libvirt-lxc and nspawn follows these suggestions, not sure about the
 other container managers.

 If I read the source of nspwan correctly, it does not use user
 namespaces.
 
 Ah, this is about user namespaces? No I have not played around with
 them so far. Sorry.

Yep. Please have a look at them. There are some pitfalls.

 libvirt-lxc is currently not sure how to support systemd. So far it
 bind mounts only the machine specific part of cgroups into the container.
 Which is not really nice but better than exposing the whole hierarchy into
 the container.
 
 It really should also bind mount the upper parts, but possibly mark
 them read-only (which nspawn currently doesn't do).

Okay. Or maybe cgroup namespaces will help.
Let's find out. :)

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to use cgroups within containers?

2014-10-20 Thread Richard Weinberger
Am 20.10.2014 um 18:51 schrieb Lennart Poettering:
 On Mon, 20.10.14 18:49, Richard Weinberger (rich...@nod.at) wrote:
 
 Am 20.10.2014 um 18:24 schrieb Lennart Poettering:
 On Fri, 17.10.14 23:35, Richard Weinberger (richard.weinber...@gmail.com) 
 wrote:

 Dear systemd and container folks,

 at Plumbers the question raised how to provide cgroups to a systemd that 
 lives
 in a container (with user namespaces).
 Due to the GDL train strikes I had to leave very soon and had no chance to
 talk to you in person.

 Was a solution proposed?
 All I want to know is how to provide cgroups in a sane and secure way
 to systemd. :-)

 The cgroups setup systemd requires to be able to run cleanly without
 changes in a container is documented here:

 http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/

 You have to mount the full cgroupfs hierarchies into the containers,
 so that /proc/$PID/cgroup makes sense inside the containers (that file
 lists absolute paths...). They can be mounted read-only up to the
 container's root, but further down they need to be writable to the
 container, so that systemd inside the container can do its job.

 And what solution do you propose?
 
 Solution? For what problem precisely?

Running systemd inside Linux container (including user namespaces). :-)

 Will cgroup namespaces make systemd finally happy?
 
 I have no idea about cgroup namespaces and what they entail.
 
 systemd is quite happy already, if you follow the guidelines for
 container managers we put together...

Have you ever used systemd inside a container?
Say, LXC or libvirt-lxc...

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to use cgroups within containers?

2014-10-20 Thread Richard Weinberger
Am 20.10.2014 um 19:04 schrieb Lennart Poettering:
 On Mon, 20.10.14 18:55, Richard Weinberger (rich...@nod.at) wrote:
 
 Am 20.10.2014 um 18:51 schrieb Lennart Poettering:
 On Mon, 20.10.14 18:49, Richard Weinberger (rich...@nod.at) wrote:

 Am 20.10.2014 um 18:24 schrieb Lennart Poettering:
 On Fri, 17.10.14 23:35, Richard Weinberger (richard.weinber...@gmail.com) 
 wrote:

 Dear systemd and container folks,

 at Plumbers the question raised how to provide cgroups to a systemd that 
 lives
 in a container (with user namespaces).
 Due to the GDL train strikes I had to leave very soon and had no chance 
 to
 talk to you in person.

 Was a solution proposed?
 All I want to know is how to provide cgroups in a sane and secure way
 to systemd. :-)

 The cgroups setup systemd requires to be able to run cleanly without
 changes in a container is documented here:

 http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/

 You have to mount the full cgroupfs hierarchies into the containers,
 so that /proc/$PID/cgroup makes sense inside the containers (that file
 lists absolute paths...). They can be mounted read-only up to the
 container's root, but further down they need to be writable to the
 container, so that systemd inside the container can do its job.

 And what solution do you propose?

 Solution? For what problem precisely?

 Running systemd inside Linux container (including user namespaces). :-)

 Will cgroup namespaces make systemd finally happy?

 I have no idea about cgroup namespaces and what they entail.

 systemd is quite happy already, if you follow the guidelines for
 container managers we put together...

 Have you ever used systemd inside a container?
 Say, LXC or libvirt-lxc...
 
 Have you read the link I posted?

Sure, I've also been in the room in Düsseldorf while you've read it in front of 
us.

 Yes, I test systemd inside containers. Daily. Actually it's my primary
 way of testing systemd, since it is extremely quick and allows me to
 attach from the host with debugging tools...
 
 As long as you follow the suggestions in the document I linked systemd
 will work without modifications in container managers. At least
 libvirt-lxc and nspawn follows these suggestions, not sure about the
 other container managers.

If I read the source of nspwan correctly, it does not use user namespaces.
libvirt-lxc is currently not sure how to support systemd. So far it
bind mounts only the machine specific part of cgroups into the container.
Which is not really nice but better than exposing the whole hierarchy into
the container.
This is why I was asking for cgroup namespaces...

 Also read:
 
 http://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/
 
 We have documented this all so nicely, I can only recommend to
 actually take the time to read this. Thanks!

Thanks a lot!
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] How to use cgroups within containers?

2014-10-17 Thread Richard Weinberger
Dear systemd and container folks,

at Plumbers the question raised how to provide cgroups to a systemd that lives
in a container (with user namespaces).
Due to the GDL train strikes I had to leave very soon and had no chance to
talk to you in person.

Was a solution proposed?
All I want to know is how to provide cgroups in a sane and secure way
to systemd. :-)

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to use cgroups within containers?

2014-10-17 Thread Richard Weinberger
...fixing LXC devel mailinglist... :-\

On Fri, Oct 17, 2014 at 11:35 PM, Richard Weinberger
richard.weinber...@gmail.com wrote:
 Dear systemd and container folks,

 at Plumbers the question raised how to provide cgroups to a systemd that lives
 in a container (with user namespaces).
 Due to the GDL train strikes I had to leave very soon and had no chance to
 talk to you in person.

 Was a solution proposed?
 All I want to know is how to provide cgroups in a sane and secure way
 to systemd. :-)

 --
 Thanks,
 //richard



-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Should user mode linux register with machined?

2014-10-10 Thread Richard Weinberger
Lennart,

Am 10.10.2014 um 18:44 schrieb Lennart Poettering:
 It's a bit more complex. While UML, qemu, kvm, currently don't, LXC,
 systemd-nspawn and libvirt-lxc all do talk directly to machined. (Note
 that LXC and libvirt-lxc are separate codebases, the latter is *not* a
 wrapper around the former).
 
 So, dunno, it really is up to how you intend UML to be used. If UML
 shall be nice and useful without libvirt, then it's worth doing the
 registration natively, but it's also OK to just leave this to libvirt,
 if that's your primary envisioned usecase...

What is the benefit of this registration?
I boot all day long UML and qemu-kvm VMs without registering them to systemd,
so I don't really know what I'm missing. :-)
But if there is a nice use case I'll happily add the registration to UML.

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Timed out waiting for device dev-disk-by...

2014-09-29 Thread Richard Weinberger
On Mon, Sep 29, 2014 at 8:29 PM, Thomas Meyer tho...@m3y3r.de wrote:
 Hi,

 I get a timeout in the Fedora 21 alpha:

 [ TIME ] Timed out waiting for device 
 dev-disk-by\x2duuid-008af19d\x2d2562\x2d49bd\x2d8907\x2d721ea08f3e14.device.

 But all devices are available from early kernel start:
 # ls -l /dev/disk/by-uuid/
 total 0
 lrwxrwxrwx 1 root root 11 Sep 29 20:17 008af19d-2562-49bd-8907-721ea08f3e14 
 - ../../ubda1
 lrwxrwxrwx 1 root root 11 Sep 29 20:17 e2bffa45-d84f-47bc-81ba-e7a395751fa6 
 - ../../ubda3
 lrwxrwxrwx 1 root root 11 Sep 29 20:17 f452f020-a446-41ed-93c0-ee5ce56d6ea4 
 - ../../ubda2

 It feels like some event notification is lost in the boot process or 
 something like this?!

 What exactly makes the device unit go into the state active/plugged?

 This is a boot of the Fedora 21 alpha under user mode linux.

 Any ideas what could be wrong here?

Please always CC me and/or the UML mailinglist in case of UML related issues.
I'm very interested in having UML work with systemd.

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RESEND][PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling

2014-09-26 Thread Richard Weinberger
Am 09.09.2014 11:09, schrieb Richard Weinberger:
 If one has a config like:
 d /tmp 1777 root root -
 X /tmp/important_mount
 
 All files below /tmp/important_mount will be deleted as the
 /tmp/important_mount item will spuriously inherit a max age of 0
 from /tmp.
 /tmp has a max age of 0 but age_set is (of course) false.
 
 This affects also the PrivateTmp feature of systemd.
 All tmp files of such services will be deleted unconditionally
 and can cause service failures and data loss.
 
 Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic.
 ---
  src/tmpfiles/tmpfiles.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c
 index 79fd0b7..c8d4abb 100644
 --- a/src/tmpfiles/tmpfiles.c
 +++ b/src/tmpfiles/tmpfiles.c
 @@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool 
 ignore_enoent) {
  candidate_item = j;
  }
  
 -if (candidate_item) {
 +if (candidate_item  candidate_item-age_set) {
  i-age = candidate_item-age;
  i-age_set = true;
  }
 

ping?

Is there something horrible wrong with this patch or the submission itself?
Please tell me. :)

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Should user mode linux register with machined?

2014-09-16 Thread Richard Weinberger
On Tue, Sep 16, 2014 at 5:31 PM, Thomas Meyer tho...@m3y3r.de wrote:
 Hi,

 I wrote a small patch for user-mode linux to register with machined by
 calling CreateMachine. Is this a good idea to do so?

 I think machined gives you a nice overview over all running UML
 instances, also you get the scope unit and the control groups with above
 registration to machined. anything else on the plus side?
 The user-mode-mailing list did ask why exactly my patch is needed.

The user-mode-mailing is also reading this list BTW. :)

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [RESEND][PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling

2014-09-09 Thread Richard Weinberger
If one has a config like:
d /tmp 1777 root root -
X /tmp/important_mount

All files below /tmp/important_mount will be deleted as the
/tmp/important_mount item will spuriously inherit a max age of 0
from /tmp.
/tmp has a max age of 0 but age_set is (of course) false.

This affects also the PrivateTmp feature of systemd.
All tmp files of such services will be deleted unconditionally
and can cause service failures and data loss.

Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic.
---
 src/tmpfiles/tmpfiles.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c
index 79fd0b7..c8d4abb 100644
--- a/src/tmpfiles/tmpfiles.c
+++ b/src/tmpfiles/tmpfiles.c
@@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool 
ignore_enoent) {
 candidate_item = j;
 }
 
-if (candidate_item) {
+if (candidate_item  candidate_item-age_set) {
 i-age = candidate_item-age;
 i-age_set = true;
 }
-- 
2.0.1

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling

2014-09-02 Thread Richard Weinberger
Am 27.08.2014 14:55, schrieb Richard Weinberger:
 If one has a config like:
 d /tmp 1777 root root -
 X /tmp/important_mount
 
 All files below /tmp/important_mount will be deleted as the
 /tmp/important_mount item will spuriously inherit a max age of 0
 from /tmp.
 /tmp has a max age of 0 but age_set is (of course) false.
 
 Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic.
 
 Signed-off-by: Richard Weinberger rich...@nod.at
 ---
  src/tmpfiles/tmpfiles.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c
 index 79fd0b7..c8d4abb 100644
 --- a/src/tmpfiles/tmpfiles.c
 +++ b/src/tmpfiles/tmpfiles.c
 @@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool 
 ignore_enoent) {
  candidate_item = j;
  }
  
 -if (candidate_item) {
 +if (candidate_item  candidate_item-age_set) {
  i-age = candidate_item-age;
  i-age_set = true;
  }
 

Ping?

Would be nice to see this merged, it fixes a nasty issue with PrivateTmp=yes.
Without that patch all files in private /tmp and /var/tmp will get deleted 
unconditionally
by systemd-tmpfiles if you configure it *not* to delete anything in /tmp and 
/var/tmp.
i.e:
d /tmp 1777 root root -
d /var/tmp 1777 root root -

This is the default on openSUSE.

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] systemd-tmpfiles: Fix IGNORE_DIRECTORY_PATH age handling

2014-08-27 Thread Richard Weinberger
If one has a config like:
d /tmp 1777 root root -
X /tmp/important_mount

All files below /tmp/important_mount will be deleted as the
/tmp/important_mount item will spuriously inherit a max age of 0
from /tmp.
/tmp has a max age of 0 but age_set is (of course) false.

Fix this by checking -age_set in the IGNORE_DIRECTORY_PATH logic.

Signed-off-by: Richard Weinberger rich...@nod.at
---
 src/tmpfiles/tmpfiles.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/tmpfiles/tmpfiles.c b/src/tmpfiles/tmpfiles.c
index 79fd0b7..c8d4abb 100644
--- a/src/tmpfiles/tmpfiles.c
+++ b/src/tmpfiles/tmpfiles.c
@@ -1572,7 +1572,7 @@ static int read_config_file(const char *fn, bool 
ignore_enoent) {
 candidate_item = j;
 }
 
-if (candidate_item) {
+if (candidate_item  candidate_item-age_set) {
 i-age = candidate_item-age;
 i-age_set = true;
 }
-- 
2.0.1

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Device units and LXC

2014-05-24 Thread Richard Weinberger
Hi!

As of my understanding of systemd, device units depend hard udev.
Units like network@.service contain lines like 
BindsTo=sys-subsystem-net-devices-%i.device
Within a Linux container this is a problem because there is no udev.
There systemd never receives an event for this device and the device unit never 
shows up.

I'm wondering what we can do to improve the situation.
At least for Ethernet devices systemd could just use sysfs to find out
whether the device is present or not.

What do you think?

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] [RFC] Ignore OOMScoreAdjust in Linux containers

2014-04-09 Thread Richard Weinberger
Am 09.04.2014 19:19, schrieb Tom Gundersen:
 On Mon, Apr 7, 2014 at 9:47 PM, Richard Weinberger rich...@nod.at wrote:
 At least LXC does not allow the container root to change
 the OOM Score adjust value.

 Signed-off-by: Richard Weinberger rich...@nod.at
 ---
 Hi!

 Within Linux containers we cannot use OOMScoreAdjust nor 
 CapabilityBoundingSet (and maybe
 more related settings).
 This patch tells systemd to ignore OOMScoreAdjust if it detects
 a container.

 Are you fine with such a change?
 Otherweise regular distros need a lot of changes in their .service file
 to make them work within LXC.

 As detect_virtualization() detects more than LXC we have to find out
 whether OOMScoreAdjust cannot be used on OpenVZ and other container as well.

 I'd volunteer to identify all settings and sending patches...
 
 Hm, is there a fundamental reason why this is not possible in
 containers in general, or is it simply an LXC restriction? Regardless,
 would it not be best to simply degrade gracefully and ignore the
 setting with a warning if it fails? See the comment Lennart just
 posted on the recent PrivateNetwork= patch. This sounds like a very
 similar situation.

Writing to oom_score_adj is disallowed by design within user namespaces.
Please see: https://lkml.org/lkml/2013/4/25/596

I'm also fine with ignoring OOMScoreAdjust if it fails.
All I want is a painless Linux userspace on top of systemd within
my Containers. :-)

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH] [RFC] Ignore OOMScoreAdjust in Linux containers

2014-04-09 Thread Richard Weinberger
Am 09.04.2014 20:28, schrieb Tom Gundersen:
 On Wed, Apr 9, 2014 at 7:39 PM, Richard Weinberger rich...@nod.at wrote:
 Am 09.04.2014 19:19, schrieb Tom Gundersen:
 On Mon, Apr 7, 2014 at 9:47 PM, Richard Weinberger rich...@nod.at wrote:
 At least LXC does not allow the container root to change
 the OOM Score adjust value.

 Signed-off-by: Richard Weinberger rich...@nod.at
 ---
 Hi!

 Within Linux containers we cannot use OOMScoreAdjust nor 
 CapabilityBoundingSet (and maybe
 more related settings).
 This patch tells systemd to ignore OOMScoreAdjust if it detects
 a container.

 Are you fine with such a change?
 Otherweise regular distros need a lot of changes in their .service file
 to make them work within LXC.

 As detect_virtualization() detects more than LXC we have to find out
 whether OOMScoreAdjust cannot be used on OpenVZ and other container as 
 well.

 I'd volunteer to identify all settings and sending patches...

 Hm, is there a fundamental reason why this is not possible in
 containers in general, or is it simply an LXC restriction? Regardless,
 would it not be best to simply degrade gracefully and ignore the
 setting with a warning if it fails? See the comment Lennart just
 posted on the recent PrivateNetwork= patch. This sounds like a very
 similar situation.

 Writing to oom_score_adj is disallowed by design within user namespaces.
 Please see: https://lkml.org/lkml/2013/4/25/596
 
 But I guess we still want to use this in containers that don't use
 user namespaces.

Containers without user namespaces and a uid 0 user are horrible broken
and insecure.
They will hopefully die soon.

 I'm also fine with ignoring OOMScoreAdjust if it fails.
 
 Sounds like the right way (might be other things like this too I suppose).

Okay, I'll send patches for OOMScoreAdjust and other settings to ignore 
failures.
This way systemd can also support containers without user namespaces.
No matter how useful these are. (hello docker.io folks! ;))

Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [PATCH] [RFC] Ignore OOMScoreAdjust in Linux containers

2014-04-07 Thread Richard Weinberger
At least LXC does not allow the container root to change
the OOM Score adjust value.

Signed-off-by: Richard Weinberger rich...@nod.at
---
Hi!

Within Linux containers we cannot use OOMScoreAdjust nor CapabilityBoundingSet 
(and maybe
more related settings).
This patch tells systemd to ignore OOMScoreAdjust if it detects
a container.

Are you fine with such a change?
Otherweise regular distros need a lot of changes in their .service file
to make them work within LXC.

As detect_virtualization() detects more than LXC we have to find out
whether OOMScoreAdjust cannot be used on OpenVZ and other container as well.

I'd volunteer to identify all settings and sending patches...

Thanks,
//richard

---
 src/core/load-fragment.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/core/load-fragment.c b/src/core/load-fragment.c
index c604f90..13f6107 100644
--- a/src/core/load-fragment.c
+++ b/src/core/load-fragment.c
@@ -59,6 +59,7 @@
 #include bus-error.h
 #include errno-list.h
 #include af-list.h
+#include virt.h
 
 #ifdef HAVE_SECCOMP
 #include seccomp-util.h
@@ -423,6 +424,12 @@ int config_parse_exec_oom_score_adjust(const char* unit,
 assert(rvalue);
 assert(data);
 
+if (detect_virtualization(NULL) == VIRTUALIZATION_CONTAINER) {
+log_syntax(unit, LOG_ERR, filename, line, EPERM,
+   Setting the OOM score adjust value is not allowed 
within containers);
+return 0;
+}
+
 r = safe_atoi(rvalue, oa);
 if (r  0) {
 log_syntax(unit, LOG_ERR, filename, line, -r,
-- 
1.8.4.2

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Howto run systemd within a linux container

2014-02-05 Thread Richard Weinberger
Hi!

We're heavily using Linux containers in our production environment.
As modern Linux distributions move forward to systemd have to make sure that
systemd works within our containers.

Sadly we're facing issues with cgroups.
Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1.

In a plain setup systemd stops immediately because it is unable to
create the cgroup hierarchy.
Mostly because the container uid 0 is in a user namespace and has no
rights to do that.

Bootlog:
---cut---
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX
-IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'lxc-libvirt'.

Welcome to openSUSE 13.1 (Bottle) (x86_64)!

Set hostname to test1.
Failed to install release agent, ignoring: No such file or directory
Failed to create root cgroup hierarchy: Permission denied
Failed to allocate manager object: Permission denied
---cut---

Next try, trigger the Ingo Molnar-branch by mounting a tmpfs to
/sys/fs/cgroup/, systemd segfaults.
Bug filed to https://bugs.freedesktop.org/show_bug.cgi?id=74589

Bootlog:
---cut---

systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX
-IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'lxc-libvirt'.

Welcome to openSUSE 13.1 (Bottle) (x86_64)!

Set hostname to test1.
No control group support available, not creating root group.
Cannot add dependency job for unit getty@console.service, ignoring:
Unit getty@console.service failed to load: Invalid argument.
Cannot add dependency job for unit display-manager.service, ignoring:
Unit display-manager.service failed to load: No such file or
directory.
[  OK  ] Listening on Syslog Socket.
[  OK  ] Reached target Remote File Systems (Pre).
[  OK  ] Reached target Remote File Systems.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Listening on Journal Socket.
 Starting Create dynamic rule for /dev/root link...
Caught SEGV, dumped core as pid 11.
Freezing execution.
---cut---

Next try, fool systemd by mounting a tmpfs to /sys/fs/cgroup/systemd/.
This seems to work. openSUSE boots, I can start/stop services...
Shutdown hangs forever, had no time to investigate so far.

But is this tmpfs hack the correct way to run systemd in a container?
I really don't think so.

Can one please explain me how to achieve this in a sane and unhacky way?

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Howto run systemd within a linux container

2014-02-05 Thread Richard Weinberger
On Thu, Feb 6, 2014 at 1:08 AM, Kay Sievers k...@vrfy.org wrote:
 On Thu, Feb 6, 2014 at 12:56 AM, Lennart Poettering
 lenn...@poettering.net wrote:
 On Wed, 05.02.14 23:44, Richard Weinberger (richard.weinber...@gmail.com) 
 wrote:

 We're heavily using Linux containers in our production environment.
 As modern Linux distributions move forward to systemd have to make sure that
 systemd works within our containers.

 Sadly we're facing issues with cgroups.
 Our testbed consists of openSUSE 13.1 with Linux 3.13.1 and libvirt 1.2.1.

 In a plain setup systemd stops immediately because it is unable to
 create the cgroup hierarchy.
 Mostly because the container uid 0 is in a user namespace and has no
 rights to do that.

 Make sure to either make the name=systemd cgroups hierarchy available in
 the container, or to grant it CAP_SYS_MOUNT so that it can do it on its
 own.

 Make sure that your container manager sets up thigns like described here:

 http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/

 Next try, trigger the Ingo Molnar-branch by mounting a tmpfs to
 /sys/fs/cgroup/, systemd segfaults.
 Bug filed to https://bugs.freedesktop.org/show_bug.cgi?id=74589

 Yeah, this is never tested, and likely to break all the time. We
 probably should remove this feature, since we cannot guarantee it work,
 and apparently nobody has noticed it to be broken since a while.

 Yeah, we should remove it now. We will never really be able to support
 that, init=/bin/sh is probably the better option than a systemd going
 crazy or crashing.

  Starting Create dynamic rule for /dev/root link...

This is so bogus that it hurts ^^^

 Seems some distros cannot let bad ideas die. :)

 But is this tmpfs hack the correct way to run systemd in a container?
 I really don't think so.

 Nope. Please mount tmpfs to /sys/fs/cgroup as tmps, and then the
 name=systemd cgroup hierarchy to /sys/fs/cgroup/systemd, see above.

 User namespaces are involved and uid 0 is mapped to an ordinary user.
 Never tried, but it might be needed that the subtree in the container
 is chown()ed to the mapped user.

As discussed on IRC, I'll try that tomorrow. :-)

-- 
Thanks,
//richard
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel