Re: [systemd-devel] Questions around cgroups, systemd, containers

2022-05-21 Thread Lewis Gaul
Hi Lennart,

Thanks for responding to the questions. I realise some of them may have
been a little unclear in isolation - my intention was for the two posts I
linked to provide the full context, but I understand they contain a lot of
text that it's unreasonable to expect people to have time to read! I'll try
to clarify for each question below.

> > - Why are private cgroups mounted read-only in
non-privileged containers?
>
> "private cgroups"? What do you mean by that? The controllers?
>
> Controller delegation on cgroupsv1 is simply not safe, that's all. You
can provide invalid configuration to the kernel, and DoS the machine
through it. cgroups are simply not a suitable privilege boundary on
cgroupsv1.
>
> If you want safe delegation, use cgroupsv2, where delegation is safe.

I was referring to the behaviour of '--cgroupns=private' (to 'docker run'
or 'podman run') where a cgroup namespace is created for the container.
This flag exists under v1 and v2 cgroups. For example, on v2 cgroups the
host cgroup path '/sys/fs/cgroup/docker//' would correspond to
'/sys/fs/cgroup/' inside the container. Discussed more at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-namespace-options
.

The question was "why is the cgroupfs mounted read-only inside a container
in non-privileged?" - when there's a cgroup namespace it seems it should be
safe [under v2 cgroups] for the container to have write access to its
cgroupfs?

The reason for caring about this of course is that it's a requirement for
running systemd inside the container. Currently workarounds are required,
such as '-v /sys/fs/cgroup:/sys/fs/cgroup' (which cannot be expected to
work with a cgroup namespace!) or podman's default '--systemd=true'
behaviour of detecting whether systemd is the entrypoint when deciding
whether to make the cgroupfs writable. However, I'm trying to understand if
there's any good reason for docker/podman not making the container's
cgroupfs read-write by default.

> > - Is it sound to override Docker’s mounting of the private
container cgroups under v1?
>
> I don't know what Docker does these days, but they used to be entirely
ignorant towards safe cooperation in the cgroup tree. i.e. they ignored
https://systemd.io/CGROUP_DELEGATION in its entirety, as they don't really
accepted systemd's existance.
>
> Today most distros I think switched over to other ways to run containers,
i.e. podman and so on, which have a more professional approach to all this,
and can safely cooperate in a cgroup tree.

This question does actually apply to podman too. It might be more
appropriately aimed at docker/podman rather than systemd, I was just
wondering if anyone had thoughts.

To rephrase/provide some more context - I have a use-case where a custom
bash script is our container entrypoint, where the purpose of the script is
to check a few things while still being able to exit the container, and at
the end of the script systemd is started (with 'exec /sbin/init'). Since
systemd requires write access to the cgroupfs, I was wondering if we could
just unmount and recreate the cgroup mount(s) as read-write in this
entrypoint script (requiring CAP_SYS_ADMIN to do so of course), overriding
the container manager's setup of making the mounts read-only.

> >   - What are the concerns around the approach of passing '-v
/sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
cgroups?
>
> I don't know what this does. Is this a Docker thing?

It's a workaround suggested for getting systemd running inside a docker
container, overriding docker's behaviour of making the cgroupfs mounts
read-only to make them available read-write. There are some references at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#systemd-inside-docker-containers
.

This workaround seems quite undesirable to me considering it gives full
access to the host's cgroupfs and breaks '--cgroupns=private'. This is not
needed with podman since '--systemd=always' can be used. But the motivation
of the point/question above was to remove the requirement for this docker
workaround.

> >   - Is modifying/replacing the cgroup mounts set up by the container
engine a reasonable workaround, or could this be fragile?
>
> I am not sure I follow? A workaround for what? One shouldn't assume one
even has the privs to modify cgroup mounts.
>
> But why would one even?

Hopefully my explanation above makes this clearer. Replacing the cgroup
mounts set up by the container manager before exec-ing systemd is one
possible workaround for the fact docker creates the cgroup mounts
read-only. As I understand it, systemd requires CAP_SYS_ADMIN anyway, and
this gives us the privileges required to modify (or unmount and recreate)
the cgroup mounts.

> > - When is it valid to manually manipulate container cgroups?
>
> When you asked for your own delegated subtree first, see docs:
> https://systemd.io/CGROUP_DELEGATION

Yep, I have read that multiple times, the following questi

Re: [systemd-devel] Questions around cgroups, systemd, containers

2022-05-21 Thread Lennart Poettering
On Fr, 20.05.22 17:12, Lewis Gaul (lewis.g...@gmail.com) wrote:

> To summarize the questions (taken from the second post linked above):
> - Why are private cgroups mounted read-only in non-privileged
> containers?

"private cgroups"? What do you mean by that? The controllers?

Controller delegation on cgroupsv1 is simply not safe, that's all. You
can provide invalid configuration to the kernel, and DoS the machine
through it. cgroups are simply not a suitable privilege boundary on
cgroupsv1.

If you want safe delegation, use cgroupsv2, where delegation is safe.

> - Is it sound to override Docker’s mounting of the private container
> cgroups under v1?

I don't know what Docker does these days, but they used to be entirely
ignorant towards safe cooperation in the cgroup tree. i.e. they
ignored https://systemd.io/CGROUP_DELEGATION in its entirety, as they
don't really accepted systemd's existance.

Today most distros I think switched over to other ways to run
containers, i.e. podman and so on, which have a more professional
approach to all this, and can safely cooperate in a cgroup tree.

>   - What are the concerns around the approach of passing '-v
> /sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
> cgroups?

I don't know what this does. Is this a Docker thing?

>   - Is modifying/replacing the cgroup mounts set up by the container engine
> a reasonable workaround, or could this be fragile?

I am not sure I follow? A workaround for what? One shouldn't assume
one even has the privs to modify cgroup mounts.

But why would one even?

> - When is it valid to manually manipulate container cgroups?

When you asked for your own delegated subtree first, see docs:

https://systemd.io/CGROUP_DELEGATION

>   - Do container managers such as Docker and Podman correctly delegate
> cgroups on hosts running Systemd?

podman probably does this correctly. docker didn't do, not sure if
that changed.

>   - Are these container managers happy for the container to take ownership
> of the container’s cgroup?

I am not sure I grok this question, but a correctly implemented
container manager should be able to safely run cgroups-using payloads
inside the container. In that model, a host systemd manages the root
of the tree, the container manager a cgroup further down, and the
payload of the container (for example another systemd run inside the
container) the stuff below.

> - Why are the container’s cgroup limits not set on a parent cgroup under
> Docker/Podman?

I don't grok the question?

>   - Why doesn’t Docker use another layer of indirection in the cgroup
> hierarchy such that the limit is applied in the parent cgroup to the
> container?

I don't understand the question. And I can't answer docker questions.

> - What happens if you have two of the same cgroup mount?

what do you mean by a "cgroup mount"? A cgroupfs controller mount? If
they are within the same cgroup namespace they will be effectively
bind mounts of each other, i.e. show the exact same contents.

>   - Are there any gotchas/concerns around manipulating cgroups via multiple
> mount points?

Why would you do that though?

> - What’s the correct way to check which controllers are enabled?

enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"
maybe? in your container mgr? depends on that.

>   - What is it that determines which controllers are enabled? Is it kernel
> configuration applied at boot?

Enabled where?

>   - Is it possible to have some controllers enabled for v1 at the same time
> as others are enabled for v2?

Yes.

Lennart

--
Lennart Poettering, Berlin


[systemd-devel] Questions around cgroups, systemd, containers

2022-05-20 Thread Lewis Gaul
Hi all,

I've been trying to get a deeper understanding of Linux cgroups and their
use with containers/systemd over the last few months. I have a few
questions, but given the amount of context around the questions I've
written up my understanding in a blog post at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/ and the
questions in another blog post at
https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/.

If anyone has any thoughts/input/answers that would be much appreciated!
I'm planning on cross-posting in a few places such as podman/docker/kernel
mailing lists/communities, but in particular any input specific to the
systemd oriented questions would be great.

To summarize the questions (taken from the second post linked above):
- Why are private cgroups mounted read-only in non-privileged containers?
- Is it sound to override Docker’s mounting of the private container
cgroups under v1?
  - What are the concerns around the approach of passing '-v
/sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
cgroups?
  - Is modifying/replacing the cgroup mounts set up by the container engine
a reasonable workaround, or could this be fragile?
- When is it valid to manually manipulate container cgroups?
  - Do container managers such as Docker and Podman correctly delegate
cgroups on hosts running Systemd?
  - Are these container managers happy for the container to take ownership
of the container’s cgroup?
- Why are the container’s cgroup limits not set on a parent cgroup under
Docker/Podman?
  - Why doesn’t Docker use another layer of indirection in the cgroup
hierarchy such that the limit is applied in the parent cgroup to the
container?
- What happens if you have two of the same cgroup mount?
  - Are there any gotchas/concerns around manipulating cgroups via multiple
mount points?
- What’s the correct way to check which controllers are enabled?
  - What is it that determines which controllers are enabled? Is it kernel
configuration applied at boot?
  - Is it possible to have some controllers enabled for v1 at the same time
as others are enabled for v2?

Thanks in advance,
Lewis