Re: [systemd-devel] Delegate= on slice before v237

2019-02-15 Thread Lennart Poettering
On Mi, 13.02.19 11:52, Filipe Brandenburger (filbran...@gmail.com) wrote:

> > The fact it wasn't refused outright was an
> > accident, and because it was one I am not entirely sure what the
> > precise effect of allowing it was. However, I am pretty sure it at
> > least had two effects:
> >
> > 1. it would turn on all controllers for the cgroup
> >
>
> I don't *think* this is why libcontainer was trying to enable it, since a
> few lines down it's explicitly enabling all the controllers by
> setting MemoryAccounting, CPUAccounting and BlockIOAccounting during
> transient unit creation:
> https://github.com/opencontainers/runc/blob/v1.0.0-rc6/libcontainer/cgroups/systemd/apply_systemd.go#L275

Note that on current systemd+kernel CPUAccounting=yes has no effect on cgroup
controller enablement since the kernel now provides the accounting
data always, independently of the CPU controller.

> > 2. it would stop systemd to ever migrating foreign processes below
> >that slice, which is primarily relevant only when changing cgroup
> >related props on the slice dynamically I guess.
>
> I'm not sure I follow... Do you mean if libcontainer would write
> to memory.limit_in_bytes (or one of the other properties of the memory or
> other controller managed by systemd, such as cpu), then systemd would not
> end up overwriting this as it does some other operation on the
> cgroup?

No, not that. (Note btw, that the cgroup resource attrs on the
delegated cgroup itself remain exclusively owned by systemd. Only the
cgroup attrs of cgroups created below the delegated are owned by the
service payload. i.e. delegation is a bit weird in that regard:
"cgroup.groups" is delegated to the payload while "cpu.weight" or
"pids.max" is not.)

> I'm not completely sure I understand what "migrate foreign processes"
> means, given slices don't really hold any pids directly... Do you mean to
> scope and service units below that slice?

If systemd makes changes to the cgroup tree then it will consider the
"name=systemd" hierarchy authoritative, and the process memberships
are the ones systemd primarily manages, and that are the "primary
sources of truth". The other controller hierarchies are secondary to
that, and are only managed in a way that they are made "reduced
copies" of the "name=systemd" hierarchy. Meaning: when systemd wants
to know in which cgroup PID X is it checks the name=systemd hierarchy,
and when it wants to migrate X elsewhere it does so first in the
name=systemd, and then syncs the other hierarchies to the
"name=systemd" to the depth needed. Now, "to the depth needed" is more
complex than it might sound: due to "controllers are enabled in
parent" logic cgroup controller memberships propagate "side-ways" in
the cgroup tree. i.e. if the service foobar.service in slice
waldo.slice gets IOAccounting=yes enabled, then this means "blkio"
will be enabled also for any sibling quux.service or miau.service if
those are also in the same slice waldo.slice. If that happens during
runtime (and it often does, since starting quux.service first, and
then also starting foobar.service means "blkio" needs to be turned
on for the already running quux.service), then this means the blkio cgroup
for quux.service gets realized and then all processes in the
name=systemd cgroup of quux.service propagated into it, i.e. the blkio
hierarchy gets synced to the name=systemd hierarchy one level further
down, if you follow what I am blabbering here.

Now, if services have Delegate=yes turned on for themselves, then
systemd will never do such synchronization of cgroup trees ever: the
cgroup controller memberships are set in stone, and systemd will never
migrate anything below the delegated hierachy, since the payload of
course has freedom to distribute the pids among its heirarchies any
way it wants, and systemd should never interfere with it and rearrange
things moving PIDs to the top of the delegate cgroup subtree just
because a sibling was modified.

> In any case, for now I'll probably leave that alone... Though as I revamp
> libcontainer support for unified hierarchy, I'll try to skip that check on
> that case, that might make this a legacy-only setting, so not that
> important to fully get rid of it for a while...

Doing Delegate=yes on a slice is really broken. In systemd we lacked a
safety check to disallow this and the fact that runc used that is
exploiting a bug, nothing else.

Lennart

[6~--
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] Delegate= on slice before v237

2019-02-13 Thread Filipe Brandenburger via systemd-devel
Hey Lennart,

Thanks for the clarification.

On Tue, Feb 12, 2019 at 2:17 AM Lennart Poettering 
wrote:

> On Mo, 11.02.19 16:39, Filipe Brandenburger (filbran...@google.com) wrote:
> > Before systemd v237 (when Delegate= was no longer allowed on slice
> > units)... Did setting Delegate=yes on a slice have *any* effect at all?
> >
> > Or did it just do nothing (and a slice with Delegate=no or no setting
> > behave just the same)?
> >
> > Reason I ask is: I want to scrap this code
> > <
> https://github.com/opencontainers/runc/blob/v1.0.0-rc6/libcontainer/cgroups/systemd/apply_systemd.go#L195
> >
> > in libcontainer that tries to detect whether Delegate= is accepted in a
> > slice unit. (I'll just default it to false, never try it.)
> >
> > I'd like to be able to say that Delegate=yes never really did anything at
> > all on slice units... So I'm trying to confirm that is really the case
> > before stating it.
>
> So, it wasn't supposed to do anything, and what it does differs on
> cgroupsv2 and cgroupsv1.


libcontainer is pretty much cgroupv1 only, so that's what I'm concerned
about.


> The fact it wasn't refused outright was an
> accident, and because it was one I am not entirely sure what the
> precise effect of allowing it was. However, I am pretty sure it at
> least had two effects:
>
> 1. it would turn on all controllers for the cgroup
>

I don't *think* this is why libcontainer was trying to enable it, since a
few lines down it's explicitly enabling all the controllers by
setting MemoryAccounting, CPUAccounting and BlockIOAccounting during
transient unit creation:
https://github.com/opencontainers/runc/blob/v1.0.0-rc6/libcontainer/cgroups/systemd/apply_systemd.go#L275


> 2. it would stop systemd to ever migrating foreign processes below
>that slice, which is primarily relevant only when changing cgroup
>related props on the slice dynamically I guess.
>

I'm not sure I follow... Do you mean if libcontainer would write
to memory.limit_in_bytes (or one of the other properties of the memory or
other controller managed by systemd, such as cpu), then systemd would not
end up overwriting this as it does some other operation on the cgroup?

I'm not completely sure I understand what "migrate foreign processes"
means, given slices don't really hold any pids directly... Do you mean to
scope and service units below that slice?

In any case, for now I'll probably leave that alone... Though as I revamp
libcontainer support for unified hierarchy, I'll try to skip that check on
that case, that might make this a legacy-only setting, so not that
important to fully get rid of it for a while...

Cheers!
Filipe
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] Delegate= on slice before v237

2019-02-12 Thread Lennart Poettering
On Mo, 11.02.19 16:39, Filipe Brandenburger (filbran...@google.com) wrote:

> Hi Lennart,

heya!

> Before systemd v237 (when Delegate= was no longer allowed on slice
> units)... Did setting Delegate=yes on a slice have *any* effect at all?
>
> Or did it just do nothing (and a slice with Delegate=no or no setting
> behave just the same)?
>
> Reason I ask is: I want to scrap this code
> 
> in libcontainer that tries to detect whether Delegate= is accepted in a
> slice unit. (I'll just default it to false, never try it.)
>
> I'd like to be able to say that Delegate=yes never really did anything at
> all on slice units... So I'm trying to confirm that is really the case
> before stating it.

So, it wasn't supposed to do anything, and what it does differs on
cgroupsv2 and cgroupsv1. The fact it wasn't refused outright was an
accident, and because it was one I am not entirely sure what the
precise effect of allowing it was. However, I am pretty sure it at
least had two effects:

1. it would turn on all controllers for the cgroup

2. it would stop systemd to ever migrating foreign processes below
   that slice, which is primarily relevant only when changing cgroup
   related props on the slice dynamically I guess.

Lennart

--
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel