[systemd-devel] [PATCHSET RE-RESEND] update unified hierarchy support

2016-03-25 Thread Tejun Heo
(sorry, of course forgot to attach the patches)
(bounced for not being subscribed, resending...)

Hello,

Unified hierarchy is available on the 4.5 kernel but there have been
several updates.

1. The __DEVEL__sane_behavior flag is gone.  Unified hierarchy is now
   available as "cgroup2" filesystem type with its own super magic
   number.

2. "cgroup.populated" file is replaced with "populated" field of
   "cgroup.events" file.

3. A zombie task remains associated with the cgroup it was associated
   with at the time of death instead of being moved immediately to
   root.  This means that pid to unit lookup may return a slice if the
   session or service unit the pid belonged to is already gone.

Three patches are attached addressing each of the above.

Thanks!

-- 
tejun
>From 278a39f0a8fa34cd899c6a08e76626c987a4713e Mon Sep 17 00:00:00 2001
From: Tejun Heo <hte...@fb.com>
Date: Fri, 25 Mar 2016 11:38:50 -0400
Subject: [PATCH 1/3] core: update unified hierarchy support

Unified hierarchy is official as of Linux v4.5 and now available through a new
filesystem type, cgroup2, with its own super magic.  Update mount logic
accordingly.

Signed-off-by: Tejun Heo <hte...@fb.com>
---
 src/basic/cgroup-util.c | 2 +-
 src/basic/missing.h | 4 
 src/core/mount-setup.c  | 2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/basic/cgroup-util.c b/src/basic/cgroup-util.c
index 56c1fca..5124b5b 100644
--- a/src/basic/cgroup-util.c
+++ b/src/basic/cgroup-util.c
@@ -2129,7 +2129,7 @@ int cg_unified(void) {
 if (statfs("/sys/fs/cgroup/", ) < 0)
 return -errno;
 
-if (F_TYPE_EQUAL(fs.f_type, CGROUP_SUPER_MAGIC))
+if (F_TYPE_EQUAL(fs.f_type, CGROUP2_SUPER_MAGIC))
 unified_cache = true;
 else if (F_TYPE_EQUAL(fs.f_type, TMPFS_MAGIC))
 unified_cache = false;
diff --git a/src/basic/missing.h b/src/basic/missing.h
index 034e334..66cd592 100644
--- a/src/basic/missing.h
+++ b/src/basic/missing.h
@@ -437,6 +437,10 @@ struct btrfs_ioctl_quota_ctl_args {
 #define CGROUP_SUPER_MAGIC 0x27e0eb
 #endif
 
+#ifndef CGROUP2_SUPER_MAGIC
+#define CGROUP2_SUPER_MAGIC 0x63677270
+#endif
+
 #ifndef TMPFS_MAGIC
 #define TMPFS_MAGIC 0x01021994
 #endif
diff --git a/src/core/mount-setup.c b/src/core/mount-setup.c
index de1a361..32fe51c 100644
--- a/src/core/mount-setup.c
+++ b/src/core/mount-setup.c
@@ -94,7 +94,7 @@ static const MountPoint mount_table[] = {
 #endif
 { "tmpfs",   "/run",  "tmpfs",  
"mode=755",MS_NOSUID|MS_NODEV|MS_STRICTATIME,
   NULL,  MNT_FATAL|MNT_IN_CONTAINER },
-{ "cgroup",  "/sys/fs/cgroup","cgroup", 
"__DEVEL__sane_behavior",  MS_NOSUID|MS_NOEXEC|MS_NODEV,
+{ "cgroup",  "/sys/fs/cgroup","cgroup2",NULL,  
MS_NOSUID|MS_NOEXEC|MS_NODEV,
   cg_is_unified_wanted, MNT_FATAL|MNT_IN_CONTAINER },
 { "tmpfs",   "/sys/fs/cgroup","tmpfs",  
"mode=755",    MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME,
   cg_is_legacy_wanted, MNT_FATAL|MNT_IN_CONTAINER },
-- 
2.5.5

>From 0fed0c3cdebe72557db528572ed2c531e32e7d5a Mon Sep 17 00:00:00 2001
From: Tejun Heo <hte...@fb.com>
Date: Fri, 25 Mar 2016 11:38:50 -0400
Subject: [PATCH 2/3] core: update populated event handling in unified
 hierarchy

Earlier during the development of unified hierarchy, the populated event was
reported through by the dedicated "cgroup.populated" file; however, the
interface was updated so that it's reported through the "populated" field of
"cgroup.events" file.  Update populated event handling logic accordingly.

Signed-off-by: Tejun Heo <hte...@fb.com>
---
 src/basic/cgroup-util.c| 45 -
 src/basic/cgroup-util.h|  2 ++
 src/core/cgroup.c  |  6 +++---
 src/nspawn/nspawn-cgroup.c |  3 +--
 4 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/src/basic/cgroup-util.c b/src/basic/cgroup-util.c
index 5124b5b..5043180 100644
--- a/src/basic/cgroup-util.c
+++ b/src/basic/cgroup-util.c
@@ -101,6 +101,39 @@ int cg_read_pid(FILE *f, pid_t *_pid) {
 return 1;
 }
 
+int cg_read_event(const char *controller, const char *path, const char *event,
+  char **val)
+{
+_cleanup_free_ char *events = NULL, *content = NULL;
+char *p, *line;
+int r;
+
+r = cg_get_path(controller, path, "cgroup.events", );
+if (r < 0)
+return r;
+
+r = read_full_file(events, , NULL);
+if (r < 0)
+return r;
+
+p = content;
+while ((line = 

[systemd-devel] [PATCHSET RESEND] update unified hierarchy support

2016-03-25 Thread Tejun Heo
(bounced for not being subscribed, resending...)

Hello,

Unified hierarchy is available on the 4.5 kernel but there have been
several updates.

1. The __DEVEL__sane_behavior flag is gone.  Unified hierarchy is now
   available as "cgroup2" filesystem type with its own super magic
  number.

2. "cgroup.populated" file is replaced with "populated" field of
   "cgroup.events" file.

3. A zombie task remains associated with the cgroup it was associated
   with at the time of death instead of being moved immediately to
  root.  This means that pid to unit lookup may return a slice if the
 session or service unit the pid belonged to is already gone.

Three patches are attached addressing each of the above.

Thanks!

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH 1/4] cgroups: support for MemoryAndSwapLimit= setting

2013-10-10 Thread Tejun Heo
Hello,

On Thu, Oct 10, 2013 at 04:03:20PM +0200, Lennart Poettering wrote:
 For example MemorySoftLimit is something we supported previously, but
 which I recently removed because Tejun Heo (the kernel cgroup
 maintainer, added to CC) suggested that the attribute wouldn't continue
 to exist on the kernel side or at least not in this form.

The problem with the current softlimit is that we currently aren't
sure what it means.  Its semantics is defined only by its
implementation details with all its quirks and different parties
interpret and use it differently.  memcg people are trying to clear
that up so I think it'd be worthwhile to wait to see what happens
there.

 Tejun, Mika sent patches to wrap memory.memsw.limit_in_bytes,
 memory.kmem.limit_in_bytes, memory.soft_limit_in_bytes,
 memory.kmem.tcp.limit_in_bytes in high-level systemd attributes. Could
 you comment on the future of these attributes in the kernel? Should we
 expose them in systemd?
 
 At the systemd hack fest in New Orleans we already discussed
 memory.soft_limit_in_bytes and memory.memsw.limit_in_bytes and you
 suggested not to expose them. What about the other two?

Except for soft_limit_in_bytes, at least the meanings of the knobs are
well-defined and stable, so I think it should be at least safe to
expose those.

 (I have the suspicion though that if we want to expose something we
 probably want to expose a single knob that puts a limit on all kinds of
 memory, regardless of RAM, swap, kernel or tcp...)

Yeah, the different knobs grew organically to cover more stuff which
wasn't covered before, so, yeah, when viewed together, they don't
really make a cohesive sense.  Another problem is that, enabling kmem
knobs would involve noticeable amount of extra overhead.  kmem also
has restrictions on when it can be enabled - it can't be enabled on a
populated cgroup.

Maybe an approach which makes sense is where one sets the amount of
memory which can be used and toggle which types of memory should be
included in the accounting.  Setting kmem limit equal to that of
limit_in_bytes makes limit_in_bytes applied to both kernel and user
memories.  I'll ask memcg people and find out how viable such approach
is.

Thanks!

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [PATCH 1/4] cgroups: support for MemoryAndSwapLimit= setting

2013-10-10 Thread Tejun Heo
(cc'ing Johannes and quoting the whole body for context)

Hey, guys.

On Thu, Oct 10, 2013 at 10:28:16AM -0400, Tejun Heo wrote:
 Hello,
 
 On Thu, Oct 10, 2013 at 04:03:20PM +0200, Lennart Poettering wrote:
  For example MemorySoftLimit is something we supported previously, but
  which I recently removed because Tejun Heo (the kernel cgroup
  maintainer, added to CC) suggested that the attribute wouldn't continue
  to exist on the kernel side or at least not in this form.
 
 The problem with the current softlimit is that we currently aren't
 sure what it means.  Its semantics is defined only by its
 implementation details with all its quirks and different parties
 interpret and use it differently.  memcg people are trying to clear
 that up so I think it'd be worthwhile to wait to see what happens
 there.
 
  Tejun, Mika sent patches to wrap memory.memsw.limit_in_bytes,
  memory.kmem.limit_in_bytes, memory.soft_limit_in_bytes,
  memory.kmem.tcp.limit_in_bytes in high-level systemd attributes. Could
  you comment on the future of these attributes in the kernel? Should we
  expose them in systemd?
  
  At the systemd hack fest in New Orleans we already discussed
  memory.soft_limit_in_bytes and memory.memsw.limit_in_bytes and you
  suggested not to expose them. What about the other two?
 
 Except for soft_limit_in_bytes, at least the meanings of the knobs are
 well-defined and stable, so I think it should be at least safe to
 expose those.
 
  (I have the suspicion though that if we want to expose something we
  probably want to expose a single knob that puts a limit on all kinds of
  memory, regardless of RAM, swap, kernel or tcp...)
 
 Yeah, the different knobs grew organically to cover more stuff which
 wasn't covered before, so, yeah, when viewed together, they don't
 really make a cohesive sense.  Another problem is that, enabling kmem
 knobs would involve noticeable amount of extra overhead.  kmem also
 has restrictions on when it can be enabled - it can't be enabled on a
 populated cgroup.
 
 Maybe an approach which makes sense is where one sets the amount of
 memory which can be used and toggle which types of memory should be
 included in the accounting.  Setting kmem limit equal to that of
 limit_in_bytes makes limit_in_bytes applied to both kernel and user
 memories.  I'll ask memcg people and find out how viable such approach
 is.

I talked with Johannes about the knobs and think something like the
following could be useful.

* A swap knob, which, when set, configures memsw.limit_in_bytes to
  memory.limit_in_bytes + the set value.

* A switch to enable kmem.  When enabled, kmem.limit_in_bytes tracks
  memory.limit_in_bytes.  ie. kmem is accounted and both kernel and
  user memory live under the same memory limit.

* A kmem knob which can be optionally configured to a lower value than
  memory.limit_in_bytes.  This is useful for overcommit scenarios as
  explained in Documentation/cgroups/memory.txt::2.7.3.

* tcp knobs are currently completely separate from other memory
  limits.  This should probably be included in memory.limit_in_bytes.
  I think it probably is a better idea to hold off on this one.

* What softlimit means is still very unclear.  We might end up with
  explicit guarantee knob and keep softlimit as it is, whatever it
  currently means.

Caveats

* This setup doesn't allow setting (memory + swap) limit without
  setting memory limit.

* The overcommit scenario described in memory.txt::2.7.3 is somewhat
  bogus because not all userland memory is reclaimable and not all
  kernel memory is unreclaimable.  Oh well...

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 02:39:53PM +0100, Daniel P. Berrange wrote:
 On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote:
  On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:
  
   1. I put all the entire world into a separate, highly constrained
   cgroup.  My real-time code runs outside that cgroup.  This seems to
   exactly what slices are for, but I need kernel threads to go in to
   the constrained cgroup.  Will systemd support this?
  
  I am not sure whether the ability to move kernel threads into cgroups
  will stay around at all, from the kernel side. Tejun, can you comment
  on this?
 
 KVM uses the vhost_net device for accelerating guest network I/O
 paths. This device creates a new kernel thread on each open(),
 and that kernel thread is attached to the cgroup associated
 with the process that open()d the device.
 
 If systemd allows for a process to be moved between cgroups, then
 it must also be capable of moving any associated kernel threads to
 the new cgroup at the same time. This co-placement of vhost-net
 threads with the KVM process, is very critical for I/O performance
 of KVM networking.

Yeah, the way virt drivers use cgroups right now is pretty hacky.  I
was thinking about adding per-process workqueue which follows the
cgroup association of the process after the unified hierarchy and then
convert virt to use that.

At any rate, those kthreads can be moved via cgroup.procs, so unified
hierarchy wouldn't break it from kernel side.  Not sure how the
interface would look from systemd side tho.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote:
 On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:
 
  1. I put all the entire world into a separate, highly constrained
  cgroup.  My real-time code runs outside that cgroup.  This seems to
  exactly what slices are for, but I need kernel threads to go in to
  the constrained cgroup.  Will systemd support this?
 
 I am not sure whether the ability to move kernel threads into cgroups
 will stay around at all, from the kernel side. Tejun, can you comment on this?

Any kernel threads with PF_NO_SETAFFINITY set already can't be removed
from the root cgroup.  In general, I don't think moving kernel threads
into !root cgroups is a good idea.  They're in most cases shared
resources and userland doesn't really have much idea what they're
actually doing, which is the fundmental issue.

Which kthreads are running on the kernel side and what they're doing
is strict implementation detail from the kernel side.  There's no
effort from kernel side in keeping them stable and userland is likely
to get things completely wrong - e.g. many kernel threads named after
workqueues in any recent kernels don't actually do anything until the
system is under heavy memory pressure.  Userland can't tell and has no
control over what's being executed where at all and that's the way it
should be.

That said, there are cases where certain async executions are
concretely bound to userland processes - say, (planned) aio updates,
virt drivers and so on.  Right now, virt implements something pretty
hacky but I think they'll have to be tied closer to the usual process
mechanism - ie. they should be saying that these kthreads are serving
this process and should be treated as such in terms of resource
control rather than the current move this kthread to this set of
cgroups, don't ask why thing.  Another not-well-thought-out aspect of
the current cgroup.  :(

I have an idea where it should be headed in the long term but am not
sure about short-term solution.  Given that the only sort wide-spread
use case is virt kthreads, maybe it just needs to be special cased for
now.  Not sure.

  2. I manage services and tasks outside systemd (for one thing, I
  currently use Ubuntu, but even if I were on Fedora, I have a bunch
  of fine-grained things that figure out how they're supposed to
  allocate resources, and porting them to systemd just to keep working
  in the new world order would be a PITA [1]).
  
  (cgroups have the odd feature that they are per-task, not per thread
  group, and the systemd proposal seems likely to break anything that
  actually wants task granularity.  I may actually want to use this,
  even though it's a bit evil -- my real-time thread groups have
  non-real-time threads.)
 
 Here too, Tejun is pretty keen on removing the ability of splitting up
 threads into cgroups from the kernel, and will only allow this
 per-process. Tejun, please comment!

Yes, again, the biggest issue is how much of low-level cgroup details
become known to individual programs.  Splitting threads into different
cgroup would in most cases mean that the binary itself would become
aware of cgroup and it's akin to burying sysctl knob tunings into
individual binaries.  cgroup is not an interface for each individual
program to fiddle with.  If certain thread-granular control is
absolutely necessary and justifiable, it's something to be added to
the existing thread API, not something to be bolted on using cgroups.

So, I'm quite strongly against allowing allowing splitting threads of
the same process into different cgroups.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello, Andy.

On Mon, Jun 24, 2013 at 11:49:05AM -0700, Andy Lutomirski wrote:
  I have an idea where it should be headed in the long term but am not
  sure about short-term solution.  Given that the only sort wide-spread
  use case is virt kthreads, maybe it just needs to be special cased for
  now.  Not sure.
 
 I'll be okay (I think) if I can reliably set affinities of these
 threads.  I'm currently doing it with cgroups.
 
 That being said, I don't like the direction that kernel thread magic
 affinity is going.  It may be great for cache performance and reducing
 random bounding, but I have a scheduling-jitter-sensitive workload and
 I don't care about overall system throughput.  I need the kernel to
 stay the f!k off my important cpus, and arranging for this to happen
 is becoming increasingly complicated.

Why is it becoming increasingly complicated?  The biggest change
probably was the shared workqueue pool implementation but that was
years ago and workqueue has grown pool attributes recently adding more
properly designed flexibility and, for example, adding default
affinity for !per-cpu workqueues should be pretty easy now.  But
anyways, if it's an issue, it should be examined and properly solved
rather than hacking up hacky solution with cgroup.

 cgroups are most certainly something that a binary can be aware of.
 It's not like a sysctl knob at all -- it's per process.  I have lots

No, it definitely is not.  Sure it is more granular than sysctl but
that's it.  It exposes control knobs which are directly tied into
kernel implementation details.  It is not a properly designed
programming API by any stretch of imagination.  It is an extreme
failure on the kernel side that that part hasn't been made crystal
clear from the beginning.  I don't know how intentional it was but the
whole thing is completely botched.

cgroup *never* was held to the standard necessary for any widely
available API and many of the controls it exposes are exactly at the
level of sysctls.  As the interface was filesystem, it could evade
scrutiny and with the hierarchical organization also gave the
impression that it's something which can be used directly by
individual applications.  It found a loophole in the way we implement
and police kernel APIs and then exploited it like there's no tomorrow.

We are firmly bound to maintain what already has been exposed from the
kernel side and I'm not gonna break any of them but the free-for-all
cgroup is broken and deprecated.  It's gonna wither and fade away and
any attempt to reverse that will be met with extreme prejudice.

 of binaries that have worked quite well for a couple years that move
 themselves into different cgroups.  I have no problem with a unified
 hierarchy, but I need control of my little piece of the hierarchy.
 
 I don't care if the interface to do so changes, but the basic
 functionality is important.

Whether you care or not is completely irrelevant.  Individual binaries
widely incorporating cgroup details automatically binds the kernel.
It becomes excruciatingly painful to back out after certain point.  I
don't think we're there yet given the overall immaturity and brokeness
of cgroups and it's imperative that we back the hell out as fast as
possible before this insanity spreads any wider.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 12:24:38PM -0700, Andy Lutomirski wrote:
 Because more things are becoming per cpu without the option of moving
 of per-cpu things on behalf of one cpu to another cpu.  RCU is a nice
 exception.

Hmm... but in most cases it's per-cpu on the same cpu that initiated
the task.  If a given CPU is just crunching numbers and IRQ affinity
is properly configured, the CPU shouldn't be bothered too much by
per-cpu work items.  If there are, please let us know.  We can hunt
them down.

 The functionality I care about is that a program can reliably and
 hierarchically subdivide system resources -- think rlimits but
 actually useful.  I, and probably many other things, want this
 functionality.  Yes, the current cgroup interface is awful, but it
 gets one thing right: it's a hierarchy.

And the hierarchy support was completely broken for many resource
controllers up until only several releases ago.

 I would argue that designing a kernel interface that requires exactly
 one userspace component to manage it and ties that one userspace
 component to something that can't easily be deployed everywhere (the
 init system) is as big a cheat as the old approach of sneaking bad
 APIs in through a filesystem was.

In terms of API, it is firmly at the level of sysctl.  That's it.

While I agree that having a proper kernel API for hierarchical
resource management could be nice.  That currently is out of scope.
We're already knee-deep in shit with the limited capabilities we're
trying to implement.  Also, I really don't think cgroup is the right
interface for such thing even if we get to that.  It should be part of
the usual process/thread model, not this completely separate thing on
the side.

 IOW, please, when designing this, please specify an API that programs
 are permitted to use, and let that API be reviewed.

cgroup is not that API and it's never gonna be in all likelihood.  As
for systemd vs. non-systemd compatibility, I'm afraid I don't have a
good answer.  This is still all in a pretty earlly phase and the
proper abstractions and APIs are being figured out.  Hopefully, we'll
converge on a mostly compatible high-level abstraction which can be
presented regardless of the actual base system implementation.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 04:01:07PM -0700, Andy Lutomirski wrote:
 So what is cgroup for?  That is, what's the goal for what the new API
 should be able to do?

It is a for controlling and distributing resources.  That part doesn't
change.  It's just not built to be used directly by individual
applications.  It's an admin tool just like sysctl - be that admin be
a human or userland base system.

There's a huge chasm between something which can be generally used by
normal applications and something which is restricted to admins and
base systems in terms of interface generality and stability, security,
how the abstractions fit together with the existing APIs and so on.
cgroup firmly belongs to the former.  It still serves the same purpose
but isn't, in a way, developed enough to be used directly by
individual applications and I'm not even sure we want or need to
develop it to such a level.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 4:38 PM, Andy Lutomirski l...@amacapital.net wrote:
 Now I'm confused.  I thought that support for multiple hierarchies was
 going away.  Is it here to stay after all?

It is going to be deprecated but also stay around for quite a while.
That said, I didn' t mean to use multiple hierarchies. I was saying
that if you build a sub-hierarchy in the unified hierarchy, you're
likely to get away with it in most cases.

Thanks.

--
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel