Re: [lxc-devel] cgroup management daemon

2013-12-04 Thread Tejun Heo
Hello, Tim.

On Tue, Dec 03, 2013 at 08:53:21PM -0800, Tim Hockin wrote:
 If this daemon works as advertised, we will explore moving all write
 traffic to use it.  I still have concerns that this can't handle read
 traffic at the scale we need.

At least from the kernel side, cgroup doesn't and won't have any
problem with direct reads.

 Tejun,  I am not sure why chown came back into the conversation.  This
 is a replacement for that.

I guess I'm just confused because of the mentions of chown.  If it
isn't about giving unmoderated write access to untrusted domains,
everything should be fine.

Thanks!

-- 
tejun

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-04 Thread Serge Hallyn
Quoting Tim Hockin (thoc...@google.com):
 If this daemon works as advertised, we will explore moving all write
 traffic to use it.  I still have concerns that this can't handle read
 traffic at the scale we need.
 
 Tejun,  I am not sure why chown came back into the conversation.  This
 is a replacement for that.

Because the daemon is chowning directories and files.  That's how
the daemon decides whether clients have access.

-serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-04 Thread Serge Hallyn
Quoting Victor Marmol (vmar...@google.com):
 I thought we were going to use chown in the initial version to enforce the
 ownership/permissions on the hierarchy. Only the cgroup manager has access
 to the hierarchy, but it tries to access the hierarchy as the user that
 sent the request. It was only meant to be a for now solution while the
 real one rolls out. It may also have gotten thrown out since last I heard :)

Actually that part wasn't meant as a for now solution.  It can of
course be thrown away in favor of having the daemon store all this
information, but I'm seeing no advantages to that right now.

There are other things which the daemon can eventually try to keep
track of, if we don't decide they belong in a higher layer.

-serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-04 Thread Tejun Heo
On Wed, Dec 04, 2013 at 09:54:37AM -0600, Serge Hallyn wrote:
 Quoting Tim Hockin (thoc...@google.com):
  If this daemon works as advertised, we will explore moving all write
  traffic to use it.  I still have concerns that this can't handle read
  traffic at the scale we need.
  
  Tejun,  I am not sure why chown came back into the conversation.  This
  is a replacement for that.
 
 Because the daemon is chowning directories and files.  That's how
 the daemon decides whether clients have access.

Ah, okay, so the manager is just using filesystem metadata for
bookkeeping.  That should be fine.  Please note that cgroup filesystem
also supports xattr and AFAIK systemd is already making use of it.

Thanks.

-- 
tejun

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tejun Heo
Hello, guys.

Sorry about the delay.

On Mon, Nov 25, 2013 at 10:43:35PM +, Serge E. Hallyn wrote:
 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

One of the reasons for not exposing knobs as-is is that the knobs we
currently have aren't consistent.  The weight values have different
ranges, some combinations of values don't make much sense, and so on.
The user can cope with it but it'd probably be better to expose
something which doesn't lead to mistakes too easily.

 The above addresses
 * creating cgroups
 * chowning cgroups
 * setting cgroup limits
 * moving tasks into cgroups
   . but does not address a 'cgexec group -- command' type of behavior.
 * To handle that (specifically for upstart), recommend that r do:
   if (!pid) {
 request_reclassify(cgroup, getpid());
 do_execve();
   }
   . alternatively, the daemon could, if kernel is new enough, setns to
 the requestor's namespaces to execute a command in a new cgroup.
 The new command would be daemonized to that pid namespaces' pid 1.

So, IIUC, cgroup hierarchy management - creation and removal of
cgroups and assignments of tasks will go through while configuring
control knobs will be delegated to the cgroup owner, right?

Hmmm... the plan is to allow delegating task assignments in the
sub-hierarchy but require CAP_X for writes to knobs (not reads).  This
stems from the fact that, especially with unified hierarchy, those
operations will be cgroup-core proper operations which are gonna be
relatively safer and that task organizations in the subhierarchy and
monitoring knobs are likely to be higher frequency operation than
enabling and configuring controllers.

As I communicated multiple times before, delegating write access to
control knobs to untrusted domain has always been a security risk and
is likely to continue to remain so.  Also, organizationally, a
cgroup's control knobs belong to the parent not the cgroup itself.
That probably is why you were thinking about putting an extra cgroup
inbetween for isolation, but the root problem there is that those
knobs belong to the parent, not the directory itself.

Security is in most part logistics - it's about getting all the
details right, and we don't either design or implement each knob with
security in mind and DoSing them has always been pretty easy, so I
don't think delegating write accesses to knobs is a good idea.

If you, for whatever reason, can trust the delegatee, which I believe
is the case for google, it's fine.  If you're trying to delegate to a
container which you don't have any control over, it isn't a good idea.

Another thing to consider is due to both the fundamental characterics
of hierarchy and implementation issues, things will become expensive
if nesting gets beyond several layers (if controllers are enabled,
that is) and the controllers in general will be implemented and
optimized with limited level of nesting in mind.  IOW, building, say,
8 level deep hierarchy in the host and then doing the same thing
inside the container with controllers enabled won't make a very happy
system.  It probably is something to keep in mind when laying out how
the whole thing eventually would look like.

 Long-term we will want the cgroup manager to become more intelligent -
 to place its own limits on clients, to address cpu and device hotplug,
 etc.  Since we will not be doing that in the first prototype, the daemon
 will not keep any state about the clients.

Isn't the above conflicting with chowning control knobs?

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tejun Heo
Ooh, can you also please cc Li Zefan lize...@huawei.com when
replying?

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tejun Heo
Hello, Tim.

On Mon, Nov 25, 2013 at 08:58:09PM -0800, Tim Hockin wrote:
 Thanks for this!  I think it helps a lot to discuss now, rather than
 over nearly-done code.
 
 On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.
 
 I'm torn here.  While I agree in principle with Tejun, I am concerned
 that this agent will always lag new kernel features or that the thin
 abstraction you want to provide here does not easily accommodate some
 of the more ... oddball features of one cgroup interface or another.

Yeah, that's the trade-off but cgroupfs is a kernel API.  It shouldn't
change or grow rapidly once things settle down.  As long as there's
not too crazy way to step-aside when such rare case arises, I think
pros outweight cons.

 This agent is the very bottom of the stack, and should probably not do
 much by way of abstraction.  I think I'd rather let something like
 lmctfy provide the abstraction more holistically, and relegate this
 agent to very simple plumbing and policy.  It could be as simple as
 providing read/write/etc ops to specific control files.  It needs to
 handle event_fd, too, I guess.  This has the nice side-effect of
 always being current on kernel features :)

The level of abstraction is definitely something debatable.  Please
note that the existing event_fd based mechanism won't grow any new
users (BTW, event_control is one of the dos vectors if you give write
access to it) and all new notifications will be using inotify.

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tejun Heo
Hello,

On Tue, Nov 26, 2013 at 09:19:18AM -0800, Victor Marmol wrote:
   From my discussions with Tejun, he wanted to move to using inotify so it
   may still be an fd we pass around.
 
  Hm, would that just be inotify on the memory.max_usage_in_bytes
  file, of inotify on a specific fd you've created which is
  associated with any threshold you specify?  The former seems
  less ideal.
 
 
 Tejun can comment more, but I think it is still TBD.

It's likely the former with configurable cadence or per-knob (not
per-opener) configurable thresholds.  max_usage_in_bytes is a special
case here as all other knobs can simply generate an event on each
transition.  If event (de)muxing is necessary, it probably should be
done from userland.

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Serge Hallyn
Quoting Tejun Heo (t...@kernel.org):
 Hello, guys.
 
 Sorry about the delay.
 
 On Mon, Nov 25, 2013 at 10:43:35PM +, Serge E. Hallyn wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.
 
 One of the reasons for not exposing knobs as-is is that the knobs we
 currently have aren't consistent.  The weight values have different
 ranges, some combinations of values don't make much sense, and so on.
 The user can cope with it but it'd probably be better to expose
 something which doesn't lead to mistakes too easily.

For the moment, for prototype (github.com/hallyn/cgmanager), I'm just
going with filenames/values.

When the bulk of the work is done, we can either (or both) (a) introduce
a thin abstraction layer over the key/values, or/and (b) whitelist
some of the filenames and filter some values.

I know the upstart folks don't want to have to wait long for a
specification...  I'll hopefully make a final decision on this next
week.

  The above addresses
  * creating cgroups
  * chowning cgroups
  * setting cgroup limits
  * moving tasks into cgroups
. but does not address a 'cgexec group -- command' type of behavior.
  * To handle that (specifically for upstart), recommend that r do:
if (!pid) {
  request_reclassify(cgroup, getpid());
  do_execve();
}
. alternatively, the daemon could, if kernel is new enough, setns to
  the requestor's namespaces to execute a command in a new cgroup.
  The new command would be daemonized to that pid namespaces' pid 1.
 
 So, IIUC, cgroup hierarchy management - creation and removal of
 cgroups and assignments of tasks will go through while configuring
 control knobs will be delegated to the cgroup owner, right?

Not sure what you mean, but I think the answer is no.  Everything
goes through the manager.  The manager doesn't try to enforce that,
but by default the cgroup filesystems will only be mounted in the
manager's private mnt_ns, and containers at least will not be
allowed to mount cgroup fstype.

 Hmmm... the plan is to allow delegating task assignments in the
 sub-hierarchy but require CAP_X for writes to knobs (not reads).  This
 stems from the fact that, especially with unified hierarchy, those
 operations will be cgroup-core proper operations which are gonna be
 relatively safer and that task organizations in the subhierarchy and
 monitoring knobs are likely to be higher frequency operation than
 enabling and configuring controllers.

Should be ok for this.

 As I communicated multiple times before, delegating write access to
 control knobs to untrusted domain has always been a security risk and
 is likely to continue to remain so.  Also, organizationally, a

Then that will need to be address with per-key blacklisting and/or
per-value filtering in the manager.

Which is my way of saying:  can we please have a list of the security
issues so we can handle them?  :)  (I've asked several times before
but haven't seen a list or anyone offering to make one)

 cgroup's control knobs belong to the parent not the cgroup itself.

After thinking awhile I think this makes perfect sense.  I haven't
implemented set_value yet, and when I do I think I'll implement this
guideline.

 That probably is why you were thinking about putting an extra cgroup
 inbetween for isolation, but the root problem there is that those
 knobs belong to the parent, not the directory itself.

Yup.

 Security is in most part logistics - it's about getting all the
 details right, and we don't either design or implement each knob with
 security in mind and DoSing them has always been pretty easy, so I
 don't think delegating write accesses to knobs is a good idea.
 
 If you, for whatever reason, can trust the delegatee, which I believe
 is the case for google, it's fine.  If you're trying to delegate to a
 container which you don't have any control over, it isn't a good idea.
 
 Another thing to consider is due to both the fundamental characterics
 of hierarchy and implementation issues, things will become expensive
 if nesting gets beyond several layers (if controllers are enabled,
 that is) and the controllers in general will be implemented and
 optimized with limited level of nesting in mind.  IOW, building, say,
 8 level deep hierarchy in the host and then doing the same thing
 inside the container with controllers enabled won't make a very happy

Yes, I very much want to avoid that.

 system.  It probably is something to keep in mind when laying out how
 the whole thing eventually would look like.
 
  Long-term we will want 

Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tejun Heo
Hello, Serge.

On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
  As I communicated multiple times before, delegating write access to
  control knobs to untrusted domain has always been a security risk and
  is likely to continue to remain so.  Also, organizationally, a
 
 Then that will need to be address with per-key blacklisting and/or
 per-value filtering in the manager.
 
 Which is my way of saying:  can we please have a list of the security
 issues so we can handle them?  :)  (I've asked several times before
 but haven't seen a list or anyone offering to make one)

Unfortunately, for now, please consider everything blacklisted.  Yes,
it is true that some knobs should be mostly safe but given the level
of changes we're going through and the difficulty of properly auditing
anything for delegation to untrusted environment, I don't feel
comfortable at all about delegating through chown.  It is an
accidental feature which happened just because it uses filesystem as
its interface and it is no where near the top of the todo list.  It
has never worked properly and won't in any foreseeable future.

  cgroup's control knobs belong to the parent not the cgroup itself.
 
 After thinking awhile I think this makes perfect sense.  I haven't
 implemented set_value yet, and when I do I think I'll implement this
 guideline.

I'm kinda confused here.  You say *everything* is gonna go through the
manager and then talks about chowning directories.  Don't the two
conflict?

   Long-term we will want the cgroup manager to become more intelligent -
   to place its own limits on clients, to address cpu and device hotplug,
   etc.  Since we will not be doing that in the first prototype, the daemon
   will not keep any state about the clients.
  
  Isn't the above conflicting with chowning control knobs?
 
 Not sure what you mean by this.
 
 To be clear what I'm talking about is having the client be able to say
 grant 50% of cpus, then when more cpus are added, the actual cpuset
 gets recalculated.  This may well forever stay outside of the cgmanager
 scope.  It may be more appropriate to put that logic into the lmctfy
 layer.

Yes, something like that would be nice but if you give out raw access
to the control knobs by chowning them, I just don't see how that would
be implementable.  What am I missing here?

Thanks.

-- 
tejun

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tejun Heo
And can somebody please fix up lxc-devel so that it doesn't generate
your message awaits moderator approval notification on *each*
message?  :(

-- 
tejun

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Serge Hallyn
Quoting Tejun Heo (t...@kernel.org):
 Hello, Serge.
 
 On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
   As I communicated multiple times before, delegating write access to
   control knobs to untrusted domain has always been a security risk and
   is likely to continue to remain so.  Also, organizationally, a
  
  Then that will need to be address with per-key blacklisting and/or
  per-value filtering in the manager.
  
  Which is my way of saying:  can we please have a list of the security
  issues so we can handle them?  :)  (I've asked several times before
  but haven't seen a list or anyone offering to make one)
 
 Unfortunately, for now, please consider everything blacklisted.  Yes,
 it is true that some knobs should be mostly safe but given the level
 of changes we're going through and the difficulty of properly auditing
 anything for delegation to untrusted environment, I don't feel
 comfortable at all about delegating through chown.  It is an
 accidental feature which happened just because it uses filesystem as
 its interface and it is no where near the top of the todo list.  It
 has never worked properly and won't in any foreseeable future.
 
   cgroup's control knobs belong to the parent not the cgroup itself.
  
  After thinking awhile I think this makes perfect sense.  I haven't
  implemented set_value yet, and when I do I think I'll implement this
  guideline.
 
 I'm kinda confused here.  You say *everything* is gonna go through the
 manager and then talks about chowning directories.  Don't the two
 conflict?

No.  I expect the user - except in the google case - to either have
access to no cgroupfs mounts, or readonly mounts.

-serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tim Hockin
If this daemon works as advertised, we will explore moving all write
traffic to use it.  I still have concerns that this can't handle read
traffic at the scale we need.

Tejun,  I am not sure why chown came back into the conversation.  This
is a replacement for that.

On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Tejun Heo (t...@kernel.org):
 Hello, Serge.

 On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
   As I communicated multiple times before, delegating write access to
   control knobs to untrusted domain has always been a security risk and
   is likely to continue to remain so.  Also, organizationally, a
 
  Then that will need to be address with per-key blacklisting and/or
  per-value filtering in the manager.
 
  Which is my way of saying:  can we please have a list of the security
  issues so we can handle them?  :)  (I've asked several times before
  but haven't seen a list or anyone offering to make one)

 Unfortunately, for now, please consider everything blacklisted.  Yes,
 it is true that some knobs should be mostly safe but given the level
 of changes we're going through and the difficulty of properly auditing
 anything for delegation to untrusted environment, I don't feel
 comfortable at all about delegating through chown.  It is an
 accidental feature which happened just because it uses filesystem as
 its interface and it is no where near the top of the todo list.  It
 has never worked properly and won't in any foreseeable future.

   cgroup's control knobs belong to the parent not the cgroup itself.
 
  After thinking awhile I think this makes perfect sense.  I haven't
  implemented set_value yet, and when I do I think I'll implement this
  guideline.

 I'm kinda confused here.  You say *everything* is gonna go through the
 manager and then talks about chowning directories.  Don't the two
 conflict?

 No.  I expect the user - except in the google case - to either have
 access to no cgroupfs mounts, or readonly mounts.

 -serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Victor Marmol
I thought we were going to use chown in the initial version to enforce the
ownership/permissions on the hierarchy. Only the cgroup manager has access
to the hierarchy, but it tries to access the hierarchy as the user that
sent the request. It was only meant to be a for now solution while the
real one rolls out. It may also have gotten thrown out since last I heard :)


On Tue, Dec 3, 2013 at 8:53 PM, Tim Hockin thoc...@google.com wrote:

 If this daemon works as advertised, we will explore moving all write
 traffic to use it.  I still have concerns that this can't handle read
 traffic at the scale we need.

 Tejun,  I am not sure why chown came back into the conversation.  This
 is a replacement for that.

 On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn serge.hal...@ubuntu.com
 wrote:
  Quoting Tejun Heo (t...@kernel.org):
  Hello, Serge.
 
  On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
As I communicated multiple times before, delegating write access to
control knobs to untrusted domain has always been a security risk
 and
is likely to continue to remain so.  Also, organizationally, a
  
   Then that will need to be address with per-key blacklisting and/or
   per-value filtering in the manager.
  
   Which is my way of saying:  can we please have a list of the security
   issues so we can handle them?  :)  (I've asked several times before
   but haven't seen a list or anyone offering to make one)
 
  Unfortunately, for now, please consider everything blacklisted.  Yes,
  it is true that some knobs should be mostly safe but given the level
  of changes we're going through and the difficulty of properly auditing
  anything for delegation to untrusted environment, I don't feel
  comfortable at all about delegating through chown.  It is an
  accidental feature which happened just because it uses filesystem as
  its interface and it is no where near the top of the todo list.  It
  has never worked properly and won't in any foreseeable future.
 
cgroup's control knobs belong to the parent not the cgroup itself.
  
   After thinking awhile I think this makes perfect sense.  I haven't
   implemented set_value yet, and when I do I think I'll implement this
   guideline.
 
  I'm kinda confused here.  You say *everything* is gonna go through the
  manager and then talks about chowning directories.  Don't the two
  conflict?
 
  No.  I expect the user - except in the google case - to either have
  access to no cgroupfs mounts, or readonly mounts.
 
  -serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
At the start of this discussion, some months ago, we offered to
co-devel this with Lennart et al.  They did not seem keen on the idea.

If they have an established DBUS protocol spec, we should consider
adopting it instead of a new one, but we CAN'T just play follow the
leader and do whatever they do, change whenever they feel like
changing.

It would be best if we could get a common DBUS api specc'ed and all
agree to it.  Serge, do you feel up to that?

On Mon, Nov 25, 2013 at 6:18 PM, Michael H. Warfield m...@wittsend.com wrote:
 Serge...

 You have no idea how much I dread mentioning this (well, after
 LinuxPlumbers, maybe you can) but...  You do realize that some of this
 is EXACTLY what the systemd crowd was talking about there in NOLA back
 then.  I sat in those session grinding my teeth and listening to
 comments from some others around me about when systemd might subsume
 bash or even vi or quake.

 Somehow, you and others have tagged me as a systemd expert but I am
 far from it and even you noted that Lennart and I were on the edge of a
 physical discussion when I made some off the cuff remarks there about
 systemd design during my talk.  I personally rank systemd in the same
 category as NetworkMangler (err, NetworkManager) in its propensity for
 committing inexplicable random acts of terrorism and changing its
 behavior from release to release to release.  I'm not a fan and I'm not
 an expert, but I have to be involved with it and watch the damned thing
 like a trapped rat, like it or not.

 Like it or not, we can not go off on divergent designs.  As much as they
 have delusions of taking over the Linux world, they are still going to
 be a major factor and this sort of thing needs to be coordinated.  We
 are going to need exactly what you are proposing whether we have systemd
 in play or not.  IF we CAN kick it to the curb, when we need to, we
 still need to know how to without tearing shit up and breaking shit that
 thinks it's there.  Ideally, it shouldn't matter if systemd where in
 play or not.

 All I ask is that we not get too far off track that we have a major
 architectural divergence here.  The risk is there.

 Mike


 On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote:
 Hi,

 as i've mentioned several times, I want to write a standalone cgroup
 management daemon.  Basic requirements are that it be a standalone
 program; that a single instance running on the host be usable from
 containers nested at any depth; that it not allow escaping ones
 assigned limits; that it not allow subjegating tasks which do not
 belong to you; and that, within your limits, you be able to parcel
 those limits to your tasks as you like.

 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

 One of the driving goals is to enable nested lxc as simply and safely as
 possible.  If this project is a success, then a large chunk of code can
 be removed from lxc.  I'm considering this project a part of the larger
 lxc project, but given how central it is to systems management that
 doesn't mean that I'll consider anyone else's needs as less important
 than our own.

 This document consists of two parts.  The first describes how I
 intend the daemon (cgmanager) to be structured and how it will
 enforce the safety requirements.  The second describes the commands
 which clients will be able to send to the manager.  The list of
 controller keys which can be set is very incomplete at this point,
 serving mainly to show the approach I was thinking of taking.

 Summary

 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.

 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.

 Outline:
   . A single manager, cgmanager, is started on the host, very early
 during boot.  It has very few dependencies, and requires only
 /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
 the cgroup hierarchies in a private namespace and set defaults
 (clone_children, use_hierarchy, sane_behavior, release_agent?) It
 will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
   . A client (requestor 'r') can make cgroup requests over
 /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
 

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
Thanks for this!  I think it helps a lot to discuss now, rather than
over nearly-done code.

On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

I'm torn here.  While I agree in principle with Tejun, I am concerned
that this agent will always lag new kernel features or that the thin
abstraction you want to provide here does not easily accommodate some
of the more ... oddball features of one cgroup interface or another.

This agent is the very bottom of the stack, and should probably not do
much by way of abstraction.  I think I'd rather let something like
lmctfy provide the abstraction more holistically, and relegate this
agent to very simple plumbing and policy.  It could be as simple as
providing read/write/etc ops to specific control files.  It needs to
handle event_fd, too, I guess.  This has the nice side-effect of
always being current on kernel features :)

 Summary

 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.

 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.

 Outline:
   . A single manager, cgmanager, is started on the host, very early
 during boot.  It has very few dependencies, and requires only
 /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
 the cgroup hierarchies in a private namespace and set defaults
 (clone_children, use_hierarchy, sane_behavior, release_agent?) It
 will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).

Where does the config come from?  How do I specify which hierarchies I
want and where, and which flags?

   . A client (requestor 'r') can make cgroup requests over
 /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
 requirements for r are listed below.
   . The client request will pertain an existing or new cgroup A.  r's
 privilege over the cgroup must be checked.  r is said to have
 privilege over A if A is owned by r's uid, or if A's owner is mapped
 into r's user namespace, and r is root in that user namespace.

Problem with this definition.  Being owned-by is not the same as
has-root-in.  Specifically, I may choose to give you root in your own
namespace, but you sure as heck can not increase your own memory
limit.

   . The client request may pertain a victim task v, which may be moved
 to a new cgroup.  In that case r's privilege over both the cgroup
 and v must be checked.  r is said to have privilege over v if v
 is mapped in r's pid namespace, v's uid is mapped into r's user ns,
 and r is root in its userns.  Or if r and v have the same uid
 and v is mapped in r's pid namespace.
   . r's credentials will be taken from socket's peercred, ensuring that
 pid and uid are translated.
   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
 translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
 which is the global uid, and check /proc/PID(r)/uid_map to see whether
 UID is mapped there.
   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
 the kernel translate it for the reader.  Only 'move task v to cgroup
 A' will require a SCM_CREDENTIAL to be sent.

 Privilege requirements by action:
 * Requestor of an action (r) over a socket may only make
   changes to cgroups over which it has privilege.
 * Requestors may be limited to a certain #/depth of cgroups
   (to limit memory usage) - DEFER?
 * Cgroup hierarchy is responsible for resource limits
 * A requestor must either be uid 0 in its userns with victim mapped
   ito its userns, or the same uid and in same/ancestor pidns as the
   victim
 * If r requests creation of cgroup '/x', /x will be interpreted
   as relative to r's cgroup.  r cannot make changes to cgroups not
   under its own current cgroup.

Does this imply that r in a lower-level (farter from root) of the
hierarchy can not make requests of higher levels of the hierarchy
(closer to root), even though they have permissions as per the
definition of privilege?

How do we reconcile this pseudo-virtualization with /proc/self/cgroup
which DOES expose raw paths?

   

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Serge E. Hallyn
Quoting Tim Hockin (thoc...@google.com):
 What are the requirements/goals around performance and concurrency?
 Do you expect this to be a single-threaded thing, or can we handle
 some number of concurrent operations?  Do you expect to use threads of
 processes?

The cgmanager should be pretty dumb, so I would expect it to be
quite fast.  I don't have any specific perf goals though.  If you
have requirements I'm very interested to hear them.  I should be
able to tell pretty soon how far short I fall.

By default I'd expect to run with a single thread, but I don't
imagine one thread can serve a busy 1024-cpu system very well.
Unless you have guidance right now, I think I'd like to get
started with the basic functionality and see how it measures
up to your requirements.  I should add perf counters from the
start so we can figure out where bottlenecks (if any) are and
how to handle them.

Otherwise I could start out with a basic numcpus/10 threadpool
and have the main thread do socket i/o and parcel access
verification and vfs work out to the threadpool, but I'd rather
first know where the problems lie.

 Can you talk about logging - what and where?

When started under upstart, anything we print out goes to
/var/log/upstart/cgmanager.log.  Would be nice to keep it
that simple.  We could log requests by r to do something
it is not allowed to do, but it seems to me the failed
attempts cause no harm, while the potential for overflowing
logs can.

Did you have anything in mind?  Did you want logging to help
detect certain conditions for system optimization, or just
for failure notices and security violations?

 How will we handle event_fd?  Pass a file-descriptor back to the caller?

The only thing currently supporting eventfd is memory threshold,
right?  I haven't tested whether this will work or not, but
ideally the caller would open the eventfd fd, pass it, the
cgroup name, controller file to be watched, and the args to
cgmanager;  cgmanager confirms read access, opens the
controller fd, makes the request over cgroup.event_control,
then passes the controller fd back to the caller and closes
its own copy.

I'm also not sure whether the cgroup interface is going to be
offering a new feature to replace eventfd, since it wants
people to stop using cgroupfs...  Tejun?

 That's all I can come up with for now.

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Victor Marmol
On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Tim Hockin (thoc...@google.com):
  What are the requirements/goals around performance and concurrency?
  Do you expect this to be a single-threaded thing, or can we handle
  some number of concurrent operations?  Do you expect to use threads of
  processes?

 The cgmanager should be pretty dumb, so I would expect it to be
 quite fast.  I don't have any specific perf goals though.  If you
 have requirements I'm very interested to hear them.  I should be
 able to tell pretty soon how far short I fall.

 By default I'd expect to run with a single thread, but I don't
 imagine one thread can serve a busy 1024-cpu system very well.
 Unless you have guidance right now, I think I'd like to get
 started with the basic functionality and see how it measures
 up to your requirements.  I should add perf counters from the
 start so we can figure out where bottlenecks (if any) are and
 how to handle them.

 Otherwise I could start out with a basic numcpus/10 threadpool
 and have the main thread do socket i/o and parcel access
 verification and vfs work out to the threadpool, but I'd rather
 first know where the problems lie.


From Rohit's talk at Linux plumbers:

http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf

The goal is O(1000) reads and O(100) writes per second.



  Can you talk about logging - what and where?

 When started under upstart, anything we print out goes to
 /var/log/upstart/cgmanager.log.  Would be nice to keep it
 that simple.  We could log requests by r to do something
 it is not allowed to do, but it seems to me the failed
 attempts cause no harm, while the potential for overflowing
 logs can.

 Did you have anything in mind?  Did you want logging to help
 detect certain conditions for system optimization, or just
 for failure notices and security violations?

  How will we handle event_fd?  Pass a file-descriptor back to the caller?

 The only thing currently supporting eventfd is memory threshold,
 right?  I haven't tested whether this will work or not, but
 ideally the caller would open the eventfd fd, pass it, the
 cgroup name, controller file to be watched, and the args to
 cgmanager;  cgmanager confirms read access, opens the
 controller fd, makes the request over cgroup.event_control,
 then passes the controller fd back to the caller and closes
 its own copy.

 I'm also not sure whether the cgroup interface is going to be
 offering a new feature to replace eventfd, since it wants
 people to stop using cgroupfs...  Tejun?


From my discussions with Tejun, he wanted to move to using inotify so it
may still be an fd we pass around.


  That's all I can come up with for now.

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Serge E. Hallyn
Quoting Tim Hockin (thoc...@google.com):
 At the start of this discussion, some months ago, we offered to
 co-devel this with Lennart et al.  They did not seem keen on the idea.
 
 If they have an established DBUS protocol spec,

see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html

  we should consider
 adopting it instead of a new one, but we CAN'T just play follow the
 leader and do whatever they do, change whenever they feel like
 changing.

Right.  And if we suspect that the APIs will always be at least
subtly different, then keeping them obviously visually different
seems to have some benefit.  (i.e. 
systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
vs
dbus-send cgmanager set-value http.server cpushares:500 
memorylimit:500M swaplimit:1G
) rather than have admins try to remember now why did that not work
here, oh yeah, MemoryLimit over here should be Memorylimit or whatever.

Then again if lmctfy is the layer which admins will use, then it
doesn't matter as much.

 It would be best if we could get a common DBUS api specc'ed and all
 agree to it.  Serge, do you feel up to that?

Not sure what you mean - I'll certainly send the API to these lists as
the code is developed, and will accept all feedback that I get.  My only
requirements are that the requirements I've listed in the document
be feasible, and be feasible back to, say, 3.2 kernels.  So that is
why we must send an scm-cred for the pid to move into a cgroup.  (With
3.12 we may have alterntives, accepting a vpid as a simple dbus message
and setns()ing into the requestor's pidns to echo the pid into the
cgroup.tasks file.)

-serge

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Victor Marmol
On Tue, Nov 26, 2013 at 8:41 AM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Victor Marmol (vmar...@google.com):
  On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com
 wrote:
 
   Quoting Tim Hockin (thoc...@google.com):
What are the requirements/goals around performance and concurrency?
Do you expect this to be a single-threaded thing, or can we handle
some number of concurrent operations?  Do you expect to use threads
 of
processes?
  
   The cgmanager should be pretty dumb, so I would expect it to be
   quite fast.  I don't have any specific perf goals though.  If you
   have requirements I'm very interested to hear them.  I should be
   able to tell pretty soon how far short I fall.
  
   By default I'd expect to run with a single thread, but I don't
   imagine one thread can serve a busy 1024-cpu system very well.
   Unless you have guidance right now, I think I'd like to get
   started with the basic functionality and see how it measures
   up to your requirements.  I should add perf counters from the
   start so we can figure out where bottlenecks (if any) are and
   how to handle them.
  
   Otherwise I could start out with a basic numcpus/10 threadpool
   and have the main thread do socket i/o and parcel access
   verification and vfs work out to the threadpool, but I'd rather
   first know where the problems lie.
  
 
  From Rohit's talk at Linux plumbers:
 
 
 http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf
 
  The goal is O(1000) reads and O(100) writes per second.

 Cool, thanks.  I can try and get a sense next week of how far off the
 mark I am for reads.

Can you talk about logging - what and where?
  
   When started under upstart, anything we print out goes to
   /var/log/upstart/cgmanager.log.  Would be nice to keep it
   that simple.  We could log requests by r to do something
   it is not allowed to do, but it seems to me the failed
   attempts cause no harm, while the potential for overflowing
   logs can.
  
   Did you have anything in mind?  Did you want logging to help
   detect certain conditions for system optimization, or just
   for failure notices and security violations?
  
How will we handle event_fd?  Pass a file-descriptor back to the
 caller?
  
   The only thing currently supporting eventfd is memory threshold,
   right?  I haven't tested whether this will work or not, but
   ideally the caller would open the eventfd fd, pass it, the
   cgroup name, controller file to be watched, and the args to
   cgmanager;  cgmanager confirms read access, opens the
   controller fd, makes the request over cgroup.event_control,
   then passes the controller fd back to the caller and closes
   its own copy.
  
   I'm also not sure whether the cgroup interface is going to be
   offering a new feature to replace eventfd, since it wants
   people to stop using cgroupfs...  Tejun?
  
 
  From my discussions with Tejun, he wanted to move to using inotify so it
  may still be an fd we pass around.

 Hm, would that just be inotify on the memory.max_usage_in_bytes
 file, of inotify on a specific fd you've created which is
 associated with any threshold you specify?  The former seems
 less ideal.


Tejun can comment more, but I think it is still TBD.


 -serge

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 Thanks for this!  I think it helps a lot to discuss now, rather than
 over nearly-done code.

 On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.

 I'm torn here.  While I agree in principle with Tejun, I am concerned
 that this agent will always lag new kernel features or that the thin
 abstraction you want to provide here does not easily accommodate some
 of the more ... oddball features of one cgroup interface or another.

 This agent is the very bottom of the stack, and should probably not do
 much by way of abstraction.  I think I'd rather let something like
 lmctfy provide the abstraction more holistically, and relegate this

 If lmctfy is an abstraction layer that should keep Tejun happy, and
 it could keep me out of the resource naming game which makes me happy :)

 agent to very simple plumbing and policy.  It could be as simple as
 providing read/write/etc ops to specific control files.  It needs to
 handle event_fd, too, I guess.  This has the nice side-effect of
 always being current on kernel features :)

  Summary
 
  Each 'host' (identified by a separate instance of the linux kernel) will
  have exactly one running daemon to manage control groups.  This daemon
  will answer cgroup management requests over a dbus socket, located at
  /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
  containers, so that one daemon can support the whole system.
 
  Programs will be able to make cgroup requests using dbus calls, or
  indirectly by linking against lmctfy which will be modified to use the
  dbus calls if available.
 
  Outline:
. A single manager, cgmanager, is started on the host, very early
  during boot.  It has very few dependencies, and requires only
  /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
  the cgroup hierarchies in a private namespace and set defaults
  (clone_children, use_hierarchy, sane_behavior, release_agent?) It
  will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).

 Where does the config come from?  How do I specify which hierarchies I
 want and where, and which flags?

 That'll have to be in a file in /etc (which can be mounted readonly).
 There should be no surprises there so I've not thought about the format.

. A client (requestor 'r') can make cgroup requests over
  /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
  requirements for r are listed below.
. The client request will pertain an existing or new cgroup A.  r's
  privilege over the cgroup must be checked.  r is said to have
  privilege over A if A is owned by r's uid, or if A's owner is mapped
  into r's user namespace, and r is root in that user namespace.

 Problem with this definition.  Being owned-by is not the same as
 has-root-in.  Specifically, I may choose to give you root in your own
 namespace, but you sure as heck can not increase your own memory
 limit.

 1. If you don't want me to change the value at all, then just don't map
 A's owner into the namespace.  I'm uid 10 which is root in my namespace,
 but I only have privilege over other uids mapped into my namespace.

I think I understand this, but it is subtle.  Maybe some examples would help?

 2. I've considered never allowing changes to your own cgroup.  So if you're
 in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
 b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
 could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
 so I can't escape the memory limit you wanted.

This is different from what we do internally, but it's an interesting
semantic.  I'm wary of how much we want to make this API about
enforcement of policy vs simple enactment.  In other words, semantics
that diverge from UNIX ownership might be more complicated to
understand than they are worth.

 3. I've not considered having the daemon track resource limits - i.e. creating
 a cgroup and saying give it 100M swap, and if it asks, let it increase that
 to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do you
 feel that would be insufficient?

I think this is a higher-level issue that should not be addressed here.

 Or maybe your question is something different and I'm missing it?

My point was that I, as machine admin, create a memory cgroup of 100
MB for you and put you in it.   I also 

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 What are the requirements/goals around performance and concurrency?
 Do you expect this to be a single-threaded thing, or can we handle
 some number of concurrent operations?  Do you expect to use threads of
 processes?

 The cgmanager should be pretty dumb, so I would expect it to be
 quite fast.  I don't have any specific perf goals though.  If you
 have requirements I'm very interested to hear them.  I should be
 able to tell pretty soon how far short I fall.

If we're limiting this to write traffic only, I think our perf goals
are fairly relaxed.  As longs as you don't develop it to preclude
threading or multi-processing, we can adapt later.  I would like to
see at least a mention to this effect.  We also need to beware DoS
(accidental or otherwise) - perhaps we should force round-robin
service of pending-requests, or something.

 By default I'd expect to run with a single thread, but I don't
 imagine one thread can serve a busy 1024-cpu system very well.
 Unless you have guidance right now, I think I'd like to get
 started with the basic functionality and see how it measures
 up to your requirements.  I should add perf counters from the
 start so we can figure out where bottlenecks (if any) are and
 how to handle them.

 Otherwise I could start out with a basic numcpus/10 threadpool
 and have the main thread do socket i/o and parcel access
 verification and vfs work out to the threadpool, but I'd rather
 first know where the problems lie.

Agree.  Correct first, then fast :)

 Can you talk about logging - what and where?

 When started under upstart, anything we print out goes to
 /var/log/upstart/cgmanager.log.  Would be nice to keep it
 that simple.  We could log requests by r to do something
 it is not allowed to do, but it seems to me the failed
 attempts cause no harm, while the potential for overflowing
 logs can.

I agree that we don't want to overflow logs.

 Did you have anything in mind?  Did you want logging to help
 detect certain conditions for system optimization, or just
 for failure notices and security violations?

When something goes amiss, we have to ty to figure out what happened -
how far did a request get?  Logging every change is probably
important.  Logging failures could be downsampled and rate-limited,
something like 1 failure log per second or something.

 How will we handle event_fd?  Pass a file-descriptor back to the caller?

 The only thing currently supporting eventfd is memory threshold,
 right?  I haven't tested whether this will work or not, but
 ideally the caller would open the eventfd fd, pass it, the
 cgroup name, controller file to be watched, and the args to
 cgmanager;  cgmanager confirms read access, opens the
 controller fd, makes the request over cgroup.event_control,
 then passes the controller fd back to the caller and closes
 its own copy.

 I'm also not sure whether the cgroup interface is going to be
 offering a new feature to replace eventfd, since it wants
 people to stop using cgroupfs...  Tejun?

 That's all I can come up with for now.

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
On Tue, Nov 26, 2013 at 8:37 AM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 At the start of this discussion, some months ago, we offered to
 co-devel this with Lennart et al.  They did not seem keen on the idea.

 If they have an established DBUS protocol spec,

 see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
 and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html

  we should consider
 adopting it instead of a new one, but we CAN'T just play follow the
 leader and do whatever they do, change whenever they feel like
 changing.

 Right.  And if we suspect that the APIs will always be at least
 subtly different, then keeping them obviously visually different
 seems to have some benefit.  (i.e.
 systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
 vs
 dbus-send cgmanager set-value http.server cpushares:500 
 memorylimit:500M swaplimit:1G
 ) rather than have admins try to remember now why did that not work
 here, oh yeah, MemoryLimit over here should be Memorylimit or whatever.

 Then again if lmctfy is the layer which admins will use, then it
 doesn't matter as much.

 It would be best if we could get a common DBUS api specc'ed and all
 agree to it.  Serge, do you feel up to that?

 Not sure what you mean - I'll certainly send the API to these lists as

What I meant was whether it is worth opening a discussion with the
systemd folks on a common lowest-level DBUS interface.  But it looks
like their work is already a bit higher level, so it's probably moot.

 the code is developed, and will accept all feedback that I get.  My only
 requirements are that the requirements I've listed in the document
 be feasible, and be feasible back to, say, 3.2 kernels.  So that is
 why we must send an scm-cred for the pid to move into a cgroup.  (With
 3.12 we may have alterntives, accepting a vpid as a simple dbus message
 and setns()ing into the requestor's pidns to echo the pid into the
 cgroup.tasks file.)

 -serge

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Serge E. Hallyn
Quoting Tim Hockin (thoc...@google.com):
 On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Quoting Tim Hockin (thoc...@google.com):
...
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 
  Problem with this definition.  Being owned-by is not the same as
  has-root-in.  Specifically, I may choose to give you root in your own
  namespace, but you sure as heck can not increase your own memory
  limit.
 
  1. If you don't want me to change the value at all, then just don't map
  A's owner into the namespace.  I'm uid 10 which is root in my namespace,
  but I only have privilege over other uids mapped into my namespace.
 
 I think I understand this, but it is subtle.  Maybe some examples would help?

When you create a user namespace, at first it is empty, and you are 'nobody'
(-1).  Then magically some uids from the host, say 10-101999, are mapped
into your namespace, to uids 0-1999.

Now assume you're uid 0 inside that namespace.  You have privilege over your
uids, 0-999, which are 10-101999 on the host.

If cgroup file A is owned by host uid 0, then the owner is not mapped into
the user namespace.  uid 0 inside the namespace only gets the world access
rights to that file.

If cgroup file A is owned by host uid 100100, then uid 0 in the
namespace has access to that file by virtue of being root, and uid 100
in the namespace (100100 on the host) has access to the file by virtue
of being the owner.

  2. I've considered never allowing changes to your own cgroup.  So if you're
  in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
  b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
  could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
  so I can't escape the memory limit you wanted.
 
 This is different from what we do internally, but it's an interesting
 semantic.  I'm wary of how much we want to make this API about
 enforcement of policy vs simple enactment.  In other words, semantics
 that diverge from UNIX ownership might be more complicated to
 understand than they are worth.

The semantics I gave are exactly the user namespace semantics.  If you're
not using a user namespace then they simply do not apply, and you are back
to strict UNIX ownership semantics that you want.  But allowing 'root' in
a user namespace to have privilege over uids, without having any privilege
outside its own namespace, must be honored for this to be usable by lxc.

Like I said, on the bright side, if you don't want to care about user
namespaces, then everything falls back to strict unix semantics - so if
you don't want to care, you don't have to care.

  3. I've not considered having the daemon track resource limits - i.e. 
  creating
  a cgroup and saying give it 100M swap, and if it asks, let it increase that
  to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do you
  feel that would be insufficient?
 
 I think this is a higher-level issue that should not be addressed here.
 
  Or maybe your question is something different and I'm missing it?
 
 My point was that I, as machine admin, create a memory cgroup of 100
 MB for you and put you in it.   I also give you root-in-namespace.
 You must not be able to change 100 MB to 200 MB.  From your (1) you
 are saying that system UID 0 owns the cgroup and is NOT mapped into
 your namespace.  Therefore your definition holds.  I think I can buy
 that.
 
 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
   /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see 
   whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a SCM_CREDENTIAL to be sent.
  
   Privilege requirements by action:
   * Requestor of an action (r) over a socket may only make
 changes to cgroups over 

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
lmctfy literally supports .. as a container name :)

On Tue, Nov 26, 2013 at 12:58 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Quoting Tim Hockin (thoc...@google.com):
 ...
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 
  Problem with this definition.  Being owned-by is not the same as
  has-root-in.  Specifically, I may choose to give you root in your own
  namespace, but you sure as heck can not increase your own memory
  limit.
 
  1. If you don't want me to change the value at all, then just don't map
  A's owner into the namespace.  I'm uid 10 which is root in my 
  namespace,
  but I only have privilege over other uids mapped into my namespace.

 I think I understand this, but it is subtle.  Maybe some examples would help?

 When you create a user namespace, at first it is empty, and you are 'nobody'
 (-1).  Then magically some uids from the host, say 10-101999, are mapped
 into your namespace, to uids 0-1999.

 Now assume you're uid 0 inside that namespace.  You have privilege over your
 uids, 0-999, which are 10-101999 on the host.

 If cgroup file A is owned by host uid 0, then the owner is not mapped into
 the user namespace.  uid 0 inside the namespace only gets the world access
 rights to that file.

 If cgroup file A is owned by host uid 100100, then uid 0 in the
 namespace has access to that file by virtue of being root, and uid 100
 in the namespace (100100 on the host) has access to the file by virtue
 of being the owner.

  2. I've considered never allowing changes to your own cgroup.  So if you're
  in /a/b, you can create /a/b/c and modify c's settings, but you can't 
  modify
  b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
  could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
  so I can't escape the memory limit you wanted.

 This is different from what we do internally, but it's an interesting
 semantic.  I'm wary of how much we want to make this API about
 enforcement of policy vs simple enactment.  In other words, semantics
 that diverge from UNIX ownership might be more complicated to
 understand than they are worth.

 The semantics I gave are exactly the user namespace semantics.  If you're
 not using a user namespace then they simply do not apply, and you are back
 to strict UNIX ownership semantics that you want.  But allowing 'root' in
 a user namespace to have privilege over uids, without having any privilege
 outside its own namespace, must be honored for this to be usable by lxc.

 Like I said, on the bright side, if you don't want to care about user
 namespaces, then everything falls back to strict unix semantics - so if
 you don't want to care, you don't have to care.

  3. I've not considered having the daemon track resource limits - i.e. 
  creating
  a cgroup and saying give it 100M swap, and if it asks, let it increase 
  that
  to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do 
  you
  feel that would be insufficient?

 I think this is a higher-level issue that should not be addressed here.

  Or maybe your question is something different and I'm missing it?

 My point was that I, as machine admin, create a memory cgroup of 100
 MB for you and put you in it.   I also give you root-in-namespace.
 You must not be able to change 100 MB to 200 MB.  From your (1) you
 are saying that system UID 0 owns the cgroup and is NOT mapped into
 your namespace.  Therefore your definition holds.  I think I can buy
 that.

 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
   /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see 
   whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a 

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Serge E. Hallyn
Quoting Tim Hockin (thoc...@google.com):
 lmctfy literally supports .. as a container name :)

So is ../.. ever used, or does noone every do anything beyond ..?

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Victor Marmol
I think most of our usecases have only wanted to know about the parent, but
I can see people wanting to go further. Would it be much different to
support both? I feel like it'll be simpler to support all if we go that
route.


On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Tim Hockin (thoc...@google.com):
  lmctfy literally supports .. as a container name :)

 So is ../.. ever used, or does noone every do anything beyond ..?

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Tim Hockin
I see three models:

1) Don't virtualize the cgroup path.  This is what lmctfy does,
though we have discussed changing to:

2) Virtualize to an administrative root - I get to tell you where
your root is, and you can't see anythign higher than that.

3) Virtualize to CWD root - you can never go up, just down.


#1 seems easy, but exposes a lot.  #3 is restrictive and fairly easy -
could we live with that?  #2 seems ideal, but it's not clear to me how
to actually implement it.

On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol vmar...@google.com wrote:
 I think most of our usecases have only wanted to know about the parent, but
 I can see people wanting to go further. Would it be much different to
 support both? I feel like it'll be simpler to support all if we go that
 route.


 On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Tim Hockin (thoc...@google.com):
  lmctfy literally supports .. as a container name :)

 So is ../.. ever used, or does noone every do anything beyond ..?



--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Serge E. Hallyn
I was planning on doing #3, but since you guys need to access .., my
plan is to have 'a/b' refer to $cwd/a/b while /a/b is the absolute
path, and allow read and eventfd but no write to any parent dirs.

Quoting Tim Hockin (thoc...@google.com):
 I see three models:
 
 1) Don't virtualize the cgroup path.  This is what lmctfy does,
 though we have discussed changing to:
 
 2) Virtualize to an administrative root - I get to tell you where
 your root is, and you can't see anythign higher than that.
 
 3) Virtualize to CWD root - you can never go up, just down.
 
 
 #1 seems easy, but exposes a lot.  #3 is restrictive and fairly easy -
 could we live with that?  #2 seems ideal, but it's not clear to me how
 to actually implement it.
 
 On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol vmar...@google.com wrote:
  I think most of our usecases have only wanted to know about the parent, but
  I can see people wanting to go further. Would it be much different to
  support both? I feel like it'll be simpler to support all if we go that
  route.
 
 
  On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
  Quoting Tim Hockin (thoc...@google.com):
   lmctfy literally supports .. as a container name :)
 
  So is ../.. ever used, or does noone every do anything beyond ..?
 
 

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel


Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Marian Marinov
On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
 Hi,

 as i've mentioned several times, I want to write a standalone cgroup
 management daemon.  Basic requirements are that it be a standalone
 program; that a single instance running on the host be usable from
 containers nested at any depth; that it not allow escaping ones
 assigned limits; that it not allow subjegating tasks which do not
 belong to you; and that, within your limits, you be able to parcel
 those limits to your tasks as you like.

 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

 One of the driving goals is to enable nested lxc as simply and safely as
 possible.  If this project is a success, then a large chunk of code can
 be removed from lxc.  I'm considering this project a part of the larger
 lxc project, but given how central it is to systems management that
 doesn't mean that I'll consider anyone else's needs as less important
 than our own.

 This document consists of two parts.  The first describes how I
 intend the daemon (cgmanager) to be structured and how it will
 enforce the safety requirements.  The second describes the commands
 which clients will be able to send to the manager.  The list of
 controller keys which can be set is very incomplete at this point,
 serving mainly to show the approach I was thinking of taking.

 Summary

 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.

 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.

 Outline:
. A single manager, cgmanager, is started on the host, very early
  during boot.  It has very few dependencies, and requires only
  /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
  the cgroup hierarchies in a private namespace and set defaults
  (clone_children, use_hierarchy, sane_behavior, release_agent?) It
  will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
. A client (requestor 'r') can make cgroup requests over
  /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
  requirements for r are listed below.
. The client request will pertain an existing or new cgroup A.  r's
  privilege over the cgroup must be checked.  r is said to have
  privilege over A if A is owned by r's uid, or if A's owner is mapped
  into r's user namespace, and r is root in that user namespace.
. The client request may pertain a victim task v, which may be moved
  to a new cgroup.  In that case r's privilege over both the cgroup
  and v must be checked.  r is said to have privilege over v if v
  is mapped in r's pid namespace, v's uid is mapped into r's user ns,
  and r is root in its userns.  Or if r and v have the same uid
  and v is mapped in r's pid namespace.
. r's credentials will be taken from socket's peercred, ensuring that
  pid and uid are translated.
. r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
  translated global pid.  It will then read UID(v) from 
 /proc/PID(v)/status,
  which is the global uid, and check /proc/PID(r)/uid_map to see whether
  UID is mapped there.
. dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
  the kernel translate it for the reader.  Only 'move task v to cgroup
  A' will require a SCM_CREDENTIAL to be sent.

 Privilege requirements by action:
  * Requestor of an action (r) over a socket may only make
changes to cgroups over which it has privilege.
  * Requestors may be limited to a certain #/depth of cgroups
(to limit memory usage) - DEFER?
  * Cgroup hierarchy is responsible for resource limits
  * A requestor must either be uid 0 in its userns with victim mapped
ito its userns, or the same uid and in same/ancestor pidns as the
victim
  * If r requests creation of cgroup '/x', /x will be interpreted
as relative to r's cgroup.  r cannot make changes to cgroups not
under its own current cgroup.
  * If r is not in the initial user_ns, then it may not change settings
in its own cgroup, only descendants.  (Not strictly necessary -
we could require the use of extra cgroups when wanted, as lxc does
 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Stéphane Graber
On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
 On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
  Hi,
 
  as i've mentioned several times, I want to write a standalone cgroup
  management daemon.  Basic requirements are that it be a standalone
  program; that a single instance running on the host be usable from
  containers nested at any depth; that it not allow escaping ones
  assigned limits; that it not allow subjegating tasks which do not
  belong to you; and that, within your limits, you be able to parcel
  those limits to your tasks as you like.
 
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.
 
  One of the driving goals is to enable nested lxc as simply and safely as
  possible.  If this project is a success, then a large chunk of code can
  be removed from lxc.  I'm considering this project a part of the larger
  lxc project, but given how central it is to systems management that
  doesn't mean that I'll consider anyone else's needs as less important
  than our own.
 
  This document consists of two parts.  The first describes how I
  intend the daemon (cgmanager) to be structured and how it will
  enforce the safety requirements.  The second describes the commands
  which clients will be able to send to the manager.  The list of
  controller keys which can be set is very incomplete at this point,
  serving mainly to show the approach I was thinking of taking.
 
  Summary
 
  Each 'host' (identified by a separate instance of the linux kernel) will
  have exactly one running daemon to manage control groups.  This daemon
  will answer cgroup management requests over a dbus socket, located at
  /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
  containers, so that one daemon can support the whole system.
 
  Programs will be able to make cgroup requests using dbus calls, or
  indirectly by linking against lmctfy which will be modified to use the
  dbus calls if available.
 
  Outline:
 . A single manager, cgmanager, is started on the host, very early
   during boot.  It has very few dependencies, and requires only
   /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
   the cgroup hierarchies in a private namespace and set defaults
   (clone_children, use_hierarchy, sane_behavior, release_agent?) It
   will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
  /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a SCM_CREDENTIAL to be sent.
 
  Privilege requirements by action:
   * Requestor of an action (r) over a socket may only make
 changes to cgroups over which it has privilege.
   * Requestors may be limited to a certain #/depth of cgroups
 (to limit memory usage) - DEFER?
   * Cgroup hierarchy is responsible for resource limits
   * A requestor must either be uid 0 in its userns with victim mapped
 ito its userns, or the same uid and in same/ancestor pidns as the
 victim
   * If r requests creation of cgroup '/x', /x will be interpreted
 as relative to r's cgroup.  r cannot make changes to cgroups not
 under its own current cgroup.
   * If r is not in the initial user_ns, then it may not change 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Marian Marinov
On 11/26/2013 02:11 AM, Stéphane Graber wrote:
 On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
 On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
 Hi,

 as i've mentioned several times, I want to write a standalone cgroup
 management daemon.  Basic requirements are that it be a standalone
 program; that a single instance running on the host be usable from
 containers nested at any depth; that it not allow escaping ones
 assigned limits; that it not allow subjegating tasks which do not
 belong to you; and that, within your limits, you be able to parcel
 those limits to your tasks as you like.

 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

 One of the driving goals is to enable nested lxc as simply and safely as
 possible.  If this project is a success, then a large chunk of code can
 be removed from lxc.  I'm considering this project a part of the larger
 lxc project, but given how central it is to systems management that
 doesn't mean that I'll consider anyone else's needs as less important
 than our own.

 This document consists of two parts.  The first describes how I
 intend the daemon (cgmanager) to be structured and how it will
 enforce the safety requirements.  The second describes the commands
 which clients will be able to send to the manager.  The list of
 controller keys which can be set is very incomplete at this point,
 serving mainly to show the approach I was thinking of taking.

 Summary

 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.

 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.

 Outline:
 . A single manager, cgmanager, is started on the host, very early
   during boot.  It has very few dependencies, and requires only
   /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
   the cgroup hierarchies in a private namespace and set defaults
   (clone_children, use_hierarchy, sane_behavior, release_agent?) It
   will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
 /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a SCM_CREDENTIAL to be sent.

 Privilege requirements by action:
   * Requestor of an action (r) over a socket may only make
 changes to cgroups over which it has privilege.
   * Requestors may be limited to a certain #/depth of cgroups
 (to limit memory usage) - DEFER?
   * Cgroup hierarchy is responsible for resource limits
   * A requestor must either be uid 0 in its userns with victim mapped
 ito its userns, or the same uid and in same/ancestor pidns as the
 victim
   * If r requests creation of cgroup '/x', /x will be interpreted
 as relative to r's cgroup.  r cannot make changes to cgroups not
 under its own current cgroup.
   * If r is not in the initial user_ns, then it may not change 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Stéphane Graber
On Tue, Nov 26, 2013 at 03:35:22AM +0200, Marian Marinov wrote:
 On 11/26/2013 02:11 AM, Stéphane Graber wrote:
 On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
 On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
 Hi,
 
 as i've mentioned several times, I want to write a standalone cgroup
 management daemon.  Basic requirements are that it be a standalone
 program; that a single instance running on the host be usable from
 containers nested at any depth; that it not allow escaping ones
 assigned limits; that it not allow subjegating tasks which do not
 belong to you; and that, within your limits, you be able to parcel
 those limits to your tasks as you like.
 
 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.
 
 One of the driving goals is to enable nested lxc as simply and safely as
 possible.  If this project is a success, then a large chunk of code can
 be removed from lxc.  I'm considering this project a part of the larger
 lxc project, but given how central it is to systems management that
 doesn't mean that I'll consider anyone else's needs as less important
 than our own.
 
 This document consists of two parts.  The first describes how I
 intend the daemon (cgmanager) to be structured and how it will
 enforce the safety requirements.  The second describes the commands
 which clients will be able to send to the manager.  The list of
 controller keys which can be set is very incomplete at this point,
 serving mainly to show the approach I was thinking of taking.
 
 Summary
 
 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.
 
 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.
 
 Outline:
 . A single manager, cgmanager, is started on the host, very early
   during boot.  It has very few dependencies, and requires only
   /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
   the cgroup hierarchies in a private namespace and set defaults
   (clone_children, use_hierarchy, sane_behavior, release_agent?) It
   will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
  /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see 
  whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a SCM_CREDENTIAL to be sent.
 
 Privilege requirements by action:
   * Requestor of an action (r) over a socket may only make
 changes to cgroups over which it has privilege.
   * Requestors may be limited to a certain #/depth of cgroups
 (to limit memory usage) - DEFER?
   * Cgroup hierarchy is responsible for resource limits
   * A requestor must either be uid 0 in its userns with victim mapped
 ito its userns, or the same uid and in same/ancestor pidns as the
 victim
   * If r requests creation of cgroup '/x', /x will be interpreted
 as relative to r's cgroup.  r cannot make changes to cgroups not
 under its own current 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Michael H. Warfield
Serge...

You have no idea how much I dread mentioning this (well, after
LinuxPlumbers, maybe you can) but...  You do realize that some of this
is EXACTLY what the systemd crowd was talking about there in NOLA back
then.  I sat in those session grinding my teeth and listening to
comments from some others around me about when systemd might subsume
bash or even vi or quake.

Somehow, you and others have tagged me as a systemd expert but I am
far from it and even you noted that Lennart and I were on the edge of a
physical discussion when I made some off the cuff remarks there about
systemd design during my talk.  I personally rank systemd in the same
category as NetworkMangler (err, NetworkManager) in its propensity for
committing inexplicable random acts of terrorism and changing its
behavior from release to release to release.  I'm not a fan and I'm not
an expert, but I have to be involved with it and watch the damned thing
like a trapped rat, like it or not.

Like it or not, we can not go off on divergent designs.  As much as they
have delusions of taking over the Linux world, they are still going to
be a major factor and this sort of thing needs to be coordinated.  We
are going to need exactly what you are proposing whether we have systemd
in play or not.  IF we CAN kick it to the curb, when we need to, we
still need to know how to without tearing shit up and breaking shit that
thinks it's there.  Ideally, it shouldn't matter if systemd where in
play or not.

All I ask is that we not get too far off track that we have a major
architectural divergence here.  The risk is there.

Mike


On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: 
 Hi,
 
 as i've mentioned several times, I want to write a standalone cgroup
 management daemon.  Basic requirements are that it be a standalone
 program; that a single instance running on the host be usable from
 containers nested at any depth; that it not allow escaping ones
 assigned limits; that it not allow subjegating tasks which do not
 belong to you; and that, within your limits, you be able to parcel
 those limits to your tasks as you like.  
 
 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.
 
 One of the driving goals is to enable nested lxc as simply and safely as
 possible.  If this project is a success, then a large chunk of code can
 be removed from lxc.  I'm considering this project a part of the larger
 lxc project, but given how central it is to systems management that
 doesn't mean that I'll consider anyone else's needs as less important
 than our own.
 
 This document consists of two parts.  The first describes how I
 intend the daemon (cgmanager) to be structured and how it will
 enforce the safety requirements.  The second describes the commands 
 which clients will be able to send to the manager.  The list of
 controller keys which can be set is very incomplete at this point,
 serving mainly to show the approach I was thinking of taking.
 
 Summary
 
 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.
 
 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.
 
 Outline:
   . A single manager, cgmanager, is started on the host, very early
 during boot.  It has very few dependencies, and requires only
 /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
 the cgroup hierarchies in a private namespace and set defaults
 (clone_children, use_hierarchy, sane_behavior, release_agent?) It
 will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
   . A client (requestor 'r') can make cgroup requests over
 /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
 requirements for r are listed below.
   . The client request will pertain an existing or new cgroup A.  r's
 privilege over the cgroup must be checked.  r is said to have
 privilege over A if A is owned by r's uid, or if A's owner is mapped
 into r's user namespace, and r is root in that user namespace.
   . The client request may pertain a victim task v, which may be moved
 to a new cgroup.  In that case r's privilege over both the cgroup
 and v must be checked.  r is said to have privilege over v if v
 is mapped in r's 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Stéphane Graber
Haha,

I was wondering how long it'd take before we got the first comment about
systemd's own cgroup manager :)

To try and keep this short, there are a lot of cases where systemd's
plan of having an in-pid1 manager, as practical as it's for them, just
isn't going to work for us.

I believe our design makes things a bit cleaner by not having it tied to
any specific init system or feature and have a relatively low level,
very simple API that people can use as a building block for anything
that wants to manage cgroups.

At this point in time, there's no hard limitation for having one or more
processes writing to the cgroup hierarchy, as much as some people may
want this to change. I very much doubt it'll happen any time soon and
until then, even if not perfectly adequate, there won't be any problem
running both systemd's manager and our own.

There's also the possibility if someone felt sufficiently strongly about
this to contribute patches, to have our manager talk to systemd's if
present and go through their manager instead of accessing cgroupfs
itself. That's assuming systemd offers a sufficiently low level API that
could be used for that without bringing an unreasonable amount of
dependencies to our code.


I don't want this thread to turn into some kind of flamewar or similarly
overheated discussion about systemd vs everyone else, so I'll just state
that from my point of view (and I suspect that of the group who worked
on this early draft), systemd's manager while perfect for grouping and
resource allocation for systemd units and user sessions doesn't quite
fit our bill with regard to supporting multiple level of full
distro-agnostic containers using nesting and mixing user namespaces.
It also has what as a non-systemd person I consider a big drawback of
being built into an init system which quite a few major distributions
don't use (specifically those distros that account for the majority of
LXC's users).

I think there's room for two implementations and competition (even if we
have slightly different goals) is a good thing and will undoubtedly help
both project consider use cases they didn't think of leading to a better
solution for everyone. And if some day one of the two wins or we can
somehow converge into a solution that works for everyone, that'd be
great. But our discussions at Linux Plumbers and other conferences have
shown that this isn't going to happen now, so it's best to stop arguing
and instead get some stuff done.

On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote:
 Serge...
 
 You have no idea how much I dread mentioning this (well, after
 LinuxPlumbers, maybe you can) but...  You do realize that some of this
 is EXACTLY what the systemd crowd was talking about there in NOLA back
 then.  I sat in those session grinding my teeth and listening to
 comments from some others around me about when systemd might subsume
 bash or even vi or quake.
 
 Somehow, you and others have tagged me as a systemd expert but I am
 far from it and even you noted that Lennart and I were on the edge of a
 physical discussion when I made some off the cuff remarks there about
 systemd design during my talk.  I personally rank systemd in the same
 category as NetworkMangler (err, NetworkManager) in its propensity for
 committing inexplicable random acts of terrorism and changing its
 behavior from release to release to release.  I'm not a fan and I'm not
 an expert, but I have to be involved with it and watch the damned thing
 like a trapped rat, like it or not.
 
 Like it or not, we can not go off on divergent designs.  As much as they
 have delusions of taking over the Linux world, they are still going to
 be a major factor and this sort of thing needs to be coordinated.  We
 are going to need exactly what you are proposing whether we have systemd
 in play or not.  IF we CAN kick it to the curb, when we need to, we
 still need to know how to without tearing shit up and breaking shit that
 thinks it's there.  Ideally, it shouldn't matter if systemd where in
 play or not.
 
 All I ask is that we not get too far off track that we have a major
 architectural divergence here.  The risk is there.
 
 Mike
 
 
 On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: 
  Hi,
  
  as i've mentioned several times, I want to write a standalone cgroup
  management daemon.  Basic requirements are that it be a standalone
  program; that a single instance running on the host be usable from
  containers nested at any depth; that it not allow escaping ones
  assigned limits; that it not allow subjegating tasks which do not
  belong to you; and that, within your limits, you be able to parcel
  those limits to your tasks as you like.  
  
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Michael H. Warfield
On Mon, 2013-11-25 at 21:43 -0500, Stéphane Graber wrote: 
 Haha,
 
 I was wondering how long it'd take before we got the first comment about
 systemd's own cgroup manager :)
 
 To try and keep this short, there are a lot of cases where systemd's
 plan of having an in-pid1 manager, as practical as it's for them, just
 isn't going to work for us.
 
 I believe our design makes things a bit cleaner by not having it tied to
 any specific init system or feature and have a relatively low level,
 very simple API that people can use as a building block for anything
 that wants to manage cgroups.
 
 At this point in time, there's no hard limitation for having one or more
 processes writing to the cgroup hierarchy, as much as some people may
 want this to change. I very much doubt it'll happen any time soon and
 until then, even if not perfectly adequate, there won't be any problem
 running both systemd's manager and our own.
 
 There's also the possibility if someone felt sufficiently strongly about
 this to contribute patches, to have our manager talk to systemd's if
 present and go through their manager instead of accessing cgroupfs
 itself. That's assuming systemd offers a sufficiently low level API that
 could be used for that without bringing an unreasonable amount of
 dependencies to our code.
 
 
 I don't want this thread to turn into some kind of flamewar or similarly
 overheated discussion about systemd vs everyone else, so I'll just state
 that from my point of view (and I suspect that of the group who worked
 on this early draft), systemd's manager while perfect for grouping and
 resource allocation for systemd units and user sessions doesn't quite
 fit our bill with regard to supporting multiple level of full
 distro-agnostic containers using nesting and mixing user namespaces.
 It also has what as a non-systemd person I consider a big drawback of
 being built into an init system which quite a few major distributions
 don't use (specifically those distros that account for the majority of
 LXC's users).
 
 I think there's room for two implementations and competition (even if we
 have slightly different goals) is a good thing and will undoubtedly help
 both project consider use cases they didn't think of leading to a better
 solution for everyone. And if some day one of the two wins or we can
 somehow converge into a solution that works for everyone, that'd be
 great. But our discussions at Linux Plumbers and other conferences have
 shown that this isn't going to happen now, so it's best to stop arguing
 and instead get some stuff done.

Concur.  And, as you know, I'm not a fan or supporter of that camp.  I
just want to make sure everyone is aware of all the gorillas in the room
before the fecal flakes hit the rapidly whirling blades.

That being said, I think this is a laudable goal.  If we do it right, it
well can become the standard they have to adhere to.

Regards,
Mike

 On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote:
  Serge...
  
  You have no idea how much I dread mentioning this (well, after
  LinuxPlumbers, maybe you can) but...  You do realize that some of this
  is EXACTLY what the systemd crowd was talking about there in NOLA back
  then.  I sat in those session grinding my teeth and listening to
  comments from some others around me about when systemd might subsume
  bash or even vi or quake.
  
  Somehow, you and others have tagged me as a systemd expert but I am
  far from it and even you noted that Lennart and I were on the edge of a
  physical discussion when I made some off the cuff remarks there about
  systemd design during my talk.  I personally rank systemd in the same
  category as NetworkMangler (err, NetworkManager) in its propensity for
  committing inexplicable random acts of terrorism and changing its
  behavior from release to release to release.  I'm not a fan and I'm not
  an expert, but I have to be involved with it and watch the damned thing
  like a trapped rat, like it or not.
  
  Like it or not, we can not go off on divergent designs.  As much as they
  have delusions of taking over the Linux world, they are still going to
  be a major factor and this sort of thing needs to be coordinated.  We
  are going to need exactly what you are proposing whether we have systemd
  in play or not.  IF we CAN kick it to the curb, when we need to, we
  still need to know how to without tearing shit up and breaking shit that
  thinks it's there.  Ideally, it shouldn't matter if systemd where in
  play or not.
  
  All I ask is that we not get too far off track that we have a major
  architectural divergence here.  The risk is there.
  
  Mike
  
  
  On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote: 
   Hi,
   
   as i've mentioned several times, I want to write a standalone cgroup
   management daemon.  Basic requirements are that it be a standalone
   program; that a single instance running on the host be usable from
   containers nested at any depth; 

Re: [lxc-devel] cgroup management daemon

2013-11-25 Thread Serge E. Hallyn
Quoting Tim Hockin (thoc...@google.com):
 Thanks for this!  I think it helps a lot to discuss now, rather than
 over nearly-done code.
 
 On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.
 
 I'm torn here.  While I agree in principle with Tejun, I am concerned
 that this agent will always lag new kernel features or that the thin
 abstraction you want to provide here does not easily accommodate some
 of the more ... oddball features of one cgroup interface or another.
 
 This agent is the very bottom of the stack, and should probably not do
 much by way of abstraction.  I think I'd rather let something like
 lmctfy provide the abstraction more holistically, and relegate this

If lmctfy is an abstraction layer that should keep Tejun happy, and
it could keep me out of the resource naming game which makes me happy :)

 agent to very simple plumbing and policy.  It could be as simple as
 providing read/write/etc ops to specific control files.  It needs to
 handle event_fd, too, I guess.  This has the nice side-effect of
 always being current on kernel features :)
 
  Summary
 
  Each 'host' (identified by a separate instance of the linux kernel) will
  have exactly one running daemon to manage control groups.  This daemon
  will answer cgroup management requests over a dbus socket, located at
  /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
  containers, so that one daemon can support the whole system.
 
  Programs will be able to make cgroup requests using dbus calls, or
  indirectly by linking against lmctfy which will be modified to use the
  dbus calls if available.
 
  Outline:
. A single manager, cgmanager, is started on the host, very early
  during boot.  It has very few dependencies, and requires only
  /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
  the cgroup hierarchies in a private namespace and set defaults
  (clone_children, use_hierarchy, sane_behavior, release_agent?) It
  will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
 
 Where does the config come from?  How do I specify which hierarchies I
 want and where, and which flags?

That'll have to be in a file in /etc (which can be mounted readonly).
There should be no surprises there so I've not thought about the format.

. A client (requestor 'r') can make cgroup requests over
  /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
  requirements for r are listed below.
. The client request will pertain an existing or new cgroup A.  r's
  privilege over the cgroup must be checked.  r is said to have
  privilege over A if A is owned by r's uid, or if A's owner is mapped
  into r's user namespace, and r is root in that user namespace.
 
 Problem with this definition.  Being owned-by is not the same as
 has-root-in.  Specifically, I may choose to give you root in your own
 namespace, but you sure as heck can not increase your own memory
 limit.

1. If you don't want me to change the value at all, then just don't map
A's owner into the namespace.  I'm uid 10 which is root in my namespace,
but I only have privilege over other uids mapped into my namespace.

2. I've considered never allowing changes to your own cgroup.  So if you're
in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
so I can't escape the memory limit you wanted.

3. I've not considered having the daemon track resource limits - i.e. creating
a cgroup and saying give it 100M swap, and if it asks, let it increase that
to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do you
feel that would be insufficient?
 
Or maybe your question is something different and I'm missing it?

. The client request may pertain a victim task v, which may be moved
  to a new cgroup.  In that case r's privilege over both the cgroup
  and v must be checked.  r is said to have privilege over v if v
  is mapped in r's pid namespace, v's uid is mapped into r's user ns,
  and r is root in its userns.  Or if r and v have the same uid
  and v is mapped in r's pid namespace.
. r's credentials will be taken from socket's peercred, ensuring that
  pid and uid are translated.
. r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
  translated global pid.  It will then read UID(v) from