subject:"Re\: \[lxc\-devel\] cgroup management daemon"

Hello, guys.

Sorry about the delay.

On Mon, Nov 25, 2013 at 10:43:35PM +, Serge E. Hallyn wrote:
 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

One of the reasons for not exposing knobs as-is is that the knobs we
currently have aren't consistent.  The weight values have different
ranges, some combinations of values don't make much sense, and so on.
The user can cope with it but it'd probably be better to expose
something which doesn't lead to mistakes too easily.

 The above addresses
 * creating cgroups
 * chowning cgroups
 * setting cgroup limits
 * moving tasks into cgroups
   . but does not address a 'cgexec group -- command' type of behavior.
 * To handle that (specifically for upstart), recommend that r do:
   if (!pid) {
 request_reclassify(cgroup, getpid());
 do_execve();
   }
   . alternatively, the daemon could, if kernel is new enough, setns to
 the requestor's namespaces to execute a command in a new cgroup.
 The new command would be daemonized to that pid namespaces' pid 1.

So, IIUC, cgroup hierarchy management - creation and removal of
cgroups and assignments of tasks will go through while configuring
control knobs will be delegated to the cgroup owner, right?

Hmmm... the plan is to allow delegating task assignments in the
sub-hierarchy but require CAP_X for writes to knobs (not reads).  This
stems from the fact that, especially with unified hierarchy, those
operations will be cgroup-core proper operations which are gonna be
relatively safer and that task organizations in the subhierarchy and
monitoring knobs are likely to be higher frequency operation than
enabling and configuring controllers.

As I communicated multiple times before, delegating write access to
control knobs to untrusted domain has always been a security risk and
is likely to continue to remain so.  Also, organizationally, a
cgroup's control knobs belong to the parent not the cgroup itself.
That probably is why you were thinking about putting an extra cgroup
inbetween for isolation, but the root problem there is that those
knobs belong to the parent, not the directory itself.

Security is in most part logistics - it's about getting all the
details right, and we don't either design or implement each knob with
security in mind and DoSing them has always been pretty easy, so I
don't think delegating write accesses to knobs is a good idea.

If you, for whatever reason, can trust the delegatee, which I believe
is the case for google, it's fine.  If you're trying to delegate to a
container which you don't have any control over, it isn't a good idea.

Another thing to consider is due to both the fundamental characterics
of hierarchy and implementation issues, things will become expensive
if nesting gets beyond several layers (if controllers are enabled,
that is) and the controllers in general will be implemented and
optimized with limited level of nesting in mind.  IOW, building, say,
8 level deep hierarchy in the host and then doing the same thing
inside the container with controllers enabled won't make a very happy
system.  It probably is something to keep in mind when laying out how
the whole thing eventually would look like.

 Long-term we will want the cgroup manager to become more intelligent -
 to place its own limits on clients, to address cpu and device hotplug,
 etc.  Since we will not be doing that in the first prototype, the daemon
 will not keep any state about the clients.

Isn't the above conflicting with chowning control knobs?

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

Ooh, can you also please cc Li Zefan lize...@huawei.com when
replying?

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

Hello, Tim.

On Mon, Nov 25, 2013 at 08:58:09PM -0800, Tim Hockin wrote:
 Thanks for this!  I think it helps a lot to discuss now, rather than
 over nearly-done code.
 
 On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.
 
 I'm torn here.  While I agree in principle with Tejun, I am concerned
 that this agent will always lag new kernel features or that the thin
 abstraction you want to provide here does not easily accommodate some
 of the more ... oddball features of one cgroup interface or another.

Yeah, that's the trade-off but cgroupfs is a kernel API.  It shouldn't
change or grow rapidly once things settle down.  As long as there's
not too crazy way to step-aside when such rare case arises, I think
pros outweight cons.

 This agent is the very bottom of the stack, and should probably not do
 much by way of abstraction.  I think I'd rather let something like
 lmctfy provide the abstraction more holistically, and relegate this
 agent to very simple plumbing and policy.  It could be as simple as
 providing read/write/etc ops to specific control files.  It needs to
 handle event_fd, too, I guess.  This has the nice side-effect of
 always being current on kernel features :)

The level of abstraction is definitely something debatable.  Please
note that the existing event_fd based mechanism won't grow any new
users (BTW, event_control is one of the dos vectors if you give write
access to it) and all new notifications will be using inotify.

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

Hello,

On Tue, Nov 26, 2013 at 09:19:18AM -0800, Victor Marmol wrote:
   From my discussions with Tejun, he wanted to move to using inotify so it
   may still be an fd we pass around.
 
  Hm, would that just be inotify on the memory.max_usage_in_bytes
  file, of inotify on a specific fd you've created which is
  associated with any threshold you specify?  The former seems
  less ideal.
 
 
 Tejun can comment more, but I think it is still TBD.

It's likely the former with configurable cadence or per-knob (not
per-opener) configurable thresholds.  max_usage_in_bytes is a special
case here as all other knobs can simply generate an event on each
transition.  If event (de)muxing is necessary, it probably should be
done from userland.

Thanks.

-- 
tejun

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Serge Hallyn

Quoting Tejun Heo (t...@kernel.org):
 Hello, guys.
 
 Sorry about the delay.
 
 On Mon, Nov 25, 2013 at 10:43:35PM +, Serge E. Hallyn wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.
 
 One of the reasons for not exposing knobs as-is is that the knobs we
 currently have aren't consistent.  The weight values have different
 ranges, some combinations of values don't make much sense, and so on.
 The user can cope with it but it'd probably be better to expose
 something which doesn't lead to mistakes too easily.

For the moment, for prototype (github.com/hallyn/cgmanager), I'm just
going with filenames/values.

When the bulk of the work is done, we can either (or both) (a) introduce
a thin abstraction layer over the key/values, or/and (b) whitelist
some of the filenames and filter some values.

I know the upstart folks don't want to have to wait long for a
specification...  I'll hopefully make a final decision on this next
week.

  The above addresses
  * creating cgroups
  * chowning cgroups
  * setting cgroup limits
  * moving tasks into cgroups
. but does not address a 'cgexec group -- command' type of behavior.
  * To handle that (specifically for upstart), recommend that r do:
if (!pid) {
  request_reclassify(cgroup, getpid());
  do_execve();
}
. alternatively, the daemon could, if kernel is new enough, setns to
  the requestor's namespaces to execute a command in a new cgroup.
  The new command would be daemonized to that pid namespaces' pid 1.
 
 So, IIUC, cgroup hierarchy management - creation and removal of
 cgroups and assignments of tasks will go through while configuring
 control knobs will be delegated to the cgroup owner, right?

Not sure what you mean, but I think the answer is no.  Everything
goes through the manager.  The manager doesn't try to enforce that,
but by default the cgroup filesystems will only be mounted in the
manager's private mnt_ns, and containers at least will not be
allowed to mount cgroup fstype.

 Hmmm... the plan is to allow delegating task assignments in the
 sub-hierarchy but require CAP_X for writes to knobs (not reads).  This
 stems from the fact that, especially with unified hierarchy, those
 operations will be cgroup-core proper operations which are gonna be
 relatively safer and that task organizations in the subhierarchy and
 monitoring knobs are likely to be higher frequency operation than
 enabling and configuring controllers.

Should be ok for this.

 As I communicated multiple times before, delegating write access to
 control knobs to untrusted domain has always been a security risk and
 is likely to continue to remain so.  Also, organizationally, a

Then that will need to be address with per-key blacklisting and/or
per-value filtering in the manager.

Which is my way of saying:  can we please have a list of the security
issues so we can handle them?  :)  (I've asked several times before
but haven't seen a list or anyone offering to make one)

 cgroup's control knobs belong to the parent not the cgroup itself.

After thinking awhile I think this makes perfect sense.  I haven't
implemented set_value yet, and when I do I think I'll implement this
guideline.

 That probably is why you were thinking about putting an extra cgroup
 inbetween for isolation, but the root problem there is that those
 knobs belong to the parent, not the directory itself.

Yup.

 Security is in most part logistics - it's about getting all the
 details right, and we don't either design or implement each knob with
 security in mind and DoSing them has always been pretty easy, so I
 don't think delegating write accesses to knobs is a good idea.
 
 If you, for whatever reason, can trust the delegatee, which I believe
 is the case for google, it's fine.  If you're trying to delegate to a
 container which you don't have any control over, it isn't a good idea.
 
 Another thing to consider is due to both the fundamental characterics
 of hierarchy and implementation issues, things will become expensive
 if nesting gets beyond several layers (if controllers are enabled,
 that is) and the controllers in general will be implemented and
 optimized with limited level of nesting in mind.  IOW, building, say,
 8 level deep hierarchy in the host and then doing the same thing
 inside the container with controllers enabled won't make a very happy

Yes, I very much want to avoid that.

 system.  It probably is something to keep in mind when laying out how
 the whole thing eventually would look like.
 
  Long-term we will want

Re: [lxc-devel] cgroup management daemon

Hello, Serge.

On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
  As I communicated multiple times before, delegating write access to
  control knobs to untrusted domain has always been a security risk and
  is likely to continue to remain so.  Also, organizationally, a
 
 Then that will need to be address with per-key blacklisting and/or
 per-value filtering in the manager.
 
 Which is my way of saying:  can we please have a list of the security
 issues so we can handle them?  :)  (I've asked several times before
 but haven't seen a list or anyone offering to make one)

Unfortunately, for now, please consider everything blacklisted.  Yes,
it is true that some knobs should be mostly safe but given the level
of changes we're going through and the difficulty of properly auditing
anything for delegation to untrusted environment, I don't feel
comfortable at all about delegating through chown.  It is an
accidental feature which happened just because it uses filesystem as
its interface and it is no where near the top of the todo list.  It
has never worked properly and won't in any foreseeable future.

  cgroup's control knobs belong to the parent not the cgroup itself.
 
 After thinking awhile I think this makes perfect sense.  I haven't
 implemented set_value yet, and when I do I think I'll implement this
 guideline.

I'm kinda confused here.  You say *everything* is gonna go through the
manager and then talks about chowning directories.  Don't the two
conflict?

   Long-term we will want the cgroup manager to become more intelligent -
   to place its own limits on clients, to address cpu and device hotplug,
   etc.  Since we will not be doing that in the first prototype, the daemon
   will not keep any state about the clients.
  
  Isn't the above conflicting with chowning control knobs?
 
 Not sure what you mean by this.
 
 To be clear what I'm talking about is having the client be able to say
 grant 50% of cpus, then when more cpus are added, the actual cpuset
 gets recalculated.  This may well forever stay outside of the cgmanager
 scope.  It may be more appropriate to put that logic into the lmctfy
 layer.

Yes, something like that would be nice but if you give out raw access
to the control knobs by chowning them, I just don't see how that would
be implementable.  What am I missing here?

Thanks.

-- 
tejun

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

And can somebody please fix up lxc-devel so that it doesn't generate
your message awaits moderator approval notification on *each*
message?  :(

-- 
tejun

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Serge Hallyn

Quoting Tejun Heo (t...@kernel.org):
 Hello, Serge.
 
 On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
   As I communicated multiple times before, delegating write access to
   control knobs to untrusted domain has always been a security risk and
   is likely to continue to remain so.  Also, organizationally, a
  
  Then that will need to be address with per-key blacklisting and/or
  per-value filtering in the manager.
  
  Which is my way of saying:  can we please have a list of the security
  issues so we can handle them?  :)  (I've asked several times before
  but haven't seen a list or anyone offering to make one)
 
 Unfortunately, for now, please consider everything blacklisted.  Yes,
 it is true that some knobs should be mostly safe but given the level
 of changes we're going through and the difficulty of properly auditing
 anything for delegation to untrusted environment, I don't feel
 comfortable at all about delegating through chown.  It is an
 accidental feature which happened just because it uses filesystem as
 its interface and it is no where near the top of the todo list.  It
 has never worked properly and won't in any foreseeable future.
 
   cgroup's control knobs belong to the parent not the cgroup itself.
  
  After thinking awhile I think this makes perfect sense.  I haven't
  implemented set_value yet, and when I do I think I'll implement this
  guideline.
 
 I'm kinda confused here.  You say *everything* is gonna go through the
 manager and then talks about chowning directories.  Don't the two
 conflict?

No.  I expect the user - except in the google case - to either have
access to no cgroupfs mounts, or readonly mounts.

-serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Tim Hockin

If this daemon works as advertised, we will explore moving all write
traffic to use it.  I still have concerns that this can't handle read
traffic at the scale we need.

Tejun,  I am not sure why chown came back into the conversation.  This
is a replacement for that.

On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Tejun Heo (t...@kernel.org):
 Hello, Serge.

 On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
   As I communicated multiple times before, delegating write access to
   control knobs to untrusted domain has always been a security risk and
   is likely to continue to remain so.  Also, organizationally, a
 
  Then that will need to be address with per-key blacklisting and/or
  per-value filtering in the manager.
 
  Which is my way of saying:  can we please have a list of the security
  issues so we can handle them?  :)  (I've asked several times before
  but haven't seen a list or anyone offering to make one)

 Unfortunately, for now, please consider everything blacklisted.  Yes,
 it is true that some knobs should be mostly safe but given the level
 of changes we're going through and the difficulty of properly auditing
 anything for delegation to untrusted environment, I don't feel
 comfortable at all about delegating through chown.  It is an
 accidental feature which happened just because it uses filesystem as
 its interface and it is no where near the top of the todo list.  It
 has never worked properly and won't in any foreseeable future.

   cgroup's control knobs belong to the parent not the cgroup itself.
 
  After thinking awhile I think this makes perfect sense.  I haven't
  implemented set_value yet, and when I do I think I'll implement this
  guideline.

 I'm kinda confused here.  You say *everything* is gonna go through the
 manager and then talks about chowning directories.  Don't the two
 conflict?

 No.  I expect the user - except in the google case - to either have
 access to no cgroupfs mounts, or readonly mounts.

 -serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-12-03 Thread Victor Marmol

I thought we were going to use chown in the initial version to enforce the
ownership/permissions on the hierarchy. Only the cgroup manager has access
to the hierarchy, but it tries to access the hierarchy as the user that
sent the request. It was only meant to be a for now solution while the
real one rolls out. It may also have gotten thrown out since last I heard :)


On Tue, Dec 3, 2013 at 8:53 PM, Tim Hockin thoc...@google.com wrote:

 If this daemon works as advertised, we will explore moving all write
 traffic to use it.  I still have concerns that this can't handle read
 traffic at the scale we need.

 Tejun,  I am not sure why chown came back into the conversation.  This
 is a replacement for that.

 On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn serge.hal...@ubuntu.com
 wrote:
  Quoting Tejun Heo (t...@kernel.org):
  Hello, Serge.
 
  On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
As I communicated multiple times before, delegating write access to
control knobs to untrusted domain has always been a security risk
 and
is likely to continue to remain so.  Also, organizationally, a
  
   Then that will need to be address with per-key blacklisting and/or
   per-value filtering in the manager.
  
   Which is my way of saying:  can we please have a list of the security
   issues so we can handle them?  :)  (I've asked several times before
   but haven't seen a list or anyone offering to make one)
 
  Unfortunately, for now, please consider everything blacklisted.  Yes,
  it is true that some knobs should be mostly safe but given the level
  of changes we're going through and the difficulty of properly auditing
  anything for delegation to untrusted environment, I don't feel
  comfortable at all about delegating through chown.  It is an
  accidental feature which happened just because it uses filesystem as
  its interface and it is no where near the top of the todo list.  It
  has never worked properly and won't in any foreseeable future.
 
cgroup's control knobs belong to the parent not the cgroup itself.
  
   After thinking awhile I think this makes perfect sense.  I haven't
   implemented set_value yet, and when I do I think I'll implement this
   guideline.
 
  I'm kinda confused here.  You say *everything* is gonna go through the
  manager and then talks about chowning directories.  Don't the two
  conflict?
 
  No.  I expect the user - except in the google case - to either have
  access to no cgroupfs mounts, or readonly mounts.
 
  -serge

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
lxc-devel mailing list
lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

At the start of this discussion, some months ago, we offered to
co-devel this with Lennart et al.  They did not seem keen on the idea.

If they have an established DBUS protocol spec, we should consider
adopting it instead of a new one, but we CAN'T just play follow the
leader and do whatever they do, change whenever they feel like
changing.

It would be best if we could get a common DBUS api specc'ed and all
agree to it.  Serge, do you feel up to that?

On Mon, Nov 25, 2013 at 6:18 PM, Michael H. Warfield m...@wittsend.com wrote:
 Serge...

 You have no idea how much I dread mentioning this (well, after
 LinuxPlumbers, maybe you can) but...  You do realize that some of this
 is EXACTLY what the systemd crowd was talking about there in NOLA back
 then.  I sat in those session grinding my teeth and listening to
 comments from some others around me about when systemd might subsume
 bash or even vi or quake.

 Somehow, you and others have tagged me as a systemd expert but I am
 far from it and even you noted that Lennart and I were on the edge of a
 physical discussion when I made some off the cuff remarks there about
 systemd design during my talk.  I personally rank systemd in the same
 category as NetworkMangler (err, NetworkManager) in its propensity for
 committing inexplicable random acts of terrorism and changing its
 behavior from release to release to release.  I'm not a fan and I'm not
 an expert, but I have to be involved with it and watch the damned thing
 like a trapped rat, like it or not.

 Like it or not, we can not go off on divergent designs.  As much as they
 have delusions of taking over the Linux world, they are still going to
 be a major factor and this sort of thing needs to be coordinated.  We
 are going to need exactly what you are proposing whether we have systemd
 in play or not.  IF we CAN kick it to the curb, when we need to, we
 still need to know how to without tearing shit up and breaking shit that
 thinks it's there.  Ideally, it shouldn't matter if systemd where in
 play or not.

 All I ask is that we not get too far off track that we have a major
 architectural divergence here.  The risk is there.

 Mike


 On Mon, 2013-11-25 at 22:43 +, Serge E. Hallyn wrote:
 Hi,

 as i've mentioned several times, I want to write a standalone cgroup
 management daemon.  Basic requirements are that it be a standalone
 program; that a single instance running on the host be usable from
 containers nested at any depth; that it not allow escaping ones
 assigned limits; that it not allow subjegating tasks which do not
 belong to you; and that, within your limits, you be able to parcel
 those limits to your tasks as you like.

 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

 One of the driving goals is to enable nested lxc as simply and safely as
 possible.  If this project is a success, then a large chunk of code can
 be removed from lxc.  I'm considering this project a part of the larger
 lxc project, but given how central it is to systems management that
 doesn't mean that I'll consider anyone else's needs as less important
 than our own.

 This document consists of two parts.  The first describes how I
 intend the daemon (cgmanager) to be structured and how it will
 enforce the safety requirements.  The second describes the commands
 which clients will be able to send to the manager.  The list of
 controller keys which can be set is very incomplete at this point,
 serving mainly to show the approach I was thinking of taking.

 Summary

 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.

 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.

 Outline:
   . A single manager, cgmanager, is started on the host, very early
 during boot.  It has very few dependencies, and requires only
 /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
 the cgroup hierarchies in a private namespace and set defaults
 (clone_children, use_hierarchy, sane_behavior, release_agent?) It
 will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
   . A client (requestor 'r') can make cgroup requests over
 /sys/fs/cgroup/manager using dbus calls.  Detailed privilege

Re: [lxc-devel] cgroup management daemon

Thanks for this!  I think it helps a lot to discuss now, rather than
over nearly-done code.

On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Additionally, Tejun has specified that we do not want users to be
 too closely tied to the cgroupfs implementation.  Therefore
 commands will be just a hair more general than specifying cgroupfs
 filenames and values.  I may go so far as to avoid specifying
 specific controllers, as AFAIK there should be no redundancy in
 features.  On the other hand, I don't want to get too general.
 So I'm basing the API loosely on the lmctfy command line API.

I'm torn here.  While I agree in principle with Tejun, I am concerned
that this agent will always lag new kernel features or that the thin
abstraction you want to provide here does not easily accommodate some
of the more ... oddball features of one cgroup interface or another.

This agent is the very bottom of the stack, and should probably not do
much by way of abstraction.  I think I'd rather let something like
lmctfy provide the abstraction more holistically, and relegate this
agent to very simple plumbing and policy.  It could be as simple as
providing read/write/etc ops to specific control files.  It needs to
handle event_fd, too, I guess.  This has the nice side-effect of
always being current on kernel features :)

 Summary

 Each 'host' (identified by a separate instance of the linux kernel) will
 have exactly one running daemon to manage control groups.  This daemon
 will answer cgroup management requests over a dbus socket, located at
 /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
 containers, so that one daemon can support the whole system.

 Programs will be able to make cgroup requests using dbus calls, or
 indirectly by linking against lmctfy which will be modified to use the
 dbus calls if available.

 Outline:
   . A single manager, cgmanager, is started on the host, very early
 during boot.  It has very few dependencies, and requires only
 /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
 the cgroup hierarchies in a private namespace and set defaults
 (clone_children, use_hierarchy, sane_behavior, release_agent?) It
 will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).

Where does the config come from?  How do I specify which hierarchies I
want and where, and which flags?

   . A client (requestor 'r') can make cgroup requests over
 /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
 requirements for r are listed below.
   . The client request will pertain an existing or new cgroup A.  r's
 privilege over the cgroup must be checked.  r is said to have
 privilege over A if A is owned by r's uid, or if A's owner is mapped
 into r's user namespace, and r is root in that user namespace.

Problem with this definition.  Being owned-by is not the same as
has-root-in.  Specifically, I may choose to give you root in your own
namespace, but you sure as heck can not increase your own memory
limit.

   . The client request may pertain a victim task v, which may be moved
 to a new cgroup.  In that case r's privilege over both the cgroup
 and v must be checked.  r is said to have privilege over v if v
 is mapped in r's pid namespace, v's uid is mapped into r's user ns,
 and r is root in its userns.  Or if r and v have the same uid
 and v is mapped in r's pid namespace.
   . r's credentials will be taken from socket's peercred, ensuring that
 pid and uid are translated.
   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
 translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
 which is the global uid, and check /proc/PID(r)/uid_map to see whether
 UID is mapped there.
   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
 the kernel translate it for the reader.  Only 'move task v to cgroup
 A' will require a SCM_CREDENTIAL to be sent.

 Privilege requirements by action:
 * Requestor of an action (r) over a socket may only make
   changes to cgroups over which it has privilege.
 * Requestors may be limited to a certain #/depth of cgroups
   (to limit memory usage) - DEFER?
 * Cgroup hierarchy is responsible for resource limits
 * A requestor must either be uid 0 in its userns with victim mapped
   ito its userns, or the same uid and in same/ancestor pidns as the
   victim
 * If r requests creation of cgroup '/x', /x will be interpreted
   as relative to r's cgroup.  r cannot make changes to cgroups not
   under its own current cgroup.

Does this imply that r in a lower-level (farter from root) of the
hierarchy can not make requests of higher levels of the hierarchy
(closer to root), even though they have permissions as per the
definition of privilege?

How do we reconcile this pseudo-virtualization with /proc/self/cgroup
which DOES expose raw paths?

Re: [lxc-devel] cgroup management daemon

Quoting Tim Hockin (thoc...@google.com):
 What are the requirements/goals around performance and concurrency?
 Do you expect this to be a single-threaded thing, or can we handle
 some number of concurrent operations?  Do you expect to use threads of
 processes?

The cgmanager should be pretty dumb, so I would expect it to be
quite fast.  I don't have any specific perf goals though.  If you
have requirements I'm very interested to hear them.  I should be
able to tell pretty soon how far short I fall.

By default I'd expect to run with a single thread, but I don't
imagine one thread can serve a busy 1024-cpu system very well.
Unless you have guidance right now, I think I'd like to get
started with the basic functionality and see how it measures
up to your requirements.  I should add perf counters from the
start so we can figure out where bottlenecks (if any) are and
how to handle them.

Otherwise I could start out with a basic numcpus/10 threadpool
and have the main thread do socket i/o and parcel access
verification and vfs work out to the threadpool, but I'd rather
first know where the problems lie.

 Can you talk about logging - what and where?

When started under upstart, anything we print out goes to
/var/log/upstart/cgmanager.log.  Would be nice to keep it
that simple.  We could log requests by r to do something
it is not allowed to do, but it seems to me the failed
attempts cause no harm, while the potential for overflowing
logs can.

Did you have anything in mind?  Did you want logging to help
detect certain conditions for system optimization, or just
for failure notices and security violations?

 How will we handle event_fd?  Pass a file-descriptor back to the caller?

The only thing currently supporting eventfd is memory threshold,
right?  I haven't tested whether this will work or not, but
ideally the caller would open the eventfd fd, pass it, the
cgroup name, controller file to be watched, and the args to
cgmanager;  cgmanager confirms read access, opens the
controller fd, makes the request over cgroup.event_control,
then passes the controller fd back to the caller and closes
its own copy.

I'm also not sure whether the cgroup interface is going to be
offering a new feature to replace eventfd, since it wants
people to stop using cgroupfs...  Tejun?

 That's all I can come up with for now.

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Victor Marmol

On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote:

Quoting Tim Hockin (thoc...@google.com):
What are the requirements/goals around performance and concurrency?
Do you expect this to be a single-threaded thing, or can we handle
some number of concurrent operations? Do you expect to use threads of
processes?

The cgmanager should be pretty dumb, so I would expect it to be
quite fast. I don't have any specific perf goals though. If you
have requirements I'm very interested to hear them. I should be
able to tell pretty soon how far short I fall.

By default I'd expect to run with a single thread, but I don't
imagine one thread can serve a busy 1024-cpu system very well.
Unless you have guidance right now, I think I'd like to get
started with the basic functionality and see how it measures
up to your requirements. I should add perf counters from the
start so we can figure out where bottlenecks (if any) are and
how to handle them.

Otherwise I could start out with a basic numcpus/10 threadpool
and have the main thread do socket i/o and parcel access
verification and vfs work out to the threadpool, but I'd rather
first know where the problems lie.

From Rohit's talk at Linux plumbers:

http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf

The goal is O(1000) reads and O(100) writes per second.

Can you talk about logging - what and where?

When started under upstart, anything we print out goes to
/var/log/upstart/cgmanager.log. Would be nice to keep it
that simple. We could log requests by r to do something
it is not allowed to do, but it seems to me the failed
attempts cause no harm, while the potential for overflowing
logs can.

Did you have anything in mind? Did you want logging to help
detect certain conditions for system optimization, or just
for failure notices and security violations?

How will we handle event_fd? Pass a file-descriptor back to the caller?

The only thing currently supporting eventfd is memory threshold,
right? I haven't tested whether this will work or not, but
ideally the caller would open the eventfd fd, pass it, the
cgroup name, controller file to be watched, and the args to
cgmanager; cgmanager confirms read access, opens the
controller fd, makes the request over cgroup.event_control,
then passes the controller fd back to the caller and closes
its own copy.

I'm also not sure whether the cgroup interface is going to be
offering a new feature to replace eventfd, since it wants
people to stop using cgroupfs... Tejun?

From my discussions with Tejun, he wanted to move to using inotify so it
may still be an fd we pass around.

That's all I can come up with for now.

--
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

Quoting Tim Hockin (thoc...@google.com):
 At the start of this discussion, some months ago, we offered to
 co-devel this with Lennart et al.  They did not seem keen on the idea.
 
 If they have an established DBUS protocol spec,

see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html

  we should consider
 adopting it instead of a new one, but we CAN'T just play follow the
 leader and do whatever they do, change whenever they feel like
 changing.

Right.  And if we suspect that the APIs will always be at least
subtly different, then keeping them obviously visually different
seems to have some benefit.  (i.e. 
systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
vs
dbus-send cgmanager set-value http.server cpushares:500 
memorylimit:500M swaplimit:1G
) rather than have admins try to remember now why did that not work
here, oh yeah, MemoryLimit over here should be Memorylimit or whatever.

Then again if lmctfy is the layer which admins will use, then it
doesn't matter as much.

 It would be best if we could get a common DBUS api specc'ed and all
 agree to it.  Serge, do you feel up to that?

Not sure what you mean - I'll certainly send the API to these lists as
the code is developed, and will accept all feedback that I get.  My only
requirements are that the requirements I've listed in the document
be feasible, and be feasible back to, say, 3.2 kernels.  So that is
why we must send an scm-cred for the pid to move into a cgroup.  (With
3.12 we may have alterntives, accepting a vpid as a simple dbus message
and setns()ing into the requestor's pidns to echo the pid into the
cgroup.tasks file.)

-serge

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Victor Marmol

On Tue, Nov 26, 2013 at 8:41 AM, Serge E. Hallyn se...@hallyn.com wrote:

Quoting Victor Marmol (vmar...@google.com):
On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com
wrote:

From Rohit's talk at Linux plumbers:

http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf

The goal is O(1000) reads and O(100) writes per second.

Cool, thanks. I can try and get a sense next week of how far off the
mark I am for reads.

Can you talk about logging - what and where?

Did you have anything in mind? Did you want logging to help
detect certain conditions for system optimization, or just
for failure notices and security violations?

How will we handle event_fd? Pass a file-descriptor back to the
caller?

I'm also not sure whether the cgroup interface is going to be
offering a new feature to replace eventfd, since it wants
people to stop using cgroupfs... Tejun?

From my discussions with Tejun, he wanted to move to using inotify so it
may still be an fd we pass around.

Hm, would that just be inotify on the memory.max_usage_in_bytes
file, of inotify on a specific fd you've created which is
associated with any threshold you specify? The former seems
less ideal.

Tejun can comment more, but I think it is still TBD.

-serge

Re: [lxc-devel] cgroup management daemon

On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 Thanks for this!  I think it helps a lot to discuss now, rather than
 over nearly-done code.

 On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Additionally, Tejun has specified that we do not want users to be
  too closely tied to the cgroupfs implementation.  Therefore
  commands will be just a hair more general than specifying cgroupfs
  filenames and values.  I may go so far as to avoid specifying
  specific controllers, as AFAIK there should be no redundancy in
  features.  On the other hand, I don't want to get too general.
  So I'm basing the API loosely on the lmctfy command line API.

 I'm torn here.  While I agree in principle with Tejun, I am concerned
 that this agent will always lag new kernel features or that the thin
 abstraction you want to provide here does not easily accommodate some
 of the more ... oddball features of one cgroup interface or another.

 This agent is the very bottom of the stack, and should probably not do
 much by way of abstraction.  I think I'd rather let something like
 lmctfy provide the abstraction more holistically, and relegate this

 If lmctfy is an abstraction layer that should keep Tejun happy, and
 it could keep me out of the resource naming game which makes me happy :)

 agent to very simple plumbing and policy.  It could be as simple as
 providing read/write/etc ops to specific control files.  It needs to
 handle event_fd, too, I guess.  This has the nice side-effect of
 always being current on kernel features :)

  Summary
 
  Each 'host' (identified by a separate instance of the linux kernel) will
  have exactly one running daemon to manage control groups.  This daemon
  will answer cgroup management requests over a dbus socket, located at
  /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
  containers, so that one daemon can support the whole system.
 
  Programs will be able to make cgroup requests using dbus calls, or
  indirectly by linking against lmctfy which will be modified to use the
  dbus calls if available.
 
  Outline:
. A single manager, cgmanager, is started on the host, very early
  during boot.  It has very few dependencies, and requires only
  /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
  the cgroup hierarchies in a private namespace and set defaults
  (clone_children, use_hierarchy, sane_behavior, release_agent?) It
  will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).

 Where does the config come from?  How do I specify which hierarchies I
 want and where, and which flags?

 That'll have to be in a file in /etc (which can be mounted readonly).
 There should be no surprises there so I've not thought about the format.

. A client (requestor 'r') can make cgroup requests over
  /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
  requirements for r are listed below.
. The client request will pertain an existing or new cgroup A.  r's
  privilege over the cgroup must be checked.  r is said to have
  privilege over A if A is owned by r's uid, or if A's owner is mapped
  into r's user namespace, and r is root in that user namespace.

 Problem with this definition.  Being owned-by is not the same as
 has-root-in.  Specifically, I may choose to give you root in your own
 namespace, but you sure as heck can not increase your own memory
 limit.

 1. If you don't want me to change the value at all, then just don't map
 A's owner into the namespace.  I'm uid 10 which is root in my namespace,
 but I only have privilege over other uids mapped into my namespace.

I think I understand this, but it is subtle.  Maybe some examples would help?

 2. I've considered never allowing changes to your own cgroup.  So if you're
 in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
 b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
 could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
 so I can't escape the memory limit you wanted.

This is different from what we do internally, but it's an interesting
semantic.  I'm wary of how much we want to make this API about
enforcement of policy vs simple enactment.  In other words, semantics
that diverge from UNIX ownership might be more complicated to
understand than they are worth.

 3. I've not considered having the daemon track resource limits - i.e. creating
 a cgroup and saying give it 100M swap, and if it asks, let it increase that
 to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do you
 feel that would be insufficient?

I think this is a higher-level issue that should not be addressed here.

 Or maybe your question is something different and I'm missing it?

My point was that I, as machine admin, create a memory cgroup of 100
MB for you and put you in it.   I also

Re: [lxc-devel] cgroup management daemon

On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 What are the requirements/goals around performance and concurrency?
 Do you expect this to be a single-threaded thing, or can we handle
 some number of concurrent operations?  Do you expect to use threads of
 processes?

 The cgmanager should be pretty dumb, so I would expect it to be
 quite fast.  I don't have any specific perf goals though.  If you
 have requirements I'm very interested to hear them.  I should be
 able to tell pretty soon how far short I fall.

If we're limiting this to write traffic only, I think our perf goals
are fairly relaxed.  As longs as you don't develop it to preclude
threading or multi-processing, we can adapt later.  I would like to
see at least a mention to this effect.  We also need to beware DoS
(accidental or otherwise) - perhaps we should force round-robin
service of pending-requests, or something.

 By default I'd expect to run with a single thread, but I don't
 imagine one thread can serve a busy 1024-cpu system very well.
 Unless you have guidance right now, I think I'd like to get
 started with the basic functionality and see how it measures
 up to your requirements.  I should add perf counters from the
 start so we can figure out where bottlenecks (if any) are and
 how to handle them.

 Otherwise I could start out with a basic numcpus/10 threadpool
 and have the main thread do socket i/o and parcel access
 verification and vfs work out to the threadpool, but I'd rather
 first know where the problems lie.

Agree.  Correct first, then fast :)

 Can you talk about logging - what and where?

 When started under upstart, anything we print out goes to
 /var/log/upstart/cgmanager.log.  Would be nice to keep it
 that simple.  We could log requests by r to do something
 it is not allowed to do, but it seems to me the failed
 attempts cause no harm, while the potential for overflowing
 logs can.

I agree that we don't want to overflow logs.

 Did you have anything in mind?  Did you want logging to help
 detect certain conditions for system optimization, or just
 for failure notices and security violations?

When something goes amiss, we have to ty to figure out what happened -
how far did a request get?  Logging every change is probably
important.  Logging failures could be downsampled and rate-limited,
something like 1 failure log per second or something.

 How will we handle event_fd?  Pass a file-descriptor back to the caller?

 The only thing currently supporting eventfd is memory threshold,
 right?  I haven't tested whether this will work or not, but
 ideally the caller would open the eventfd fd, pass it, the
 cgroup name, controller file to be watched, and the args to
 cgmanager;  cgmanager confirms read access, opens the
 controller fd, makes the request over cgroup.event_control,
 then passes the controller fd back to the caller and closes
 its own copy.

 I'm also not sure whether the cgroup interface is going to be
 offering a new feature to replace eventfd, since it wants
 people to stop using cgroupfs...  Tejun?

 That's all I can come up with for now.

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

On Tue, Nov 26, 2013 at 8:37 AM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 At the start of this discussion, some months ago, we offered to
 co-devel this with Lennart et al.  They did not seem keen on the idea.

 If they have an established DBUS protocol spec,

 see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
 and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html

  we should consider
 adopting it instead of a new one, but we CAN'T just play follow the
 leader and do whatever they do, change whenever they feel like
 changing.

 Right.  And if we suspect that the APIs will always be at least
 subtly different, then keeping them obviously visually different
 seems to have some benefit.  (i.e.
 systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
 vs
 dbus-send cgmanager set-value http.server cpushares:500 
 memorylimit:500M swaplimit:1G
 ) rather than have admins try to remember now why did that not work
 here, oh yeah, MemoryLimit over here should be Memorylimit or whatever.

 Then again if lmctfy is the layer which admins will use, then it
 doesn't matter as much.

 It would be best if we could get a common DBUS api specc'ed and all
 agree to it.  Serge, do you feel up to that?

 Not sure what you mean - I'll certainly send the API to these lists as

What I meant was whether it is worth opening a discussion with the
systemd folks on a common lowest-level DBUS interface.  But it looks
like their work is already a bit higher level, so it's probably moot.

 the code is developed, and will accept all feedback that I get.  My only
 requirements are that the requirements I've listed in the document
 be feasible, and be feasible back to, say, 3.2 kernels.  So that is
 why we must send an scm-cred for the pid to move into a cgroup.  (With
 3.12 we may have alterntives, accepting a vpid as a simple dbus message
 and setns()ing into the requestor's pidns to echo the pid into the
 cgroup.tasks file.)

 -serge

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

Quoting Tim Hockin (thoc...@google.com):
 On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Quoting Tim Hockin (thoc...@google.com):
...
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 
  Problem with this definition.  Being owned-by is not the same as
  has-root-in.  Specifically, I may choose to give you root in your own
  namespace, but you sure as heck can not increase your own memory
  limit.
 
  1. If you don't want me to change the value at all, then just don't map
  A's owner into the namespace.  I'm uid 10 which is root in my namespace,
  but I only have privilege over other uids mapped into my namespace.
 
 I think I understand this, but it is subtle.  Maybe some examples would help?

When you create a user namespace, at first it is empty, and you are 'nobody'
(-1).  Then magically some uids from the host, say 10-101999, are mapped
into your namespace, to uids 0-1999.

Now assume you're uid 0 inside that namespace.  You have privilege over your
uids, 0-999, which are 10-101999 on the host.

If cgroup file A is owned by host uid 0, then the owner is not mapped into
the user namespace.  uid 0 inside the namespace only gets the world access
rights to that file.

If cgroup file A is owned by host uid 100100, then uid 0 in the
namespace has access to that file by virtue of being root, and uid 100
in the namespace (100100 on the host) has access to the file by virtue
of being the owner.

  2. I've considered never allowing changes to your own cgroup.  So if you're
  in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
  b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
  could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
  so I can't escape the memory limit you wanted.
 
 This is different from what we do internally, but it's an interesting
 semantic.  I'm wary of how much we want to make this API about
 enforcement of policy vs simple enactment.  In other words, semantics
 that diverge from UNIX ownership might be more complicated to
 understand than they are worth.

The semantics I gave are exactly the user namespace semantics.  If you're
not using a user namespace then they simply do not apply, and you are back
to strict UNIX ownership semantics that you want.  But allowing 'root' in
a user namespace to have privilege over uids, without having any privilege
outside its own namespace, must be honored for this to be usable by lxc.

Like I said, on the bright side, if you don't want to care about user
namespaces, then everything falls back to strict unix semantics - so if
you don't want to care, you don't have to care.

  3. I've not considered having the daemon track resource limits - i.e. 
  creating
  a cgroup and saying give it 100M swap, and if it asks, let it increase that
  to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do you
  feel that would be insufficient?
 
 I think this is a higher-level issue that should not be addressed here.
 
  Or maybe your question is something different and I'm missing it?
 
 My point was that I, as machine admin, create a memory cgroup of 100
 MB for you and put you in it.   I also give you root-in-namespace.
 You must not be able to change 100 MB to 200 MB.  From your (1) you
 are saying that system UID 0 owns the cgroup and is NOT mapped into
 your namespace.  Therefore your definition holds.  I think I can buy
 that.
 
 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
   /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see 
   whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a SCM_CREDENTIAL to be sent.
  
   Privilege requirements by action:
   * Requestor of an action (r) over a socket may only make
 changes to cgroups over

Re: [lxc-devel] cgroup management daemon

lmctfy literally supports .. as a container name :)

On Tue, Nov 26, 2013 at 12:58 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Tim Hockin (thoc...@google.com):
 On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn se...@hallyn.com wrote:
  Quoting Tim Hockin (thoc...@google.com):
 ...
 . A client (requestor 'r') can make cgroup requests over
   /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
   requirements for r are listed below.
 . The client request will pertain an existing or new cgroup A.  r's
   privilege over the cgroup must be checked.  r is said to have
   privilege over A if A is owned by r's uid, or if A's owner is mapped
   into r's user namespace, and r is root in that user namespace.
 
  Problem with this definition.  Being owned-by is not the same as
  has-root-in.  Specifically, I may choose to give you root in your own
  namespace, but you sure as heck can not increase your own memory
  limit.
 
  1. If you don't want me to change the value at all, then just don't map
  A's owner into the namespace.  I'm uid 10 which is root in my 
  namespace,
  but I only have privilege over other uids mapped into my namespace.

 I think I understand this, but it is subtle.  Maybe some examples would help?

 When you create a user namespace, at first it is empty, and you are 'nobody'
 (-1).  Then magically some uids from the host, say 10-101999, are mapped
 into your namespace, to uids 0-1999.

 Now assume you're uid 0 inside that namespace.  You have privilege over your
 uids, 0-999, which are 10-101999 on the host.

 If cgroup file A is owned by host uid 0, then the owner is not mapped into
 the user namespace.  uid 0 inside the namespace only gets the world access
 rights to that file.

 If cgroup file A is owned by host uid 100100, then uid 0 in the
 namespace has access to that file by virtue of being root, and uid 100
 in the namespace (100100 on the host) has access to the file by virtue
 of being the owner.

  2. I've considered never allowing changes to your own cgroup.  So if you're
  in /a/b, you can create /a/b/c and modify c's settings, but you can't 
  modify
  b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
  could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
  so I can't escape the memory limit you wanted.

 This is different from what we do internally, but it's an interesting
 semantic.  I'm wary of how much we want to make this API about
 enforcement of policy vs simple enactment.  In other words, semantics
 that diverge from UNIX ownership might be more complicated to
 understand than they are worth.

 The semantics I gave are exactly the user namespace semantics.  If you're
 not using a user namespace then they simply do not apply, and you are back
 to strict UNIX ownership semantics that you want.  But allowing 'root' in
 a user namespace to have privilege over uids, without having any privilege
 outside its own namespace, must be honored for this to be usable by lxc.

 Like I said, on the bright side, if you don't want to care about user
 namespaces, then everything falls back to strict unix semantics - so if
 you don't want to care, you don't have to care.

  3. I've not considered having the daemon track resource limits - i.e. 
  creating
  a cgroup and saying give it 100M swap, and if it asks, let it increase 
  that
  to 200M.  I'd prefer that be done incidentally through (1) and (2).  Do 
  you
  feel that would be insufficient?

 I think this is a higher-level issue that should not be addressed here.

  Or maybe your question is something different and I'm missing it?

 My point was that I, as machine admin, create a memory cgroup of 100
 MB for you and put you in it.   I also give you root-in-namespace.
 You must not be able to change 100 MB to 200 MB.  From your (1) you
 are saying that system UID 0 owns the cgroup and is NOT mapped into
 your namespace.  Therefore your definition holds.  I think I can buy
 that.

 . The client request may pertain a victim task v, which may be moved
   to a new cgroup.  In that case r's privilege over both the cgroup
   and v must be checked.  r is said to have privilege over v if v
   is mapped in r's pid namespace, v's uid is mapped into r's user ns,
   and r is root in its userns.  Or if r and v have the same uid
   and v is mapped in r's pid namespace.
 . r's credentials will be taken from socket's peercred, ensuring that
   pid and uid are translated.
 . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
   translated global pid.  It will then read UID(v) from 
   /proc/PID(v)/status,
   which is the global uid, and check /proc/PID(r)/uid_map to see 
   whether
   UID is mapped there.
 . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
   the kernel translate it for the reader.  Only 'move task v to cgroup
   A' will require a

Re: [lxc-devel] cgroup management daemon

Quoting Tim Hockin (thoc...@google.com):
 lmctfy literally supports .. as a container name :)

So is ../.. ever used, or does noone every do anything beyond ..?

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

2013-11-26 Thread Victor Marmol

I think most of our usecases have only wanted to know about the parent, but
I can see people wanting to go further. Would it be much different to
support both? I feel like it'll be simpler to support all if we go that
route.


On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Tim Hockin (thoc...@google.com):
  lmctfy literally supports .. as a container name :)

 So is ../.. ever used, or does noone every do anything beyond ..?

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon

I see three models:

1) Don't virtualize the cgroup path.  This is what lmctfy does,
though we have discussed changing to:

2) Virtualize to an administrative root - I get to tell you where
your root is, and you can't see anythign higher than that.

3) Virtualize to CWD root - you can never go up, just down.


#1 seems easy, but exposes a lot.  #3 is restrictive and fairly easy -
could we live with that?  #2 seems ideal, but it's not clear to me how
to actually implement it.

On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol vmar...@google.com wrote:
 I think most of our usecases have only wanted to know about the parent, but
 I can see people wanting to go further. Would it be much different to
 support both? I feel like it'll be simpler to support all if we go that
 route.


 On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Tim Hockin (thoc...@google.com):
  lmctfy literally supports .. as a container name :)

 So is ../.. ever used, or does noone every do anything beyond ..?



--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Lxc-devel mailing list
Lxc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Re: [lxc-devel] cgroup management daemon